I've made a translator in perl for a messageboard migration, All I do is applying regexes and print the result. I write stdout to a file and here we go ! But the problem is that my program won't work after 18 MB written !
I've made a translate.pl ( https://gist.github.com/914450 )
and launch it with this line :
$ perl translate.pl mydump.sql > mydump-bbcode.sql
Really sorry for quality of code but I never use perl... I tried sed for same work but didn't manage to apply the regex I found in original script.
[EDIT]
I reworked the code and sanitized some regexes (see gist.github.com/914450) but I'm still stuck. When I splited the big dump in 15M files, I launched translate.pl 7(processes) by 7 to use all cores but the script stops at a variable size. a "tail" command doesn't show a complex message on any url when it stops...
Thanks Guys ! I let you know if I manage finally
yikes - start with the basics:
use strict;
use warnings;
..at the top of your script. It will complain about not properly declaring your lexicals, so go ahead and do that. I don't see anything obvious that would be truncating your file, but perhaps one or more of your regexes is pathological. Also, the undefs at the end are not needed.
For what you are doing, you might consider just using sed
You say the "script stops". It keeps running but produces no more output? Or actually stops running? If it stops running, what does:
perl translate.pl mydump.sql > mydump-bbcode.sql
echo $?
show? And if you add a print STDERR "done!\n"; after your loop, does that show up?
Perl can certainly handle files much larger than 18 MB. I know because I routinely run files of 5 GB through Perl.
I think that your problem is in while($html=<FILE>).
Whenever $html is set to an empty line the while will evaluate as False and exit the loop.
You need to use something like while( defined( $html = <FILE> ) )
Edit:
Hmm. I had always thought you need the defined but in my testing just now it didn't exit on blank lines or 0. Must be more of that special Perl magic that mostly works the way you intend -- except when it doesn't.
Indeed if you restructure the while loop enough you can fool Perl into working the way I always thought it worked. (And it might have, in Perl 4 or in earlier versions of Perl 5)
This will fail:
$x = <>;
chomp $x;
while( $x ) {
print $x;
$x = <>;
chomp $x;
}
There could be any number of things going on:
Try adding $| = 1; to the top of your script. This will make all output unbuffered.
One of your regexes is going crazy and is deleting strings when you're not expecting it.
You've run out of disk space.
There's nothing really wrong with your script (other than you're missing use strict; use warnings; and you're not using the three-argument form of open()) that would cause it to stop working after some magic number of bytes.
Hello guys and Thank you so much for your help and ideas !
After trying to cut and parallelize the jobs, I tried to cut my program in 3 programs, translate1.pl, translate2.pl and 3... the job is done, and it's fast by 8 active cores !
then my launcher.sh starts successively the 3 scripts for each splitted file. done with 2 loops and here we go :)
Regards, Yoann
Related
I'm fairly new to the whole coding game, and am very grateful for every answer!
I am working on a directory with many .txt files in them and have a file with looong list of regex like "perl -p -i -e 's/\n\n/\n/g' *.xml" they all work if I copy them to terminal. But is there a possibility to run them straight from the file?
I tried ./unicode.sh but that resulted in:
No such file or directory.
Any ideas?
Thank you so much!
Here's a (mostly) equivalent Perl script to the oneliner perl -p -i -e 's/\n\n/\n/g' *.xml (one main difference being that this has strict and warnings enabled, which is strongly recommended), which you could expand upon by putting more code to modify the current line in the body of the while loop.
#!/usr/bin/env perl
use warnings;
use strict;
if (!#ARGV) { # if no files on command line
#ARGV = glob('*.xml'); # get a default list of files
}
local $^I = ''; # enable inplace editing (like perl -i)
while (<>) { # read each line of each file into $_
s/\n\n/\n/g; # modify $_ with a regex
# more regexes here...
print; # write the line $_ back out
}
You can save this script in a file such as process.pl, and then run it with perl process.pl, or do chmod u+x process.pl and then run it via ./process.pl.
On the other hand, you really shouldn't modify XML files with regular expressions, there are lots of Perl modules to do XML processing - I wrote about that some more here. Also, in the example you showed, s/\n\n/\n/g actually won't have any effect, since when reading files line-by-line, no string will contain two \n's (you can change how Perl reads files, but I don't see any mention of that in the question).
Edit: You've named the script in your example unicode.sh - if you're processing Unicode files, then Perl has very powerful features to help with that, although the code won't necessarily end up as nice and short as I've showed above. You'll have to tell us some more about what you're doing, and show some example input and output, to get suggestions about that. See also e.g. perlunitut.
It's likely if you got no such file or directory, your problem was you forgot to make unicode.sh executable, as in chmod +x unicode.sh, assuming that's a script that you wrote.
Of course the normal way to run multiple perl commands is this thing that looks like runme.pl which you write, i.e., a perl script.
That said, yes, everything will work from the terminal, you just need to be careful about escaping that bash performs.
Having trouble with a script right now.
Trying to filter out portions of a file and put them into a scalar. Here is the code.
#value = (grep {m/(III[ABC])/g and m//g }<$fh>)
print #value;
#value = (grep { m/[012]iii/g}(<$fh>));
print #value;
When I run the first grep , the values appear in the print statment. But when I run the second grep. The 2nd print statement doesnt print anything. Does adding a second grep, cancel out the effectiveness of the first grep ?
I know that first and second grep work because even when I commented out the first grep. The second grep function worked.
All I really want to do, is filter out information, for multiple different individual arrays. I am really confused as to how to fix this problem, since I am planning on adding more grep's to the script.
The first read on <$fh> gets to the end of the file. Then the second invocation has nothing to read. Thus if you comment out the first one this doesn't happen and the second one works.
The code below adds to the same array. Change to the commented out code if needed. The regex is simplified, since it requires a comment while it doesn't affect the actual question. Please put it back the way it was, if that was what you really meant.
You can either rewind the filehandle after all lines have been read
my #vals = grep { /III[ABC]/ } <$fh>;
seek $fh, 0, 0;
# ready for reading again from the beginning
push #vals, grep { /[012]iii/ } <$fh>;
#or: my #vals_2 = grep { /[012]iii/ } <$fh>;
Or you can read all lines into an array that you can then process repeatedly.
my #original = <$fh>;
my #vals = grep { /III[ABC]/ } #original;
push #vals, grep { m/[012]iii/ } #original;
# or assign to a different array
If you don't need to store these results in such order it would be far more efficient to read the file line by line, and process and add as you go.
Update
I simplified the originally posted regex in order to focus on the question at hand, since the exact condition inside the block has no bearing on it. See the Note below. Thanks to ikegami for bringing it up and for explaining that // "repeats the last successful query".
The m//g is tricky and I removed it.
grep checks a condition and passes a line through if the condition evaluates true. In such scalar context /.../g modifier has effects which are a very different story, removed.
For the same reason as above, the capturing () is unneeded (excessive).
Cleaning up the syntax helps readability here, removed m/.
Note on regex
In scalar context /.../g modifier does the following, per perlrequick:
successive matches against a string will have //g jump from match to match
The empty string pattern m//g also has effects which are far from obvious, stated above.
Taken together these produce non-trivial results in my tests and need mental tracing to understand. I removed them from the code here since leaving them begs a question on whether they are intended trickery or subtle bugs, thus distracting from the actual question -- which they do not affect at all.
I don't know what you think the g modifier does, but it makes no sense here.
I don't know what you think // (a match with an empty pattern) does, but it makes no sense here.
In list context, <$fh> returns all remaining lines in the file. It returns nothing the second time you evaluate it since since you've already reached the end of the file the first time you evaluated it.
Fix:
my #lines = <$fh>;
my #values1 = grep { /III[ABC]/ && /.../ } #lines;
my #values2 = grep { /[012]iii/ } #lines;
Of course, substitute ... for what you meant to use there.
I'd like to preface that I am unable to make changes to the underlying source code. This is code that gets checked out for each project for a team and I cannot make any changes at this time.
Okay so essentially, in a particular .cpp file, let's say foo.cpp, there is a unique line somewhere in the middle that reads:
FT_BAR, 1,
where 1 could be any number (but is going to be 1,2,3,4,5... practically never anything higher)...
I'd like to have a Bash or Perl script that allows me to automatically find this number and increase it by one. For what purpose you may ask... well, it would save me precious seconds multiple times per day and spare me a great deal of tedium endured by opening and closing this file to increase this number.
What is the best approach to this problem? I'm sure I will be embarrassed by a ridiculously simple one-line solution or some standard Unix tool that does exactly this, but I've been unable to find this so please forgive me if this is the case.
How about
perl -pe's/(\d+)/$1+1/e if /FT_BAR, \d+,/' foo.cpp > new.cpp
Perl has the e regex flag for evaluate.
use strict;
use warnings;
while ( <DATA> ) {
s/FT_BAR, (\d+),/"FT_BAR, ".($1+1).","/eg;
print;
}
__DATA__
FT_BAR, 1,
You can turn this into a one liner:
perl -pi.bak -e 's/FT_BAR, (\d+),/"FT_BAR, ".($1+1).","/e;' test_re.txt
I have a perl routine that is causing me frequent "out of memory" issues in the system.
The script does 3 things
1> get the output of a command to an array (#arr = `$command` --> array will hold about 13mb of data after the command)
2> Use a large regex to match the contents of individual array elements -->
The regex is something like this
if($new_element =~ m|([A-Z0-9\-\._\$]+);\d+\s+([0-9]+)-([A-Z][A-Z][A-Z])-([0-9][0-9][0-9][0-9]([0-9]+)\:([0-9]+)\:([0-9]+)|io)
<put to hash>
3> Put the array in a persistent hash map.
$hash_var{arr[0]} = "Some value"
edit:
Sample data processed by regex are
Z4:[newuser.newdir]TESTOPEN_ERROR.COM;4
8-APR-2014 11:14:12.58
Z4:[newuser.newdir]TEST_BOC.CFG;5
5-APR-2014 10:43:11.70
Z4:[newuser.newdir]TEST_BOC.COM;20
5-APR-2014 10:41:01.63
Z4:[newuser.newdir]TEST_NEWRT.COM;17
4-APR-2014 10:30:56.11
About 10000 lines like these
I started by suspecting the array and hash together may be consuming too much of memory.
However i have started to think this regex might have some thing to do with out of memory as well.
Does perl regex(with 'io' option!) really the main culprit causing out of memory?
This has nothing to do with regexes.
If you are operating in a memory-constrained environment, you should process data records one at a time rather than fetching all of them at once. Let's assume you pull your data like:
my #data = `some command`;
for my $line (#data) {
... # process the line
}
This is incredibly wasteful because you need storage for the data, and for the output of your processing (in your case: the hash).
Instead, process the input line by line. We can use the open function instead of backticks for this:
open my $cmd, '-|', 'some', 'command' or die "Can't run some command: $!";
while (my $line = <$cmd>) {
... # process the line
}
There is no need for an array here, which saves us 13MB of memory which we can now put to use otherwise.
What problem are you really trying to solve?
Use your words... not Perl.
Something like: "The script is picking apart the output from an openvms Directory output command and the objective is to report the number of file and oldest date ordered by directory"
First question is WHY keep the array. Will the script 'walk' it again?
If not, just processes there and then in a for loop.
The regex seems to pick out out a file-name, and date. That's been does before.
It is not hard, and can be simplified by trusting the OpenVMS directory format.
Somethign like this reads better imho:
if($new_element =~ m|](.*);\d+\s+(\d+)-(\w+)-(\d+)\s+(\d+):(d+):(\d+)|)
: $hash_var{arr[0]} =
Hmmm, that suggests to me that a whole line from array is used as a key value, with all 50+ spaces. So those 10,000 lines tuning into 1,000,000+ bytes just for raw key bytes. A lot but not crazy. New we know that the first word on the line MUST be unique, why not exploit that:
$hash_var{$1} = xxx if /(\S+)/l;
The program may also want to exploit that the leading strings are highly repetitive, and substitute everything before the "]" with an ever increasing directory number, maintained in a 'look-a-side' array and/or hash.
Personally I would drop /NOHEAD from the command, and use a regex to pick up the directories as they come by on their own lines.
Or use a SUBSTR or whatever... of course you'd need to construct a similar key on re-access.
In the related topic, there is debugging output printed.
Perhaps include the line number in the array for your own understanding?
Perl encounters "out of memory" in openvms system
Good luck!
Hein
Well, I tried and failed so, here I am again.
I need to match my abs path pattern.
/public_html/mystuff/10000001/001/10/01.cnt
I am in taint mode etc..
#!/usr/bin/perl -Tw
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
$ENV{PATH} = "bin:/usr/bin";
delete ($ENV{qw(IFS CDPATH BASH_ENV ENV)});
I need to open the same file a couple times or more and taint forces me to untaint the file name every time. Although I may be doing something else wrong, I still need help constructing this pattern for future reference.
my $file = "$var[5]";
if ($file =~ /(\w{1}[\w-\/]*)/) {
$under = "/$1\.cnt";
} else {
ErroR();
}
You can see by my beginner attempt that I am close to clueless.
I had to add the forward slash and extension to $1 due to my poorly constructed, but working, regex.
So, I need help learning how to fix my expression so $1 represents /public_html/mystuff/10000001/001/10/01.cnt
Could someone hold my hand here and show me how to make:
$file =~ /(\w{1}[\w-\/]*)/ match my absolute path /public_html/mystuff/10000001/001/10/01.cnt ?
Thanks for any assistance.
Edit: Using $ in the pattern (as I did before) is not advisable here because it can match \n at the end of the filename. Use \z instead because it unambiguously matches the end of the string.
Be as specific as possible in what you are matching:
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
if ( $fn =~ m!
^(
/public_html
/mystuff
/[0-9]{8}
/[0-9]{3}
/[0-9]{2}
/[0-9]{2}\.cnt
)\z!x ) {
print $1, "\n";
}
Alternatively, you can reduce the vertical space taken by the code by putting the what I assume to be a common prefix '/public_html/mystuff' in a variable and combining various components in a qr// construct (see perldoc perlop) and then use the conditional operator ?::
#!/usr/bin/perl
use strict;
use warnings;
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
my $prefix = '/public_html/mystuff';
my $re = qr!^($prefix/[0-9]{8}/[0-9]{3}/[0-9]{2}/[0-9]{2}\.cnt)\z!;
$fn = $fn =~ $re ? $1 : undef;
die "Filename did not match the requirements" unless defined $fn;
print $fn, "\n";
Also, I cannot reconcile using a relative path as you do in
$ENV{PATH} = "bin:/usr/bin";
with using taint mode. Did you mean
$ENV{PATH} = "/bin:/usr/bin";
You talk about untainting the file path every time. That's probably because you aren't compartmentalizing your program steps.
In general, I break up these sort of programs into stages. One of the earlier stages is data validation. Before I let the program continue, I validate all the data that I can. If any of it doesn't fit what I expect, I don't let the program continue. I don't want to get half-way through something important (like inserting stuff into a database) only to discover something is wrong.
So, when you get the data, untaint all of it and store the values in a new data structure. Don't use the original data or the CGI functions after that. The CGI module is just there to hand data to your program. After that, the rest of the program should know as little about CGI as possible.
I don't know what you are doing, but it's almost always a design smell to take actual filenames as input.