Segmentation Fault when using perl regex on large file - regex

I am trying to find and replace pattern on a very large file (43Go) and face some problems with that. I first tried to use sed for this but it does not seems to be optimized for large file, even smaller than 43 Go so i switched to perl.
I have this command :
perl -0777 -i -pe 's/(public\..)*_seq/\1_id_seq/mg' dump.sql
But it generates a segmentation fault before exiting and turns my dump of 43 Go into a 0 octet file. The file i am trying to parse is a simple postgresql database dump.
Just as an information :
# perl --version
This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi
(with 67 registered patches, see perl -V for more detail)
Did someone already faced this problem or have any idea about how to solve this ? I would prefer prefer to keep this one line python command but if you have solutions with any other program i will take it too

-0777 tells perl to load the whole file into memory (see perlrun). If your memory is less than 43 Go (whatever it is), you'll have to find a way to process it in smaller chunks. For example, try dumping the option, or use -00 for the "paragraph mode".
Also note that, unlike in sed, you need to use $1 instead of \1 in the replacement part of a substitution in Perl.

Related

How to run list of perl regex from file in terminal

I'm fairly new to the whole coding game, and am very grateful for every answer!
I am working on a directory with many .txt files in them and have a file with looong list of regex like "perl -p -i -e 's/\n\n/\n/g' *.xml" they all work if I copy them to terminal. But is there a possibility to run them straight from the file?
I tried ./unicode.sh but that resulted in:
No such file or directory.
Any ideas?
Thank you so much!
Here's a (mostly) equivalent Perl script to the oneliner perl -p -i -e 's/\n\n/\n/g' *.xml (one main difference being that this has strict and warnings enabled, which is strongly recommended), which you could expand upon by putting more code to modify the current line in the body of the while loop.
#!/usr/bin/env perl
use warnings;
use strict;
if (!#ARGV) { # if no files on command line
#ARGV = glob('*.xml'); # get a default list of files
}
local $^I = ''; # enable inplace editing (like perl -i)
while (<>) { # read each line of each file into $_
s/\n\n/\n/g; # modify $_ with a regex
# more regexes here...
print; # write the line $_ back out
}
You can save this script in a file such as process.pl, and then run it with perl process.pl, or do chmod u+x process.pl and then run it via ./process.pl.
On the other hand, you really shouldn't modify XML files with regular expressions, there are lots of Perl modules to do XML processing - I wrote about that some more here. Also, in the example you showed, s/\n\n/\n/g actually won't have any effect, since when reading files line-by-line, no string will contain two \n's (you can change how Perl reads files, but I don't see any mention of that in the question).
Edit: You've named the script in your example unicode.sh - if you're processing Unicode files, then Perl has very powerful features to help with that, although the code won't necessarily end up as nice and short as I've showed above. You'll have to tell us some more about what you're doing, and show some example input and output, to get suggestions about that. See also e.g. perlunitut.
It's likely if you got no such file or directory, your problem was you forgot to make unicode.sh executable, as in chmod +x unicode.sh, assuming that's a script that you wrote.
Of course the normal way to run multiple perl commands is this thing that looks like runme.pl which you write, i.e., a perl script.
That said, yes, everything will work from the terminal, you just need to be careful about escaping that bash performs.

Why is this vim regex so expensive: s/\n/\\n/g

Attempting this on a sufficiently large file (say 80,000+ lines and about 500k+) will crash things or stall eventually both on my server and on my local Mac.
I've tried this at the command line as well, with the same result:
vim -es -c '%s/\n/\\n/g' -c wq $file
Also, the problem appears to be with the selection (\n) and not the replacement (\\n).
For my larger files I can of course split them and cat them back when finished, but the split points cannot be arbitrary in my case and must be adjusted manually for each and every split.
I appreciate that there are other ways to do this -- sed, etc. -- but I have similar and additional problems there, and I would like to be able to do this with vim.
I'm adding my comment as an answer:
Text editors usually don't like 'gigantic' lines (which is what you'll get with that replacement).
To test that if this is is due because of the 'big line' and not the substitution itself I did this test:
I created a simple ~500KB file with a script. No new line characters, just a single line. Then I tried to load the file with vim. Result? I had to kill it :-).
However, if on the same script I write some new lines every now and then, I have no problems opening the file.
Also, one thing you could try is the following: on vim, replace \n by \n\n if it is fast, then this should also confirm the 'big line' issue.

What is the difference b/w two sed commands below?

Information about the environment I am working in:
$ uname -a
AIX prd231 1 6 00C6B1F74C00
$ oslevel -s
6100-03-10-1119
Code Block A
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \
/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleA.txt
Code Block B
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \n/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleB.txt
I have two code blocks(shown above) that make use of sed to trim the input down to 6 digits but one command is behaving differently than I expected.
Sample of input for the two code blocks
Mar 25 14:06:16 prd231 ajbtux[33423660]: 20160325140616:~schd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT:~schdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341~ SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210~
I get the following output when the sample input above goes through the two code blocks.
cycleA.txt content
389210
cycleB.txt content
25140616231334236602016032514061610200705008341389210
I understand that my last piped sed command (sed 's/[^0-9]*//g') is deleting all characters other than numbers so I omitted it from the block codes and placed the output in two additional files. I get the following output.
cycleA1.txt content
SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210
cycleB1.txt content
Mar 25 15:27:58 prd231 ajbtux[33423660]: 20160325152758: nschd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT: nschdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341 n SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210 n
I can see that the first code block is removing every thing other that (SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210) and is using the tilde but the second code block is just replacing the tildes with the character n. I can also see that it is necessary in the first code block for a line break after this(sed 's/[~]/ ) and that is why I though having \n would simulate a line break but that is not the case. I think my different output results are because of the way regular expressions are being used. I have tried to look into regular expressions and searched about them on stackoverflow but did not obtain what I was looking for. Could someone explain how I can achieve the same result from code block B as code block A without having part of my code be on a second line?
Thank you in advance
This is an example of the XY problem (http://xyproblem.info/). You're asking for help to implement something that is the wrong solution to your problem. Why are you changing ~s to newlines, etc when all you need given your posted sample input and expected output is:
$ sed -n 's/.*schdCycCleanup.* \([0-9]*\).*/\1/p' file
389210
or:
$ awk -F'[ ~]' '/schdCycCleanup/{print $(NF-1)}' file
389210
If that's not all you need then please edit your question to clarify your requirements for WHAT you are trying to do (as opposed to HOW you are trying to do it) as your current approach is just wrong.
Etan Reisner's helpful answer explains the problem and offers a single-line solution based on an ANSI C-quoted string ($'...'), which is appropriate, given that you originally tagged your question bash.
(Ed Morton's helpful answer shows you how to bypass your problem altogether with a different approach that is both simpler and more efficient.)
However, it sounds like your shell is actually something different - presumably ksh88, an older version of the Korn shell that is the default sh on AIX 6.1 - in which such strings are not supported[1]
(ANSI C-quoted strings were introduced in ksh93, and are also supported not only in bash, but in zsh as well).
Thus, you have the following options:
With your current shell, you must stick with a two-line solution that contains an (\-escaped) actual newline, as in your code block A.
Note that $(printf '\n') to create a newline does not work, because command substitutions invariably trim all trailing newlines, resulting in the empty string in this case.
Use a more modern shell that supports ANSI C-quoted strings, and use Etan's answer. http://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.cmds3/ksh.htm tells me that ksh93 is available as an alternative shell on AIX 6.1, as /usr/bin/ksh93.
If feasible: install GNU sed, which natively understands escape sequences such as \n in replacement strings.
[1] As for what actually happens when you try echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g' in a POSIX-like shell that does not support $'...': the $ is left as-is, because what follow is not a valid variable name, and sed ends up seeing literal $s/[~]/\\\n/g, where the $ is interpreted as a context address applying to the last input line - which doesn't make a difference here, because there is only 1 line. \\ is interpreted as plain \, and \n as plain n, effectively replacing ~ instances with literal \n sequences.
GNU sed handles \n in the replacement the way you expect.
OS X (and presumably BSD) sed does not. It treats it as a normal escaped character and just unescapes it to n. (Though I don't see this in the manual anywhere at the moment.)
You can use $'' quoting to use \n as a literal newline if you want though.
echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g'

Substitute first string match in 1-line 2GB file on Linux

I'm trying to substitute only the first match of one string in a huge file with only one line (2.1 GB), this substitution will occur in a shell script job. The big problem is that the machine that will run this script has only 1GB memory (approximately
300MB free), so i need a buffered strategy that don't overflow my memory. I already tried sed, perl and a python approach, but all of them return me out of memory errors.
Here are my attemps (discovered in other questions):
# With perl
perl -pi -e '!$x && s/FROM_STRING/TO_STRING/ && ($x=1)' file.txt
# With sed
sed '0,/FROM_STRING/s//TO_STRING/' file.txt > file.txt.bak
# With python (in a custom script.py file)
for line in fileinput.input('file.txt', inplace=True):
print line.replace(FROM_STRING, TO_STRING, 1)
break
One good point is that the FROM_STRING that i'm searching is always in the beggining of this huge 1-line file, in the first 100 characters. Other good thing is that the execution time is not a problem, it can take time without problems.
EDIT (SOLUTION):
I tested three solutions of the answers, all them solved the problem, thanks for all of you. I tested the performance with Linux time and all of them take about the same time too, up to 10 seconds approximately... But i choose the #Miller solution because it's simpler (just uses perl).
Since you know that your string is always in the first chunk of the file, you should use dd for this.
You'll also need a temporary file to work with, as in tmpfile="$(mktemp)"
First, copy the first block of the file to a new, temporary location:
dd bs=32k count=1 if=file.txt of="$tmpfile"
Then, do your substitution on that block:
sed -i 's/FROM_STRING/TO_STRING/' "$tmpfile"
Next, concatenate the new first block with the rest of the old file, again using dd:
dd bs=32k if=file.txt of="$tmpfile" seek=1 skip=1
EDIT: As per Mark Setchell's suggestion, I have added a specification of bs=32k to these commands to speed up the pace of the dd operations. This is tunable, per your needs, but if tuning separate commands distinctly, you may need to be careful about the changes in semantics between different input and output block sizes.
If you're certain the string you're trying to replace is just in the first 100 characters, then the following perl one-liner should work:
perl -i -pe 'BEGIN {$/ = \1024} s/FROM_STRING/TO_STRING/ .. undef' file.txt
Explanation:
Switches:
-i: Edit <> files in place (makes backup if extension supplied)
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
BEGIN {$/ = \1024}: Set the $INPUT_RECORD_SEPARATOR to the number of characters to read for each “line”
s/FROM/TO/ .. undef: Use a flip-flop to perform the regex only once. Could also have used if $. == 1.
Given that then string to replace is in the first 100 bytes,
Given that Perl IO is slow unless you start using sysread to read large blocks,
Assuming that the substitution changes the size of the file[1], and
Assuming that binmode isn't needed[2],
I'd use
( head -c 100 | perl -0777pe's/.../.../' && cat ) <file.old >file.new
A faster solution for that exists.
Though it's easy to add if needed.
Untested, but I would do:
perl -pi -we 'BEGIN{$/=\65536} s/FROM_STRING/TO_STRING/ if 1..1' file.txt
to read in 64k chunks.
A practical (not very compact but efficient) would be to split the file, do the search-replace and join: eg:
head -c 100 myfile | sed 's/FROM/TO/' > output.1
tail -c +101 myfile > output.2
cat output.1 output.2 > output && /bin/rm output.1 output.2
Or, in one line:
( ( head -c 100 myfile | sed 's/FROM/TO/' ) && (tail -c +101 myfile ) ) > output

How can I remove text at beginning of a file using a regex?

I have a bunch of files that contain a semi-standard header. That is, the look of it is very similar but the text changes somewhat.
I want to remove this header from all of the files.
From looking at the files, I know that what I want to remove is encapsulated between similar words.
So, for instance, I have:
Foo bar...some text here...
more text
Foo bar...I want to keep everything after this point
I tried this command in perl:
perl -pi -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
But it doesn't work. I'm not a regex expert but hoping someone knows how to basically remove a chunk of text from the beginning of a file based on a text match and not the number of characters...
By default, ARGV (aka <> which is used behind-the-scenes by -p) only reads a single line at a time.
Workarounds:
Unset $/, which tells Perl to read a whole file at a time.
perl -pi -e "BEGIN{undef$/}s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
BEGIN is necessary to have that code run before the first read is done.
Use -0, which sets $/ = "\0".
perl -pi -0 -e "s/\A.*?Foo.bar*?Foo.bar//simxg" 00ws110.txt
Take advantage of the flip-flop operator.
perl -ni -e "print unless 1 ... /^Foo.bar/'
This will skip printing starting from line 1 to /^Foo.bar/.
If your header stretches across more than one line you must tell perl how much to read. If the files are small in comparison to memory you may want to just slurp the whole file into memory:
perl -0777pi.orig -e 's/your regex/your replace/s' file1 file2 file3
The -0777 option sets perl to slurp mode, so $_ will hold the each whole file each time through the loop. Also, always remember to set the backup extension. If you don't you may find that you have wiped out your data accidentally and have no way to get it back. See perldoc perlrun for more information.
Given information from the comments, it looks like you are trying to strip all of the annoying stuff from the front of a Project Gutenberg ebook. If you understand all of the copyright issues involved, you should be able to get rid of the front matter like this:
perl -ni.orig -e 'print unless 1 .. /^\*END/' 00ws110.txt
The Project Gutenberg header ends with
*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*
A safer regex would take into account the *END* at the end of the line as well, but I am lazy.
I might be misinterpreting what you're asking for, but it looks to me that simple:
perl -ni -e 'print unless 1..($. > 1 && /^Foo bar/)'
Here you go! This replaces the first line of the file:
use Tie::File;
tie my #array,"Tie::File","path_to_file" or die("can't tie the file");
$array[0] =~s/text_i_want_to_replace/replacement_text/gi;
untie #array;
You can operate on the array and you will see the modifications in the array. You can delete elements from the array and it will erase the line from the file. Applying substitution on elements will substitute text from the lines.
If you want to delete the first two lines, and keep something from the third, you can do something like this :
# tie the #array before this
shift #array;
shift #array;
$array[0]=~s/foo bar\.\.\.//gi;
# untie the #array
and this will do exactly what you need!