Regex substitution over a large file - regex

Given a large file (potentially 1.5GB) with some very long lines, I need to run substitutions. I thought of running a sliding window where I read 4096 bytes off the file, append them to the previous chunk and substitute on that, but of course that doesn't work so well for substitutions where the substitution might be across the fold.
Another thought came to mind that I could start with my first two chunks and s/// on that. if there are no substitutions made I write chunk1 to disk and grab chunk3. Then I s/// on chunk2 and chunk3. If there are substitions I grab chunk4 and append it. Then I keep appending until there's no substitions. At that point I write everything but the latest chunk to disk. Like this:
read( $data, $previous_chunk, 4096 );
while( read( $data, $this_chunk, 4096 ) > 0 ){
my $chunk = $previous_chunk . $this_chunk;
if( 0 == $chunk =~ s/foo/bar/g ){
# There was nothing to substitute, so we'll append all
# the old stuff to our file and just keep the latest chunk.
print OUTPUT $previous_chunk;
$previous_chunk = $this_chunk;
}
else {
# There was a substitution, so we want to keep building
# in case it crossed the fold
$previous_chunk .= $this_chunk;
}
}
Does that sound sane? The only problem I can see is that if the substitution causes a new match in the running $previous_chunk. So we probably need to clear $previous_chunk up to the latest substitution somehow and only keep the clean content that follows it. (Eg, if we had s/foo/foobar/ we'd turn 'foo' into 'foobar' then into 'foobarbar' then into 'foobarbarbar'. Is there a way to avoid that?
Is there a better way to do it?

You ask if there is a better way to do it. "Better" is subjective, but this is what I would do if I were faced with the same problem.
One option is not to use Perl at all but to perform your search-and-replace in a text editor that supports gigantic files.
On Windows, one editor that does so is EditPad Pro, which supports huge files. Not my product, so this is not an ad. I use EPP because it is regex-centric (same author as RegexBuddy).
Quoting from the brochure: Open and edit files of absolutely any size, including files larger than 4 GB, even on a 32-bit system with a modest amount of RAM.
This allows you to offload the whole large file issue.
Other features that may or may not be relevant: you can save your regular expressions, chain them with macros, and invoke tools (such as your own Perl scripts) to manipulate the currently open file (though in that case you are once again at the mercy of the outside tool's memory management).

Related

Why is this vim regex so expensive: s/\n/\\n/g

Attempting this on a sufficiently large file (say 80,000+ lines and about 500k+) will crash things or stall eventually both on my server and on my local Mac.
I've tried this at the command line as well, with the same result:
vim -es -c '%s/\n/\\n/g' -c wq $file
Also, the problem appears to be with the selection (\n) and not the replacement (\\n).
For my larger files I can of course split them and cat them back when finished, but the split points cannot be arbitrary in my case and must be adjusted manually for each and every split.
I appreciate that there are other ways to do this -- sed, etc. -- but I have similar and additional problems there, and I would like to be able to do this with vim.
I'm adding my comment as an answer:
Text editors usually don't like 'gigantic' lines (which is what you'll get with that replacement).
To test that if this is is due because of the 'big line' and not the substitution itself I did this test:
I created a simple ~500KB file with a script. No new line characters, just a single line. Then I tried to load the file with vim. Result? I had to kill it :-).
However, if on the same script I write some new lines every now and then, I have no problems opening the file.
Also, one thing you could try is the following: on vim, replace \n by \n\n if it is fast, then this should also confirm the 'big line' issue.

perl regex using too much memory?

I have a perl routine that is causing me frequent "out of memory" issues in the system.
The script does 3 things
1> get the output of a command to an array (#arr = `$command` --> array will hold about 13mb of data after the command)
2> Use a large regex to match the contents of individual array elements -->
The regex is something like this
if($new_element =~ m|([A-Z0-9\-\._\$]+);\d+\s+([0-9]+)-([A-Z][A-Z][A-Z])-([0-9][0-9][0-9][0-9]([0-9]+)\:([0-9]+)\:([0-9]+)|io)
<put to hash>
3> Put the array in a persistent hash map.
$hash_var{arr[0]} = "Some value"
edit:
Sample data processed by regex are
Z4:[newuser.newdir]TESTOPEN_ERROR.COM;4
8-APR-2014 11:14:12.58
Z4:[newuser.newdir]TEST_BOC.CFG;5
5-APR-2014 10:43:11.70
Z4:[newuser.newdir]TEST_BOC.COM;20
5-APR-2014 10:41:01.63
Z4:[newuser.newdir]TEST_NEWRT.COM;17
4-APR-2014 10:30:56.11
About 10000 lines like these
I started by suspecting the array and hash together may be consuming too much of memory.
However i have started to think this regex might have some thing to do with out of memory as well.
Does perl regex(with 'io' option!) really the main culprit causing out of memory?
This has nothing to do with regexes.
If you are operating in a memory-constrained environment, you should process data records one at a time rather than fetching all of them at once. Let's assume you pull your data like:
my #data = `some command`;
for my $line (#data) {
... # process the line
}
This is incredibly wasteful because you need storage for the data, and for the output of your processing (in your case: the hash).
Instead, process the input line by line. We can use the open function instead of backticks for this:
open my $cmd, '-|', 'some', 'command' or die "Can't run some command: $!";
while (my $line = <$cmd>) {
... # process the line
}
There is no need for an array here, which saves us 13MB of memory which we can now put to use otherwise.
What problem are you really trying to solve?
Use your words... not Perl.
Something like: "The script is picking apart the output from an openvms Directory output command and the objective is to report the number of file and oldest date ordered by directory"
First question is WHY keep the array. Will the script 'walk' it again?
If not, just processes there and then in a for loop.
The regex seems to pick out out a file-name, and date. That's been does before.
It is not hard, and can be simplified by trusting the OpenVMS directory format.
Somethign like this reads better imho:
if($new_element =~ m|](.*);\d+\s+(\d+)-(\w+)-(\d+)\s+(\d+):(d+):(\d+)|)
: $hash_var{arr[0]} =
Hmmm, that suggests to me that a whole line from array is used as a key value, with all 50+ spaces. So those 10,000 lines tuning into 1,000,000+ bytes just for raw key bytes. A lot but not crazy. New we know that the first word on the line MUST be unique, why not exploit that:
$hash_var{$1} = xxx if /(\S+)/l;
The program may also want to exploit that the leading strings are highly repetitive, and substitute everything before the "]" with an ever increasing directory number, maintained in a 'look-a-side' array and/or hash.
Personally I would drop /NOHEAD from the command, and use a regex to pick up the directories as they come by on their own lines.
Or use a SUBSTR or whatever... of course you'd need to construct a similar key on re-access.
In the related topic, there is debugging output printed.
Perhaps include the line number in the array for your own understanding?
Perl encounters "out of memory" in openvms system
Good luck!
Hein

build a control file to reformat source file with <wbr>

My problem: long chemical terms, without any guidance to a browser about where to break the term. Some terms are over 70 characters.
My goal: introduce <wbr> at logical insertion points.
Example of problem:
isoquinolinetetramethylenesulfoxidetetrachlororuthenate (55 chars)
Example of opportunities to break a chemical term (e.g. the way a person would pronounce the term as opposed to typing the term):
iso<wbr>quinoline
tetra<wbr>methylene
methylene<wbr>sulfoxide
tetra<wbr>chloro
Usually (but not always) iso, tetra, and methyl are word_break_opportunities.
In general how should I set up an environment with:
control file with "rules" that introduce word_break opportunities
file on which to apply the rules from the control file
The control file will be updated with new rules as new chemical term are encountered.
Would like to use: sed, awk, regex.
Perhaps the environment would look like:
awk rules.awk inputfile.txt > outputfile.txt
Am prepared for trial and error so would appreciate basic explanation so I can refine the control file.
My platform: Windows 7; 64 bit; 8 GB memory; GNUwin32; sed 4.1.5.4013; awk 3.1.6.2962
Thank you in advance.
Your first job is to come up with a list of what is and isn't breakable. Once you have this you can define a format to interpret, and build some code around it.
For example, I would probably go something like:
Opening chars:
iso
tetra
then some code like:
for Each openingString {
if (string.startsWith(openingString)){
insert wbr after opening string
}
}
2.
Opening chars, unless followed by
iso|"tope|bob"
tetra|"pak"
for Each openingString {
if (string.startsWith(openingString)){
get the next element from the row (after the |, surrounded by ")
split around the |
for each part
if (!string.startsWith(part, openingString.length)) {
insert wbr after openingString
}
}
}
then build up from there. It's a pretty monumental task though, it's going to take a lot of building on to get to something useful, but if you're committed to it! The first task is to decide how you're going to hold these mappings though.

Windows Batch File - Find and return string inside matching pattern

I'm using a batch file to identify and load fonts temporarily. It looks for strings like /FontFamily(Rubber Dinghy Rapids)/ occurring inside .ai and .pdf files.
Now if I do findstr /r FontFamily\(.*\) MyFile.ai, this command returns a hugely interminable line of crap data with FontFamily(Rubber Dinghy Rapids) lost somewhere in there. I ACTUALLY need it to return the value of .* it found inside - in this case Rubber Dinghy Rapids.
Can I do this more elegantly? Or maybe I can switch to using VBScript if it's more elegant there?
My current solution is slow as hell... nested for loops, with one of them delimiting the crap data by the ( character, then finding the line that says FontFamily(Rubber Dinghy Rapids then stripping out the FontFamily( string, leaving me finally with Rubber Dinghy Rapids.
I wrote an hybrid Batch-JScript program called FindRepl.bat that use JScript's regular expressions to search for strings in a file. Using my program you may solve your problem this way:
FindRepl.bat "FontFamily\((.*)\)" /$:1 < input.txt
You may get my program from this site.

Perl splitting text string (from HTML page, text document, etc.) by line into array?

This is kind of a weird question, at least for me, as I don't exactly understand what is fully involved in this. Basically, I have been doing this process where I save a scraped document (such as a web page) to a .txt file. Then I can easily use Perl to read this file and put each line into an array. However, it is not doing this based on any visible thing in the document (i.e., it is not going by HTML linebreaks); it just knows where a new line is, based on the .txt format.
However, I would like to cut this process out and just do the same thing from within a variable, so instead I would have what would have been the contents of the .txt file in a string and then I want to parse it, in the same way, line by line. The problem for me is that I don't know much about how this would work as I don't really understand how Perl would be able to tell where a new line is (assuming I'm not going by HTML linebreaks, as often it is just a web based .txt file (which presents to my scraper, www:mechanize, as a web page) I'm scraping so there is no HTML to go by). I figure I can do this using other parameters, such as blank spaces, but am interested to know if there is a way to do this by line. Any info is appreciated.
I'd like to cut the actual saving of a file to reduce issues related to permissions on servers I use and also am just curious if I can make the process more efficient.
Here's an idea that might help you: you can open from strings as well as files.
So if you used to do this:
open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my #list = <$io>;
You can just do this:
open( my $io, '<', \$text_I_captured );
my #list = <$io>;
It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n (the traditional Unix newline) or \r\n (the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be
my #lines = split(/\r?\n/, $scraped_text);
Use the $/ variable, this determines what to break lines on. So:
local $/ = " ";
while(<FILE>)...
would give you chunks separated by spaces. Just set it back to "\n" to get back to the way it was - or better yet, go out of the local $/ scope and let the global one come back, just in case it was something other than "\n" to begin with.
You can eliminate it altogether:
local $/ = undef;
To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split or a splice, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:
# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'
If you're breaking apart things by newlines, for example. \G matches either the beginning of the string or the end of the last match, within a /g-tagged regex.
Another weird tidbit is $/=\10... if you give it a scalar reference to an integer (here 10), you can get record-length chunks:
# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'
More info: http://www.perl.com/pub/a/2004/06/18/variables.html
If you combine this with FM's answer of using:
$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }
I think you can get every combination of what you need...