regex_replace in large string - c++

std/boost regex_replace returns modified string by value. In my case I have to search/replace by regex in a file. I have thousands of files to process and many of them are over 1MB in size. The string to be searched and replaced is rare (e.g. only 5-10% of files will have it).
So, with the available interface is that possible to run regex replace and if the searched string isn't found then avoid creating a copy of 1MB buffer?
I just don't seem to figure out, is that regex interface in c++ that failed that the only way is to first search for a string in my buffer and only if it's found then use regex_replace (effectively search for the second time)? Or can I reuse results from regex_match or regex_search and pass them to regex_replace?

Related

Dart regex match against a stream

I'm writing code to parse a text file.
The user of my library can pass a line delimiter as a regex.
The file may be large so I need to stream the contents.
So the question is how do I apply the regex to the stream as it passes through my line matcher.
I will apply a limit so the line delimiter matched by the regex may not be greater than 100 chars otherwise the regex has the potential to match the entire contents of the file.
I can't just buffer the 100 char max as the delimiter may span the buffer.
The only idea I can think of is preparsing the regex into segments and checking for partial matched as I go.
Any better ideas?
It's a thorny issue, and not one which has a simple solution.
Your file is large, so you do not want to load it entirely into memory first. That's reasonable.
You do need to buffer some of it, at least everything after the last detected line delimiter, so that you can combine that with the next chunk in order to look for delimiters that may be split between the chunks.
That would be my initial approach: Keep a "prefix" string, which is everything after the last line delimiter, and when you receive a new chunk, concatenate that onto the prefix, and then check for line delimiters in the entire available string. If the prefix is (way) more than 100 chars, you can split the prefix into the part that is definitely not part of the delimiter, which you then put into your StringBuffer directly, and the last 99 characters which you combine with the next chunk. I'd benchmark that, because it's not at all obvious that it'll be faster than just concatenating the entire thing, but it might be if you get lines spanning many chunks.
If you allow arbitrary RegExps, then there is no simpler solution.(And even that does not allow a RegExp which uses look-ahead or look-behind to check characters outside of the 100-char match, perhaps even in earlier or later chunks, you really need the entire file in memory as a string for that kind of shenanigans to work).
Now, if that's too inefficient, perhaps because some lines are large, or some chunks are small, and you are doing a lot of copying to do the concatenation, then I'd move away from using RegExps (and you should be using Pattern, not RegExp, as the type you accept already), and start using just strings or code unit sequences to search for.
Then it will make sense to scan each incoming chunk to the end, and remember whether you have seen a partial delimiter, and how much of one, then you can continue with the next chunk and not need to first combine them in memory to run a RegExp over the combination.
It will even allow you to search for the delimiter in incoming bytes instead of converting them to a string first, reducing the copy-overhead even more.
It's a little tricky if you allow arbitrary character sequences as line delimiters. For example, using \r\n\r\n\t as delimiter, if you have seen \r\n\r\n at the end of one chunk, you need to be able to recognize both \t and \r\n\t at the start of the next.
(You might be able to adapt something like KMP string search for this purpose, or just disallow line delimiters that are not fairly simple).
If you had a RegExp implementation based on a state machine, it would be trivial(ish) to keep the state at the end of one chunk and continue matching in the next chunk, but the Dart (JavaScript) RegExp is not such an implementation, and it isn't built to do partial matching.
Converting the RegExp itself into one which matches prefixes of what the original would match, and detecting which prefix, is not something I'd recommend. Simply because it's very non-trivial. And that's for RegExps that are actually regular, which the Dart ones are not (back-references are non-regular).

Using a regex_iterator on an istream

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:
Read it in the istream a character at a time
Collect the characters
When a delimiter is hit return the collection as a token
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.
Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?
This question is about code appearance:
Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.
boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.
boost::match_partial has 4 drawbacks:
In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
The inclusion of boost always causes bloat
Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.
I think not. istream_iterator has the input_iterator_tag tag, whereas regex_iterator expects to be initialized using bi-directional iterators (bidirectional_iterator_tag).
If your delimiter regex is complex enough to avoid reading the stream yourself, the best way to do this is to indeed slurp the istream.

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Regular Expression performance issue to pasrse big contents

Suppose i have to filter some text from a file. Then i have 2 solutions
Either I take all the contents of file into
a single variable(like fileinputstream or something else) which can be
parsed using regular expression.
Or i use looping to read file line
by line. Then i apply either regular
expression or some string function on each line.
Which method will be better and faster?
Most regular expression libraries (such as PCRE) are very efficient and highly optimized, so I say go with the first option.
But of course if performance is very important to you, you should use a profiler anyway; it could give you a better answer for your exact situation.

Incremental Stream Parsing in C++

I am reading data from a (serial) port using a non-blocking read() function (in C/C++).
This means the data comes-in in chunks of undetermined (but reported) sizes (incl. 0) each time I "poll" the port. I then need to parse this "stream" for certain patterns (not XML).
My naive implementation concatenates the new string to the previous stream-string each time read() returns a non-zero buffer, and re-parses the whole string.
When a pattern is matched, the relevant part is discarded leaving only the tail of the string for the next time.
Obviously there are much more efficient way to do this, for example, incremental parsing a-la SAX, deque-like buffers or similar string slices etc.
Also, obviously, I am not the first to have to do this type of stream parsing.
Does anyone know of any library that already does this sort of thing?
Preventing memory-overflow in case of missing pattern matches would also be a big plus.
Thanks,
Adi
You can do some tricks depending on your pattern.
Looking for one character like a newline you only need to scan the new portion of the string.
If you are looking for \r\n then you only need to scan the new portion starting with the last character of the previous portion.
If you have a pattern with a known ending part then you only have to scan for that ending to know if you need to scan the whole buffer.
If you have some kind of synchronizing character, like semicolon in C/C++, then you can parse or declare a parse error whenever you find one.