Incremental Stream Parsing in C++

Incremental Stream Parsing in C++ - c++

I am reading data from a (serial) port using a non-blocking read() function (in C/C++).
This means the data comes-in in chunks of undetermined (but reported) sizes (incl. 0) each time I "poll" the port. I then need to parse this "stream" for certain patterns (not XML).
My naive implementation concatenates the new string to the previous stream-string each time read() returns a non-zero buffer, and re-parses the whole string.
When a pattern is matched, the relevant part is discarded leaving only the tail of the string for the next time.
Obviously there are much more efficient way to do this, for example, incremental parsing a-la SAX, deque-like buffers or similar string slices etc.
Also, obviously, I am not the first to have to do this type of stream parsing.
Does anyone know of any library that already does this sort of thing?
Preventing memory-overflow in case of missing pattern matches would also be a big plus.
Thanks,
Adi

You can do some tricks depending on your pattern.
Looking for one character like a newline you only need to scan the new portion of the string.
If you are looking for \r\n then you only need to scan the new portion starting with the last character of the previous portion.
If you have a pattern with a known ending part then you only have to scan for that ending to know if you need to scan the whole buffer.
If you have some kind of synchronizing character, like semicolon in C/C++, then you can parse or declare a parse error whenever you find one.

Related

Dart regex match against a stream

I'm writing code to parse a text file.
The user of my library can pass a line delimiter as a regex.
The file may be large so I need to stream the contents.
So the question is how do I apply the regex to the stream as it passes through my line matcher.
I will apply a limit so the line delimiter matched by the regex may not be greater than 100 chars otherwise the regex has the potential to match the entire contents of the file.
I can't just buffer the 100 char max as the delimiter may span the buffer.
The only idea I can think of is preparsing the regex into segments and checking for partial matched as I go.
Any better ideas?

It's a thorny issue, and not one which has a simple solution.
Your file is large, so you do not want to load it entirely into memory first. That's reasonable.
You do need to buffer some of it, at least everything after the last detected line delimiter, so that you can combine that with the next chunk in order to look for delimiters that may be split between the chunks.
That would be my initial approach: Keep a "prefix" string, which is everything after the last line delimiter, and when you receive a new chunk, concatenate that onto the prefix, and then check for line delimiters in the entire available string. If the prefix is (way) more than 100 chars, you can split the prefix into the part that is definitely not part of the delimiter, which you then put into your StringBuffer directly, and the last 99 characters which you combine with the next chunk. I'd benchmark that, because it's not at all obvious that it'll be faster than just concatenating the entire thing, but it might be if you get lines spanning many chunks.
If you allow arbitrary RegExps, then there is no simpler solution.(And even that does not allow a RegExp which uses look-ahead or look-behind to check characters outside of the 100-char match, perhaps even in earlier or later chunks, you really need the entire file in memory as a string for that kind of shenanigans to work).
Now, if that's too inefficient, perhaps because some lines are large, or some chunks are small, and you are doing a lot of copying to do the concatenation, then I'd move away from using RegExps (and you should be using Pattern, not RegExp, as the type you accept already), and start using just strings or code unit sequences to search for.
Then it will make sense to scan each incoming chunk to the end, and remember whether you have seen a partial delimiter, and how much of one, then you can continue with the next chunk and not need to first combine them in memory to run a RegExp over the combination.
It will even allow you to search for the delimiter in incoming bytes instead of converting them to a string first, reducing the copy-overhead even more.
It's a little tricky if you allow arbitrary character sequences as line delimiters. For example, using \r\n\r\n\t as delimiter, if you have seen \r\n\r\n at the end of one chunk, you need to be able to recognize both \t and \r\n\t at the start of the next.
(You might be able to adapt something like KMP string search for this purpose, or just disallow line delimiters that are not fairly simple).
If you had a RegExp implementation based on a state machine, it would be trivial(ish) to keep the state at the end of one chunk and continue matching in the next chunk, but the Dart (JavaScript) RegExp is not such an implementation, and it isn't built to do partial matching.
Converting the RegExp itself into one which matches prefixes of what the original would match, and detecting which prefix, is not something I'd recommend. Simply because it's very non-trivial. And that's for RegExps that are actually regular, which the Dart ones are not (back-references are non-regular).

regex_replace in large string

std/boost regex_replace returns modified string by value. In my case I have to search/replace by regex in a file. I have thousands of files to process and many of them are over 1MB in size. The string to be searched and replaced is rare (e.g. only 5-10% of files will have it).
So, with the available interface is that possible to run regex replace and if the searched string isn't found then avoid creating a copy of 1MB buffer?
I just don't seem to figure out, is that regex interface in c++ that failed that the only way is to first search for a string in my buffer and only if it's found then use regex_replace (effectively search for the second time)? Or can I reuse results from regex_match or regex_search and pass them to regex_replace?

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.

No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.

Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).

Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.

What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.

What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.

No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.

This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'

I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.

If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.

How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js