Dart regex match against a stream - regex

I'm writing code to parse a text file.
The user of my library can pass a line delimiter as a regex.
The file may be large so I need to stream the contents.
So the question is how do I apply the regex to the stream as it passes through my line matcher.
I will apply a limit so the line delimiter matched by the regex may not be greater than 100 chars otherwise the regex has the potential to match the entire contents of the file.
I can't just buffer the 100 char max as the delimiter may span the buffer.
The only idea I can think of is preparsing the regex into segments and checking for partial matched as I go.
Any better ideas?

It's a thorny issue, and not one which has a simple solution.
Your file is large, so you do not want to load it entirely into memory first. That's reasonable.
You do need to buffer some of it, at least everything after the last detected line delimiter, so that you can combine that with the next chunk in order to look for delimiters that may be split between the chunks.
That would be my initial approach: Keep a "prefix" string, which is everything after the last line delimiter, and when you receive a new chunk, concatenate that onto the prefix, and then check for line delimiters in the entire available string. If the prefix is (way) more than 100 chars, you can split the prefix into the part that is definitely not part of the delimiter, which you then put into your StringBuffer directly, and the last 99 characters which you combine with the next chunk. I'd benchmark that, because it's not at all obvious that it'll be faster than just concatenating the entire thing, but it might be if you get lines spanning many chunks.
If you allow arbitrary RegExps, then there is no simpler solution.(And even that does not allow a RegExp which uses look-ahead or look-behind to check characters outside of the 100-char match, perhaps even in earlier or later chunks, you really need the entire file in memory as a string for that kind of shenanigans to work).
Now, if that's too inefficient, perhaps because some lines are large, or some chunks are small, and you are doing a lot of copying to do the concatenation, then I'd move away from using RegExps (and you should be using Pattern, not RegExp, as the type you accept already), and start using just strings or code unit sequences to search for.
Then it will make sense to scan each incoming chunk to the end, and remember whether you have seen a partial delimiter, and how much of one, then you can continue with the next chunk and not need to first combine them in memory to run a RegExp over the combination.
It will even allow you to search for the delimiter in incoming bytes instead of converting them to a string first, reducing the copy-overhead even more.
It's a little tricky if you allow arbitrary character sequences as line delimiters. For example, using \r\n\r\n\t as delimiter, if you have seen \r\n\r\n at the end of one chunk, you need to be able to recognize both \t and \r\n\t at the start of the next.
(You might be able to adapt something like KMP string search for this purpose, or just disallow line delimiters that are not fairly simple).
If you had a RegExp implementation based on a state machine, it would be trivial(ish) to keep the state at the end of one chunk and continue matching in the next chunk, but the Dart (JavaScript) RegExp is not such an implementation, and it isn't built to do partial matching.
Converting the RegExp itself into one which matches prefixes of what the original would match, and detecting which prefix, is not something I'd recommend. Simply because it's very non-trivial. And that's for RegExps that are actually regular, which the Dart ones are not (back-references are non-regular).

Related

Optimising regex matching around an index

Is there a way (3rd party libraries are OK) to locate a regex match that envelops a specific "anchor" position without the need to iterate through the matches?
For example, we have a string with a location X. Now I want to match a specific regex, with the X being part of the captured area.
My specific scenario is about non-break conditions for punctuation marks. E.g. we have a . character, and I want to ignore it for [0-9]+[.][0-9]+.
As the string may be quite long, I need an efficient way of doing it, without the need to check several matches until I'm at the right spot. The maximum length of the interval between the start of the match and the X is unknown.
Of course, iterating through the matches is also possible but it's not efficient because while the number of the punctuation marks is limited, the number of non-break conditions may be quite high.

Validate ASCII GnuPlot file with c++ regex

I have been trying to get this right, but cannot seem to make things work the way I want it to.
I have an ASCII file containing several million lines of floating point values, seperated by spaces. Reading these values is straightforward using std::istream_iterator<double> but I wanted to validate the file upfront to make sure it is really formatted the way I described. Since there is only one correct format, and gazillions of way how it can be illformed, I wanted to go about it using std::regex.
This is what I came up with:
std::string begln( "^" );
std::string endln( "$" );
std::string fp( "[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?." );
std::string space( "[[:space:]]{1}" );
std::regex regexp( "(" + begln + fp + space + fp + space + fp + endln + ")+" );
What I wanted to express was: A line consists of something between the beginning and end of the line, which consists of three sets of floating point values seperated with a single space, and I am looking for one or more of these lines.
I would expect a valid datafile to have a single match without prefix and suffix.
But hey, since these values will go into a std::vector<std::array<double, 3>>, why don't I reuse the regex machinery and obtain the values from a match list? If the file is valid, then an absolutely trivial regex could match just individual lines, and construct a std::sregex_iterator to iterate over the lines. At this point, it is only a matter of obsession how one obtains the values from a singe std::string of a line, whether using regex again or std::stringsteam.
Why not? The reason why you wouldn't want this is because regex'es are absolute overkill. They can match far more complex grammars, and are capable of reading in those grammars at runtime. That flexibility comes at a high price. All the possible parsers must be included. No current compiler is smart enough to see that you just used [[:space:]] as a regex. (In fact, no C++ compiler or linker knows anything about regex - that's purely a library thing).
In comparison, operator>> is overloaded and the compiler sees exactly which overloads you use at compile time. The linker is told this, and includes just the relevant code.
Furthermore, the CPU branch predictor will soon notice that operator>> almost always succeeds, which is a further speedup. Your regex code is less likely to benefit in the same way - the conditional part in [0-9]* is at least one level of indirection deeper.

Java Regex to find if a given String contains a set of characters in the same order of their occurrence.

We need Java Regex to find if a given String contains a set of characters in the same order of their occurrence.
E.g. if the given String is "TYPEWRITER",
the following strings should return a match:
"YERT", "TWRR" & "PEWRR" (character by character match in the order of occurrence),
but not
"YERW" or "YERX" (this contains characters either not present in the given string or doesn't match the order of occurrence).
This can be done by character by character matching in a for loop, but it will be more time consuming. A regex for this or any pointers will be highly appreciated.
First of all REGEX has nothing to do with it. Regex is powerful but not that much powerful to accomplish this.
The thing you are asking is a part of Longest Common Subsequence(LCS) Algorithm implementation. For your case you need to change the algorithm a bit. I mean instead of matching part of string from both, you'll require to match your one string as a whole subsequence from the Larger one.
The LCS is a dynamic algorithm and so far this is the fastest way to achieve this. If you take a look at the LCS Example here you'll find that what I am talking about.

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Incremental Stream Parsing in C++

I am reading data from a (serial) port using a non-blocking read() function (in C/C++).
This means the data comes-in in chunks of undetermined (but reported) sizes (incl. 0) each time I "poll" the port. I then need to parse this "stream" for certain patterns (not XML).
My naive implementation concatenates the new string to the previous stream-string each time read() returns a non-zero buffer, and re-parses the whole string.
When a pattern is matched, the relevant part is discarded leaving only the tail of the string for the next time.
Obviously there are much more efficient way to do this, for example, incremental parsing a-la SAX, deque-like buffers or similar string slices etc.
Also, obviously, I am not the first to have to do this type of stream parsing.
Does anyone know of any library that already does this sort of thing?
Preventing memory-overflow in case of missing pattern matches would also be a big plus.
Thanks,
Adi
You can do some tricks depending on your pattern.
Looking for one character like a newline you only need to scan the new portion of the string.
If you are looking for \r\n then you only need to scan the new portion starting with the last character of the previous portion.
If you have a pattern with a known ending part then you only have to scan for that ending to know if you need to scan the whole buffer.
If you have some kind of synchronizing character, like semicolon in C/C++, then you can parse or declare a parse error whenever you find one.