Validate ASCII GnuPlot file with c++ regex - c++

I have been trying to get this right, but cannot seem to make things work the way I want it to.
I have an ASCII file containing several million lines of floating point values, seperated by spaces. Reading these values is straightforward using std::istream_iterator<double> but I wanted to validate the file upfront to make sure it is really formatted the way I described. Since there is only one correct format, and gazillions of way how it can be illformed, I wanted to go about it using std::regex.
This is what I came up with:
std::string begln( "^" );
std::string endln( "$" );
std::string fp( "[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?." );
std::string space( "[[:space:]]{1}" );
std::regex regexp( "(" + begln + fp + space + fp + space + fp + endln + ")+" );
What I wanted to express was: A line consists of something between the beginning and end of the line, which consists of three sets of floating point values seperated with a single space, and I am looking for one or more of these lines.
I would expect a valid datafile to have a single match without prefix and suffix.
But hey, since these values will go into a std::vector<std::array<double, 3>>, why don't I reuse the regex machinery and obtain the values from a match list? If the file is valid, then an absolutely trivial regex could match just individual lines, and construct a std::sregex_iterator to iterate over the lines. At this point, it is only a matter of obsession how one obtains the values from a singe std::string of a line, whether using regex again or std::stringsteam.

Why not? The reason why you wouldn't want this is because regex'es are absolute overkill. They can match far more complex grammars, and are capable of reading in those grammars at runtime. That flexibility comes at a high price. All the possible parsers must be included. No current compiler is smart enough to see that you just used [[:space:]] as a regex. (In fact, no C++ compiler or linker knows anything about regex - that's purely a library thing).
In comparison, operator>> is overloaded and the compiler sees exactly which overloads you use at compile time. The linker is told this, and includes just the relevant code.
Furthermore, the CPU branch predictor will soon notice that operator>> almost always succeeds, which is a further speedup. Your regex code is less likely to benefit in the same way - the conditional part in [0-9]* is at least one level of indirection deeper.

Related

Dart regex match against a stream

I'm writing code to parse a text file.
The user of my library can pass a line delimiter as a regex.
The file may be large so I need to stream the contents.
So the question is how do I apply the regex to the stream as it passes through my line matcher.
I will apply a limit so the line delimiter matched by the regex may not be greater than 100 chars otherwise the regex has the potential to match the entire contents of the file.
I can't just buffer the 100 char max as the delimiter may span the buffer.
The only idea I can think of is preparsing the regex into segments and checking for partial matched as I go.
Any better ideas?
It's a thorny issue, and not one which has a simple solution.
Your file is large, so you do not want to load it entirely into memory first. That's reasonable.
You do need to buffer some of it, at least everything after the last detected line delimiter, so that you can combine that with the next chunk in order to look for delimiters that may be split between the chunks.
That would be my initial approach: Keep a "prefix" string, which is everything after the last line delimiter, and when you receive a new chunk, concatenate that onto the prefix, and then check for line delimiters in the entire available string. If the prefix is (way) more than 100 chars, you can split the prefix into the part that is definitely not part of the delimiter, which you then put into your StringBuffer directly, and the last 99 characters which you combine with the next chunk. I'd benchmark that, because it's not at all obvious that it'll be faster than just concatenating the entire thing, but it might be if you get lines spanning many chunks.
If you allow arbitrary RegExps, then there is no simpler solution.(And even that does not allow a RegExp which uses look-ahead or look-behind to check characters outside of the 100-char match, perhaps even in earlier or later chunks, you really need the entire file in memory as a string for that kind of shenanigans to work).
Now, if that's too inefficient, perhaps because some lines are large, or some chunks are small, and you are doing a lot of copying to do the concatenation, then I'd move away from using RegExps (and you should be using Pattern, not RegExp, as the type you accept already), and start using just strings or code unit sequences to search for.
Then it will make sense to scan each incoming chunk to the end, and remember whether you have seen a partial delimiter, and how much of one, then you can continue with the next chunk and not need to first combine them in memory to run a RegExp over the combination.
It will even allow you to search for the delimiter in incoming bytes instead of converting them to a string first, reducing the copy-overhead even more.
It's a little tricky if you allow arbitrary character sequences as line delimiters. For example, using \r\n\r\n\t as delimiter, if you have seen \r\n\r\n at the end of one chunk, you need to be able to recognize both \t and \r\n\t at the start of the next.
(You might be able to adapt something like KMP string search for this purpose, or just disallow line delimiters that are not fairly simple).
If you had a RegExp implementation based on a state machine, it would be trivial(ish) to keep the state at the end of one chunk and continue matching in the next chunk, but the Dart (JavaScript) RegExp is not such an implementation, and it isn't built to do partial matching.
Converting the RegExp itself into one which matches prefixes of what the original would match, and detecting which prefix, is not something I'd recommend. Simply because it's very non-trivial. And that's for RegExps that are actually regular, which the Dart ones are not (back-references are non-regular).

Tokenize parse option

Consider a slightly different toy example from my previous question:
. local string my first name is Pearly,, and my surname is Spencer
. tokenize "`string'", parse(",,")
. display "`1'"
my first name is Pearly
. display "`2'"
,
. display "`3'"
,
. display "`4'"
and my surname is Spencer
I have two questions:
Does tokenize work as expected in this case? I thought local macro
2 should be ,, instead of , while local macro 3 contain the rest of the string (and local macro 4 be empty).
Is there a way to force tokenize to respect the double comma as a parsing
character?
tokenize -- and gettoken too -- won't, from what I can see, accept repeated characters such as ,, as a composite parsing character. ,, is not illegal as a specification of parsing characters, but is just understood as meaning that , and , are acceptable parsing characters. The repetition in practice is ignored, just as adding "My name is Pearly" after "My name is Pearly" doesn't add information in a conversation.
To back up: know that without other instructions (such as might be given by a syntax command) Stata will parse a string according to spaces, except that double quotes (or compound double quotes) bind harder than spaces separate.
tokenize -- and gettoken too -- will accept multiple parse characters pchars and the help for tokenize gives an example with space and + sign. (It's much more common, in my experience, to want to use space and comma , when the syntax for a command is not quite what syntax parses completely.)
A difference between space and the other parsing characters is that spaces are discarded but other parsing characters are not discarded. The rationale here is that those characters often have meaning you might want to take forward. Thus in setting up syntax for a command option, you might want to allow something like myoption( varname [, suboptions])
and so whether a comma is present and other stuff follows is important for later code.
With composite characters, so that you are looking for say ,, as separators I think you'd need to loop around using substr() or an equivalent. In practice an easier work-around might be first to replace your composite characters with some neutral single character and then apply tokenize. That could need to rely on knowing that that neutral character should not occur otherwise. Thus I often use # as a character placeholder because I know that it will not occur as part of variable or scalar names and it's not part of function names or an operator.
For what it's worth, I note that in first writing split I allowed composite characters as separators. As I recall, a trigger to that was a question on Statalist which was about data for legal cases with multiple variations on VS (versus) to indicate which party was which. This example survives into the help for the official command.
On what is a "serious" bug, much depends on judgment. I think a programmer would just discover on trying it out that composite characters don't work as desired with tokenize in cases like yours.

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.
It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.
It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.