Parsing Strings/Tokens - c++

I was wondering what the most efficient way of parsing strings would be for protocols like HTTP, FTP, SMTP, IMAP, IRC, etc. where communication is done by sending information to a server, and reading the response.
For example, let's say I would like to parse a typical IRC message.
PING irc.example.com
What I am doing right now is dividing the response string into tokens, and iterating through them. If the token is "PING", my program calls the pong function. However, at the moment, "parsing" these strings merely consists of a bunch of strcmp()s.
I am curious for any alternative, more efficient ways of 'parsing' data (I was thinking something like a Map for tokens so my program can easily look it up).

Define a grammar for it, or simply make an automata that detects your tokens.
Example in this post.

Depending on how much you want to support, you've got a few options. At the first level is simple tokenizing like what you're doing. This only works for a very limited set of commands. Next up you have regular expressions which may give you a bit more flexibility. Finally you've got a full blown grammar as suggested, which would allow for the greatest flexibility.
The complexity of each of these is bigger than the last.

Related

How can I find out the unexpected token in a C++ bison parser?

I am using bison/flex to develop a parser in c++ for an expression that a user can type into a field in a gui. I would like to be able to give feedback to the user about allowed tokens (basically autocomplete) as they are typing. The information that '%error-verbose' generates would be sufficient, but it is only available as a string. Is there a way I can get programmatic access to the unexpected token and the expected token list while processing a parse error?
The token itself is in the variable yychar. That part is easy.
Finding the list of possibilities is trickier.
Conceptually, you could reparse the current input up to but not including the erroneous token; save the parser state; and then try every other possible token in turn to see if an error is produced.
The reason you need to reparse is that LALR parsers may execute erroneous reductions before encountering a syntax error. (They never execute erroneous shifts, though.) In order to discover the valid lookaheads for the parser state, these reductions would have to be undone, and there is no mechanism for doing that. In general, a reduction loses information so it is not even theoretically possible.
If you enable LAC (q.v.), which you need to do in order to get precise errors, error-verbose parsers avoid the reduction issue by doing an exploratory parse (without reduction actions) on every token which might trigger an incorrect reduction. If this parse fails, then the parser state is available for constructing the list of options; if it succeeds, then it is redone with reduction actions.
Unfortunately, bison does not provide an API for "copy the parser state"; you could reverse-engineer one easily enough, but that would be pretty fragile. So if you wanted to try this without access to the generated parser's internals, you would actually have to reparse the input many times, once for each possible lookahead token.
You could use a canonical LR parser, which has the property that errors are detected before any reduction. Full LR parsing tables can be enormous, but if your grammar is simple enough this might not be a problem. However, you still have no clean way to save parser state, so unless you reverse-engineer that, you would still have to reparse for every successful lookahead token. (Or enough of them to construct a valid error message. Bison's verbose error setting will only output a maximum of five possibilities, and for good reason.)
Possibly the simplest solution is to parse the bison error message, which has a simple fixed format. If you were to do this, I would recommend making your token names simple easily-parsed words and substituting human-readable text in your yyerror handler.
Enabling LAC definitely slows down a parse. In general, all precise error-detection and reporting modifications slow down parsers, sometimes even noticeably; this includes keeping position information (although that is also useful for debug output, so in practice it may be necessary anyway).
The recommendation I always give, because it has worked well for me in practice, is to build two parsers: one which is optimized for error-free code and makes no attempt to do anything more than reject input on the first error, and another (possibly much slower) one which can handle error detection and recovery. Erroneous inputs are then parsed twice, once with the quick parser and then again with the slow one; correct inputs only need to be parsed once and only with the quick parser. This makes project builds rapid, and normally doesn't slow down the initial write-"compile"-edit loop much, as long as the fast parser is actually fast. Keeping the two parsers in synch can be annoying, but most of the time the error-recovery parser only requires some additional methods which can be turned into no-ops and then optimized away in the fast parser. With this strategy, you might be able to use the fast parser to do the "legal lookahead" generation, and it might turn out to be fast enough.
As always, YMMV. Good luck.
I ended up customising the skeleton that bison uses to give me access to the information I wanted. This was a hack because as #rici says in his answer bison doesn't give public access to the information I was interested in. I modified the error function to take yytoken and yystate as extra parameters, the same variables that are passed into yysyntax_error_. I then used the same algorithm that yysyntax_error_ uses to generate to generate its 'verbose' message to produce a list of expected tokens and pass them back to the driving program. It's messy, but for my simple grammar at the moment it achieves what I wanted.

Fast method to check for substrings

I'm currently programming a chat system based on a server - client model and using TCP as the communication protocol. Although it's working as expected, I'd like to further optimize important parts on the server side.
The server uses four extra threads to handle new connections, console input, etc, without blocking normal chat conversations. Well, there is only one thread for all messages that are being sent from client to client, so I assume it would be good to optimize the code there, as it would be the most obvious bottleneck. After reading the data on each client's socket, the data has to be processed using different steps. One of those steps would be to check for blocked words. And that's where my original question starts.
I played with std::string::find() and the strstr() function. According to my tests, std::string::find() was clearly faster than the old C-style strstr() function.
I know that the std::string is very well optimized, but C-style char arrays and their own functions always seemed to be somewhat faster, especially if the string has to be constructed over and over again.
So, is there anything faster than std::string::find() to scan a series of characters for blocked words? Is std::string::find() faster than strstr(), or are my benchmarks lousy? I know that the gain may be negligigle compared to effort needed to keep C-style char arrays and their functions clean, but I'd like to keep it as fast as possible, even if it is just for testing purposes.
EDIT: Sorry, forgot to mention that I am using MSVC++2010 Express. I am only targeting Windows machines.
Have you benchmarked to verify that lots of time is in fact being taken in the check for blocked words? My completely naive guess is you're gonna be spending lots more time waiting for RPCs than any local processing...
Have you tried the regular expressions library in either C++11 if you use that, or Boost if you don't? I'm not sure about the speed, but I believe they perform quite well. Additionally, if you are using this as a form of profanity filter, you'd want regular expressions anyway to prevent trivial circumvention.
There exist faster searching-algorithms than the linear search typically used in STL, or strstr.
Boyer-Moore is quite popular. It requires preprocessing of the target-string, which should be feasible for your usecase.
Exact string matching algorithms is a free e-book with an in-depth description of different search-algorithms and their tradeofs.
Implementing more advanced algorithms could take considerable effort.
As said in the other answers, It is doubtful that string-searching is a bottle-neck in your chat-server.

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
"string1"
"string2"
"string3"
and the input:
"string2"
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

How should I go about building a simple LR parser?

I am trying to build a simple LR parser for a type of template (configuration) file that will be used to generate some other files. I've read and read about LR parsers, but I just can't seem to understand it! I understand that there is a parse stack, a state stack and a parsing table. Tokens are read onto the parse stack, and when a rule is matched then the tokens are shifted or reduced, depending on the parsing table. This continues recursively until all of the tokens are reduced and the parsing is then complete.
The problem is I don't really know how to generate the parsing table. I've read quite a few descriptions, but the language is technical and I just don't understand it. Can anyone tell me how I would go about this?
Also, how would I store things like the rules of my grammar?
http://codepad.org/oRjnKacH is a sample of the file I'm trying to parse with my attempt at a grammar for its language.
I've never done this before, so I'm just looking for some advice, thanks.
In your study of parser theory, you seem to have missed a much more practical fact: virtually nobody ever even considers hand writing a table-driven, bottom-up parser like you're discussing. For most practical purposes, hand-written parsers use a top-down (usually recursive descent) structure.
The primary reason for using a table-driven parser is that it lets you write a (fairly) small amount of code that manipulates the table and such, that's almost completely generic (i.e. it works for any parser). Then you encode everything about a specific grammar into a form that's easy for a computer to manipulate (i.e. some tables).
Obviously, it would be entirely possible to do that by hand if you really wanted to, but there's almost never a real point. Generating the tables entirely by hand would be pretty excruciating all by itself.
For example, you normally start by constructing an NFA, which is a large table -- normally, one row for each parser state, and one column for each possible input. At each cell, you encode the next state to enter when you start in that state, and then receive that input. Most of these transitions are basically empty (i.e. they just say that input isn't allowed when you're in that state). Note: since the valid transitions are so sparse, most parser generators support some way of compressing these tables, but that doesn't change the basic idea).
You then step through all of those and follow some fairly simple rules to collect sets of NFA states together to become a state in the DFA. The rules are simple enough that it's pretty easy to program them into a computer, but you have to repeat them for every cell in the NFA table, and do essentially perfect book-keeping to produce a DFA that works correctly.
A computer can and will do that quite nicely -- for it, applying a couple of simple rules to every one of twenty thousand cells in the NFA state table is a piece of cake. It's hard to imagine subjecting a person to doing the same though -- I'm pretty sure under UN guidelines, that would be illegal torture.
The classic solution is the lex/yacc combo:
http://dinosaur.compilertools.net/yacc/index.html
Or, as gnu calls them - flex/bison.
edit:
Perl has Parse::RecDescent, which is a recursive descent parser, but it may work better for simple work.
you need to read about ANTLR
I looked at the definition of your fileformat, while I am missing some of the context why you would want specifically a LR parser, my first thought was why not use existing formats like xml, or json. Going down the parsergenerator route usually has a high startup cost that will not pay off for the simple data that you are looking to parse.
As paul said lex/yacc are an option, you might also want to have a look at Boost::Spirit.
I have worked with neither, a year ago wrote a much larger parser using QLALR by the Qt/Nokia people. When I researched parsers this one even though very underdocumented had the smallest footprint to get started (only 1 tool) but it does not support lexical analysis. IIRC I could not figure out C++ support in ANTLR at that time.
10,000 mile view: In general you are looking at two components a lexer that takes the input symbols and turns them into higher order tokens. To work of the tokens your grammar description will state rules, usually you will include some code with the rules, this code will be executed when the rule is matched. The compiler generator (e.g. yacc) will take your description of the rules and the code and turn it into compilable code. Unless you are doing this by hand you would not be manipulating the tables yourself.
Well you can't understand it like
"Function A1 does f to object B, then function A2 does g to D etc"
its more like
"Function A does action {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o or p, or no-op} and shifts/reduces a certain count to objects {1-1567} at stack head of type {B,C,D,E,F,or G} and its containing objects up N levels which may have types {H,I,J,K or L etc} in certain combinations according to a rule list"
It really does need a data table (or code generated from a data table like thing, like a set of BNF grammar data) telling the function what to do.
You CAN write it from scratch. You can also paint walls with eyelash brushes. You can interpret the data table at run-time. You can also put Sleep(1000); statements in your code every other line. Not that I've tried either.
Compilers is complex. Hence compiler generators.
EDIT
You are attempting to define the tokens in terms of content in the file itself.
I assume the reason you "don't want to use regexes" is that you want to be able to access line number information for different tokens within a block of text and not just for the block of text as a whole. If line numbers for each word are unnecessary, and entire blocks are going to fit into memory, I'd be inclined to model the entire bracketed block as a token, as this may increase processing speed. Either way you'll need a custom yylex function. Start by generating one with lex with fixed markers "[" and "]" for content start and end, then freeze it and modify it to take updated data about what markers to look for from the yacc code.

Parsing a string in C++

I have a huge set of log lines and I need to parse each line (so efficiency
is very important).
Each log line is of the form
cust_name time_start time_end (IP or URL )*
So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there
is more than 1, then they are separated by semicolons.
I need a way to parse this line and read it into a data structure. time_start or
time_end could be either system time or GMT. cust_name could also have multiple strings
separated by spaces.
I can do this by reading character by character and essentially writing my own parser.
Is there a better way to do this ?
Maybe Boost RegExp lib will help you.
http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html
I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.
Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips
Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).
Why do you want to do this in C++? It sounds like an obvious job for something like perl.
Consider using a Regular Expressions library...
Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.
for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194
UPDATE changed answer drastically!
I have a huge set of log lines and I need to parse each line (so efficiency is very important).
Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!
The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.
Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,
Storing huge data structure in memory is very inefficient, no matter what language you're using!
What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.
If you do that, you've already solved the major bottleneck.
For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)
In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.
see this answer for example code.
Alternatively, (I didn't want to delete this part!!)
If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still
Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf
It's a worth while read.
At slide 31 he deals with what seems to be something very similar to what you're trying to do.
It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.
You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.
The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.