Bitstream lexer generator - bit-manipulation

I have a number of packet formats that are bit-oriented. Rather than writing many fairly complex lexer by hand I am looking for a bit-level lexer generator, a bit-oriented version of flex/lex say. Obviously I could just write straight C, but I was wondering whether such a bit-level lexer generator exists. After a quick Google, I found a number of media decoders and such, however, I am not parsing media files, but network packets.
Alternatively is there a way to run flex in a bit-oriented made?

You could look into redefining YY_INPUT (see Generated Scanner in the flex documentation) and decompose each input byte into its individual bits and use '0' and '1' as alphabet of the regular expressions. You may want to consider if a slightly bigger alphabet can be defined with a simple definition of YY_INPUT.

Related

Elegant way for C++ code generation

I am currently working on a database related project in which I generate a lot of C++ code. This code is compiled then and loaded as a dynamic library. I use this techniques to build efficient code for the database schema and queries.
Currently, I am using simple file write to generate the code (what was okay for the proof-of-concept implementation). Now, I am searching for a more elegant but comparable flexible solution to generate C++ code.
I searched quite a lot but all the solutions I found so far are rather complex/extensive, not efficient enough, or not flexible enough.
What libraries are you using in your C++ projects to generate code?
Best,
Moritz
You can use a program transformation system (PTS) to define and compose code templates in a reliable way.
Most PTS enable one to define a grammar, and then parse source code into ASTs using that grammar. More importantly, they accept patterns: source code fragments (usually of a nonterminal or a list of nonterminals) with placeholders that correspond to wellformed sub-fragments (nonterminals representing subtrees). These patterns usually insist that a named placeholder match identically (see example below). Such patterns can be used to match against a parsed AST as a way to find code fragments using the surface syntax.
So, one might use a pattern:
pattern x_squared(t: term): product
= " \t * \t ";
to hunt for subexpressions which consist of products of identical subtrees.
This will match
(p + q[17])*(p+q[17)
but not
2 * (x-3)
But just as interestingly, such patterns can be used as code generators, by instantiating the pattern with bound value (trees) for the variables. So,
"instantiate x_squared(2^x)" produces
(2^x)*(2^x)
By itself, this is just a fancy sort of macro scheme. It is a lot better, in that it can tell you "at compile time" (for the patterns) whether what you are composing makes sense or not. So you get type checking of the composition of the code fragments. For instance, you might accidentally code "instantiate x_squared(int q)", but a good PTS will object that "int q" is not a "term"; you find the bug when you build the code generator.
Where this gets really interesting is where one can build many different code fragments, from many different patterns, and compose those fragments with yet more patterns. This allows one to build very complex code. All of this is a (syntax-type) safe way; resulting trees are valid syntax. (You can still bollix semantics; nothing is perfect). As the complexity of the code you can generate goes up, it is good to have this additional checking to help you avoid generating bad code.
A PTS has an additional advantage: after composing code fragments, it can apply source-to-source transformations to optimize the resulting code. Thus you can produce optimized code according to your ability to write matching transformations, and harnessing knowledge you have during code generation.
Imagine you generate code for a matrix multiply:
... P * Q ...
and your code generator somehow or other knows that Q is an identity matrix.
Then the following optimization can remove an expensive matrix multiply:
rule optimize_matrix_times_unit(m: term, n: term): product -> product
" \m * \q "
-> " \m "
if is_identity_matrix(q)
This transformation takes advantage of pattern matching (to find a matrix product) in the generated code, pattern instantiation (to generate a replacement for the matched product), and additional knowledge or analysis (is_identity_matrix) that the code generation can do.
You need a PTS capable of handling C++ parsing; those are a bit hard to find. The one I designed (DMS Software Reengineering Toolkit) happens to do this. The examples in this answer are DMS-style.
Here's a technical paper that describes a large-scale reengineering task done by DMS on C++ code. A number of examples in the paper are actually quite complex patterns used to instantiate code; the reengineering task had to generate a new set of APIs for an existing chunk of code.

Computation of DFA states

I want to compute the total number of DFA states for a certain regular expression using FLEX. Which C files or functions will help me to achieve this task using FLEX?
If you look in the file generated by flex, then the number of entries in yy_accept (and yy_base) will probably give a good indication of the number of states used by the generated DFA. If you'd use -Cf option then yy_nxt contains the transition function of the DFA and the number of rows in the table is again the number of used states.
You may have a different version of flex where the tables are named differently, but most likely their names will be very similar.
In reaction to your questions below: the number of states in a DFA could be considered quite well defined, assuming the DFA has been minimized. The number of transitions is however much less well defined.
In the first place flex has a transition for each input character as it will ECHO any character that is not part of the defined language. This is implemented by a fresh new state to handle that case. Using a debugger you could reverse engineer which state this is. But beware that if you use start conditions, you may have to consider the possibility that there are multiple such states. If you want to analyze many regular expressions, then you may want to look into some other tools or take the sources of flex and go from there.
In the second place flex has strategies to minimize the total size of all the tables. The -Cf option instructs it to not do that. One such optimization is finding equivalence classes of characters and only use transitions for each character class. An input character is first translated to its class, which in turn is used to determine the transition. As a consequence the number of transitions is much lower, but an additional table (see yy_ec) is required for determining the character class.
As a consequence the number of transitions is a not so well defined concept. If you are interested in determining the memory footprint of the scanner, then I would look at the size of the data section of the scanner. Use for example objdump -h on the lex.yy.o file. The size of the .rodata section will give a quite accurate estimate of the total size of the tables.
You seemed to have already found the -v option of flex that gives the number of states in the DFA in a more verbose form. In answer to why "a" {} gives 5 states, you may also use the --trace option as it gives the DFA while it is generated. Apparently there is also an End Marker rule, I assume it is used for end-of-file. For each start condition there are two states, one that is used when at the start of a line and one in the middle of a line. That makes 3 accepting states (one for "a", one for End Marker and one for (.|"\n")) plus two states for the single start condition.
The source file dfa.c is not part of the generated code, but if you feel brave you could of course change the sources of flex to do further analysis of your own. I had a quick look and it does seem that generation of the code is intertwined with the transformations, which makes it a bit less modular than one would desire for an experimentation platform. Also beware of the K&R prototypes which effectively disables any type checking on the prototypes.

symbolic computation

Does anyone know a good approach/libs for doing algebraic computations in C++?
I have an application being developed in c++ which needs to do algebraic computation. For now I built a C++ parser that accepts expressions in form of strings like "5 + (2 - MYFUNC(3))" which get tokenized into structs and then converted into postfix notation using the Shunting Yard algorithm and evaluated.
The MYFUNC in these expression are my own defined functions that may do some complex computations.
This is a high performance app, the expressions also have variables that are dynamically replaced with values and the expression is re-evaluated
e.g. var1 + (2 - MYFUNC(var2)) -> with var1 and var2 replaced by some values during the course of the run and re-evaluated
I'm using Linux and so far found Giac library but not sure if it's good, any feedback would be welcome.
How do people generally approach this problem? The main language in this case is C++.
Have a look on Bison and Flex Parser. The basic idea here is that a grammar file would be written and converted into C code which can be integrated into you application. Any book on Flex and Bison (http://www.amazon.com/Flex-Bison-Text-Processing-Tools/dp/0596155972) is good enough for initial reading.
May be this helps you.!
Probably the fastest way to handle this is to generate a compiled, optimized function at runtime for the defined function, and evaluate it for the different variable values that you may have. You can do this with LLVM, probably other tools.
I would write a recursive descent parser for the language, because the syntax doesn't look very complex. The parsing may be a tiny bit slower than that of say Flex/Bison, but I'm guessing the parsing will be the least computationally expensive in your project.

Identifying keywords of a (programming) language

this is a follow up to my recent question ( Code for identifying programming language in a text file ). I'm really thankful for all the answers I got, it helped me very much. My code for this task is complete and it works fairly well - quick and reasonably accurate.
The method i used is the following: i have a "learning" perl script that identifies most frequently used words in a language by doing a word histogram over a set of sample files. These data are then loaded by the c++ program which then checks the given text and accumulates score for each language based on found words and then simply checks which language accumulated the highest score.
Now i would like to make it even better and work a bit on the quality of identification. The problem is I often get "unknown" as result (many languages accumulate a small score, but none anything bigger than my threshold). After some debugging, research etc i found out that this is probably due to the fact, that all words are considered equal. This means that seeing a "#include" for example has the same effect as seeing a "while" - both of which indicate that it might be c/c++ (i'm now ignoring the fact that "while" is used in many other languages), but of course in larger .cpp files there might be a ton of "while" but most of the time only a few "#include".
So the fact that a "#include" is more important is ignored, because i could not come up with a good way how to identify if a word is more important than another. Now bear in mind that the script which creates the data is fairly stupid, its only a word histogram and for every chosen word it assigns a score of 1. It does not even look at the words (so if there is a "#&|?/" in a file very often it might get chosen as a good word).
Also i would like to have the data creation part fully automated, so nobody should have to look at the data and alter them, change scores, change words etc. All the "brainz" should be in the script and the cpp program.
Does somebody have a suggestion how to identify keywords, or more generally, important words? Some things that might help: i have the number of occurences of each word and the number of total words (so a ratio may be calculated). I have also thought about wiping out characters like ;, etc. since the histogram script often puts for example "continue;" in the result, but the important word is "continue". Last note: all checks for equality are done for exact match - no substrings, case sensitive. This is mainly because of speed, but substrings might help (or hurt, i dont know)...
NOTE: thanks all who bothered to answer, you helped me a lot.
My work with this is almost finished so i will describe what i did to get good results.
1) Get a decent training set, about 30-50 files per language from various sources to avoid coding style bias
2) Write a perl script that does a word histogram. Implement blacklist and whitelist (more about it below)
3) add bogus words to blacklist, like "license", "the" etc. These are often found at the start of file in license information.
4) add about five most important words per language to the whitelist. These are words that are found in most source code of a given language, but are not frequent enough to get into the histogram. For example for C/C++ i had: #include, #define, #ifdef, #ifndef and #endif in the whitelist.
5) Emphasize the start of a file, so give more points to words found in the first 50-100 lines
6) when doing the word histogram, tokenize the file using #words = split(/[\s\(\){}\[\];.,=]+/, $_); This should be ok for most languages i think (gives me the best results). For each language, have about 10-20 most frequent words in the final results.
7) When the histogram is complete, remove all words that are found in the blacklist and add all those that are found in the whitelist
8) Write a program which processes a text file in the same way as the script - tokenize using the same rules. If a word is found in the histogram data, add points to the right language. Words in the histogram which correspond to only one language should add more points, those which belong to multiple languages should add less.
Comments are welcome. Currently on about 1000 text files i get 80 unknowns (mostly on extremely short files - mainly javascript with just one or two lines). About 20 files are recognized wrong. Size of the files is about 11kB ranging from 100 bytes to 100 kBytes (almost 11MB total). It takes one second to process them all, which is good enough for me.
I think you're approaching this from the wrong viewpoint. From your description, it sounds like you are building a classifier. A good classifier needs to discriminate between different classes; it doesn't need to precisely estimate the correspondence between the input and the most likely class.
Practically: your classifier doesn't need to assess precisely how close to C++ a certain input is; it merely needs to determine if the input is more like C than C++. This makes your work a lot easier - most of your current "unknown" cases will be close to one or two languages, even though they don't exceed your basic threshold.
Now, once you realize this, you will also see what training your classifier needs: not some random aspect of the sample files, but what sets two languages apart. Hence, when you have parsed your C samples, and your C++ samples, you will see that #include does not set them apart. However, class and template will be far more common in C++. On the other hand, #include does distinguish between C++ and Java.
There are of course other aspects besides keywords that you can use. For instance, the most obvious would be the frequency of {, and ; is similarly distinguishing. Another very useful feature for your classifier would be the comment tokens for the different languages. The basic problem of course would be automatically identifying them. Again, hardcoding //, /*, ', --, # and ! as pseudo-keywords would help.
This also identifies another classification rule: SQL will often have -- at the beginning of a line, whereas in C it will often appear somewhere else. Thus it may be useful for your classifier to take the context into account as well.
Use Google Code Search to learn weights for the set of keywords: #include in C++ gets 672.000 hits, in Python only ~5000.
You can normalize the results by looking at the number of results for the language in total:
C++ gives about 770.000 files whereas Python returns 120.000.
Thus "#include" is extremely rare in Python files, but exists in almost every C++ file. (Now you still have to learn to distinguish C++ and C of course.) All that is left is to do the correct reasoning about probabilities.
You need to get some exclusiveness into your lookup data.
When teaching the programming languages you expect, you should search for words typical for one or few language(s). If a word appears in several code files of the same language but appears in few or none of the other language files, it's a strong suggestion to that language.
So the score of a word could be calculated at the lookup side by selecting the words that are exclusive to a language or a group of languages. Find several of these words and get the intersection of these by adding the scores, and found your language you will have.
In an answer to your other question, someone recommended a naïve Bayes classifier. You should implement this suggestion because the technique is good at separating according to distinguishing features. You mentioned the while keyword, but that's not likely to be useful because so many languages use it—and a Bayes classifier won't treat it as useful.
An interesting part of your problem is how to tokenize an unknown program. Whitespace-separated chunks is a decent rough start, but going meaningfully beyond that will be tricky.

File Compressor In Assembly

In an effort to get better at programming assembly, and as an academic exercise, I would like to write a non-trivial program in x86 assembly. Since file compression has always been kind of an interest to me, I would like to write something like the zip utility in assembly.
I'm not exactly out of my element here, having written a simple web server using assembly and coded for embedded devices, and I've read some of the material for zlib (and others) and played with its C implementation.
My problem is finding a routine that is simple enough to port to assembly. Many of the utilities I've inspected thus far are full of #define's and other included code. Since this is really just for me to play with, I'm not really interested in super-awesome compression ratios or anything like that. I'm basically just looking for the RC4 of compression algorithms.
Is a Huffman Coding the path I should be looking down or does anyone have another suggestion?
And here is a more sophisticated algorithm which should not be too hard to implement: LZ77 (containing assembly examples) or LZ77 (this site contains many different compression algorithms).
One option would be to write a decompressor for DEFLATE (the algorithm behind zip and gzip). zlib's implementation is going to be heavily optimized, but the RFC gives pseudocode for a decoder. After you have learned the compressed format, you can move on to writing a compressor based on it.
I remember a project from second year computing science that was something similar to this (in C).
Basically, compressing involves replacing a string of xxxxx (5 x's) with #\005x (the at sign, a byte with a value of 5, followed by the repeated byte. This algorithm is very simple. It doesn't work that well for English text, but works surprisingly well for bitmap images.
Edit: what I am describing is run length encoding.
Take a look at UPX executable packer. It contains some low-level decompressing code as part of unpacking procedures...