I have an app that includes a 3 operator (& | !) boolean expression evaluator, with variables and constants. Generally the expressions aren't too long (perhaps 50 terms at the most, but usually a lot less). There can be very many expressions - I'm expecting the upper limit to be around a million. Currently I have a hand written parser with a very simple evaluator that simply recursively traverses the parse tree. One constraint is that this has to be callable from C++. I have no sharing between expressions. I'd like to investigate speeding this up.
I see two avenues of research.
Add sharing and store the state indicating whether an expression node has been evaluated or not.
Extract Common Subexpressions.
Also I would expect that a code generation approach will be faster than an interpretive approach working on parse trees or similar structures. It would probably be fairly straightforward to generate some C++ code, but considering the length of the functions, I don't know if a compiler like GCC will be able to optimize the CSEs.
I've seen that there are a few libraries available for expression evaluation, but in my work environment adding 3rd party libraries is not simple plus they all seem very complicated compared to my needs.
Lastly I've been looking at Antlr4 a bit recently, so that might be appropriate for me. In the past I've worked on C code generation, but I have no experience of using something like LLVM for optimisation and code generation.
Any suggestions for which way to go?
As far as I understood, your question is more about faster expression evaluation than it about faster expression parsing. So my answer will focus on the former. Parsing, after all, should not be the bottleneck as your expression language looks simple enough to implement a manually tuned parser for it.
So, to accelerate your evaluations, you can consider JIT execution of your formulas using LLVM. That is, given your formula F you can (relatively) easily generate corresponding LLVM IR and directly evaluate it. This SMT solver does just that. IR code generation is implemented in a single C++ class here.
Note that the boolean expressions you mentioned are a subset of the SMT language supported by that solver. Additionally, you can easily adjust how aggressive the LLVM optimizer needs to be.
However, IR generation and optimization has its overhead. Therefore, in case a given formula is not evaluated often enough to amortize the initial overhead, then I would recommend direct interpretation instead. You can look in this case for opportunities to find structural similarities and common subexpressions.
As much as I'd like to suggest ANTLR4, I fear it won't meet your performance needs. There is a lot going on under the hood with its adaptive LL(*) algoritms and though there are some common tricks to improve its performance, simply tracing an ANTLR4 interpreter at runtime suggests that unless your current expression evaluator is very inefficient, it is likely faster than ANTLR4, which is an industrial-duty engine meant to support grammars far more complicated than yours. I use ANTLR when a LALR(1) DFA shift-reduce engine won't support my grammar, and take the performance hit in return for the extra parsing power of ANTLR4.
Related
I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy
Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.
I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.
I am currently working on a database related project in which I generate a lot of C++ code. This code is compiled then and loaded as a dynamic library. I use this techniques to build efficient code for the database schema and queries.
Currently, I am using simple file write to generate the code (what was okay for the proof-of-concept implementation). Now, I am searching for a more elegant but comparable flexible solution to generate C++ code.
I searched quite a lot but all the solutions I found so far are rather complex/extensive, not efficient enough, or not flexible enough.
What libraries are you using in your C++ projects to generate code?
Best,
Moritz
You can use a program transformation system (PTS) to define and compose code templates in a reliable way.
Most PTS enable one to define a grammar, and then parse source code into ASTs using that grammar. More importantly, they accept patterns: source code fragments (usually of a nonterminal or a list of nonterminals) with placeholders that correspond to wellformed sub-fragments (nonterminals representing subtrees). These patterns usually insist that a named placeholder match identically (see example below). Such patterns can be used to match against a parsed AST as a way to find code fragments using the surface syntax.
So, one might use a pattern:
pattern x_squared(t: term): product
= " \t * \t ";
to hunt for subexpressions which consist of products of identical subtrees.
This will match
(p + q[17])*(p+q[17)
but not
2 * (x-3)
But just as interestingly, such patterns can be used as code generators, by instantiating the pattern with bound value (trees) for the variables. So,
"instantiate x_squared(2^x)" produces
(2^x)*(2^x)
By itself, this is just a fancy sort of macro scheme. It is a lot better, in that it can tell you "at compile time" (for the patterns) whether what you are composing makes sense or not. So you get type checking of the composition of the code fragments. For instance, you might accidentally code "instantiate x_squared(int q)", but a good PTS will object that "int q" is not a "term"; you find the bug when you build the code generator.
Where this gets really interesting is where one can build many different code fragments, from many different patterns, and compose those fragments with yet more patterns. This allows one to build very complex code. All of this is a (syntax-type) safe way; resulting trees are valid syntax. (You can still bollix semantics; nothing is perfect). As the complexity of the code you can generate goes up, it is good to have this additional checking to help you avoid generating bad code.
A PTS has an additional advantage: after composing code fragments, it can apply source-to-source transformations to optimize the resulting code. Thus you can produce optimized code according to your ability to write matching transformations, and harnessing knowledge you have during code generation.
Imagine you generate code for a matrix multiply:
... P * Q ...
and your code generator somehow or other knows that Q is an identity matrix.
Then the following optimization can remove an expensive matrix multiply:
rule optimize_matrix_times_unit(m: term, n: term): product -> product
" \m * \q "
-> " \m "
if is_identity_matrix(q)
This transformation takes advantage of pattern matching (to find a matrix product) in the generated code, pattern instantiation (to generate a replacement for the matched product), and additional knowledge or analysis (is_identity_matrix) that the code generation can do.
You need a PTS capable of handling C++ parsing; those are a bit hard to find. The one I designed (DMS Software Reengineering Toolkit) happens to do this. The examples in this answer are DMS-style.
Here's a technical paper that describes a large-scale reengineering task done by DMS on C++ code. A number of examples in the paper are actually quite complex patterns used to instantiate code; the reengineering task had to generate a new set of APIs for an existing chunk of code.
I can imagine expression templates doing awful things to compile times for things as pervasive as vectors/matrices/quaternions etc, but if it is such a great speed boost why don't games use it? It's quite obvious that SIMD instructions can exploit data level parallelism to great effect. Expression templates and lazy evaluation together seem to make sense, at least when it comes to eliminating temporaries.
So while libraries like Eigen advertise such features, I don't see this done commonly in middleware (e.g. Havok) or games where things are extremely speed critical. Can anyone shed some light on this? Does it have to do with non-determinism or branch prediction?
I can think of a lot of reasons:
it hurts compile-times. Longer compile-times means that testing any change you made to the code takes longer. It hurts productivity.
it's complex. Most likely, many developers on the team are not familiar with expression templates, and will have a hard time reading and debugging them.
Games often have to work on multiple platforms, with various compilers which may have a wide range of shortcomings, which might for example make advanced template trickery problematic.
It's generally not necessary. You can write efficient code without expression templates. It just gets more verbose, and you have to do more hand-holding for the compiler.
Game developers are extremely skeptical of anything that wasn't already used in games 10 years ago. It's not long ago that several major developers stuck to C: not because C++ wasn't good enough, but because it was "new". Game developers are conservative as hell.
And of course, the obvious question: where would they use expression templates? Is there enough complex math to really make it worthwhile? Games tend to rely on a fairly small number of linear algebra operations, which will typically be heavily hand-tuned in any case.
I want to add one more reason not stated in the above answers. Apologies if it is and I missed it.
Adding templates to math based classes, such as a vec3 class, can change the meaning of operators and lead to functions that are invalid for some template types.
Take for instance,
vec3<int> myVec( 3, 5, 4 );
myVec.Normalize();
What would normalize mean to an integer vector? All of the sudden when we add templates to math constructs, we invalidate many existing functions, such as the example described above.
Also, another thing worth mentioning is that many math constructs are optimized with certain types because optimization is so important in games. GPU's are floating point calculating machines. Doubles take up double the space of floats and are quite a bit slower to compute with, even though it may seem like an obvious use case to a new game developer.
I hope this example makes sense. Templates are a great tool but math constructs in games are just not the right place to use them.
Typically the parts of a game that are both performance sensitive and math heavy and still tend to run on the CPU rather than the GPU are applying the same basic operations to large numbers of elements. Some examples are animation blending, physics calculations, visibility tests, etc.
The best approach to optimizing these sorts of problems on current console hardware is generally to try and batch as much work together as possible and to aim for maximum data locality to avoid expensive cache misses. The actual math can then be optimized using SIMD intrinsics and will typically be carefully hand optimized. The kind of optimizations that expression templates give you can be performed relatively easily during that hand optimization phase but there are various other important optimizations that are also likely to be performed that expression templates won't give you. Often this critical code will have sections with custom optimizations for each target platform and won't be very portable.
I think the reason that expression templates aren't widely used is that they add software complexity (for all the reasons described by jalf) to non performance critical code that doesn't really warrant it while not covering all the optimizations that are necessary for the really performance critical code that shows up at the top of profiles.
I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.
Does anyone know a good approach/libs for doing algebraic computations in C++?
I have an application being developed in c++ which needs to do algebraic computation. For now I built a C++ parser that accepts expressions in form of strings like "5 + (2 - MYFUNC(3))" which get tokenized into structs and then converted into postfix notation using the Shunting Yard algorithm and evaluated.
The MYFUNC in these expression are my own defined functions that may do some complex computations.
This is a high performance app, the expressions also have variables that are dynamically replaced with values and the expression is re-evaluated
e.g. var1 + (2 - MYFUNC(var2)) -> with var1 and var2 replaced by some values during the course of the run and re-evaluated
I'm using Linux and so far found Giac library but not sure if it's good, any feedback would be welcome.
How do people generally approach this problem? The main language in this case is C++.
Have a look on Bison and Flex Parser. The basic idea here is that a grammar file would be written and converted into C code which can be integrated into you application. Any book on Flex and Bison (http://www.amazon.com/Flex-Bison-Text-Processing-Tools/dp/0596155972) is good enough for initial reading.
May be this helps you.!
Probably the fastest way to handle this is to generate a compiled, optimized function at runtime for the defined function, and evaluate it for the different variable values that you may have. You can do this with LLVM, probably other tools.
I would write a recursive descent parser for the language, because the syntax doesn't look very complex. The parsing may be a tiny bit slower than that of say Flex/Bison, but I'm guessing the parsing will be the least computationally expensive in your project.