Building static (but complicated) lookup table using templates - c++

I am currently in the process of optimizing a numerical analysis code. Within the code, there is a 200x150 element lookup table (currently a static std::vector <std::vector<double>> ) that is constructed at the beginning of every run. The construction of the lookup table is actually quite complex- the values in the lookup table are constructed using an iterative secant method on a complicated set of equations. Currently, for a simulation, the construction of the lookup table is 20% of the run time (run times are on the order of 25 second, lookup table construction takes 5 seconds). While 5-seconds might not seem to be a lot, when running our MC simulations, where we are running 50k+ simulations, it suddenly becomes a big chunk of time.
Along with some other ideas, one thing that has been floated- can we construct this lookup table using templates at compile time? The table itself never changes. Hard-coding a large array isn't a maintainable solution (the equations that go into generating the table are constantly being tweaked), but it seems that if the table can be generated at compile time, it would give us the best of both worlds (easily maintainable, no overhead during runtime).
So, I propose the following (much simplified) scenario. Lets say you wanted to generate a static array (use whatever container suits you best- 2D c array, vector of vectors, etc..) at compile time. You have a function defined-
double f(int row, int col);
where the return value is the entry in the table, row is the lookup table row, and col is the lookup table column. Is it possible to generate this static array at compile time using templates, and how?

Usually the best solution is code generation. There you have all the freedom and you can be sure that the output is actually a double[][].

Save the table on disk the first time the program is run, and only regenerate it if it is missing, otherwise load it from the cache.
Include a version string in the file so it is regenerated when the code changes.

A couple of things here.
What you want to do is almost certainly at least partially possible.
Floating point values are invalid template arguments (just is, don't ask why). Although you can represent rational numbers in templates using N1/N2 representation, the amount of math that you can do on them does not encompass the entire set that can be done on rational numbers. root(n) for instance is unavailable (see root(2)). Unless you want a bajillion instantiations of static double variables you'll want your value accessor to be a function. (maybe you can come up with a new template floating point representation that splits exp and mant though and then you're as well off as with the double type...have fun :P)
Metaprogramming code is hard to format in a legible way. Furthermore, by its very nature, it's rather tough to read. Even an expert is going to have a tough time analyzing a piece of TMP code they didn't write even when it's rather simple.
If an intern or anyone under senior level even THINKS about just looking at TMP code their head explodes. Although, sometimes senior devs blow up louder because they're freaking out at new stuff (making your boss feel incompetent can have serious repercussions even though it shouldn't).
All of that said...templates are a Turing-complete language. You can do "anything" with them...and by anything we mean anything that doesn't require some sort of external ability like system access (because you just can't make the compiler spawn new threads for example). You can build your table. The question you'll need to then answer is whether you actually want to.

Why not have separate programs? One that generates the table and stores it in a file, and one that loads the file and runs the simulation on it. That way, when you need to tweak the equations that generate the table, you only need to recompile that program.

If your table was a bunch of ints, then yes, you could. Maybe. But what you certainly couldn't do is generate doubles at compile-time.
More importanly, I think that a plain double[][] would be better than a vector of vectors here- you're pushing a LOT of dynamic allocation for a statically sized table.

Related

Can splitting a one-liner into multiple lines with temporary variables impact performance i.e. by inhibiting some optimizations?

It is a very general c++ question. Consider the following two blocks (they do the same thing):
v_od=((x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint()).cwiseAbs2().rowwise().sum()).array().sqrt();
and
MatrixXd wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(wtemp.cwiseAbs2().rowwise().sum()).array().sqrt();
Now the first construct feels more efficient. But is it true,
or would the c++ compiler compile them down to the same thing (I'm assuming the compiler is a good one and has all the safe optimization flag turned on. For argument's sake wtemp is mild sized, say a matrix with 100k elements all told)?
I know the generic answer to this is 'benchmark it and come back to us'
but I want a general answer.
There are two cases where your second expression could be fundamentally less efficient than your first.
The first case is where the writer of the MatrixXd class did rvalue reference to this overloads on cwiseAbs2(). In the first code, the value we call the method on is a temporary, in the second it is not. We can fix this by simply changing the second expression to:
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
which casts wtemp into an rvalue reference, and basically tells cwiseAbs2() that the matrix it is being called on can be reused as scratch space. This only matters if the writers of the MatrixXd class implemented this particular feature.
The second possible way it could be fundamentally slower is if the writers of the MatrixXd class used expression templates for pretty much every operation listed. This technique builds the parse tree of the operations, and only finalizes all of them when you assign the result to a value at the end.
Some expression templates are written to handle being able to be stored in an intermediate object like this:
auto&& wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
where the first stores the expression template wtemp rather than evaluating it into a matrix, and the second line consumes the first intermediate result. Other expression template implementations break horribly if you try to do something like the above.
Expression templates are also something that the matrix class writers would have to have specifically implemented. And is again a somewhat obscure technique -- it would mainly be of use in situations where extending a buffer is done by seemingly cheap operations, like string append.
Barring those two cases, any difference in performance is going to be purely "noise" -- there would be no reason, a priori, to expect the compiler to be confused by one or the other more or less.
And both of these are relatively advanced/modern techniques.
Neither of them will be implemented "by the compiler" without explicitly being done by the library writer.
In general second case is much more readable, and that's why preferred. It clearly names temporary variable, that helps to understand code better. Moreover, it's much easier to debug! That's why I would strongly recommend to go for second option.
I would not care much about preformance difference: I think good compiler will make identical code from both examples.
The most important aspects of code in order, most important -> less important:
Correct code
Readable code
Fast code
Of course, this can change (i.e. on embedded devices where you have to squeeze out every last bit of performance in limited memory space) but this is the general case.
Therefor, you want the code that is easier to read over a possibly neglible performance increase.
I wouldn't expect a performance hit for storing temporaries - at least not in the general case. In fact, in some cases you can expect it to be faster, i.e. caching the result of strlen() when working with c_strings (as the first example that comes to mind)
Once you have written the code, verified that it is correct code, and found a performace problem, only then should you worry about profiling and making it faster, at which point you'll probably find that having more maintainable / readable code actually helps you isolate the problem.

Defending classes with 'magic numbers'

A few months ago I read a book on security practices, and it suggested the following method for protecting our classes from overwriting with e.g. overflows etc.:
first define a magic number and a fixed-size array (can be a simple integer too)
use that array containing the magic number, and place one at the top, and one at the bottom of our class
a function compares these numbers, and if they are equal, and equal to the static variable, the class is ok, return true, else it is corrupt, and return false.
place this function at the start of every other class method, so this will check the validity of the class on function calls
it is important to place this array at the start and the end of the class
At least this is as I remember it. I'm coding a file encryptor for learning purposes, and I'm trying to make this code exception safe.
So, in which scenarios is it useful, and when should I use this method, or is this something totally useless to count on? Does it depend on the compiler or OS?
PS: I forgot the name of the book mentioned in this post, so I cannot check it again, if anyone of you know which one was it please tell me.
What you're describing sounds a Canary, but within your program, as opposed to the compiler. This is usually on by default when using gcc or g++ (plus a few other buffer overflow countermeasures).
If you're doing mutable operations on your class and you want to make sure you don't have side effects, I don't know if having a magic number is very useful. Why rely on a homebrew validity check when there are mothods out there that are more likely to be successful?
Checksums: I think it'd be more useful for you to hash the unencrypted text and add that to the end of the encrypted file. When decrypting, remove the hash and compare the hash(decrypted text) with what it should be.
I think most, if not all, widely used encryptors/decryptors store some sort of checksum in order to verify that the data has not changed.
This type of a canary will partially protect you against a very specific type of overflow attack. You can make it a little more robust by randomizing the canary value every time you run the program.
If you're worried about buffer overflow attacks (and you should be if you are ever parsing user input), then go ahead and do this. It probably doesn't cost too much in speed to check your canaries every time. There will always be other ways to attack your program, and there might even be careful buffer overflow attacks that get around your canary, but it's a cheap measure to take so it might be worth adding to your classes.

Is defining a probability distribution costly?

I'm coding a physics simulation and I'm now feeling the need for optimizing it. I'm thinking about improving one point: one of the methods of one of my class (which I call a billion times in several cases) defines everytime a probability distribution. Here is the code:
void myClass::myMethod(){ //called billions of times in several cases
uniform_real_distribution<> probd(0,1);
uniform_int_distribution<> probh(1,h-2);
uniform_int_distribution<> probv(1,v-2);
//rest of the code
}
Could I pass the distribution as member of the class so that I won't have to define them everytime? And just initialize them in the constructor and redefine them when h and v change? Can it be a good optimizing progress? And last question, could it be something that is already corrected by the compiler (g++ in my case) when compiled with flag -O3 or -O2?
Thank you in advance!
Update: I coded it and timed both: the program turned actually a bit slower (a few percents) so I'm back at what I started with: creating the probability distributions at each loop
Answer A: I shouldn't think so, for a uniform distribution it's just going to copy the parameter values into place, maybe with a small amount of arithmetic, and that will be well optimized.
However, I believe distribution objects can have state. They can use part of the random data from a call to the generator, and are permitted save the rest of the randomness to use next time the distribution is used, in order to reduce the total number of calls to the generator. So when you destroy a distribution object you might be discarding some possibly-costly random data.
Answer B: stop guessing and test it.
Time your code, then add static to the definition of probd and time it again.
Yes
Yes
Well, there may be some advantage, but AFAIK those objects aren't really heavyweight/expensive to construct. Also, with locals you may gain something in data locality and in assumptions the optimizer can make.
I don't think they are automatically moved as class variables (especially if your class is POD - in that case I doubt the compiler will dare to modify its layout); most probably, instead, they are completely optimized away - only the code of the called methods - in particular operator() - may remain, referring directly to h and v. But this must be checked by looking at the generated assembly.
Incidentally, if you have a performance problem, besides optimizing obvious points (non-optimal algorithms used in inner loops, continuous memory allocations, removing useless copies of big objects, ...) you should really try to use a profiler to find the real "hot spots" in your code, and concentrate to optimize them instead of going randomly through all the code.
uniform_real_distribution maintains a state of type param_type which is two double values (using default template parameters). The constructor assigns to these and is otherwise trivial, the destructor is trivial.
Therefore, constructing a temporary within your function has an overhead of storing 2 double values as compared to initializing 1 pointer (or reference) or going through an indirection via this. In theory, it might therefore be faster (though, what appears to be faster, or what would make sense to run faster isn't necessary any faster). Since it's not much work, it's certainly worth trying and timing whether there's a difference, even if it is a micro-optimization.
Some 3-4 extra cycles are normally neglegible, but since you're saying "billions of times" it may of course very well make a measurable difference. 3 cycles times one billion is 1 second on a 3GHz machine.
Of course, optimization without profiling is always somewhat... awkward. You might very well find that a different part in your code that's called billions of times saves a lot more cycles.
EDIT:
Since you're not going to modify it, and since the first distribution is initialized with literal values, you might actually make it a constant (such as a constexpr or namespace level static const). That should, regardless of the other two, allow the compiler to generate the most efficient code in any case for that one.

Optimization and testability at the same time - how to break up code into smaller units

I am trying to break up a long "main" program in order to be able to modify it, and also perhaps to unit-test it. It uses some huge data, so I hesitate:
What is best: to have function calls, with possibly extremely large (memory-wise) data being passed,
(a) by value, or
(b) by reference
(by extremely large, I mean maps and vectors of vectors of some structures and small classes... even images... that can be really large)
(c) Or to have private data that all the functions can access ? That may also mean that main_processing() or something could have a vector of all of them, while some functions will only have an item... With the advantage of functions being testable.
My question though has to do with optimization, while I am trying to break this monster into baby monsters, I also do not want to run out of memory.
It is not very clear to me how many copies of data I am going to have, if I create local variables.
Could someone please explain ?
Edit: this is not a generic "how to break down a very large program into classes". This program is part of a large solution, that is already broken down into small entities.
The executable I am looking at, while fairly large, is a single entity, with non-divisible data. So the data will either be all created as member variable in a single class, which I have already created, or it will (all of it) be passed around as argument around functions.
Which is better ?
If you want unit testing, you cannot "have private data that all the functions can access" because then, all of that data would be a part of each test case.
So, you must think about each function, and define exactly on which part of the data it works. As for function parameters and return values, it's very simple: use pass-by-value for small objects, and pass-by-reference for large objects.
You can use a guesstimate for the threshold that separates small and large. I use the rule "8 is small, anything more is large" but what is good for my system cannot be equally good for yours.
This seems more like a general question about OOP. Split up your data into logically grouped concepts (classes), and place the code that works with those data elements with the data (member functions), then tie it all together with composition, inheritance, etc.
Your question is too broad to give more specific advice.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?
You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.
Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.
Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.
WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).
Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io
GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html