How do I optimize this postfix expression tree for speed?

How do I optimize this postfix expression tree for speed? - c++

Thanks to the help I received in this post:
I have a nice, concise recursive function to traverse a tree in postfix order:
deque <char*> d;
void Node::postfix()
{
if (left != __nullptr) { left->postfix(); }
if (right != __nullptr) { right->postfix(); }
d.push_front(cargo);
return;
};
This is an expression tree. The branch nodes are operators randomly selected from an array, and the leaf nodes are values or the variable 'x', also randomly selected from an array.
char *values[10]={"1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","x"};
char *ops[4]={"+","-","*","/"};
As this will be called billions of times during a run of the genetic algorithm of which it is a part, I'd like to optimize it for speed. I have a number of questions on this topic which I will ask in separate postings.
The first is: how can I get access to each 'cargo' as it is found. That is: instead of pushing 'cargo' onto a deque, and then processing the deque to get the value, I'd like to start processing it right away.
Edit: This question suggests that processing the deque afterwards is a better way.
I don't yet know about parallel processing in c++, but this would ideally be done concurrently on two different processors.
In python, I'd make the function a generator and access succeeding 'cargo's using .next().
See the above Edit.
But I'm using c++ to speed up the python implementation. I'm thinking that this kind of tree has been around for a long time, and somebody has probably optimized it already. Any Ideas? Thanks

Of course, you'd want first to measure the cost overhead before you bother with optmization here, as your genetic algorithm next-generation production and mutations may swamp the evaluation time.
Once you've determined you want to optimize...
the obvious answer is to compile the expression ("as much as possible"). Fortunately, there's lots of ways to "compile".
If you are implementing this in Python, you may be able to ask Python (i'm not an expert) to compile a constructed abstract syntax tree into a function, and that might be a lot faster, especially if CPython supports this.
It appears that you are implementing this in C++, however. In that case, I wouldn't evaluate the expression tree as you have defined it, as it means lots of tree walking, indirect function calls, etc. which are pretty expensive.
One cheesy trick is to spit out the actual expression as a text string with appropriate C++ function body text around it into a file, and launch a C++ compiler on that.
You can automate all the spit-compile-relink with enough script magic, so that if
you do this rarely, this would work and you would get expression evaluation as fast as
the machine can do it.
Under the assumption you don't want to do that, I'd be tempted to walk your expression tree before you start the evaluation process, and "compile" that tree into a set of actions stored in a linear array called "code". The actions would be defined by an enum:
enum actions {
// general actions first
pushx, // action to push x on a stack
push1,
push2, // action to push 2 on a stack
...
pushN,
add,
sub,
mul, // action multiply top two stack elements together
div,
...
// optimized actions
add1,
sub1,
mul1,
div1, // action to divide top stack element by 1
...
addN,
subN,
...
addx,
subX,
...
}
In this case, I've defined the actions to implement a push-down stack expression evaluator, because that's easy to understand. Fortunately your expression vocabulary is pretty limited, so your actions can also be pretty limited (they'd be more complex if you had arbitrary variables or constants).
The expression ((x*2.0)+x)-1 would be executed by the series of actions
pushx
mul2
addx
sub1
Its probably hard to get a lot better than this.
One might instead define the actions to implement a register-oriented expression evaluator following the model of a multi-register CPU; that would enable even faster execution (I'd guess by a factor of two, but only if the expression got really complex).
What you want are actions that cover the most general computation you need to do (so you can always choose a valid actions sequence regardless of your original expression) and actions that occur frequently in the expressions you encounter (add1 is pretty typical in machine code, don't know what your statistics are like, and your remark that you are doing genetic programming suggests you don't know what the statistics will be, but you can measure them somehow or make and educated guess).
Your inner loop for evaluation would then look like (sloppy syntax here):
float stack[max_depth];
stack_depth=0;
for (i=1;i<expression_length;i++)
{
switch (code[i]) // one case for each opcode in the enum
{
case pushx: stack[stack_depth++]=x;
break;
case push1: stack[stack_depth++]=1;
break;
...
case add: stack[stack_depth-1]+=stack[stack_depth];
stack_depth--;
break;
...
case subx: stack[stack_depth]-=x;
break;
...
}
}
// stack[1] contains the answer here
The above code implements a very fast "threaded interpreter" for a pushdown stack expression evaluator.
Now "all" you need to do is to generate the content of the code array. You can do that
by using your original expression tree, executing your original recursive expression tree walk, but instead of doing the expression evaluation, write the action that your current expression evaluator would do into the code array, and spitting out special case actions when you find them (this amounts to "peephole optimization"). This is classic compilation from trees, and you can find out a lot more about how to do this in pretty much any compiler book.
Yes, this is all a fair bit of work. But then, you decided to run a genetic algorithm, which is computationally pretty expensive.

Lots of good suggestions in this thread for speeding up tree iteration:
Tree iterator, can you optimize this any further?
As for the problem I guess you could process Cargo in a different thread but seeing as you aren't actually doing THAT much. You would probably end up spending more time in thread synchronisation mechanisms that doing any actual work.
You may find instead of just pushing it into the deque that if you just process as you go along you may have things run faster. Yuo may find processing it all in a seperate loop at the end is faster. Best way to find out is try both methods with a variety of different inputs and time them.

Assuming that processing a cargo is expensive enough that locking a mutex is relatively cheap, you can use a separate thread to access the queue as you put items on it.
Thread 1 would execute your current logic, but it would lock the queue's mutex before adding an item and unlock it afterwards.
Then thread 2 would just loop forever, checking the size of the queue. If it's not empty, then lock the queue, pull off all available cargo and process it. Repeat loop. If no cargo available sleep for a short period of time and repeat.
If the locking is too expensive you can build up a queue of queues: First you put say 100 items into a cargo queue, and then put that queue into a locked queue (like the first example). Then start on a new "local" queue and continue.

Related

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB

The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

Possibility of Iteration For Compile-Time Computations

In my understanding compile-time computation is anything that can be computed by the compiler instead of that portion being computed during program execution to increase performance. Iterative computation is possible when a program executes but it is not allowed during compile-time computations. One troublesome and specific example is Variadic Templates where one naturally thinks of iteration to handle various types provided yet the standard and compilers force programmers to handle them recursively.
In general, all compile-time computations are handled via recursion rather than iteration. As far as I know constexpr functions expected to be computed at compile-time is supposed to be recursive as well. What makes iteration forbidden for anything that is compile-time?

When I implemented elimination of constant subexpressions as optimizations in Hammer, the issue turned out to be that recursion basically already happens during code generation. You do not really need to define variables, because they simply get replaced with constants.
On the other hand, if you are trying to run a loop, not only do you need to have code that not just executes the operations on constants at runtime, but is a full-blown interpreter of your language. You need to be able to declare variables, set and receive their values (as loop counters), and even worse, you need to detect endless loops so your compiler won't hang (I mean, while(true); is a perfectly constant expression).
So in short, due to the nature of parsers, ASTs and optimizers, it is simply easier to recursively evaluate parts at compile time than it is to implement full control flow and implement loops and variable manipulation.

memoization vs. state-free code

In the development of a stateless Clojure library I encounter a problem: Many functions have to be called repeatedly with the same arguments. Since everything until now is side-effect-free, this will always lead to the same results. I'm considering ways to make this more performative.
My library works like this: Every time a function is called it needs to be passed a state-hash-map, the function returns a replacement with a manipulated state object. So this keeps everything immutable and every sort of state is kept outside of the library.
(require '[mylib.core :as l])
(def state1 (l/init-state))
(def state2 (l/proceed state1))
(def state3 (l/proceed state2))
If proceed should not perform the same operations repeatedly, I have several options to solve this:
Option 1: "doing it by hand"
Store the necessary state in the state-hash-map, and update only where it is necessary. Means: Having a sophisticated mechanism that knows which parts have to be recalculated, and which not. This is always possible, in my case it would be not that trivial. If I implemented this, I'd produce much more code, which in the end is more error prone. So is it necessary?
Option 2: memoize
So there is the temptation to use the memoize function at the critical points in the lib: At the points, at which I'd expect the possibility of repeated function calls with the same args. This is sort of another philosophy of programming: Modelling each step as if it was the first time it has to run. And separating the fact that is called several times to another implementation. (this reminds me of the idea of react/om/reagent's render function)
Option 3: core.memoize
But memoization is stateful - of course. And this - for example - becomes a problem when the lib runs in a web-server. The server would just keep on filling memory with captured results. In my case however it would make sense, to only capture calculated results for each user-session. So it would be perfect to attach the cache to the previously described state-hash-map, which will be passed back by lib.
And it looks like core.memoize provides some tools for this job. Unfortunately it's not that well documented - I don't really find useful information related to the the described situation.
My question is: Do I more or less estimate the possible options correctly? Or are there other options that I have not considered? If not, it looks like the core.memoize is the way to go. Then, I'd appreciate if someone could give me a short pattern at hand, which one should use here.

If state1, state2 & state3 are different in your example, memoization will gain you nothing. proceed would, be called with different arguments each time.
As a general design principle do not impose caching strategies to the consumer. Design so that the consumers of your library have the possibility to use whatever memoization technique, or no memoization at all.
Also, you don't mention if init-state is side-effect free, and if it returns the same state1. If that is so, why not just keep all (or some) states as static literals. If they don't take much space, you can write a macro that calculates them compile time. Say, first 20 states hard-coded, then call proceed.

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.

With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.

You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.

No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.

You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

How to write a test case for ensuring thread-safe

I wrote a thread-safe(at least the aim is that) container class in C++. I lock mutexes while accessing the member and release when finished.
Now, I try to write a test case if it is really thread safe.
Let's say, I have Container container and two threads Thread1 Thread2.
Container container;
Thread1()
{
//Add N items to the container
}
Thread2()
{
//Add N items to the container
}
In this way, it works with no problem with N=1000.
But I'm not sure this regression test is enough or not. Is there a deterministic way to test a class like that?
Thanks.

there is no real way to write a test to prove its safe.
you can only design it so it is safe and test that your design is implemented. best you can do is stress test it.

I guess that you wrote a generic container and that you want to verify that two different threads cannot insert items on the same time.
If my assumptions are correct, then my proposition would be to write a custom class in wich you overload the copy constructor, inserting a sleep that could be parametrized.
To test your container, create an instance of it for your custom class and then in the first thread, insert an instance of the custom class with a long sleep, meanwhile you start the second thread trying to insert an instance of the custom class with a short sleep. If the second insertion comes back before the first one, you know that the test failed.

That's a reasonable starting point, though I'd make a few suggestions:
Run the test on a quad-core machine to improve the odds of real resource contention.
Instead of having a fixed number of threads, I'd suggest spawning a random number of threads with a lower bound equal to the number of processors on the test machine and an upper bound that's four times that number.
Consider doing occasional runs with a substantially larger number of items (say 100,000).
Run your tests on optimized, release (non-debug) builds.
If you're targeting Windows, you may want to consider using critical sections rather than mutexes as they're generally more performant.

Proving that it's safe is not possible, but for improving the stress-testing chances of finding bugs, you can modify the container's add method so looks like this:
// Assuming all this is thread safe
if ( in_use_flag == true ) {
error!
}
in_use_flag = true;
... original add method code ....
sleep( long_time );
in-use-flag = false;
This way you can almost make sure that the two threads would try to access the container at the same time, and also check for such occurrences - thus making sure the thread-safety actually works.
PS I would also remove the mutex protection just to see it fail once.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js