In the development of a stateless Clojure library I encounter a problem: Many functions have to be called repeatedly with the same arguments. Since everything until now is side-effect-free, this will always lead to the same results. I'm considering ways to make this more performative.
My library works like this: Every time a function is called it needs to be passed a state-hash-map, the function returns a replacement with a manipulated state object. So this keeps everything immutable and every sort of state is kept outside of the library.
(require '[mylib.core :as l])
(def state1 (l/init-state))
(def state2 (l/proceed state1))
(def state3 (l/proceed state2))
If proceed should not perform the same operations repeatedly, I have several options to solve this:
Option 1: "doing it by hand"
Store the necessary state in the state-hash-map, and update only where it is necessary. Means: Having a sophisticated mechanism that knows which parts have to be recalculated, and which not. This is always possible, in my case it would be not that trivial. If I implemented this, I'd produce much more code, which in the end is more error prone. So is it necessary?
Option 2: memoize
So there is the temptation to use the memoize function at the critical points in the lib: At the points, at which I'd expect the possibility of repeated function calls with the same args. This is sort of another philosophy of programming: Modelling each step as if it was the first time it has to run. And separating the fact that is called several times to another implementation. (this reminds me of the idea of react/om/reagent's render function)
Option 3: core.memoize
But memoization is stateful - of course. And this - for example - becomes a problem when the lib runs in a web-server. The server would just keep on filling memory with captured results. In my case however it would make sense, to only capture calculated results for each user-session. So it would be perfect to attach the cache to the previously described state-hash-map, which will be passed back by lib.
And it looks like core.memoize provides some tools for this job. Unfortunately it's not that well documented - I don't really find useful information related to the the described situation.
My question is: Do I more or less estimate the possible options correctly? Or are there other options that I have not considered? If not, it looks like the core.memoize is the way to go. Then, I'd appreciate if someone could give me a short pattern at hand, which one should use here.
If state1, state2 & state3 are different in your example, memoization will gain you nothing. proceed would, be called with different arguments each time.
As a general design principle do not impose caching strategies to the consumer. Design so that the consumers of your library have the possibility to use whatever memoization technique, or no memoization at all.
Also, you don't mention if init-state is side-effect free, and if it returns the same state1. If that is so, why not just keep all (or some) states as static literals. If they don't take much space, you can write a macro that calculates them compile time. Say, first 20 states hard-coded, then call proceed.
Related
It is a very general c++ question. Consider the following two blocks (they do the same thing):
v_od=((x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint()).cwiseAbs2().rowwise().sum()).array().sqrt();
and
MatrixXd wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(wtemp.cwiseAbs2().rowwise().sum()).array().sqrt();
Now the first construct feels more efficient. But is it true,
or would the c++ compiler compile them down to the same thing (I'm assuming the compiler is a good one and has all the safe optimization flag turned on. For argument's sake wtemp is mild sized, say a matrix with 100k elements all told)?
I know the generic answer to this is 'benchmark it and come back to us'
but I want a general answer.
There are two cases where your second expression could be fundamentally less efficient than your first.
The first case is where the writer of the MatrixXd class did rvalue reference to this overloads on cwiseAbs2(). In the first code, the value we call the method on is a temporary, in the second it is not. We can fix this by simply changing the second expression to:
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
which casts wtemp into an rvalue reference, and basically tells cwiseAbs2() that the matrix it is being called on can be reused as scratch space. This only matters if the writers of the MatrixXd class implemented this particular feature.
The second possible way it could be fundamentally slower is if the writers of the MatrixXd class used expression templates for pretty much every operation listed. This technique builds the parse tree of the operations, and only finalizes all of them when you assign the result to a value at the end.
Some expression templates are written to handle being able to be stored in an intermediate object like this:
auto&& wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
where the first stores the expression template wtemp rather than evaluating it into a matrix, and the second line consumes the first intermediate result. Other expression template implementations break horribly if you try to do something like the above.
Expression templates are also something that the matrix class writers would have to have specifically implemented. And is again a somewhat obscure technique -- it would mainly be of use in situations where extending a buffer is done by seemingly cheap operations, like string append.
Barring those two cases, any difference in performance is going to be purely "noise" -- there would be no reason, a priori, to expect the compiler to be confused by one or the other more or less.
And both of these are relatively advanced/modern techniques.
Neither of them will be implemented "by the compiler" without explicitly being done by the library writer.
In general second case is much more readable, and that's why preferred. It clearly names temporary variable, that helps to understand code better. Moreover, it's much easier to debug! That's why I would strongly recommend to go for second option.
I would not care much about preformance difference: I think good compiler will make identical code from both examples.
The most important aspects of code in order, most important -> less important:
Correct code
Readable code
Fast code
Of course, this can change (i.e. on embedded devices where you have to squeeze out every last bit of performance in limited memory space) but this is the general case.
Therefor, you want the code that is easier to read over a possibly neglible performance increase.
I wouldn't expect a performance hit for storing temporaries - at least not in the general case. In fact, in some cases you can expect it to be faster, i.e. caching the result of strlen() when working with c_strings (as the first example that comes to mind)
Once you have written the code, verified that it is correct code, and found a performace problem, only then should you worry about profiling and making it faster, at which point you'll probably find that having more maintainable / readable code actually helps you isolate the problem.
I remember I saw somewhere (probably in Github) an example like this in a setter:
void MyClass::setValue(int newValue)
{
if (value != newValue) {
value = newValue;
}
}
For me it doesn't make a lot of sense, but I wonder if it gives any performance improvement.
It have no sense for scalar types, but it may have sense for some user-defined types (since type can be really "big" or its assignment operator can do some "hard" work).
The deeper the instruction pipeline (and it only gets deeper and deeper on Intel platform at least), the higher the cost of a branch misprediction.
When a branch mispredicts, some instructions from the mispredicted
path still move through the pipeline. All work performed on these
instructions is wasted since they would not have been executed had the
branch been correctly predicted
So yes, adding an if int he code can actually hurt performance. The write would be L1 cached, possibly for a long time. If the write has to be visible then the operation would have to be interlocked to start with.
The only way you can really tell is by actually testing the different alternatives (benchmarking and/or profiling the code). Different compiler, different processors and different code calling it will make a big difference.
In general, and for "simple" data types (int, double, char, pointers, etc), it won't make sense. It will just make the code longer and more complex for the processor [at least if the compiler does what you ask of it - it may realize that "this doesn't make any sense, let's remove this check - I wouldn't rely on that tho' - compilers are often smarter than you, but making life more difficult for the compiler almost never leads to better code].
Edit: Additionally, it only makes GOOD sense to compare things that can be easily compared. If it's difficult to compare the data in the case where they are equal (for example, long strings take a lot of reads from both strings if they are equal [or strings that begin the same, and are only different in the last few characters]. So there is very little saving. The same applies for a class with a bunch of members that are often almost all the same, but one or two fields are not, and so on. On the other hand, if you have a "customer data" class, that has an integer customer ID that must be unique, then comparing just the customer id will be "cheap", but copying the customer name, address, phone number(s), and other data on the customer will be expensive. [Of course, in this case, why is it not a (smart) pointer or reference?]. End Edit.
If the data is "shared" between different processors (multiple threads accessing the same data), then it may help a little bit [in particular if this value is often read, and often written with the same value as before]. This is because "kicking out" the old value from the other processor's caches is expensive, and you only want to do that if you ACTUALLY change something.
And of course, it only makes ANY sense to worry about performance when you are working on code that you know is absolutely on the bleeding edge of the performance hot-path. Anywhere else, making the code as easily readable and as clear and concise as possible is always the best choice - this will also, typically, make the compiler more able to determine what is actually going on and ensure best optimization results.
This pattern is common in Qt, where the API is highly based on signals & slots. This pattern helps to avoid infinite looping in the case of cyclic connections.
In your case, where signals aren't present, this code only kills performance, as pointed out by #remus-rusanu and #mats-petersson.
I am trying to break up a long "main" program in order to be able to modify it, and also perhaps to unit-test it. It uses some huge data, so I hesitate:
What is best: to have function calls, with possibly extremely large (memory-wise) data being passed,
(a) by value, or
(b) by reference
(by extremely large, I mean maps and vectors of vectors of some structures and small classes... even images... that can be really large)
(c) Or to have private data that all the functions can access ? That may also mean that main_processing() or something could have a vector of all of them, while some functions will only have an item... With the advantage of functions being testable.
My question though has to do with optimization, while I am trying to break this monster into baby monsters, I also do not want to run out of memory.
It is not very clear to me how many copies of data I am going to have, if I create local variables.
Could someone please explain ?
Edit: this is not a generic "how to break down a very large program into classes". This program is part of a large solution, that is already broken down into small entities.
The executable I am looking at, while fairly large, is a single entity, with non-divisible data. So the data will either be all created as member variable in a single class, which I have already created, or it will (all of it) be passed around as argument around functions.
Which is better ?
If you want unit testing, you cannot "have private data that all the functions can access" because then, all of that data would be a part of each test case.
So, you must think about each function, and define exactly on which part of the data it works. As for function parameters and return values, it's very simple: use pass-by-value for small objects, and pass-by-reference for large objects.
You can use a guesstimate for the threshold that separates small and large. I use the rule "8 is small, anything more is large" but what is good for my system cannot be equally good for yours.
This seems more like a general question about OOP. Split up your data into logically grouped concepts (classes), and place the code that works with those data elements with the data (member functions), then tie it all together with composition, inheritance, etc.
Your question is too broad to give more specific advice.
Is there a "rule" for this? What i'm wondering is there best practice that tells how to combine functions to an operation. For example SetRecord-operation: if id is specified for some kind of record the operation updates the record otherwise the operation creates the record. In this case return message would tell if insert or update was made, but would this be bad design (and if it is, why)?
Another example would be that there's contains-hierarchy of records and sometimes it's wanted to create all levels of hiearchy, sometimes 2 levels and sometime only 1. (bad) Example would be hiearchy car-seat-arm rest. Sometimes only a car or a single seat is created. Sometimes a car with 4 seats (each having 2 arm rests) is created. How this is supposed to map to wsdl-operations and types. If you have opinion i would like to know why? I must say that i'm bit lost here.
Thanks and BR - Matti
Although there's no problem on doing that, it violates some principles of good programming patterns.
Your methods and also your classes should do only one thing and no more then one. The Single Responsibility Principle says exactly that:
The Single Responsibility Principle (SRP) says that a class should
have one, and only one, reason to change. To say this a different way,
the methods of a class should change for the same reasons, they should
not be affected by different forces that change at different rates.
It may also violates some other principles, like:
Separation of concerns
Cohesion
I don't even have to say that it can lead to a lot of Code Smells like:
Long Method
Conditional Complexity
Check this good text.
I made some research and i think the answer above is presenting quite narrow view of wsdl inteface design. It is stupid to combine my question's example Insert and Update to Set in a way that the operation done is deduced on the data (checking if id or similar filled in request message). So in that kind of case it's bad because the interface is not really stating what will happen. Having 2 separate operations is much more clear and does not consume any more resources.
However combining operations can be a correct way to do things. Think about my hiearchical data example: It would require 13 request to have a car with 4 seats with all having both arm-rests. All border crossings should be expected as costly. So this one could be combined to single operation.
Read for example:
Is this the Crudy anti pattern?
and
http://msdn.microsoft.com/en-us/library/ms954638.aspx
and you will find out that your answer above was definitely over simplification and all programming principles can't be automatically applied in web service interface design.
Good example in SO-answer above is creating 1st order header and them orderitems with separate requests is bad because e.g. it can be slow and unreliable. They could be combined to
PlaceOrder(invoiceHeader, List<InvoiceLines>)
So the answer is: it depends what you are combining. Too low level CRUD-kinda thing is not way to go but also combining things not needed to be combined shouldn't be. Moreover defining clear interface with clear message structures that tells straight away what will be done is the key here instead of simplyfying it to multiple / single.
-Matti
Thanks to the help I received in this post:
I have a nice, concise recursive function to traverse a tree in postfix order:
deque <char*> d;
void Node::postfix()
{
if (left != __nullptr) { left->postfix(); }
if (right != __nullptr) { right->postfix(); }
d.push_front(cargo);
return;
};
This is an expression tree. The branch nodes are operators randomly selected from an array, and the leaf nodes are values or the variable 'x', also randomly selected from an array.
char *values[10]={"1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","x"};
char *ops[4]={"+","-","*","/"};
As this will be called billions of times during a run of the genetic algorithm of which it is a part, I'd like to optimize it for speed. I have a number of questions on this topic which I will ask in separate postings.
The first is: how can I get access to each 'cargo' as it is found. That is: instead of pushing 'cargo' onto a deque, and then processing the deque to get the value, I'd like to start processing it right away.
Edit: This question suggests that processing the deque afterwards is a better way.
I don't yet know about parallel processing in c++, but this would ideally be done concurrently on two different processors.
In python, I'd make the function a generator and access succeeding 'cargo's using .next().
See the above Edit.
But I'm using c++ to speed up the python implementation. I'm thinking that this kind of tree has been around for a long time, and somebody has probably optimized it already. Any Ideas? Thanks
Of course, you'd want first to measure the cost overhead before you bother with optmization here, as your genetic algorithm next-generation production and mutations may swamp the evaluation time.
Once you've determined you want to optimize...
the obvious answer is to compile the expression ("as much as possible"). Fortunately, there's lots of ways to "compile".
If you are implementing this in Python, you may be able to ask Python (i'm not an expert) to compile a constructed abstract syntax tree into a function, and that might be a lot faster, especially if CPython supports this.
It appears that you are implementing this in C++, however. In that case, I wouldn't evaluate the expression tree as you have defined it, as it means lots of tree walking, indirect function calls, etc. which are pretty expensive.
One cheesy trick is to spit out the actual expression as a text string with appropriate C++ function body text around it into a file, and launch a C++ compiler on that.
You can automate all the spit-compile-relink with enough script magic, so that if
you do this rarely, this would work and you would get expression evaluation as fast as
the machine can do it.
Under the assumption you don't want to do that, I'd be tempted to walk your expression tree before you start the evaluation process, and "compile" that tree into a set of actions stored in a linear array called "code". The actions would be defined by an enum:
enum actions {
// general actions first
pushx, // action to push x on a stack
push1,
push2, // action to push 2 on a stack
...
pushN,
add,
sub,
mul, // action multiply top two stack elements together
div,
...
// optimized actions
add1,
sub1,
mul1,
div1, // action to divide top stack element by 1
...
addN,
subN,
...
addx,
subX,
...
}
In this case, I've defined the actions to implement a push-down stack expression evaluator, because that's easy to understand. Fortunately your expression vocabulary is pretty limited, so your actions can also be pretty limited (they'd be more complex if you had arbitrary variables or constants).
The expression ((x*2.0)+x)-1 would be executed by the series of actions
pushx
mul2
addx
sub1
Its probably hard to get a lot better than this.
One might instead define the actions to implement a register-oriented expression evaluator following the model of a multi-register CPU; that would enable even faster execution (I'd guess by a factor of two, but only if the expression got really complex).
What you want are actions that cover the most general computation you need to do (so you can always choose a valid actions sequence regardless of your original expression) and actions that occur frequently in the expressions you encounter (add1 is pretty typical in machine code, don't know what your statistics are like, and your remark that you are doing genetic programming suggests you don't know what the statistics will be, but you can measure them somehow or make and educated guess).
Your inner loop for evaluation would then look like (sloppy syntax here):
float stack[max_depth];
stack_depth=0;
for (i=1;i<expression_length;i++)
{
switch (code[i]) // one case for each opcode in the enum
{
case pushx: stack[stack_depth++]=x;
break;
case push1: stack[stack_depth++]=1;
break;
...
case add: stack[stack_depth-1]+=stack[stack_depth];
stack_depth--;
break;
...
case subx: stack[stack_depth]-=x;
break;
...
}
}
// stack[1] contains the answer here
The above code implements a very fast "threaded interpreter" for a pushdown stack expression evaluator.
Now "all" you need to do is to generate the content of the code array. You can do that
by using your original expression tree, executing your original recursive expression tree walk, but instead of doing the expression evaluation, write the action that your current expression evaluator would do into the code array, and spitting out special case actions when you find them (this amounts to "peephole optimization"). This is classic compilation from trees, and you can find out a lot more about how to do this in pretty much any compiler book.
Yes, this is all a fair bit of work. But then, you decided to run a genetic algorithm, which is computationally pretty expensive.
Lots of good suggestions in this thread for speeding up tree iteration:
Tree iterator, can you optimize this any further?
As for the problem I guess you could process Cargo in a different thread but seeing as you aren't actually doing THAT much. You would probably end up spending more time in thread synchronisation mechanisms that doing any actual work.
You may find instead of just pushing it into the deque that if you just process as you go along you may have things run faster. Yuo may find processing it all in a seperate loop at the end is faster. Best way to find out is try both methods with a variety of different inputs and time them.
Assuming that processing a cargo is expensive enough that locking a mutex is relatively cheap, you can use a separate thread to access the queue as you put items on it.
Thread 1 would execute your current logic, but it would lock the queue's mutex before adding an item and unlock it afterwards.
Then thread 2 would just loop forever, checking the size of the queue. If it's not empty, then lock the queue, pull off all available cargo and process it. Repeat loop. If no cargo available sleep for a short period of time and repeat.
If the locking is too expensive you can build up a queue of queues: First you put say 100 items into a cargo queue, and then put that queue into a locked queue (like the first example). Then start on a new "local" queue and continue.