Related
I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.
As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".
In his talk a few days ago at Facebook - slides, video, Andrei Alexandrescu talks about common intuitions that might prove us wrong. For me one very interesting point came up on Slide 7 where he states that the assumption "Fewer instructions = faster code" is not true and more instructions will not necessarily mean slower code.
Here comes my problem: The audio quality of his talk (around 6:20min) is not that well and I don't understand the explanation very well, but from what I get is that he is comparing retired instructions with optimality of an algorithm on a performance level.
However, from my understanding this cannot be done because these are two independent structural levels. Instructions (especially actually retired instructions) are one very important measure and basically, gives you an idea about performance to achieve a goal. If we leave out the latency of an instruction, we can generalize that fewer retired instructions = faster code. Now, of course there are cases where an algorithm that performs complex calculations inside a loop will yield better performance even though it is performed inside the loop, because it will break the loop earlier (think graph traversal). But wouldn't it be more useful to compare to algorithms on a complexity level rather than saying this loop has more instructions and is better than the other? From my point of view, the better algorithm will have less retired instructions in the end.
Can someone please help me to understand where he was going with his example, and how can there be a case where (significantly) more retired instructions lead to better performance?
The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).
Here are some cases where more instructions are faster:
Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.
CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.
It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.
Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.
The number of instructions is not a good measure in itself.
Fewer retired instructions (because there is nothing more to do) = faster code.
Fewer retired instructions (because they have to wait for dependencies) = slower code.
It can sometimes be that more instructions in the code also means more retired instructions, because they can use up execution slots that would otherwise be wasted in case 2.
I am writing an application whose purpose is to optimize a trading strategy. For the sake of simplicity, assume only that we have a trading strategy that says "enter here", then another that says "exit here if in a trade" and then lets have two models: one says how much risk we should take (how much we lose if we're on the wrong side of the market) and the other says how much profit we should take (i.e. how much profit we will take if the market agrees).
For simplicity sake, I will refer to historical realized trades as ticks. That means if I "enter on tick 28" this means I would have entered a trade in the time of 28th trade in my dataset at the price of this trade. Ticks are stored chronologically in my dataset.
Now, imagine the entry strategy on the whole dataset comes up with 500 entries. For each entry, I can precalculate the exact entry tick. I can also calculate the exit points determined by the exit strategy for each entry point (again as tick numbers). For each entry, I can also precalculate the modeled loss and profit and the ticks where these losses or profits would have been hit. The last thing that remains to be done is calculating what would have happenned first, i.e. exit on strategy, exit on a loss or exit on a profit.
Hence, I iterate through the array of trades and calculate exitTick[i] = min(exitTickByStrat[i], exitTickByLoss[i], exitTickByProfit[i]). And the whole process is bloody slow (let's say I do this 100M times). I suspect cache misses are the main culprit. And the question is: can this be made faster somehow? I have to iterate through 4 arrays of some non-trivial length. One suggestion I have come up with would be to group data in tuples of four, i.e. have one array of structures like (entryTick, exitOnStrat, exitOnLoss, exitOnProfit). This might be faster due to better cache predictability, but I cannot say for sure. Why I haven't tested it so far is that instrumenting profilers somehow don't work for release binaries of my app while sampling profilers seem to me to be unreliable (I have tried Intel's profiler).
So the final questions are: can this problem be made faster? What is the best profiler to use for mem profiling with release binaries? I work on Win7, VS2010.
Edit:
Many thanks to all. I tried to simplify my original question as much as possible, hence the confusion. Just to make sure it's readable - target means an envisaged/realized profit, stop means an envisaged/realized loss.
The optimizer is a brute-force one. So, i have some strat settings (e.g. indicator periods, whatever), then min/max breakEvenAfter/breakEvenBy and then formulas to give you stop/target values in ticks. These formulas are also objects of optimization. Hence, I have a structure of optimization like
for each in params
{
calculateEntries()
for each in beSettings
{
precalculateBeData()
for each in targetFormulaSettings
{
precalculateTargetsAndRespectiveExitTicks
for each in stopFormulaSettings
{
precalulcateStopsAndRespectiveExitsTicks
evaluateExitsAndDetermineImprovement()
}
}
}
}
So I precalculate stuff as much as possible and only calculate something when I need it. And out of 30 seconds, the calculation spends 25 seconds in the evaluateExitsAndDetermineImprovement() function which does just what I described in the original question, i.e. picks min(exitOnPattern, exitOnStop, exitOnTarget). The reason why I need to call the function 100M times is because I have 100M combinations of all params combined. But within the last for cycle only the exitOnStops array changes. I can post some code if that helps. Im grateful for all the comments!
I don't know much about trading strategies, but i usually do some optimisation.
Well, there are many optimisation methods.
Like, type of container, using a different min function(i think boost has a somewhat faster function than in stl library), try reducing same calculations,etc.
Also you can optimise by using faster functions to gain speed, or by redesinging your algorithm.
For profiling I use GlowCode under Win7 x64, and it's ok for release builds too.
Maybe I misunderstand your system completely, but:
what is it that you "pre-calculate" and when and WHY 100M times???
I don't know if it will help you but it may simplify your system significantly - there are 2 common trading strategies: (descriptions are my and not official)
1) "fixed point exit" - when the trade happens all exit points are calculated once and they are checked against market conditions/price periodically.
2) "variable point exit" - when the market moves the exit points are recalculated (usually to lock in more profit/reduce loss).
In case 1) the actual calculation happens only once so it should be VERY fast
In case 2) the calculations will happen every time, but it can be optimised in many different ways - one of them being that you may store your trades indexed by exit points and only get and re-calculate those close to the actual market situation.
I am not sure which cache misses you are referring to? You data cache? CPU cache?
So, after some work, I understood the advice by Alexandre C. When I ran cache-miss profiling, I found that out of 15M calls of the evaluateExits() function I have only 30K cache misses hence the performance of this function cannot be hindered by cache. Hence, I had to "start believing" that VTune is actually producing valid results, albeit weird. Since the analysis of VTune output does not match the current thread's name, I decided to start a new thread. Thank you all for opinions and recommendations.
I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =
You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.
As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.
Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.
This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.
You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.
as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.
How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?
They do not predict the time. They calculate the number of operations to be done on a total of operations.
You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.
There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.
Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.
Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.
In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem
As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).
A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)
You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..
As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)
You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.