Caching in a high-performance financial application - c++

I am writing an application whose purpose is to optimize a trading strategy. For the sake of simplicity, assume only that we have a trading strategy that says "enter here", then another that says "exit here if in a trade" and then lets have two models: one says how much risk we should take (how much we lose if we're on the wrong side of the market) and the other says how much profit we should take (i.e. how much profit we will take if the market agrees).
For simplicity sake, I will refer to historical realized trades as ticks. That means if I "enter on tick 28" this means I would have entered a trade in the time of 28th trade in my dataset at the price of this trade. Ticks are stored chronologically in my dataset.
Now, imagine the entry strategy on the whole dataset comes up with 500 entries. For each entry, I can precalculate the exact entry tick. I can also calculate the exit points determined by the exit strategy for each entry point (again as tick numbers). For each entry, I can also precalculate the modeled loss and profit and the ticks where these losses or profits would have been hit. The last thing that remains to be done is calculating what would have happenned first, i.e. exit on strategy, exit on a loss or exit on a profit.
Hence, I iterate through the array of trades and calculate exitTick[i] = min(exitTickByStrat[i], exitTickByLoss[i], exitTickByProfit[i]). And the whole process is bloody slow (let's say I do this 100M times). I suspect cache misses are the main culprit. And the question is: can this be made faster somehow? I have to iterate through 4 arrays of some non-trivial length. One suggestion I have come up with would be to group data in tuples of four, i.e. have one array of structures like (entryTick, exitOnStrat, exitOnLoss, exitOnProfit). This might be faster due to better cache predictability, but I cannot say for sure. Why I haven't tested it so far is that instrumenting profilers somehow don't work for release binaries of my app while sampling profilers seem to me to be unreliable (I have tried Intel's profiler).
So the final questions are: can this problem be made faster? What is the best profiler to use for mem profiling with release binaries? I work on Win7, VS2010.
Edit:
Many thanks to all. I tried to simplify my original question as much as possible, hence the confusion. Just to make sure it's readable - target means an envisaged/realized profit, stop means an envisaged/realized loss.
The optimizer is a brute-force one. So, i have some strat settings (e.g. indicator periods, whatever), then min/max breakEvenAfter/breakEvenBy and then formulas to give you stop/target values in ticks. These formulas are also objects of optimization. Hence, I have a structure of optimization like
for each in params
{
calculateEntries()
for each in beSettings
{
precalculateBeData()
for each in targetFormulaSettings
{
precalculateTargetsAndRespectiveExitTicks
for each in stopFormulaSettings
{
precalulcateStopsAndRespectiveExitsTicks
evaluateExitsAndDetermineImprovement()
}
}
}
}
So I precalculate stuff as much as possible and only calculate something when I need it. And out of 30 seconds, the calculation spends 25 seconds in the evaluateExitsAndDetermineImprovement() function which does just what I described in the original question, i.e. picks min(exitOnPattern, exitOnStop, exitOnTarget). The reason why I need to call the function 100M times is because I have 100M combinations of all params combined. But within the last for cycle only the exitOnStops array changes. I can post some code if that helps. Im grateful for all the comments!

I don't know much about trading strategies, but i usually do some optimisation.
Well, there are many optimisation methods.
Like, type of container, using a different min function(i think boost has a somewhat faster function than in stl library), try reducing same calculations,etc.
Also you can optimise by using faster functions to gain speed, or by redesinging your algorithm.
For profiling I use GlowCode under Win7 x64, and it's ok for release builds too.

Maybe I misunderstand your system completely, but:
what is it that you "pre-calculate" and when and WHY 100M times???
I don't know if it will help you but it may simplify your system significantly - there are 2 common trading strategies: (descriptions are my and not official)
1) "fixed point exit" - when the trade happens all exit points are calculated once and they are checked against market conditions/price periodically.
2) "variable point exit" - when the market moves the exit points are recalculated (usually to lock in more profit/reduce loss).
In case 1) the actual calculation happens only once so it should be VERY fast
In case 2) the calculations will happen every time, but it can be optimised in many different ways - one of them being that you may store your trades indexed by exit points and only get and re-calculate those close to the actual market situation.
I am not sure which cache misses you are referring to? You data cache? CPU cache?

So, after some work, I understood the advice by Alexandre C. When I ran cache-miss profiling, I found that out of 15M calls of the evaluateExits() function I have only 30K cache misses hence the performance of this function cannot be hindered by cache. Hence, I had to "start believing" that VTune is actually producing valid results, albeit weird. Since the analysis of VTune output does not match the current thread's name, I decided to start a new thread. Thank you all for opinions and recommendations.

Related

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.
As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

profiler for c++ code, very sleepy

I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.

Performance of breaking apart one loop into two loops

Good Day,
Suppose that you have a simple for loop like below...
for(int i=0;i<10;i++)
{
//statement 1
//statement 2
}
Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...
for(int i=0;i<10;i++)
{
//statement 1
}
for(int i=0;i<10;i++)
{
//statement 2
}
Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call
cds.update(objectlist);
instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.
It depends on your application.
Possible Drawbacks (of splitting):
your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop
Possible Gains (of splitting):
your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop
These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.
In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.
In terms of algorithmic complexity splitting the loops makes no difference.
In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.
As noted, the complexity remains.
But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:
Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes
(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)
Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.
Solutions:
Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.
It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.
Literature:
* Agner Fog's Guides
* Intel's Guides
With two loops you will be paying for:
increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.
The last point could have a huge impact in either direction. You should measure as with any perf optimization.
As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.
You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.
I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.
Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

Inaccuracy in gprof output

I am trying to profile a c++ function using gprof, I am intrested in the %time taken. I did more than one run and for some reason I got a large difference in the results. I don't know what is causing this, I am assuming the sampling rate or I read in other posts that I/O has something to do with it. So is there a way to make it more accurate and generate somehow almost constant results?
I was thinking of the following:
increase the sampling rate
flush the caches before executing anything
use another profiler but I want it to generate results in a similar format to grof as function time% function name, I tried Valgrind but it gave me a massive file in size. So maybe I am generating the file with the wrong command.
Waiting for your input
Regards
I recommend printing a copy of the gprof paper and reading it carefully.
According to the paper, here's how gprof measures time. It samples the PC, and it counts how many samples land in each routine. Multiplied by the time between samples, that is each routine's total self time.
It also records in a table, by call site, how many times routine A calls routine B, assuming routine B is instrumented by the -pg option. By summing those up, it can tell how many times routine B was called.
Starting from the bottom of the call tree (where total time = self time), it assumes the average time per call of each routine is its total time divided by the number of calls.
Then it works back up to each caller of those routines. The time of each routine is its average self time plus the average number of calls to each subordinate routine times the average time of the subordinate routine.
You can see, even if recursions (cycles in the call graph) are not present, how this is fraught with possibilities for errors, such as assumptions about average times and average numbers of calls, and assumptions about subroutines being instrumented, which the authors point out. If there are recursions, they basically say "forget it".
All of this technology, even if it weren't problematic, begs the question - What is it's purpose? Usually, the purpose is "find bottlenecks". According to the paper, it can help people evaluate alternative implementations. That's not finding bottlenecks. They do recommend looking at routines that seem to be called a lot of times, or that have high average times. Certainly routines with low average cumulative time should be ignored, but that doesn't localize the problem very much. And, it completely ignores I/O, as if all I/O that is done is unquestionably necessary.
So, to try to answer your question, try Zoom, for one, and don't expect to eliminate statistical noise in measurements.
gprof is a venerable tool, simple and rugged, but the problems it had in the beginning are still there, and far better tools have come along in the intervening decades.
Here's a list of the issues.
gprof is not very accurate, particularly for small functions, see http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC11
If this is Linux then I recommend a profiler that doesn't require the code to be instrumented, e.g. Zoom - you can get a free 30 day evaluation license, after that it costs money.
All sampling profilers suffer form statistical inaccuracies - if the error is too large then you need to sample for longer and/or with a smaller sampling interval.

Predict C++ program running time

How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?
They do not predict the time. They calculate the number of operations to be done on a total of operations.
You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.
There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.
Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.
Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.
In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem
As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).
A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)
You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..
As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)
You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.