Related
Skippable Context: I have a simulation loop (using fixed update but variable rendering pattern), which instantiates entities of classes that are generated on the fly according to user-input and/or file configurations from a database of millions of components (of the container state modifying pattern).
I have implemented a system that automagically...deduces(some kind of logic cell/math I don't know the name to) and applies needed components whenever the user-input/config ignores the fact that one of their options requires additional components.
How so? Many of the components are complex formulae or fuzzy logic (gates?) or other complex scientific reasoning, coded in a way that can operate my simulation's structure, its objects, its environment, thus sometimes one component relies on the other and I need the 'deduction algorithm/system' to be able to pass that reliance to the class constructor.
I've used maximum granularity in the way I decided to store these 'knowledge pieces' because I really can't waste any memory given the sizes and compute intensity of the simulations and number of individual instances but now I'm running the issue of a single instance needing thousands, sometimes tens of thousands of components, and I need the instance's "creation-map" both saved and still bound as a private member so I can: 1st - know where my deductions are leading the constructor of instances and maybe be able to use memoization to reduce build times; 2nd - implement injection of changes to a live instance during the simulation¹.
What I think I need: I need a possibly infinite or at least very long
bit-mask, so I can iterate the creation faster and let the final
component-tree of my dynamically-constructed objects logged for future
use.
Possible approaches, which I have no idea will work: 1st - Manually
and sequentially store values for the bit flags in each RAM cell,
using the RAM wafer as my bit-mask. 2nd - Breakdown the map into
smaller bit-masks of known size(hard because the final number of
components is unknown until the creation is done and decoupling the
deduction is something which might even be impossible without
refactoring the entire system). 3rd - Figure out a way to make an
infinite bit-mask, or use some library that has implemented a really
long integer(5.12e+11 or bigger).
My code is written in C++, my rendering and compute kernels are Vulkan.
My objective question:
How do I implement an arbitrarily long bit-mask that is memory and compute efficient?
If I am allowed a bonus question, what is the most efficient way to implement such a bit-mask, assuming I have no architecture (software and hardware) constraints?
¹ I cannot browse an object's tree during the simulation and I also cannot afford to pause the simulation and wait for the browsing to be done before injecting the modification, I need to know and be able to make an arbitrary change on an arbitrary frame of simulation in both a pre-scheduled manner and a real-time manner.
what is the most efficient way to implement such a bit-mask
Put std::vector<bool> there and profile. If/when you’ll discover you’re spending significant time working with that vector, read further.
There’s no most efficient way, it all depends on what exactly does your code do with these bits.
One standard approach is keep continuous vector of integers, and treat them as a vector of bits. For example, std::vector<bool> on Windows uses 32-bit values, keeping 32 bools in each one.
However, the API of std::vector is too general. If your vector has majority of 0-s, or majority of 1-s, you’ll be able to implement many operations (get value, find first/next/previous 0/1) much faster than std::vector<bool> does.
Also depending on your access patterns, you may want either smaller or larger elements. If you’ll decide to use large SIMD types i.e. __m128i or __m256i don’t forget about alignment for them.
Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)
There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).
If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.
I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if
Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.
Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.
I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.
Good Day,
Suppose that you have a simple for loop like below...
for(int i=0;i<10;i++)
{
//statement 1
//statement 2
}
Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...
for(int i=0;i<10;i++)
{
//statement 1
}
for(int i=0;i<10;i++)
{
//statement 2
}
Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call
cds.update(objectlist);
instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.
It depends on your application.
Possible Drawbacks (of splitting):
your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop
Possible Gains (of splitting):
your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop
These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.
In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.
In terms of algorithmic complexity splitting the loops makes no difference.
In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.
As noted, the complexity remains.
But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:
Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes
(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)
Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.
Solutions:
Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.
It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.
Literature:
* Agner Fog's Guides
* Intel's Guides
With two loops you will be paying for:
increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.
The last point could have a huge impact in either direction. You should measure as with any perf optimization.
As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.
You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.
I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.
Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.
How to predict C++ program running time, if program executes different functions (working with database, reading files, parsing xml and others)? How installers do it?
They do not predict the time. They calculate the number of operations to be done on a total of operations.
You can predict the time by using measurement and estimation. Of course the quality of the predictions will differ. And BTW: The word "predict" is correct.
You split the workload into small tasks, and create an estimation rule for each task, e.g.: if copying files one to ten took 10s, then the remaining 90 files may take another 90s. Measure the time that these tasks take at runtime, and update your estimations.
Each new measurement will make the prediction a bit more precise.
There really is no way to do this in any sort of reliable way, since it depends on thousands of factors.
Progress bars typically measure this in one of two ways:
Overall progress - I have n number of bytes/files/whatever to transfer, and so far I have done m.
Overall work divided by current speed - I have n bytes to transfer, and so far I have done m and it took t seconds, so if things continue at this rate it will take u seconds to complete.
Short answer:
No you can't. For progress bars and such, most applications simply increase the bar length with a percentage based on the overall tasks done. Some psuedo-code:
for(int i=0; i<num_files_to_load; ++i){
files.push_back(File(filepath[i]));
SetProgressBarLength((float)i/((float)num_files_to_load) - 1.0f);
}
This is a very simplified example. Making a for-loop like this would surely block the window system's event/message queue. You would probably add a timed event or something similar instead.
Longer answer:
Given N known parameters, the problem finding whether a program completes at all is undecidable. This is called the Halting problem. You can however, find the time it takes to execute a single instruction. Some very old games actually depended on exact cycle timings, and failed to execute correctly on newer computers due to race conditions that occur because of subtle differences in runtime. Also, on architectures with data and instruction caches, the cycles the instructions consume is not constant anymore. So cache makes cycle-counting unpredictable.
Raymond Chen discussed this issue in his blog.
Why does the copy dialog give such
horrible estimates?
Because the copy dialog is just
guessing. It can't predict the future,
but it is forced to try. And at the
very beginning of the copy, when there
is very little history to go by, the
prediction can be really bad.
In general it is impossible to predict the running time of a program. It is even impossible to predict whether a program will even halt at all. This is undecidable.
http://en.wikipedia.org/wiki/Halting_problem
As others have said, you can't predict the time. Approaches suggested by Partial and rmn are valid solutions.
What you can do more is assign weights to certain operations (for instance, if you know a db call takes roughly twice as long as some processing step, you can adjust accordingly).
A cool installer compiler would execute a faux install, time each op, then save this to disk for the future.
I used such a technique for a 3D application once, which had a pretty dead-on progress bar for loading and mashing data, after you've run it a few times. It wasn't that hard, and it made development much nicer. (Since we had to see that bar 10-15 times/day, startup was 10-20 secs)
You can't predict it entirely.
What you can do is wait until a fraction of the work is done, say 1%, and estimate the remaining time by that - just time how long it takes for 1% and multiply by 100, for example. That is easily done if you can enumerate all that you have to do in advance, or have some kind of a loop going on..
As I mentioned in a previous answer, it is impossible in general to predict the running time.
However, empirically it may be possible to predict with good accuracy.
Typically all of these programs are approximatelyh linear in some input.
But if you wanted a more sophisticated approach, you could define a large number of features (database size, file size, OS, etc. etc.) and input those feature values + running time into a neural network. If you had millions of examples (obviously you would have an automated method for gathering data, e.g. some discovery programs) you might come up with a very flexible and intelligent prediction algorithm.
Of course this would only be worth doing for fun, as I'm sure the value to your company over some crude guessing algorithm will probably be nil :)
You should make estimation of time needed for different phases of the program. For example: reading files - 50, working with database - 30, working with network - 20. In ideal it would be good if you make some progress callback during all of those phases, but it requires coding the progress calculation into the iterations of algorithm.