In C++, I've written a mathematical program (for diffusion limited aggregation) where each new point calculated is dependent on all of the preceding points.
Is it possible to have such a program work in a parallel or distributed manner to increase computing speed?
If so, what type of modifications to the code would I need to look into?
EDIT: My source code is available at...
http://www.bitbucket.org/damigu78/brownian-motion/downloads/
filename is DLA_full3D.cpp
I don't mind significant re-writes if that's what it would take. After all, I want to learn how to do it.
If your algorithm is fundamentally sequential, you can't make it fundamentally not that.
What is the algorithm you are using?
EDIT: Googling "diffusion limited aggregation algorithm parallel" lead me here, with the following quote:
DLA, on the other hand, has been shown
[9,10] to belong to the class of
inherently sequential or, more
formally, P-complete problems.
Therefore, it is unlikely that DLA
clusters can be sampled in parallel in
polylog time when restricted to a
number of processors polynomial in the
system size.
So the answer to your question is "all signs point to no".
Probably. There are parallel versions of most sequential algorithms, and for those sequential algorithms which are not immediately parallelisable there are usually parallel substitutes. This looks like be one of those cases where you need to consider parallelisation or parallelisability before you choose an algorithm. But unless you tell us a bit (a lot ?) more about your algorithm we can't provide much specific guidance. If it amuses you to watch SOers argue in the absence of hard data sit back and watch, but if you want answers, edit your question.
The toxiclibs website gives some useful insight into how one DLA implementation is done
There is cilk, which is an enhancement to the C language (unfortunately not C++ (yet)) that allows you to add some extra information to your code. With just a few minor hints, the compiler can automatically parallelize parts of your code, such as running multiple iterations of a for loop in parallel instead of in series.
Without knowing more about your problem, I'll just say that this looks like a good candidate to implement as a parallel prefix scan (http://en.wikipedia.org/wiki/Prefix_sum). The simplest example of this is an array that you want to make a running sum out of:
1 5 3 2 5 6
becomes
1 6 9 11 16 22
This looks inherently serial (as all the points depend on the ones previous), but it can be done in parallel.
You mention that each step depends on the results of all preceding steps, which makes it hard to parallelize such a program.
I don't know which algorithm you are using, but you could use multithreading for speedup. Each thread would process one step, but must wait for results that haven't yet been calculated (though it can work with the already calculated results if they don't change values over time). That essentially means you would have to use a locking/waiting mechanism in order to wait for results that haven't yet been calculated but are currently needed by a certain worker thread to go on.
Related
I'm asking regarding answers on this question, In my answer I first just got the time before and after the loops and printed out their difference, But as an update for #cigiens answer, it seems that I've done benchmarking inaccurately by not warming up the code.
What is warming up of the code? I think what happened here is that the string was moved to the cache first and that made the benchmarking results for the following loops close to each other. In my old answer, the first benchmarking result was slower than others, since it took more time to move the string to the cache I think, Am I correct? If not, what is warming up actually doing to code and also generally speaking if possible, What should I've done else than warming up for more accurate results? or how to do benchmarking correctly for C++ code (also C if possibly the same)?
To give you an example of warm up, i've recently benchmarked some nvidia cuda kernel calls:
The execution speed seems to increase over time, probably for several reasons like the fact that the GPU frequency is variable (to save power and cooldown).
Sometimes the slower call has an even worse impact on the next call so the benchmark can be misleading.
If you need to feel safe about these points, I advice you to:
reserve all the dynamic memory (like vectors) first
make a for loop to do the same work several times before a measurement
this implies to initialize the input datas (especially random) only once before the loop and to copy them each time inside the loop to ensure that you do the same work
if you deal with complex objects with cache, i advice you to pack them in a struct and to make an array of this struct (with the same construction or cloning technique), in order to ensure that the same work is done on the same starting data in the loop
you can avoid doing the for loop and copying the datas IF you alternate two calls very often and suppose that the impact of the behavior differences will cancel each other, for example in a simulation of continuous datas like positions
concerning the measurement tools, i've always faced problems with high_resolution_clock on different machines, like the non consistency of the durations. On the contrary, the windows QueryPerformanceCounter is very good.
I hope that helps !
EDIT
I forgot to add that effectively as said in the comments, the compiler optimization behavior can be annoying to deal with. The simplest way i've found is to increment a variable depending on some non trivial operations from both the warm up and the measured datas, in order to force the sequential computation as much as possible.
I have looked into gprof. But dont quite understand how to acheive the following:
I have written a clustering procedure. In each iteration 4 functions are called repetitively. There are about a 100000 iterations to be done. I want to find out how much time was spent in each function.
These functions might call other sub functions and may involve data structures like hashmaps, maps etc. But I dont care about these sub functions. I just want to know how much total time was spent in all those parent functions over all the iterations. This will help me optimize my program better.
The problem with gprof is that, it analyzes every function. So even the functions of the stl datastructures are taken in to account.
Currently I am using clock_gettime. For each function, I output the time taken for each iteration. Then I manipulate this outputfile. For this I have to type a lot of profiling code. The profiling code makes my code look very complex and I want to avoid it. How is this done in industries?
Is there an easier way to do this?
If you have any other cleaner ways, please let me know
If I understand correctly, you're interested in how much time was spent in the four target functions you're interested in, but not any of the child functions called by those functions.
This information is provided in gprof's "flat" profile under "self seconds". Alternatively, if you're looking at the call graph, this timing is in the "self" column.
I'd take a look at telemetry. It's mainly targeted at game developers which wants to compare per frame data, but it seems to fit your requirements very well.
You want the self-time of those 4 functions, so you can optimize them specifically.
gprof will show you that, as a % of total time.
Suppose it is 10%. If so, even if you were able to optimize it to 0%, you would get a speedup factor of 100/90 = 1.11, or a speedup of 11%.
If it took 100 seconds, and that was too slow, chances are 90 seconds is also too slow.
However, the inclusive (self plus callees) time taken by those functions is likely to be a much larger %, 80%, to pick a number. If so, you could optimize it much more by having it make fewer calls to those callees. Alternatively, you could find that the callees are spending a big % doing things that you don't strictly need done, such as testing their arguments for generality's sake, in which case you could replace them with ad-hoc routines.
In fact, strictly speaking, there is no such thing as self time. Even the simplest instruction where the program counter is found is actually a call to a microcode subroutine.
Here is some discussion of the issues and a constructive recommendation.
I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.
Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.
Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.
About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.
No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.
What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.
The best answer is to time it with your compiler.
Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)
Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.
Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you
Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things
Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.
"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...
The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).
Out of order execution in CPUs means that a CPU can reorder instructions to gain better performance and it means the CPU is having to do some very nifty bookkeeping and such. There are other processor approaches too, such as hyper-threading.
Some fancy compilers understand the (un)interrelatedness of instructions to a limited extent, and will automatically interleave instruction flows (probably over a longer window than the CPU sees) to better utilise the processor. Deliberate compile-time interleaving of floating and integer instructions is another example of this.
Now I have highly-parallel task. And I typically have an ageing single-core x86 processor without hyper-threading.
Is there a straight-forward way to get my the body of my 'for' loop for this highly-parallel task to be interleaved so that two (or more) iterations are being done together? (This is slightly different from 'loop unwinding' as I understand it.)
My task is a 'virtual machine' running through a set of instructions, which I'll really simplify for illustration as:
void run(int num) {
for(int n=0; n<num; n++) {
vm_t data(n);
for(int i=0; i<data.len(); i++) {
data.insn(i).parse();
data.insn(i).eval();
}
}
}
So the execution trail might look like this:
data(1) insn(0) parse
data(1) insn(0) eval
data(1) insn(1) parse
...
data(2) insn(1) eval
data(2) insn(2) parse
data(2) insn(2) eval
Now, what I'd like is to be able to do two (or more) iterations explicitly in parallel:
data(1) insn(0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
I know, from profiling, (e.g. using Callgrind with --simulate-cache=yes), that parsing is about random memory accesses (cache missing) and eval is about doing ops in registers and then writing results back. Each step is several thousand instructions long. So if I can intermingle the two steps for two iterations at once, the processor will hopefully have something to do whilst the cache misses of the parse step are occurring...
Is there some c++ template madness to get this kind of explicit parallelism generated?
Of course I can do the interleaving - and even staggering - myself in code, but it makes for much less readable code. And if I really want unreadable, I can go so far as assembler! But surely there is some pattern for this kind of thing?
Given optimizing compilers and pipelined processors, I would suggest you just write clear, readable code.
Your best plan may be to look into OpenMP. It basically allows you to insert "pragmas" into your code which tell the compiler how it can split between processors.
Hyperthreading is a much higher-level system than instruction reordering. It makes the processor look like two processors to the operating system, so you'd need to use an actual threading library to take advantage of that. The same thing naturally applies to multicore processors.
If you don't want to use low-level threading libraries and instead want to use a task-based parallel system (and it sounds like that's what you're after) I'd suggest looking at OpenMP or Intel's Threading Building Blocks.
TBB is a library, so it can be used with any modern C++ compiler. OpenMP is a set of compiler extensions, so you need a compiler that supports it. GCC/G++ will from verion 4.2 and newer. Recent versions of the Intel and Microsoft compilers also support it. I don't know about any others, though.
EDIT: One other note. Using a system like TBB or OpenMP will scale the processing as much as possible - that is, if you have 100 objects to work on, they'll get split about 50/50 in a two-core system, 25/25/25/25 in a four-core system, etc.
Modern processors like the Core 2 have an enormous instruction reorder buffer on the order of nearly 100 instructions; even if the compiler is rather dumb the CPU can still make up for it.
The main issue would be if the code used a lot of registers, in which case the register pressure could force the code to be executed in sequence even if theoretically it could be done in parallel.
There is no support for parallel execution in the current C++ standard. This will change for the next version of the standard, due out next year or so.
However, I don't see what you are trying to accomplish. Are you referring to one single-core processor, or multiple processors or cores? If you have only one core, you should do whatever gets the fewest cache misses, which means whatever approach uses the smallest memory working set. This would probably be either doing all the parsing followed by all the evaluation, or doing the parsing and evaluation alternately.
If you have two cores, and want to use them efficiently, you're going to have to either use a particularly smart compiler or language extensions. Is there one particular operating system you're developing for, or should this be for multiple systems?
It sounds like you ran into the same problem chip designers face: Executing a single instruction takes a lot of effort, but it involves a bunch of different steps that can be strung together in an execution pipeline. (It is easier to execute things in parallel when you can build them out of separate blocks of hardware.)
The most obvious way is to split each task into different threads. You might want to create a single thread to execute each instruction to completion, or create one thread for each of your two execution steps and pass data between them. In either case, you'll have to be very careful with how you share data between threads and make sure to handle the case where one instruction affects the result of the following instruction. Even though you only have one core and only one thread can be running at any given time, your operating system should be able to schedule compute-intense threads while other threads are waiting for their cache misses.
(A few hours of your time would probably pay for a single very fast computer, but if you're trying to deploy it widely on cheap hardware it might make sense to consider the problem the way you're looking at it. Regardless, it's an interesting problem to consider.)
Take a look at cilk. It's an extension to ANSI C that has some nice constructs for writing parallelized code in C. However, since it's an extension of C, it has very limited compiler support, and can be tricky to work with.
This answer was written assuming the questions does not contain the part "And I typically have an ageing single-core x86 processor without hyper-threading.". I hope it might help other people who want to parallelize highly-parallel tasks, but target dual/multicore CPUs.
As already posted in another answer, OpenMP is a portable way how to do this. However my experience is OpenMP overhead is quite high and it is very easy to beat it by
rolling a DIY (Do It Youself) implementation. Hopefully OpenMP will improve over time, but as it is now, I would not recommend using it for anything else than prototyping.
Given the nature of your task, What you want to do is most likely a data based parallelism, which in my experience is quite easy - the programming style can be very similar to a single-core code, because you know what other threads are doing, which makes maintaining thread safety a lot easier - an approach which worked for me: avoid dependencies and call only thread safe functions from the loop.
To create a DYI OpenMP parallel loop you need to:
as a preparation create a serial for loop template and change your code to use functors to implement the loop bodies. This can be tedious, as you need to pass all references across the functor object
create a virtual JobItem interface for the functor, and inherit your functors from this interface
create a thread function which is able process individual JobItems objects
create a thread pool of the thread using this thread function
experiment with various synchronizations primitives to see which works best for you. While semaphore is very easy to use, its overhead is quite significant and if your loop body is very short, you do not want to pay this overhead for each loop iteration. What worked great for me was a combination of manual reset event + atomic (interlocked) counter as a much faster alternative.
experiment with various JobItem scheduling strategies. If you have long enough loop, it is better if each thread picks up multiple successive JobItems at a time. This reduces the synchronization overhead and at the same time it makes the threads more cache friendly. You may also want to do this in some dynamic way, reducing the length of the scheduled sequence as you are exhausting your tasks, or letting individual threads to steal items from other thread schedules.