Related
What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.
Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.
I have a question concerning runtime measurements in parallel programs (I used C++ but I think the question is more general).
Some short explanations: 3 threads are running parallel (pthread), solving the same problem in different ways. Each thread may pass information to the other thread (e.g. partial solutions obtained by the one thread but not by the other, yet) for speeding up the other threads, depending on his own status / available information in his own calculation. The whole process stops as soon as the first thread is ready.
Now I would like to have a unique time measurement for evaluating the runtime from start until the problem is solved. ( In the end, I want to determine if using synergy effects through a parallel calculation is faster then calculation on a single thread).
In my eyes, the problem is, that (because of the operating system pausing / unpausing the single threads), the point when information is passed in the process is not deterministic in each process' state. That means, a certain information is acquired after xxx units of cpu time on thread 1, but it can not be controlled, whether thread 2 receives this information after yyy or zzz units of cpu time spent in its calculations. Assumed that this information would have finished thread 2's calculation in any case, the runtime of thread 2 was either yyy or zzz, depending on the operating system's action.
What can I do for obtaining a deterministic behaviour for runtime comparisons? Can I order the operation system to run each thread "undisturbed" (on a multicore machine)? Is there something I can do on implementation (c++) - basis?
Or are there other concepts for evaluating runtime (time gain) of such implementations?
Best regards
Martin
Any time someone uses the terms 'deterministic' and 'multicore' in the same sentence, it sets alarm bells ringing :-)
There are two big sources of non-determinism in your program: 1) the operating system, which adds noise to thread timings through OS jitter and scheduling decisions; and 2) the algorithm, because the program follows a different path depending on the order in which communication (of the partial solutions) occurs.
As a programmer, there's not much you can do about OS noise. A standard OS adds a lot of noise even for a program running on a dedicated (quiescent) node. Special purpose operating systems for compute nodes go some way to reducing this noise, for example Blue Gene systems exhibit significantly less OS noise and therefore less variation in timings.
Regarding the algorithm, you can introduce determinism to your program by adding synchronisation. If two threads synchronise, for example to exchange partial solutions, then the ordering of the computation before and after the synchronisation is deterministic. Your current code is asynchronous, as one thread 'sends' a partial solution but does not wait for it to be 'received'. You could convert this to a deterministic code by dividing the computation into steps and synchronising between threads after each step. For example, for each thread:
Compute one step
Record partial solution (if any)
Barrier - wait for all other threads
Read partial solutions from other threads
Repeat 1-4
Of course, we would not expect this code to perform as well, because now each thread has to wait for all the other threads to complete their computation before proceeding to the next step.
The best approach is probably to just accept the non-determinism, and use statistical methods to compare your timings. Run the program many times for a given number of threads and record the range, mean and standard deviation of the timings. It may be enough for you to know e.g. the maximum computation time across all runs for a given number of threads, or you may need a statistical test such as Student's t-test to answer more complicated questions like 'how certain is it that increasing from 4 to 8 threads reduces the runtime?'. As DanielKO says, the fluctuations in timings are what will actually be experienced by a user, so it makes sense to measure these and quantify them statistically, rather than aiming to eliminate them altogether.
What's the use of such a measurement?
Suppose you can, by some contrived method, set up the OS scheduler in a way that the threads run undisturbed (even by indirect events such as other processes using caches, MMU, etc), will that be realistic for the actual usage of the parallel program?
It's pretty rare for a modern OS to let an application take control over general interrupts handling, memory management, thread scheduling, etc. Unless you are talking directly to the metal, your deterministic measurements will not only be impractical, but the users of your program will never experience them (unless they are equally close to the metal as when you did the measurements.)
So my question is, why do you need such strict conditions for measuring your program? In the general case, just accept the fluctuations, as that is what the users will most likely see. If the speed up of a certain algorithm/implementation is so insignificant as to be indistinguishable from the background noise, that's more useful information to me than knowing the actual speedup fraction.
I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
1.Tasks Pool
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
2.Separate Tasks
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even
a small speedup (the additional core will serve OS processes, so that
thread affinity of the rest will become more stable compared to 12
threads). Any concurrent interactive activity on the system will see
a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor
affinity mask to each, to impose a 1-1 mapping between threads and processors.
This is good in the simplest near-academical case
where there are no resources other than CPU and shared memory
involved; you will see no chronic migration of threads across
processes. The drawback is an
algorithm closely coupled to a particular machine; on another machine
it could behave so poorly as to finish never at all (because of an
unrelated real time task that
blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread
downgrade its own priority once it is past 40% and again once it is
past 80% of its load. This will improve load balancing inside your
process, but it will behave poorly if your application is competing
with other CPU-bound processes.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
OTOH..
1E8 tasks # 0.1s/task = 10,000,000 seconds
= 2777.7r hours
= 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.
What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.