I’m conducting a molecular dynamics simulation, and I’ve been struggling for quite a while to implement it in parallel, and although I succeeded in fully loading my 4-thread processor, the computation time in parallel is greater than the computation time in serial mode.
Studying at which point of time each thread starts and finishes its loop iteration, I’ve noticed a pattern: it’s as if different threads are waiting for each other.
It was then that I turned my attention to the structure of my program. I have a class, an instance of which represents my system of particles, containing all the information about particles and some functions that use this information. I also have a class instance of which represents my interatomic potential, containing parameters of potential function along with some functions (one of those functions calculates force between two given particles).
And so in my program there exist instances of two different classes, and they interact with each other: some functions of one class take references to instances of another class.
And the block I’m trying to implement in parallel looks like this:
void Run_simulation(Class_system &system, Class_potential &potential, some other arguments){
#pragma omp parallel for
for(…)
}
for(...) is the actual computation, using data from the system instance of the Class_system class and some functions from thepotential instance of the Class_potential class.
Am I right that it’s this structure that’s the source of my troubles?
Could you suggest me what has to be done in this case? Must I rewrite my program in completely different manner? Should I use some different tool to implement my program in parallel?
Without further details on your simulation type I can only speculate, so here are my speculations.
Did you look into the issue of load balancing? I guess the loop distributes the particles among threads but if you have some kind of a restricted range potential, then the computational time might differ from particle to particle in the different regions of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics and one that is very hard to solve properly in distributed memory (MPI in most cases) codes. Fortunately with OpenMP you get direct access to all particles at each computing element and so the load balancing is much easier to achieve. It is not only easier, but it is also built-in, so to speak - simply change the scheduling of the for directive with the schedule(dynamic,chunk) clause, where chunk is a small number whose optimal value might vary from simulation to simulation. You might make chunk part of the input data to the program or you might instead write schedule(runtime) and then play with different scheduling classes by setting the OMP_SCHEDULE environment variable to values like "static", "dynamic,1", "dynamic,10", "guided", etc.
Another possible source of performance degradation is false sharing and true sharing. False sharing occurs when your data structure is not suitable for concurrent modification. For example, if you keep 3D positional and velocity information for each particle (let's say you use velocity Verlet integrator), given IEEE 754 double precision, each coordinate/velocity triplet takes 24 bytes. This means that a single cache line of 64 bytes accommodates 2 complete triplets and 2/3 of another one. The consequence of this is that no matter how you distribute the particles among the threads, there would always be at lest two threads that would have to share a cache line. Suppose that those threads run on different physical cores. If one thread writes to its copy of the cache line (for example it updates the position of a particle), the cache coherency protocol would be involved and it will invalidate the cache line in the other thread, which would then have to reread it from a slower cache of even from main memory. When the second thread update its particle, this would invalidate the cache line in the first core. The solution to this problem comes with proper padding and proper chunk size choice so that no two threads would share a single cache line. For example, if you add a superficial 4-th dimension (you can use it to store the potential energy of the particle in the 4-th element of the position vector and the kinetic energy in the 4-th element of the velocity vector) then each position/velocity quadruplet would take 32 bytes and information for exactly two particles would fit in a single cache line. If you then distribute an even number of particles per thread, you automatically get rid of possible false sharing.
True sharing occurs when threads access concurrently the same data structure and there is an overlap between the parts of the structure, modified by the different threads. In molecular dynamics simulations this occurs very frequently as we want to exploit the Newton's third law in order to cut the computational time in two when dealing with pairwise interaction potentials. When one thread computes the force acting on particle i, while enumerating its neighbours j, computing the force that j exerts on i automatically gives you the force that i exerts on j so that contribution can be added to the total force on j. But j might belong to another thread that might be modifying it at the same time, so atomic operations have to be used for both updates (both, sice another thread might update i if it happens to neighbour one of more of its own particles). Atomic updates on x86 are implemented with locked instructions. This is not that horribly slow as often presented, but still slower than a regular update. It also includes the same cache line invalidation effect as with false sharing. To get around this, at the expense of increased memory usage one could use local arrays to store partial force contributions and then perform a reduction in the end. The reduction itself has to either be performed in serial or in parallel with locked instructions, so it might turn out that not only there is no gain from using this approach, but rather it could be even slower. Proper particles sorting and clever distribution between the processing elements so to minimise the interface regions can be used to tackle this problem.
One more thing that I would like to touch is the memory bandwidth. Depending on your algorithm, there is a certain ratio between the number of data elements fetched and the number of floating point operations performed at each iteration of the loop. Each processor has only a limited bandwidth available for memory fetches and if it happens that your data does not quite fit in the CPU cache, then it might happen that the memory bus is unable to deliver enough data to feed so many threads executing on a single socket. Your Core i3-2370M has only 3 MiB of L3 cache so if you explicitly keep the position, velocity and force for each particle, you can only store about 43000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles per hyperthread).
The last one is hyperthreading. As High Performance Mark has already noted, hyperthreads share a great deal of core machinery. For example there is only one AVX vector FPU engine that is shared among both hyperthreads. If your code is not vectorised, you lose a great deal of computing power available in your processor. If your code is vectorised, then both hyperthreads will get into each others way as they fight for control over the AVX engine. Hyperthreading is useful only when it is able to hide memory latency by overlaying computation (in one hyperthread) with memory loads (in another hyperthread). With dense numerical codes that perform many register operations before they perform memory load/store, hyperthreading gives no benefits whatsoever and you'd be better running with half the number of threads and explicitly binding them to different cores as to prevent the OS scheduler from running them as hyperthreads. The scheduler on Windows is particularly dumb in this respect, see here for an example rant. Intel's OpenMP implementation supports various binding strategies controlled via environment variables. GNU's OpenMP implementation too. I am not aware of any way to control threads binding (a.k.a. affinity masks) in Microsoft's OpenMP implementation.
Related
I'm optimizing a solver (systems of linear equations) whose most critical part consists of
Many (1000+) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:
for(i=0;i<num_iter;i++)
{
// SYNC-POINT
parallel_for(j=0;j<array_size;j++)
array_out[j] = some_function( array_in[j] )
swap( array_in, array_out );
}
Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far
(serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))
EDIT to address answers,questions & comments so far:
The example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line.
some_func(array_in, j) is a simple stencil that accesses the current point j and a small neighborhood around it, e.g. sume_func( array, j ) = array[j-1]+array[j]+array[j+1];
The answer given by Jérôme Richard touches a very intersting point
about barriers ( here is IMO the root of the problem). It is suggested to "replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key". Interesting but very general. How exactly would this be accomplished in this case ?
Does "point-to-point-neighbor synchronization" involve an atomic primitive for every entry of the array ?
The general solution to this problem is to create the threads and distribute the work only once, and then use fast synchronization point in the threads. In this case, the outer loop is moved in the threaded function. This is possible with the TBB library by providing a range (tbb::blocked_range<size_t> ) and a function to tbb::parallel_for (see an example here).
Barriers are known to scale poorly on many core architectures, especially in such codes. The only way to make the program scale is to reduce the synchronization between threads so to make it more local. For example, for stencils, the solution is to replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key to solve such problem. In fact, note the fundamental laws of physics prevent barriers to scale because clocks cannot be fully synchronized in general relativity and computers (unfortunately) obeys to physics law.
Many core systems are now nearly always NUMA ones. Regarding your configuration, you certainly use an AMD Threadripper processor which have multiple NUMA nodes. This means you should care about locality and the NUMA allocation policy. The default policy is generally the first touch. This means that is your initialization is sequential or threads are mapped differently, then cores have to fetch data from remote NUMA nodes which is slow. In the worst case, all cores can access to the same NUMA node and saturate it resulting in a possibly slower execution than the sequential version. You should generally make it parallel for better performance. Getting high-performance on such architecture is far from being easy. You need to carefully control the allocation policy (numactl can help for that), the initialization (parallel), the thread binding (with taskset, hwloc and/or environment variables) and the memory access pattern (by reading articles/books about how NUMA machines work and applying dedicated algorithms).
By the way, there is probably a false-sharing effect happening in your code because cache lines of array_out are certainly shared between thread. This should not have a critical impact but it does contribute to get a poor scalability (especially on your 64-core processor). The general solution to this problem is to align the array in memory on a cache line and take take the parallel splitting is done on a cache line boundary. Alternatively, you can allocate the array part in each thread. This is generally a better approach as is ensure data is locally allocated, locally filled and make data-sharing/communication between NUMA nodes and even cores more explicit (ie. better control), though it can make the code more complex (there is no free lunch).
Sharing data across threads is slow. This is because each CPU core has at least one layer of personal cache. The minute you share data between cores/threads, the personal caches need to be synchronised which is slow.
Threads running in parallel across different cores work best if they do not share data.
In your case, if data is read only you might be best off giving each thread its own copy of the data. For read write data, you have to accept the synchronisation slowdown.
I'm programming an application that processes a video stream using a `tbb::parallel_pipeline'. My first filter contains two important operations, one that must occur immediately after the other.
My tests show that the delay between two operations is anywhere from 3 to 20 milliseconds when I set max_number_of_live_tokens to 6 (# of filters I have) but is consistently 3 to 4 milliseconds when max_number_of_live_tokens is 1. The jitter in the first case is unacceptable for my application, but I need to allow multiple tokens to be in flight simultaneously to exploit parallelism.
Here my pipeline setup:
tbb::parallel_pipeline(6, //max_number_of_live_tokens
// 1st Filter
tbb::make_filter< void, shared_ptr<PipelinePacket_t> >(tbb::filter::serial_in_order,
[&](tbb::flow_control& fc)->shared_ptr<PipelinePacket_t>
{
shared_ptr<PipelinePacket_t> pPacket = grabFrame();
return pPacket;
}
)
&
... // 5 other filters that process the image - all 'serial_in_order'
);
And here is my grabFrame() function:
shared_ptr<VisionPipeline::PipelinePacket_t> VisionPipeline::grabFrame() {
shared_ptr<PipelinePacket_t> pPacket(new PipelinePacket_t);
m_cap >> pPacket->frame; // Operation A (use opencv api to capture frame)
pPacket->motion.gyroDeg = m_imu.getGyroZDeg(); // Operation B (read a gyro value)
return pPacket;
}
I need operations A and B to happen as close as possible to each other (so that the gyro value reflects its value at the time the frame was captured).
My guess is that the jitter that occurs when multiple tokens are in flight simultaneously is caused by tasks from other filters running on the same thread as the first filter and interrupting it while grabFrame() executing. I've dug through some TBB documentation but can't find anything on how parallel_pipeline breaks up filters into tasks, so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
Is this assumption correct? If so, how can I tell tbb not to interrupt the first filter between operations A and B with other tasks?
OpenCV is using TBB internally itself, for various operations. So if this is actually related to TBB, it's not as you were interrupted between A and B, but rather OpenCV itself is fighting for priority with the remainder of the filter chain. Unlikely though.
so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.
That is never happening. Unless there are parts in there explicitly dispatching via TBB, it has no effect whatsoever. TBB is not magically splitting your functions into tasks.
But that may not even be your only issue. If your filters happen to be heavy on memory bandwidth, it's likely the case that you are slowing down the actual capture process significantly just by concurrent execution of the image processing.
Looks like you are running the full image through 5 filters in a row, is that correct? Full resolution, not tiled? If so, most of these filter are likely not ALU constrained, but rather by memory bandwidth, as you are not staying within CPU cache bounds either.
If you wish to go parallel, you must get rid of the write-backs to main memory in between the filter stages. The only way to do that, is to either start tiling the images in the input filter of the filter chain, or to write a custom all-in-one filter kernel. If you have filters in that chain with spatial dependencies, that's obviously not as easy as I make it sound, then you have to include some overlap in the upper stages.
max_number_of_live_tokens then actually has a real meaning. It's the number of "tiles" in flight. Which is not primarily intended to limit each filter stage to only one concurrent execution (that's not happening anyway), but rather to keep the maximum working set size under control.
E.g. if you know that each of your tiles is now 128kB in size, you know that there are 2 copies involved in each filter (source and destination), and you know you have a 2MB L3 cache, then you would know that you can afford to have 8 tokens in flight without spilling to main memory. If you also happen to have (at least) 8 CPU cores, that yields ideal throughput, but even if you don't, at least you are not risking to become bottle-necked by exceeding cache size. Of course you can afford some spilling to main memory (past what you calculated to be safe), but then you have to perform in-depth profiling of your system to see if you are getting constrained.
i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?
As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.
Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.
I have seen many implementations of parallel scan; the two main ones are Hillis & Steel and blelloch scan. Though all the implementations I have seen work within shared memory, memory only shared between threads in a block.
Are there any implementations of scan that work well over arrays that have more elements than threads per block, i.e. the array will not fit into shared memory?
This link mentions a scan implementation I see in all my searches, a Hillis Steele version, example 39-1 https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
Is the only option to do a segmented scan on sub arrays within the array and then do a "final scan" adding a magnitude value from the prior sub array to the next?
With or without shared memory, CUDA kernels execute in chunks (threadblocks) that can execute in any order. To take full advantage of the hardware, you must have multiple threadblocks in your kernel call, but this creates an uncertain execution order.
Because of this, a scan algorithm that works across a large array will necessarily have to work in threadblock-sized pieces (in some fashion). If we have multiple threadblocks, then a given threadblock has no way of knowing whether other threadblocks have finished their work on adjacent data. (Yes, there are contrived mechanisms to allow inter-threadblock communication, but these are fraught with difficulty and don't solve the problem on a large scale.)
The net effect of this is that algorithms like this generally imply a global sync of some sort, and the only safe-in-any-scenario global sync is the kernel launch. Threadblocks can do a portion of their work independently, but when it comes time to stitch the work of threadblocks together, we must wait until step A is completed across all threadblocks before proceeding with step B.
Therefore I think you'll find that most device-wide scan algorithms, including the chapter 39 GPU Gems example you linked, as well as thrust and cub will launch multiple kernels to get this job done, since the kernel launch gives a convenient global sync.
Note that we can certainly devise a scan that has individual threadblocks that "work on more elements than threads per block", but this does not ultimately solve our problem (unless we use only 1 threadblock), because we must launch multiple threadblocks to take full advantage of the hardware, and multiple threadblocks in the general case introduces the global sync necessity.
The cub and thrust implementations I mentioned are both open-source template libraries, so you can certainly study the code there if you wish (not a trivial undertaking). They do represent high-quality approaches designed and built by CUDA experts. You can also at a high level study their behavior quite easily using:
nvprof --print-gpu-trace ./mycode
to get a quick read on how many kernels are being launched and what data transfers may be occurring, or you can use nvvp, the visual profiler, to study this.
I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
No need for blocking; each function invocation needs only read-only
access
The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.
Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+
I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.
Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.
If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.
I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
std::thread::hardware_concurrency()
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.
The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.
The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).
I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2