Do I need batched parallel programming? - c++

I'm writing a raytracer with EM-processing (radar application). The following is done for many different simulation locations of the radar:
raytracing, generates a lot of data
EM-postprocessing, generates a scalar and then deletes the raytrace data
Each simulation point will be encapsulated in its own instance of a class (all grouped in a std::vector) with a specification of the radar location for that simulation point, references to data it will only read (shared by all simulation points), and properties for storing its results (so each simulation point has its own). Because of this setup I thought I could benefit of using a for_each loop with std::execution::par_unseq policy without taking further measures. Correct?
The problem however is that the raytracing generates so much data that, when for example having 10 thousand simulation locations, I may run out of memory when the scheduler decides to do all the raytracing first because it is allowed to do so with par_unseq. So my idea was to write a normal for loop which feeds an inner parallel for_each loop with, say, 100 simulation points at the time. Is this an optimal solution for my case? Or did I totally misinterpret how parallel things work?

Quoting the standard, §25.3.3/8, one can find (emphasis mine):
The semantics of invocation with execution​::​unsequenced_­policy, execution​::​parallel_­policy, or execution​::​parallel_­unsequenced_­policy allow the implementation to fall back to sequential execution if the system cannot parallelize an algorithm invocation, e.g., due to lack of resources.
So implementations are allowed by the standards to fall back to sequential execution if they hit a resource hog.

Related

Is openMp dynamic scheduling the same as LPT scheduling when tasks are sorted by processing time?

I am confused about dynamic scheduling and LPT scheduling(I think it is static).
What I learnt is dynamic scheduling pick tasks based on chunk sizes and when a thread has done its tasks, it picks another. LPT scheduling picks the tasks based on the longest processing time required for each task.
So, if I sort the tasks based on processing time and then I applied dynamic scheduling with chunk size 1, then is it the same as LPT scheduling or not?
For example, suppose there is a loop with 15 iterations. In each iteration, CartesianProduct of vectors are calculated. But in each itearation, the sizes of vectors are different which means the load is unbalanced. If I calculated the resulting size of each iteration and sorted them in descending order and then schedule(dynamic,1), is that the same as LPT in theory?
Firstly, OpenMP schedule clauses apply to loops, not tasks, so it is confusing to talk about tasks in this context since OpenMP also has tasks (and, even taskloop). For the avoidance of confusion, I will call the loop scheduled entity a "chunk", since it is a contiguous chunk of iterations.
Assuming you want to discuss the schedule clause on loops, then
There is a fundamental difference between a scheduling algorithm
like LPT which assumes prior knowledge of chunk execution time and
any of the algorithms permitted by OpenMP, which do not require such
knowledge.
As of OpenMP 4.5 there is now a schedule modifier
(monotonic or nonmonotonic) which can be applied to the
dynamic schedule (as well as others), and that affects the sequences which the schedule can generate.
In OpenMP 5.0 the default, undecorated, schedule(dynamic) is equivalent to schedule(nonmonotonic:dynamic) which allows sequences which which would not be possible with schedule(monotonic:dynamic), and would likely break your mapping (though, you can use schedule(monotonic:dynamic) of course!)
Since all of the OpenMP scheduling behaviour is described in terms of the execution state of the machine, it is certainly possible that a sequence will be produced that is not that which you would expect, since the machine state represents ground-truth, reflecting issues like interference from other machine load, whereas a scheduling scheme like LPT is based on assumed prior knowledge of execution time which may not be reflected in reality.
You can see a discussion of schedule(nonmonotonic:dynamic) at https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf

Is it thread-safe to access a Mat with multiple threads in OpenCV?

i want to speedup an algorithm (complete local binary pattern with circle neighbours) for which i iterate trough all pixels and calculate some stuff with it neighbours (so i need neighbour pixel access).
Currently i do this by iterating over all pixels with one thread/process. I want to parallelize this task by dividing the input image into multiple ROIs and calculate each ROI seperatly (with multiple threads).
The Problem here is, that the ROIs are overlapping (because to calculate a pixel, sometimes i need to look at neighbours far away) and its possible that multiple threads accessing Pixel-Data (READING) at same time. Is that a Problem if two or more threads reading same Mat at same Indices at same time?
Is it also a problem, if i write to the same Mat parallel but at different indices?
As long as no writes happen simultaneously to the reads, it is safe to have multiple concurrent reads.
That holds for any sane system.
Consider the alternative:
If there was a race condition, it would mean that the memory storing your object gets modified during the read operation. If no memory (storing the object) gets written to during the read, there's no possible interaction between the threads.
Lastly, if you look at the doc,
https://docs.opencv.org/3.1.0/d3/d63/classcv_1_1Mat.html
You'll see two mentions of thread-safety:
Thus, it is safe to operate on the same matrices asynchronously in
different threads.
They mention it around ref-counting, performed during matrix assignment. So, at the very least, assigning from the same matrix to two others can be done safely in multiple threads. This pretty much guarantees that simple read access is also thread-safe.
Generally, parallel reading is not a problem as a cv::Mat is just a nice wrapper around an array, just like std::vector (yes there are differences but I don't see how they would affect the matter of the topic here so I'm going to ignore them). However parallelization doesn't automatically give you a performance boost. There are quite a few things to consider here:
Creating a thread is ressource heavy and can have a large negative impact if the task is relatively short (in terms of computation time) so thread pooling has to be considered.
If you write high performance code (no matter if multi- or single threaded) you should have a grasp of how your hardware works. In this case: memory and CPU. There is a very good talk from Timur Doumler at CppCon 2016 about that topic. This should help you avoiding cache misses.
Also mention worthy is compiler optimization. Turn it on. I know this sounds super obvious but there are a lot of people on SO that ask questions about performance and yet they don't know what compiler optimization is.
Finally, there is the OpenCV Transparent API (TAPI) which basically utilizes the GPU instead of the CPU. Almost all built-in algorithms of OpenCV support the TAPI, you just have to pass a cv::UMat instead of a cv::Mat. Those two types are convertible to each other. However, the conversion is time intensive because a UMat is basically an array on the GPU memory (VRAM), which means it has to be copied each time you convert it. Also accessing the VRAM takes longer than accessing the RAM (for the CPU that is).
Though, you have to keep in mind that you cannot access VRAM data with the CPU without copying it to the RAM. This means you cannot iterate over your pixels if you use cv::UMat. It is only possible if you write your own OpenCL or Cuda code so your algorithm can run on the GPU.
In most consumer grade PCs, for sliding window algorithms (basically anything that iterates over the pixels and performs a calculation around each pixel), using the GPU is usually by far the fastest method (but also requires the most effort to implement). Of course this only holds if the data buffer (your image) is large enough to make it worth copying to and from the VRAM.
For parallel writing: it's generally safe as long as you don't have overlapping areas. However, cache misses and false sharing (as pointed out by NathanOliver) are problems to be considered.

What approaches can one recommend for accelerating a massively (CPU) parallel program?

My neuroevolution program (C++) is currently limited to small data sets, and I have projects for it that would (on my current workstation/cloud arrangement) take months to run. The biggest bottleneck is NOT the evaluation of the network or evolutionary processes; it is the size of the data sets. To obtain the fitness of a candidate network, it must be evaluated for EACH record in the set.
In a perfect world, I would have access to a cloud-based virtual machine instance with 1 core for each record in the 15,120-record Cover Type data set. However, the largest VMs I have found are 112-core. At present my program uses OpenMP to parallelize the for-loop implementing the evaluation of all records. The speedup is equal to the number of cores. The crossover/mutation is serial, but could easily be parallelized for the evaluation of each individual (100-10,000 of them).
The biggest problem is the way the network had to be implemented. Addressing the network directly from this structure.
struct DNA {
vector<int> sizes;
vector<Function> types;
vector<vector<double>> biases;
vector<vector<vector<double>>> weights;
};
GPU acceleration appears to be impossible. The program's structures must be made of multi-dimensional data types of sizes that can differ (not every layer is the same size). I selected STL vectors... THEN realized that kernels cannot be passed or address these. Standard operations (vector/matrix) would require data conversion, transfer, run, and conversion back. It simply isn't viable.
MPI. I have condsidered this, recently, and it would appear to be viable for the purposes of evaluating the fitness of each individual. If evaluating each takes more that a couple of seconds (which is a near-certainty), I can imagine this approach being the best way forward. However, I am considering 3 possibilities for how to procced:
Initialize a "master" cloud instance, and use it to launch 100-10,000 smaller instances. Each would have a copy of the data set in-memory, and would need to be deleted once the program found a solution.
SBCs, with their low costs and increasing specifications could permit the construction of a small home computing cluster, eliminating any security concerns with the cloud and giving me more control over the hardware.
I have no idea what I'm doing, it is impossible to breed larger neural networks (practically) without GPU acceleration, I failed to understand that the "thrust" library could allow vector-based code to run on a GPU, and I haven't done my homework.
By looking at what you described, I do not think GPU acceleration is impossible. My favorite approach is OpenCL but even if you use CUDA, you can't easily use C++ STL for the purpose. But if you go through the hurdle of converting your C++ code to C data structures (i.e., float, double, or int and arrays of them, instead of vector<> types, and redefine your vector<Function> into more primitive types), leveraging the GPU should be easy, especially if your program is mostly matrix operations. But you may want to beware that GPU architecture is different from CPU. If your logic has a lot of branching (i.e., if-then-else structures), the performance in GPU would not be good.
GPU is far more capable than you thought. All the memory in GPU is dynamically allocated, which means you can allocate as many memory as you want. If you want to specify different size for each thread, just simply store them in an array and use thread ID to index. Moreover, you can even store the network in shared memory and evaluate records over the threads to accelerate memory access. The most convenient way, as you mentioned, is to make use of thrust library. You don't need to understand how it is implemented if your aim is not study GPU. You neither need to worry about performance issue because it is optimized by professional GPU experts (many from Nvidia who build GPU). Thrust is designed very similar to STL, therefore it is easy to master if you are familiar with C++.

How is a parallel scan performed on an array with more elements than threads per block?

I have seen many implementations of parallel scan; the two main ones are Hillis & Steel and blelloch scan. Though all the implementations I have seen work within shared memory, memory only shared between threads in a block.
Are there any implementations of scan that work well over arrays that have more elements than threads per block, i.e. the array will not fit into shared memory?
This link mentions a scan implementation I see in all my searches, a Hillis Steele version, example 39-1 https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
Is the only option to do a segmented scan on sub arrays within the array and then do a "final scan" adding a magnitude value from the prior sub array to the next?
With or without shared memory, CUDA kernels execute in chunks (threadblocks) that can execute in any order. To take full advantage of the hardware, you must have multiple threadblocks in your kernel call, but this creates an uncertain execution order.
Because of this, a scan algorithm that works across a large array will necessarily have to work in threadblock-sized pieces (in some fashion). If we have multiple threadblocks, then a given threadblock has no way of knowing whether other threadblocks have finished their work on adjacent data. (Yes, there are contrived mechanisms to allow inter-threadblock communication, but these are fraught with difficulty and don't solve the problem on a large scale.)
The net effect of this is that algorithms like this generally imply a global sync of some sort, and the only safe-in-any-scenario global sync is the kernel launch. Threadblocks can do a portion of their work independently, but when it comes time to stitch the work of threadblocks together, we must wait until step A is completed across all threadblocks before proceeding with step B.
Therefore I think you'll find that most device-wide scan algorithms, including the chapter 39 GPU Gems example you linked, as well as thrust and cub will launch multiple kernels to get this job done, since the kernel launch gives a convenient global sync.
Note that we can certainly devise a scan that has individual threadblocks that "work on more elements than threads per block", but this does not ultimately solve our problem (unless we use only 1 threadblock), because we must launch multiple threadblocks to take full advantage of the hardware, and multiple threadblocks in the general case introduces the global sync necessity.
The cub and thrust implementations I mentioned are both open-source template libraries, so you can certainly study the code there if you wish (not a trivial undertaking). They do represent high-quality approaches designed and built by CUDA experts. You can also at a high level study their behavior quite easily using:
nvprof --print-gpu-trace ./mycode
to get a quick read on how many kernels are being launched and what data transfers may be occurring, or you can use nvvp, the visual profiler, to study this.

OpenMP and OOP (Molecular Dynamics Simulation)

I’m conducting a molecular dynamics simulation, and I’ve been struggling for quite a while to implement it in parallel, and although I succeeded in fully loading my 4-thread processor, the computation time in parallel is greater than the computation time in serial mode.
Studying at which point of time each thread starts and finishes its loop iteration, I’ve noticed a pattern: it’s as if different threads are waiting for each other.
It was then that I turned my attention to the structure of my program. I have a class, an instance of which represents my system of particles, containing all the information about particles and some functions that use this information. I also have a class instance of which represents my interatomic potential, containing parameters of potential function along with some functions (one of those functions calculates force between two given particles).
And so in my program there exist instances of two different classes, and they interact with each other: some functions of one class take references to instances of another class.
And the block I’m trying to implement in parallel looks like this:
void Run_simulation(Class_system &system, Class_potential &potential, some other arguments){
#pragma omp parallel for
for(…)
}
for(...) is the actual computation, using data from the system instance of the Class_system class and some functions from thepotential instance of the Class_potential class.
Am I right that it’s this structure that’s the source of my troubles?
Could you suggest me what has to be done in this case? Must I rewrite my program in completely different manner? Should I use some different tool to implement my program in parallel?
Without further details on your simulation type I can only speculate, so here are my speculations.
Did you look into the issue of load balancing? I guess the loop distributes the particles among threads but if you have some kind of a restricted range potential, then the computational time might differ from particle to particle in the different regions of the simulation volume, depending on the spatial density. This is a very common problem in molecular dynamics and one that is very hard to solve properly in distributed memory (MPI in most cases) codes. Fortunately with OpenMP you get direct access to all particles at each computing element and so the load balancing is much easier to achieve. It is not only easier, but it is also built-in, so to speak - simply change the scheduling of the for directive with the schedule(dynamic,chunk) clause, where chunk is a small number whose optimal value might vary from simulation to simulation. You might make chunk part of the input data to the program or you might instead write schedule(runtime) and then play with different scheduling classes by setting the OMP_SCHEDULE environment variable to values like "static", "dynamic,1", "dynamic,10", "guided", etc.
Another possible source of performance degradation is false sharing and true sharing. False sharing occurs when your data structure is not suitable for concurrent modification. For example, if you keep 3D positional and velocity information for each particle (let's say you use velocity Verlet integrator), given IEEE 754 double precision, each coordinate/velocity triplet takes 24 bytes. This means that a single cache line of 64 bytes accommodates 2 complete triplets and 2/3 of another one. The consequence of this is that no matter how you distribute the particles among the threads, there would always be at lest two threads that would have to share a cache line. Suppose that those threads run on different physical cores. If one thread writes to its copy of the cache line (for example it updates the position of a particle), the cache coherency protocol would be involved and it will invalidate the cache line in the other thread, which would then have to reread it from a slower cache of even from main memory. When the second thread update its particle, this would invalidate the cache line in the first core. The solution to this problem comes with proper padding and proper chunk size choice so that no two threads would share a single cache line. For example, if you add a superficial 4-th dimension (you can use it to store the potential energy of the particle in the 4-th element of the position vector and the kinetic energy in the 4-th element of the velocity vector) then each position/velocity quadruplet would take 32 bytes and information for exactly two particles would fit in a single cache line. If you then distribute an even number of particles per thread, you automatically get rid of possible false sharing.
True sharing occurs when threads access concurrently the same data structure and there is an overlap between the parts of the structure, modified by the different threads. In molecular dynamics simulations this occurs very frequently as we want to exploit the Newton's third law in order to cut the computational time in two when dealing with pairwise interaction potentials. When one thread computes the force acting on particle i, while enumerating its neighbours j, computing the force that j exerts on i automatically gives you the force that i exerts on j so that contribution can be added to the total force on j. But j might belong to another thread that might be modifying it at the same time, so atomic operations have to be used for both updates (both, sice another thread might update i if it happens to neighbour one of more of its own particles). Atomic updates on x86 are implemented with locked instructions. This is not that horribly slow as often presented, but still slower than a regular update. It also includes the same cache line invalidation effect as with false sharing. To get around this, at the expense of increased memory usage one could use local arrays to store partial force contributions and then perform a reduction in the end. The reduction itself has to either be performed in serial or in parallel with locked instructions, so it might turn out that not only there is no gain from using this approach, but rather it could be even slower. Proper particles sorting and clever distribution between the processing elements so to minimise the interface regions can be used to tackle this problem.
One more thing that I would like to touch is the memory bandwidth. Depending on your algorithm, there is a certain ratio between the number of data elements fetched and the number of floating point operations performed at each iteration of the loop. Each processor has only a limited bandwidth available for memory fetches and if it happens that your data does not quite fit in the CPU cache, then it might happen that the memory bus is unable to deliver enough data to feed so many threads executing on a single socket. Your Core i3-2370M has only 3 MiB of L3 cache so if you explicitly keep the position, velocity and force for each particle, you can only store about 43000 particles in the L3 cache and about 3600 particles in the L2 cache (or about 1800 particles per hyperthread).
The last one is hyperthreading. As High Performance Mark has already noted, hyperthreads share a great deal of core machinery. For example there is only one AVX vector FPU engine that is shared among both hyperthreads. If your code is not vectorised, you lose a great deal of computing power available in your processor. If your code is vectorised, then both hyperthreads will get into each others way as they fight for control over the AVX engine. Hyperthreading is useful only when it is able to hide memory latency by overlaying computation (in one hyperthread) with memory loads (in another hyperthread). With dense numerical codes that perform many register operations before they perform memory load/store, hyperthreading gives no benefits whatsoever and you'd be better running with half the number of threads and explicitly binding them to different cores as to prevent the OS scheduler from running them as hyperthreads. The scheduler on Windows is particularly dumb in this respect, see here for an example rant. Intel's OpenMP implementation supports various binding strategies controlled via environment variables. GNU's OpenMP implementation too. I am not aware of any way to control threads binding (a.k.a. affinity masks) in Microsoft's OpenMP implementation.