I have a C++ number crunching program. The structure is:
a) data input, data preparation
b) "big" loop, uses global and local data (lots of different variables in both cases)
c) postprocess results and write data
The most intensive part is "b", which is basically a loop. I need to speedup the program in a cluster. 25 blades, 4 cores each. I wonder whether I could use here OpenMP and MPI, or if you can point me to tutorials, not general cases, but complex and "big" for loops.
Thanks
Actually, you should use both.
Use MPI to distribute tasks between blades and OpenMP to fully utilize each blade. Take some time to understand how memory and sharing works on each case.
You cannot devide your task between blade using OpenMP. Try to devide you loop on several part and distribute capacity on them.
For example if you want composition of 2 vectors with N size. N/2 will be on one node and another part on another.
But transmition costs between blades is palpable. Thus if your task is not actually great. May be would be better if you distribute it into 4 cores.
Related
I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.
The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.
If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.
I was asked this question my someone and bit confused on same.
Q: how will you process the data which is coming at double speed than my processing speed?
I think of following:
using queue to handle this. But if I use simply queue then size of
queue required will be indefinetly large and i will still lag
behind. As every t time i will have half more data that I can
process. and I will keep laging exponentially.
I use one thread for reading data and two more for processing. But
suppose my data has to be processed serially then what happens.
Am still confused and any help on similar problems will be welcomed. I know there might be a standard solution for this but am unaware of same.
I would like to implement in c/c++
Short answer: you'll need some kind of parallel processing. It's not easy.
Long answer: Depending on your workload requirements, and whether the bottleneck is in IO or in CPU, it might simply be multithreading on a single core, or on a multicore processor, or on a shared memory multiprocessor or even distributed between multiple nodes. It can be just a matter of distributing and balancing your work between the worker units, if the problem is simple enough (embarrasingly parallel) or you'll need to explicitly do some parallel programming. There are fundamentally two parallel programming models: OpenMP, for multithreading in multicore systems with shared memory (either symmetric or non-uniform access); and MPI, for distributed processing in a low-latency high-bandwidth network. To complicate even further, OpenMP and MPI might perfectly run together, in a hybrid parallel programming runtime environment: OpenMP distributes and coordinates the parallel compute load between the cores on each node, and MPI does it between the nodes. Be aware, it is very tough work.
I have seen many implementations of parallel scan; the two main ones are Hillis & Steel and blelloch scan. Though all the implementations I have seen work within shared memory, memory only shared between threads in a block.
Are there any implementations of scan that work well over arrays that have more elements than threads per block, i.e. the array will not fit into shared memory?
This link mentions a scan implementation I see in all my searches, a Hillis Steele version, example 39-1 https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
Is the only option to do a segmented scan on sub arrays within the array and then do a "final scan" adding a magnitude value from the prior sub array to the next?
With or without shared memory, CUDA kernels execute in chunks (threadblocks) that can execute in any order. To take full advantage of the hardware, you must have multiple threadblocks in your kernel call, but this creates an uncertain execution order.
Because of this, a scan algorithm that works across a large array will necessarily have to work in threadblock-sized pieces (in some fashion). If we have multiple threadblocks, then a given threadblock has no way of knowing whether other threadblocks have finished their work on adjacent data. (Yes, there are contrived mechanisms to allow inter-threadblock communication, but these are fraught with difficulty and don't solve the problem on a large scale.)
The net effect of this is that algorithms like this generally imply a global sync of some sort, and the only safe-in-any-scenario global sync is the kernel launch. Threadblocks can do a portion of their work independently, but when it comes time to stitch the work of threadblocks together, we must wait until step A is completed across all threadblocks before proceeding with step B.
Therefore I think you'll find that most device-wide scan algorithms, including the chapter 39 GPU Gems example you linked, as well as thrust and cub will launch multiple kernels to get this job done, since the kernel launch gives a convenient global sync.
Note that we can certainly devise a scan that has individual threadblocks that "work on more elements than threads per block", but this does not ultimately solve our problem (unless we use only 1 threadblock), because we must launch multiple threadblocks to take full advantage of the hardware, and multiple threadblocks in the general case introduces the global sync necessity.
The cub and thrust implementations I mentioned are both open-source template libraries, so you can certainly study the code there if you wish (not a trivial undertaking). They do represent high-quality approaches designed and built by CUDA experts. You can also at a high level study their behavior quite easily using:
nvprof --print-gpu-trace ./mycode
to get a quick read on how many kernels are being launched and what data transfers may be occurring, or you can use nvvp, the visual profiler, to study this.
I'm writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.
I dont understand why. The operations in the kernel are mostly without control structures (like if/else).
Kernels:
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Ant.cl
https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/Pheromon.cl
I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)
Can you give me advice?
You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/clInitFunctions.cpp).
Thanks :)
You have a lot of if/else, can't you write it in a different way?
Don't follow the if/else path, since you will never reach anywhere.
You need to make the GPU will only execute useful instructions. Not millions of if/else.
It may be better to keep track and execute only the ants that are live in the grid. You better keep track of them and move them around. Having stored their coordinates.
You will obviously need as well a map with the ant positions and status, so you will need a multi kernel system.
In addition, you have a los of non-useful memory transfers, starting from using int variables for single boolean storage. This can lead to 90% of non useful transfer that can bottleneck the GPU.
Your OpenCL kernels have ifs. Current GPUs aren't supposed to do that. AFAIK an AMD GPU has n groups of 64 cores that have the same instruction pointer (they are executing the exact same part of the exact same statement). Ifs are implemented by stopping some of the cores, executing the true branch, stopping the others and executing the false branch. Imagine this with nested ifs or loops.
I am trying to learn threading in C++, and just had a few questions about it (more specifically <thread>.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads? If I were to create 8 threads instead of 4, would this run slower on a 4 core machine? What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
I apologize if these questions have been already answered; I've been looking for information about threading with <thread>, which was introduced in c11 so I haven't been able to find too much about it.
The program in question is going to run many independent simulations.
If anybody has any insight about <thread> or just multithreading in general, I would be glad to hear it.
If you are performing pure calculations with no I/O - and those calculations are freestanding and not relying on results from other calculations happening in another thread, the maximum number of such threads should be the number of cores (possibly one or two less if the system is also loaded with other tasks).
If you are doing network I/O or similar, more threads are certainly a possibility.
If you are doing disk-I/O, a single thread reading from the disk is often best, because disk reads from multiple threads leads to moving the read/write head around on the disk, which just makes things slower.
If you're using threads for to make the code simpler, then the number of threads will probably depend on what you are doing.
It also depends on how "freestanding" each thread is. If they need to share data in complex ways, the sharing/waiting for other thread/etc, may well make it slower with more threads.
And as others have said, try to make your framework for this flexible and test different options. Preferably on multiple machines (unless you only have one kind of machine that you will ever run your code on).
There is no such thing as <threads.h>, you mean <thread>, the thread support library introduced in C++11.
The only answer to your question is "test and see". You can make your code flexible enough, so that it can be run by passing an N parameter (where N is the desired number of threads).
If you are CPU-bound, the answer will be very different from the case when you are IO bound.
So, test and see! For your reference, this link can be helpful. And if you are serious, then go ahead and get this book. Multithreading, concurrency, and the like are hairy topics.
Let's say the machine this code will run on has 4 cores, should I split up an operation into 4 threads?
If some portions of your code can be run in parallel, then yes it can be made to go faster, but this is very tricky to do since loading threads and switching data between them takes a ton of time.
If I were to create 8 threads instead of 4, would this run slower on a 4 core machine?
It depends on the context switching it has to do. Sometimes the execution will switch between threads very often and sometimes it will not but this is very difficult to control. It will not in any case run faster than 4 threads doing the same work.
What if the processor has hyperthreading, should I try and make the threads match the number of physical cores or logical cores?
Hyperthreading works nearly the same as having more cores. When you will notice the differences between a real core and an execution core, you will have enough knowledge to work around the caveats.
Should I just not worry about the number of cores a machine has, and try to create as many threads as possible?
NO, threads are hard to manage, avoid them as much as you can.
The program in question is going to run many independent simulations.
You should look into openmp. It is a library in C made to parallelize computation when your program can be split up. Do not confuse parallel with concurrent. Concurrent is simply multiple threads working together while parallel is made specifically to speed up your application. Maybe openmp is overkill for your thing, but it is a good thing to know when you are approaching parallel computing
Don't think of the number of threads you need as in comparison to the machine you're running on. Threading is valuablue any time you have a process that:
A: There is some very slow operation, that the rest of the process need not wait for.
B: Certain functions can run faster than one another and don't need to be executed inline.
C: There is a lot of non-order dependant I/O going on(web servers).
These are just a few of the obvious examples when launching a thread makes sense. The number of threads you launch is therefore more dependant on the number of these scenarios that pop up in your code, than the architecture you expect to run on. In fact unless you're running a process that really really needs to be optimized, it is likely that you can only eek out a few percentage points of additional performance by benchmarking for your architecture in comparison to the number of threads that you launch, and in modern computers this number shouldn't vary much at all.
Let's take the I/O example, as it is the scenario that will see the most benefit. Let's assume that some program needs to interract with 200 users over the network. Network I/O is very very slow. Thousands of times slower than the CPU. If we were to handle each user in turn we would waste thousands of processor cycles just waiting for data to come from the first user. Could we not have been processing information from more than one user at a time? In this case since we have roughly 200 users, and the data that we're waiting for we know to be 1000s of times slower than what we can handle(assuming we have a minimal amount of processing to do on this data), we should launch as many threads as the operating system allows. A web server that takes advantage of threading can serve hundreds of more people per second than one that does not.
Now, let's consider a less I/O intensive example, where say we have several functions that execute in turn, but are independant of one another and some of them might run faster, say because there is disk I/O in one, and no disk I/O in another. In this case, our I/O is still fairly fast, but we will certainly waste processing time waiting for the disk to catch up. As such we can launch a few threads, just to take advantage of our processing power, and minimize wasted cycles. However, if we launch as many threads as the operating system allows we are likely to cuase memory management issues for branch predictors, etc... and launching too many threads in this case is actually sub optimal and can slow the program down. Note that in this, I never mentioned how many cores the machine has! NOt that optimizing for different architectures isn't valuable, but if you optimize for one architecture you are likely very close to optimal for most. Assuming, again, that you're dealing with all reasonably modern processors.
I think most people would say that large scale threading projects are better supported by languages other than c++ (go, scala,cuda). Task parallelism as opposed to data parallelism works better in c++. I would say that you should create as many threads as you have tasks to dole out but if data parallelism is more related to your problem consider maybe using cuda and linking to the rest of your project at a later time
NOTE: if you look at some sort of system monitor you will notice that there are likely far more than 8 threads running, I looked at my computer and it had hundreds of threads running at once so don't worry too much about the overhead. The main reason I choose to mention the other languages is that managing many threads in c++ or c tends to be very difficult and error prone, I did not mention it because the c++ program will run slower(which unless you use cuda it probably won't)
In regards to hyper-threading let me comment on what I have found from experience.
In large dense matrix multiplication hyper-threading actually gives worse performance. For example Eigen and MKL both use OpenMP (at least the way I have used them) and get better results on my system which has four cores and hyper-threading using only four threads instead of eight. Also, in my own GEMM code which gets better performance than Eigen I also get better results using four threads instead of eight.
However, in my Mandelbrot drawing code I get a big performance increase using hyper-threading with OpenMP (eight threads instead of four). The general trend (so far) seems to be that if the code works well using schedule(static) in OpenMP then hyper-threading does not help and may even be worse. If the code works better using schedule(dynamic) then hyper-threading may help.
In other words, my observation so far is that if the run time of each thread can vary a lot hyper-threading can help. If the run time of each thread is constant then it may even make performance worse. But YOU have to test and see for each case.