I want to optimize my sequential code to make a gradient.
The main thread compute gradient for the border of the image and the other threads each one compute the gradient for a chunk of the image,
using 2 threads and the main thread give result better than sequential code but using more than 2 threads, but it consume more time and looks worst than the sequential.
I tried this code to speed up the gradient process:
for (int n = 0; n<iter_outer; n++)
{
int chunk = 1 + ((row - 1) / num_threads); //ceiling
int start=0;
int end=0;
//Launch a group of threads
for (int tid = 0; tid < num_threads; ++tid)
{
start = tid * chunk;
end = start + chunk;
t[tid] = thread(gradient, tid, g, vx, vy, row, col, 1, start, end);
}
//Launched from the main;
gradient(1, g, vx, vy, row, col,0, start, end);
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i)
{
t[i].join();
}
}
For any parallel execution you have to take into account Amdahl's law. It states that the time required to do some task in parallel does not scale linear with the number of processors:
t = ( (1-p) + p/n ) * T
where
T is the time needed for the task when it is done sequentially
p fraction of time that can be parallelized
n is the number of processors
Note that I used a slightly different formulation, but the statement is the same: The total speedup you get is limited by 1/(1-p) (e.g. if p=50% your parallel version will run maximum twice as fast).
In addition to that you have to consider that adding more parallelism in reality also adds more overhead (for synchronisation, setting up threads, etc), so a more realistic estimate is:
t = ( (1-p) + p/n ) * T + o*p
^^ overhead
This t as a function of the number of processors p has a minimum for a certain number of processors. Adding more processors to the problem will not result in a speedup but rather in a slow down, because the minimum time you need to do that p portion is zero, but the overhead you add by adding more processors increases unlimited.
This does not explain why you dont get a speedup in your case, but in general it is not a big surprise that simply adding more processors on a task does not always result in a speedup.
The parallel execution is a huge benefit for tasks that are easily splittable and the threads will not depend on themselves, however creating threads does come with a price. Let's imagine that a computer does nothing else but running your program (there is not OS and no other processes). The processor has 2 cores, they are processors in their own regard and can concurrently run any code. In case of just one thread the second core sits and does absolutely nothing hence there is potential for speed up. If you spawn the second thread (and give it 50% of the task) the second core now works as well and theoretically the speedup is 2 (ignoring the sequential parts and practical aspects). Now, lets make 4 threads. Wait... we have two processors and 4 threads? Yes, now each CPU does more than one thing and before changing the task on which it works the CPU has to switch contexts (change the values of registers to hold appropriate variables values, go to different code section and so on) this takes time and if you create way too many threads it will in fact take more time than doing the job. This might be a huge impact on any threaded application and should be noted before deciding on how many threads to run.
Note that this post is as simplification many modern CPU's can run efficiently more then one thread per core (ie. HyperThreading).
It seems like your CPU is dual-core. So, actually, only 2 tasks could be done parallel
Related
Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.
Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
All threads may read the value of sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.
Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.
I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Multi Threaded: 108ms, CPU: 3930k, 12 threads in thread pool.
template <unsigned M, unsigned N, unsigned P, typename T>
static Matrix<M, P, T> multiply( const Matrix<M, N, T> &lhs, const Matrix<N, P, T> &rhs, ThreadPool & pool )
{
Matrix<M, P, T> result = {0};
Task<void> task(pool);
for (auto i=0u; i<M; ++i)
for (auto j=0u; j<P; j++)
task.async([&result, &lhs, &rhs, i, j](){
T sum = 0;
for (auto k=0u; k < N; ++k)
sum += lhs[i * N + k] * rhs[k * P + j];
result[i * M + j] = sum;
});
task.wait();
return std::move(result);
}
I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios.
When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are
creation/cleanup of starting/ending threads
context switches when (number of threads) > (number of CPU cores)
locking of resources, waiting to obtain a lock
cache synchronization issues
The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks.
However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. It's always a good idea not to split up the problem into pieces smaller than necessary.
Also 4. will very likely be important in your case. While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs.
"Why?" you might think, because you're writing to different places in memory. That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. This is the smallest size that can be used for transfers from and to the cache, when something is changed. Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using).
If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory.
Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
Edit2: BTW, it's suggested to avoid std::move in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. See Is returning with `std::move` sensible in the case of multiple return statements?
Multi threading means always synchronization, context switching, function call. This all adds up and costs CPU cycles, you can spend on the main task itself.
If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread.
Multi threading is useful only, if these costs are small compared to the main task. I guess, you will see better results with multi threading, when the matrix is larger than just 200x200.
In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. The loop you showed us takes to short to execute for it to be effectively parallelized.
You have to remember that there is much overhead with thread creation. There is also some (but significantly less) overhead with synchronization.
I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.
#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>
using namespace std;
int main(){
srand(time(NULL));//Seed random number generator
vector<int>v;//Create vector to hold random numbers in interval [0,9]
vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0
for(int i=0;i<1e9;++i)
v.push_back(rand()%10);//Push back random numbers [0,9]
clock_t c=clock();
#pragma omp parallel for
for(int i=0;i<v.size();++i)
d[v[i]]+=1;//Count number stored at v[i]
cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;
for(vector<int>::iterator i=d.begin();i!=d.end();++i)
cout<<*i<<endl;
return 0;
}
The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)
Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?
My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.
Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?
Your code exibits:
race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time
Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.
False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:
#pragma omp parallel for
for(int i=0;i<v.size();++i)
#pragma omp atomic
d[v[i]]+=1;//Count number stored at v[i]
The second is implemented like this:
omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
int vv = v[i];
omp_set_lock(&locks[vv]);
d[vv]+=1;//Count number stored at v[i]
omp_unset_lock(&locks[vv]);
}
for (int i = 0; i < 10; i++)
omp_destroy_lock(&locks[i]);
(include omp.h to get access to the omp_* functions)
I leave it up to you to come up with an implementation of the third option.
You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.
EDIT
Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:
Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.
A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.
See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP
I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!
Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.
From what I understand, #pragma omp parallel and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?
I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?
(1) Nested parallelism in OpenMP:
http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.
Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.
(2) An example of matrix-vector multiplication: the code would look like:
// Input: A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
for (int j = 0; j < M; ++j)
C[i] += A[i][j] * B[j];
In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)
Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.
However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:
Changing the loop structure so that only 1-level loop exists. (It looks doable)
Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.
For something like dense linear algebra, where all the potential parallelism is already lain bare in one place in nice wide for loops, you don't need nested parallism -- if you do want to protect against the case of having (say) really narrow matricies where the leading dimension might be smaller than the number of cores, then all you need is the collapse directive which notionally flattens the multiple loops into one.
Nested parallelism is for those cases where the parallelism isn't all exposed at once -- say you want to do 2 simultaneous function evaluations, each of which could usefully utilize 4 cores, and you have an 8 core system. You call the function in a parallel section, and within the function definition there is an additional, say, parallel for.
At the outer level use the NUM_THREADS(num_groups) clause to set the number of threads to use. If your outer loop has a count N, and the number of processors or cores is num_cores, use num_groups = min(N,num_cores). At the inner level, you need to set the number of sub-threads for each thread group so that the total number of subthreads equals the number of cores. So if num_cores = 8, N = 4, then num_groups = 4. At the lower level each sub-thread should use 2 threads (since 2+2+2+2 = 8) so use the NUM_THREADS(2) clause. You can collect the number of sub-threads into an array with one element per outer region thread (with num_groups elements).
This strategy always makes optimal use of your cores. When N < num_cores some nested parallelisation occurs. When N >= num_cores the array of subthread counts contains all 1s and so the inner loop is effectively serial.