OpenMP overhead with a very parallel loop

OpenMP overhead with a very parallel loop - fortran

I'm trying to parallelize a loop with OpenMP where each iteration is independent (code sample below).
!$OMP PARALLEL DO DEFAULT(PRIVATE)
do i = 1, 16
begin = omp_get_wtime()
allocate(array(100000000))
do j=1, 100000000
array(j) = j
end do
deallocate(array)
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$END OMP PARALLEL DO
I would except a linear speedup out of this piece of code, with each iteration taking as much time as in the sequential version, as there are no possible race conditions or false sharing issues. However, I obtain the following results on a machine with 2 Xeon E5-2670 (8 cores each):
With only one thread:
It 1 Thread 0 time 0.435683965682983
It 2 Thread 0 time 0.435048103332520
It 3 Thread 0 time 0.435137987136841
It 4 Thread 0 time 0.434695959091187
It 5 Thread 0 time 0.434970140457153
It 6 Thread 0 time 0.434894084930420
It 7 Thread 0 time 0.433521986007690
It 8 Thread 0 time 0.434685945510864
It 9 Thread 0 time 0.433223009109497
It 10 Thread 0 time 0.434834957122803
It 11 Thread 0 time 0.435106039047241
It 12 Thread 0 time 0.434649944305420
It 13 Thread 0 time 0.434831142425537
It 14 Thread 0 time 0.434768199920654
It 15 Thread 0 time 0.435182094573975
It 16 Thread 0 time 0.435090065002441
And with 16 threads :
It 1 Thread 0 time 1.14882898330688
It 3 Thread 2 time 1.19775915145874
It 4 Thread 3 time 1.24406099319458
It 14 Thread 13 time 1.28723978996277
It 8 Thread 7 time 1.39885497093201
It 10 Thread 9 time 1.46112895011902
It 6 Thread 5 time 1.50975203514099
It 11 Thread 10 time 1.63096308708191
It 16 Thread 15 time 1.69229602813721
It 7 Thread 6 time 1.74118590354919
It 9 Thread 8 time 1.78044819831848
It 15 Thread 14 time 1.82169485092163
It 12 Thread 11 time 1.86312794685364
It 2 Thread 1 time 1.90681600570679
It 5 Thread 4 time 1.96404480934143
It 13 Thread 12 time 2.00902700424194
Any idea where the 4x factor in the iteration time is coming from ?
I have tested with both the GNU compiler and the Intel compiler with the O3 optimization flag.

The speed of the operation
do j=1, 100000000
array(j) = j
end do
is limited not by the ALU speed but by the memory bandwith. Typically, you now have several channels to the main memory per CPU socket available, but still smaller number then the number of cores.
Also the allocation and deallocation are memory access bound. I am not sure whether there may be also some synchronization needed for the allocate and deallocate.
For the same reason, the STREAM benchmark http://www.cs.virginia.edu/stream/ gives different speed-ups than purely arithmetically intensive problems.

I'm sure I've covered that topic before, but since I cannot seem to find my earlier posts, here I go again...
Large memory allocations on Linux (and possibly on other platforms) are handled via anonymous memory mappings. That is, some area gets reserved in the virtual address space of the process by calling mmap(2) with flags MAP_ANONYMOUS. The maps are initially empty - there is no physical memory backing them up. Instead, they are associated with the so-called zero page, which is a read-only frame in physical memory filled with zeros. Since the zero page is not writeable, an attempt to write into a memory location still backed by it results in a segmentation fault. The kernel handles the fault by finding a free frame in physical memory and associating it with the virtual memory page where the fault has occurred. This process is known as faulting the memory.
Faulting the memory is a relatively slow process as it involves modifications to the process' PTEs (page table entries) and flushes of the TLB (Translation Lookaside Buffer) cache. On multicore and multisocket systems it is even slower as it involves invalidation of the remote TLBs (known as remote TLB shootdown) via expensive inter-processor interrupts. Freeing an allocation results in removal of the memory mapping and reset of the PTEs. Therefore, the whole process is repeated during the next iteration.
Indeed, if you look at the effective memory bandwidth in your serial case, it is (assuming an array of double precision floats):
(100000000 * 8) / 0.435 = 1.71 GiB/s
Should your array be of REAL or INTEGER elements, the bandwidth should be cut in half. This is nowhere the memory bandwidth that even the very first generation of E5-2670 provides.
For the parallel case, the situation is even worse, since the kernel locks the page tables while faulting the pages. That's why the average bandwidth for a single thread varies from 664 MiB/s down to 380 MiB/s for a total of 7.68 GiB/s, which is almost an order of magnitude slower than the memory bandwidth of a single CPU (and your system has two, hence twice the available bandwidth!).
A completely different picture will emerge if you move the allocation outside of the loop:
!$omp parallel default(private)
allocate(array(100000000))
!$omp do
do i = 1, 16
begin = omp_get_wtime()
do j=1, 100000000
array(j) = j
end do
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$omp end do
deallocate(array)
!$omp end parallel
Now the second and later iterations will yield twice shorter times (at least on E5-2650). This is because after the first iteration, all the memory is already faulted. The gain is even larger for the multithreaded case (increase the loop count to 32 to have each thread do two iterations).
The time to fault the memory depends heavily on the system configuration. On systems that have THP (transparent huge pages) enabled, the kernel automatically uses huge pages to implement large mappings. This reduces the number of faults by a factor of 512 (for huge pages of 2 MiB). The above cited speed gains of 2x for the serial case and 2.5x for the parallel one are from a system with THP enabled. The mere use of huge pages decreases the time for the first iteration on E5-2650 to 1/4 (1/8 if your array is of integers or single-precision floats) of the time in your case.
This is usually not the case for smaller arrays, which are allocated via subdivision of a larger and reused persistent memory allocation known as arena. Newer memory allocators in glibc usually have one arena per CPU core in order to facilitate lock-less multithreaded allocation.
That is the reason why many benchmark applications simply throw away the very first measurement.
Just to substantiate the above with real-life measurements, my E5-2650 needs 0.183 seconds to perform in serial one iteration over already faulted memory and 0.209 seconds to perform it with 16 threads (on a dual-socket system).

They're not independent. Allocate/deallocate will be sharing the heap.
Try allocating a bigger array outside of the parallel section, then timing just the memory access.
It's also a non uniform memory architecture - if all the allocations come from one cpu's local memory, access from the other cpu will be relatively slow as they get routed via the first cpu. This is tedious to work around.

Related

Do I need to prevent preemption while measuring performance

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.
How do people make accurate measurements through measuring the time of execution of a specific piece of code?

Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).
That can hugely influence your measured times.
Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.
Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.
Threads Size_MB Time_ms us/Page MB/s Scenario
1 2000 355 0.693 5634 Touch 1
1 2000 11 0.021 N.a. Touch 2
2 2000 276 0.539 7246 Touch 1
2 2000 12 0.023 N.a. Touch 2
3 2000 274 0.535 7299 Touch 1
3 2000 13 0.025 N.a. Touch 2
4 2000 288 0.563 6944 Touch 1
4 2000 11 0.021 N.a. Touch 2
// Touch is from the compiler point of view a nop operation with no observable side effect
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
char *pB = (char *)p;
char tmp;
for (size_t i = 0; i < N; i += 4096)
{
tmp = pB[i];
}
}
#pragma optimize("", on)
To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.

Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.
Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).
A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))
Cheers and thanks to all who provided input.

Performance decreases with a higher number of threads (no synchronization)

I have a data structure (a vector) which elements have to be parsed by a function, where elements can be parsed by different threads.
Following is the parsing method:
void ConsumerPool::parse(size_t n_threads, size_t id)
{
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
//parse(nodes[idx]);
parse(idx);
}
}
Where:
n_threads is the total number of threads
id is the (univocal) index of the current thread
and the threads are created as follow:
std::vector<std::thread> threads;
for (size_t i = 0; i < n_threads; i++)
threads.emplace_back(&ConsumerPool::parse, this, n_threads, i);
Unfortunately, even if this method works, the performance of my application decreases if the number of threads is too high. I would like to understand why the performance decreases even if there's no synchronization between these threads.
Following are the elapsed times (between the threads start and the last join() return) according to the number of threads used:
2 threads: 500 ms
3 threads: 385 ms
4 threads: 360 ms
5 threads: 475 ms
6 threads: 580 ms
7 threads: 635 ms
8 threads: 660 ms
The time necessary for the threads creation is always between 1/2 ms.
The software has been tested by using its release build. Following is my configuration:
2x Intel(R) Xeon(R) CPU E5507 # 2.27GHz
Maximum speed: 2.26 GHz
Sockets: 2
Cores: 8
Logical processors: 8
Virtualization: Enabled
L1 cache: 512 KB
L2 cache: 2.0 MB
L3 cache: 8.0 MB
EDIT:
What the parse() function does is the following:
// data shared between threads (around 300k elements)
std::vector<std::unique_ptr<Foo>> vfoo;
std::vector<rapidxml::xml_node<>*> nodes;
std::vector<std::string> layers;
void parse(int idx)
{
auto p = vfoo[idx];
// p->parse() allocate memory according to the content of the XML node
if (!p->parse(nodes[idx], layers))
vfoo[idx].reset();
}

The processor you are using Intel(R) Xeon(R) CPU E5507 has only 4 cores (see http://ark.intel.com/products/37100/Intel-Xeon-Processor-E5507-4M-Cache-2_26-GHz-4_80-GTs-Intel-QPI).
So having more threads than 4 would cause the slow down because of the context switching as is visible from the data you have provided.
You can read more about the context switching at the following link: https://en.wikipedia.org/wiki/Context_switch

update:
We still don't have a lot of info about the memory access patterns of parse(), and how much time it spends reading input data from memory vs. how much time spent writing/reading private scratch memory.
You say p->parse() "allocates memory according to the content of the XML node". If it frees it again, you may see a big speedup from keeping a big-enough scratch buffer allocated in each thread. Memory allocation/deallocation is a "global" thing that requires synchronization between threads. A thread-aware allocator can hopefully handle an allocate/free / allocate/free pattern by satisfying allocations from memory just freed by that thread, so it's probably still hot in private L1 or L2 cache on that core.
Use some kind of profiling to find the real hotspots. It might be memory allocation/deallocation, or it might be code that reads some memory.
Your dual-socket Nehalem Xeon doesn't have hyperthreading, so you can't be running into issues with threads slowing each other down if a non-HT-aware OS schedules two on two logical cores of the same physical core.
You should investigate with performance counters (e.g. Linux perf stat, or Intel's VTune) whether you're getting more cache misses per thread once you pass 4 threads. Nehalem uses large shared (for the whole socket) L3 (aka last-level) caches, so more threads running on the same socket creates more pressure on that. The relevant perf events will be something like LLC_something, IIRC.
You should definitely look at L1/L2 misses, and see how those scale with number of threads, and how that changes with strided vs. contiguous access to node[].
There are other perf counters you can check to look for false sharing (one thread's private variable sharing a cache line with another thread's private variable, so the cache line bounces between cores). Really just look for any perf events that change with number of threads; that could point the way towards an explanation.
A multi-socket system like your 2-socket Nehalem will have NUMA (Non-uniform_memory_access). A NUMA-aware OS will try to allocate memory that's fast for the core doing the allocation.
So presumably your buffer has all of its physical pages in memory attached to one of your two sockets. In this case it's probably not something you can or should avoid, since I assume you're filling the array in a single-threaded way before handing it off to multiple threads for parsing. In general, though, try to allocate memory (especially scratch buffers) in the thread that will use it most, when that's convenient.
This may partially explain less-than-perfect scaling with number of threads. Although more likely this has nothing to do with things, if #AntonMalyshev's answer didn't help. Having each thread work on a contiguous range, instead of striding through the array with a stride of n_threads, should be better for L2 / L1 cache efficiency.
node[] is a vector of pointers (so with 8 threads, each thread uses only 8 bytes of each 64 byte cache line it touches in node[]). However, each thread presumably touches way more memory in the pointed-to data structures and strings. If node entries point to monotonically-increasing positions in other data structures and the string, then the strided access to node[] creates non-contiguous access patterns to most of the memory touched by the thread.
One possible benefit of the strided access pattern: Strided means that if all threads run at more or less the same speed, they're all looking at the same part of memory at the same time. Threads that get ahead will slow down from L3 misses, while other threads catch up because they see L3 hits. (Unless something happens that lets one thread get too far behind, like the OS de-scheduling it for a time slice.)
So maybe L3 vs. RAM bandwidth / latency is more of an issue than efficient use of per-core L2/L1. Maybe with more threads, L3 bandwidth can't keep up with all the requests for the same cache lines from the L2 caches of multiple cores. (L3 isn't fast enough to satisfy constant L2 misses from all cores at once, even if they all hit in L3.)
This argument applies to the everything pointed to by node[] only if contiguous ranges of node[] point to contiguous ranges of other memory.

Try to parse continuos ranges of elements inside threads, e.g. change
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
parse(nodes[idx]);
}
to
for (size_t idx = id * nodes.size()/n_threads; idx < (id+1)*nodes.size()/n_threads; idx++)
{
// parse node
parse(nodes[idx]);
}
that should be better for caching.
Also it's better to precompute size = (id+1)*nodes.size()/n_threads and use it in loop's stop condition instead of computing it on each iteration.

For CPU bound processes, adding additional threads beyond the number of available cores will decrease overall performance. The decrease is due to scheduling and other kernel interactions. For such situations, the optimal number of threads is often the number of cores -1. The remaining core will be used by the kernel and other other running processes.
I address this topic in a bit more detail here A case for minimal multithreading
Looking at the hardware and numbers a bit closer, I suspect you are hitting hyper-threading contention. For a 4 core cpu, 8 cores are simulated with hyper-threading. For a fully cpu bound process, hyper-threading will actually decrease performance. There is some interesting discussion here Hyper-threading and details Wikipedia hyper-threading

2 CPU (4 core each)
Threads run in a shared memory space. Performance decrease is caused by moving shared memory between CPU (threads cannot access directly cache in different CPU, more threads => more moving => bigger performance decrease).

Parallelization of elementwise matrix multiplication

I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into
!$OMP PARALLEL DO
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
!$OMP END PARALLEL DO
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.

You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:
!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).
Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.

Use CUDA in order to compute efficiently the positions of a sorted array where an element changes

Let's say we have this sorted array
0 1 1 1 1 2 2 2 2 2 3 10 10 10
I would like to find efficiently the positions where an element changes. For example in our array the positions are the following:
0 1 5 10 11
I know there are a few libraries(Thrust) that can achieve this, however I would like to create my own efficient implementation for educational purposes.
You can find the whole code here: http://pastebin.com/Wu34F4M2
It includes validation as well.
The kernel is the following function:
__global__ void findPositions(int *device_data,
int totalAmountOfValuesPerThread, int* pos_ptr, int N){
int res1 = 9999999;
int res2 = 9999999;
int index = totalAmountOfValuesPerThread*(threadIdx.x +
blockIdx.x*blockDim.x);
int start = index; //from this index each thread will begin searching
if(start < N){ //if the index is out of bounds do nothing
if(start!=0){ //if start is not in the beginning, check the previous value
if(device_data[start-1] != device_data[start]){
res1 = start;
}
}
else res1 = start; //since it's the
//beginning we update the first output buffer for the thread
pos_ptr[index] = res1;
start++; //move to the next place and see if the
//second output buffer needs updating or not
if(start < N && device_data[start] != device_data[start-1]){
res2 = start;
}
if((index+1) < N)
pos_ptr[index+ 1] = res2;
}
}
I create so many threads so that each thread has to work with two values of the array.
device_data has all the numbers stored in the array
totalAmountOfValuesPerThread in this case is the total amount of values that each thread will have to work with
pos_ptr has the same length as device_data, each thread writes the results of the buffers to this device_vector
N is the total amount of numbers in the device_data array
In the output buffers called res1 and res2 each thread either saves a position that has not been found before, or it leaves it as it is.
Example:
0 <---- thread 1
1
1 <---- thread 2
1
2 <---- thread 3
2
3 <---- thread 4
The output buffers of each thread, assuming that the big number 9999999 is inf would be:
thread1 => {res1=0, res2=1}
thread2 => {res1=inf, res2=inf}
thread3 => {res1=4, res2=inf}
thread4 => {res1=6, res2=inf}
Each thread will update the pos_ptr device_vector so this vector will have as a result the following:
pos_ptr =>{0, 1, inf, inf, 4, inf, 6, inf}
After finishing the kernel, I sort the vector by using the library Thrust and save the results inside a host vector called host_pos. So the host_pos vector will have the following:
host_pos => {0, 1, 4, 6, inf, inf, inf, inf}
This implementation is horrible because
A lot of branches inside the kernel are created, so inefficient wrap handling will occur
I assume that each thread works with 2 values only, which is very inefficient because too many threads are created
I create and transfer a device_vector which is as big as the input and also resides in the global memory. Each thread accesses this vector in order to write the results, this is very inefficient.
Here is test case for input of size 1 000 000 when having 512 threads in each block.
CPU time: 0.000875688 seconds
GPU time: 1.35816 seconds
Another testcase for input of size 10 000 000
CPU time: 0.0979209
GPU time: 1.41298 seconds
Notice that the CPU version became almost 100 times slower while the GPU gave almost the same times.
Unfortunately my GPU hasn't got enough memory, so let's try for 50 000 000
CPU time: 0.459832 seconds
GPU time: 1.59248 seconds
As I understand, for huge inputs my GPU implementation might become faster, however I believe a much more efficient approach might make the implementation a lot faster even for smaller inputs.
What design would you suggest in order to make my algorithm run faster? Unfortunately I can't think of anything better.
Thank you in advance

I'm not really understanding any of the reasons why you think this is horrible. Too many threads? What is the definition of too many threads? One thread per input element is a very common thread strategy in CUDA programs.
Since you seem to be willing to consider using thrust for much of the work (e.g. you're willing to call thrust::sort after you're done marking the data) and taking into account BenC's observation (that you are spending a lot of time trying to optimize 3% of your total run time) maybe you can have a much bigger impact by just making better use of thrust.
Suggestions:
Make your kernel as simple as possible. Just have each thread look
at one element, and decide to make a marker based on comparing with
the previous element. I'm not sure any significant gains are made
by having each thread handle 2 elements. Alternatively, have a kernel that creates a much smaller number of blocks, but have them loop through your overall device_data array, marking the boundaries as they go. This might make a noticeable improvement in your kernel. But again, optimizing 3% is not necessarily where you want to spend a lot of effort.
Your kernel is going to be memory bandwidth bound. Therefore rather than worry about things like branching, I would focus on efficient use of memory, i.e. minimizing reads and writes to global memory, and look for opportunities to make sure your reads and writes are coalesced. Test your kernel independently of the rest of the program, and use the visual profiler to tell you if you've done a good job on memory operations.
Consider using shared memory. By having every thread load it's respective element into shared memory, you can easily coalesce all the global reads (and make sure you only read every global element once, or almost every element once) and then operate out of shared memory, i.e. have each thread compare it's element to the previous one in shared memory.
Once you've created your pos_ptr array, let's note that apart from
the inf values it is already sorted. So maybe there's a smarter
option than "thrust::sort" followed by trimming the array, to
produce the result. Take a look at thrust functions like
remove_if and copy_if. I haven't benchmarked it, but my guess
is they will be signficantly less expensive than sort, followed by
trimming the array (removing the inf values).

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.
#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>
using namespace std;
int main(){
srand(time(NULL));//Seed random number generator
vector<int>v;//Create vector to hold random numbers in interval [0,9]
vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0
for(int i=0;i<1e9;++i)
v.push_back(rand()%10);//Push back random numbers [0,9]
clock_t c=clock();
#pragma omp parallel for
for(int i=0;i<v.size();++i)
d[v[i]]+=1;//Count number stored at v[i]
cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;
for(vector<int>::iterator i=d.begin();i!=d.end();++i)
cout<<*i<<endl;
return 0;
}
The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)
Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?
My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.
Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?

Your code exibits:
race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time
Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.
False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:
#pragma omp parallel for
for(int i=0;i<v.size();++i)
#pragma omp atomic
d[v[i]]+=1;//Count number stored at v[i]
The second is implemented like this:
omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
int vv = v[i];
omp_set_lock(&locks[vv]);
d[vv]+=1;//Count number stored at v[i]
omp_unset_lock(&locks[vv]);
}
for (int i = 0; i < 10; i++)
omp_destroy_lock(&locks[i]);
(include omp.h to get access to the omp_* functions)
I leave it up to you to come up with an implementation of the third option.
You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.

EDIT
Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:
Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.
A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.
See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js