I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into
!$OMP PARALLEL DO
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
!$OMP END PARALLEL DO
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.
You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:
!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).
Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.
Related
I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.
Say I have a toy loop like this
float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
y[i] = a*(x[i-1] - x[i] + x[i+1])
And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3 FLOP:
1 (cached) read access: loading all 3 x[i-1], x[i], x[i+1]
1 write access: storing y[i]
3 FLOP (1 mul, 1 add, 1 sub)
The operational intensity is ergo
OI = 3 FLOP/(2 * 4 BYTE)
Now what happens if I do something like this
float x[N];
for (int i = 1; i < N-1; i++)
x[i] = a*(x[i-1] - x[i] + x[i+1])
Note that there is no y anymore. Does it mean now that I have a single RAM access
1 (cached) read/write: loading x[i-1], x[i], x[i+1], storing x[i]
or still 2 RAM accesses
1 (cached) read: loading x[i-1], x[i], x[i+1]
1 (cached) write: storing x[i]
Because the operational intensity OI would be different in either case. Can anyone tell something about this? Or maybe clarify some things. Thanks
Disclaimer: I've never heard of the roofline performance model until today. As far as I can tell, it attempts to calculate a theoretical bound on the "arithmetic intensity" of an algorithm, which is the number of FLOPS per byte of data accessed. Such a measure may be useful for comparing similar algorithms as the size of N grows large, but is not very helpful for predicting real-world performance.
As a general rule of thumb, modern processors can execute instructions much more quickly than they can fetch/store data (this becomes drastically more pronounced as the data starts to grow larger than the size of the caches). So contrary to what one might expect, a loop with higher arithmetic intensity may run much faster than a loop with lower arithmetic intensity; what matters most as N scales is the total amount of data touched (this will hold true as long as memory remains significantly slower than the processor, as is true in common desktop and server systems today).
In short, x86 CPUs are unfortunately too complex to be accurately described with such a simple model. An access to memory goes through several layers of caching (typically L1, L2, and L3) before hitting RAM. Maybe all your data fits in L1 -- the second time you run your loop(s) there could be no RAM accesses at all.
And there's not just the data cache. Don't forget that code is in memory too and has to be loaded into the instruction cache. Each read/write is also done from/to a virtual address, which is supported by the hardware TLB (that can in extreme cases trigger a page fault and, say, cause the OS to write a page to disk in the middle of your loop). All of this is assuming your program is hogging the hardware all to itself (in non-realtime OSes this is simply not the case, as other processes and threads are competing for the same limited resources).
Finally, the execution itself is not (directly) done with memory reads and writes, but rather the data is loaded into registers first (then the result is stored).
How the compiler allocates registers, if it attempts loop unrolling, auto-vectorization, the instruction scheduling model (interleaving instructions to avoid data dependencies between instructions) etc. will also all affect the actual throughput of the algorithm.
So, finally, depending on the code produced, the CPU model, the amount of data processed, and the state of various caches, the latency of the algorithm will vary by orders of magnitude. Thus, the operational intensity of a loop cannot be determined by inspecting the code (or even the assembly produced) alone, since there are many other (non-linear) factors in play.
To address your actual question, though, as far as I can see by the definition outlined here, the second loop would count as a single additional 4-byte access per iteration on average, so its OI would be θ(3N FLOPS / 4N bytes). Intuitively, this makes sense because the cache already has the data loaded, and the write can change the cache directly instead of going back to main memory (the data does eventually have to be written back, however, but that requirement is unchanged from the first loop).
I have a data structure (a vector) which elements have to be parsed by a function, where elements can be parsed by different threads.
Following is the parsing method:
void ConsumerPool::parse(size_t n_threads, size_t id)
{
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
//parse(nodes[idx]);
parse(idx);
}
}
Where:
n_threads is the total number of threads
id is the (univocal) index of the current thread
and the threads are created as follow:
std::vector<std::thread> threads;
for (size_t i = 0; i < n_threads; i++)
threads.emplace_back(&ConsumerPool::parse, this, n_threads, i);
Unfortunately, even if this method works, the performance of my application decreases if the number of threads is too high. I would like to understand why the performance decreases even if there's no synchronization between these threads.
Following are the elapsed times (between the threads start and the last join() return) according to the number of threads used:
2 threads: 500 ms
3 threads: 385 ms
4 threads: 360 ms
5 threads: 475 ms
6 threads: 580 ms
7 threads: 635 ms
8 threads: 660 ms
The time necessary for the threads creation is always between 1/2 ms.
The software has been tested by using its release build. Following is my configuration:
2x Intel(R) Xeon(R) CPU E5507 # 2.27GHz
Maximum speed: 2.26 GHz
Sockets: 2
Cores: 8
Logical processors: 8
Virtualization: Enabled
L1 cache: 512 KB
L2 cache: 2.0 MB
L3 cache: 8.0 MB
EDIT:
What the parse() function does is the following:
// data shared between threads (around 300k elements)
std::vector<std::unique_ptr<Foo>> vfoo;
std::vector<rapidxml::xml_node<>*> nodes;
std::vector<std::string> layers;
void parse(int idx)
{
auto p = vfoo[idx];
// p->parse() allocate memory according to the content of the XML node
if (!p->parse(nodes[idx], layers))
vfoo[idx].reset();
}
The processor you are using Intel(R) Xeon(R) CPU E5507 has only 4 cores (see http://ark.intel.com/products/37100/Intel-Xeon-Processor-E5507-4M-Cache-2_26-GHz-4_80-GTs-Intel-QPI).
So having more threads than 4 would cause the slow down because of the context switching as is visible from the data you have provided.
You can read more about the context switching at the following link: https://en.wikipedia.org/wiki/Context_switch
update:
We still don't have a lot of info about the memory access patterns of parse(), and how much time it spends reading input data from memory vs. how much time spent writing/reading private scratch memory.
You say p->parse() "allocates memory according to the content of the XML node". If it frees it again, you may see a big speedup from keeping a big-enough scratch buffer allocated in each thread. Memory allocation/deallocation is a "global" thing that requires synchronization between threads. A thread-aware allocator can hopefully handle an allocate/free / allocate/free pattern by satisfying allocations from memory just freed by that thread, so it's probably still hot in private L1 or L2 cache on that core.
Use some kind of profiling to find the real hotspots. It might be memory allocation/deallocation, or it might be code that reads some memory.
Your dual-socket Nehalem Xeon doesn't have hyperthreading, so you can't be running into issues with threads slowing each other down if a non-HT-aware OS schedules two on two logical cores of the same physical core.
You should investigate with performance counters (e.g. Linux perf stat, or Intel's VTune) whether you're getting more cache misses per thread once you pass 4 threads. Nehalem uses large shared (for the whole socket) L3 (aka last-level) caches, so more threads running on the same socket creates more pressure on that. The relevant perf events will be something like LLC_something, IIRC.
You should definitely look at L1/L2 misses, and see how those scale with number of threads, and how that changes with strided vs. contiguous access to node[].
There are other perf counters you can check to look for false sharing (one thread's private variable sharing a cache line with another thread's private variable, so the cache line bounces between cores). Really just look for any perf events that change with number of threads; that could point the way towards an explanation.
A multi-socket system like your 2-socket Nehalem will have NUMA (Non-uniform_memory_access). A NUMA-aware OS will try to allocate memory that's fast for the core doing the allocation.
So presumably your buffer has all of its physical pages in memory attached to one of your two sockets. In this case it's probably not something you can or should avoid, since I assume you're filling the array in a single-threaded way before handing it off to multiple threads for parsing. In general, though, try to allocate memory (especially scratch buffers) in the thread that will use it most, when that's convenient.
This may partially explain less-than-perfect scaling with number of threads. Although more likely this has nothing to do with things, if #AntonMalyshev's answer didn't help. Having each thread work on a contiguous range, instead of striding through the array with a stride of n_threads, should be better for L2 / L1 cache efficiency.
node[] is a vector of pointers (so with 8 threads, each thread uses only 8 bytes of each 64 byte cache line it touches in node[]). However, each thread presumably touches way more memory in the pointed-to data structures and strings. If node entries point to monotonically-increasing positions in other data structures and the string, then the strided access to node[] creates non-contiguous access patterns to most of the memory touched by the thread.
One possible benefit of the strided access pattern: Strided means that if all threads run at more or less the same speed, they're all looking at the same part of memory at the same time. Threads that get ahead will slow down from L3 misses, while other threads catch up because they see L3 hits. (Unless something happens that lets one thread get too far behind, like the OS de-scheduling it for a time slice.)
So maybe L3 vs. RAM bandwidth / latency is more of an issue than efficient use of per-core L2/L1. Maybe with more threads, L3 bandwidth can't keep up with all the requests for the same cache lines from the L2 caches of multiple cores. (L3 isn't fast enough to satisfy constant L2 misses from all cores at once, even if they all hit in L3.)
This argument applies to the everything pointed to by node[] only if contiguous ranges of node[] point to contiguous ranges of other memory.
Try to parse continuos ranges of elements inside threads, e.g. change
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
parse(nodes[idx]);
}
to
for (size_t idx = id * nodes.size()/n_threads; idx < (id+1)*nodes.size()/n_threads; idx++)
{
// parse node
parse(nodes[idx]);
}
that should be better for caching.
Also it's better to precompute size = (id+1)*nodes.size()/n_threads and use it in loop's stop condition instead of computing it on each iteration.
For CPU bound processes, adding additional threads beyond the number of available cores will decrease overall performance. The decrease is due to scheduling and other kernel interactions. For such situations, the optimal number of threads is often the number of cores -1. The remaining core will be used by the kernel and other other running processes.
I address this topic in a bit more detail here A case for minimal multithreading
Looking at the hardware and numbers a bit closer, I suspect you are hitting hyper-threading contention. For a 4 core cpu, 8 cores are simulated with hyper-threading. For a fully cpu bound process, hyper-threading will actually decrease performance. There is some interesting discussion here Hyper-threading and details Wikipedia hyper-threading
2 CPU (4 core each)
Threads run in a shared memory space. Performance decrease is caused by moving shared memory between CPU (threads cannot access directly cache in different CPU, more threads => more moving => bigger performance decrease).
I'm trying to parallelize a loop with OpenMP where each iteration is independent (code sample below).
!$OMP PARALLEL DO DEFAULT(PRIVATE)
do i = 1, 16
begin = omp_get_wtime()
allocate(array(100000000))
do j=1, 100000000
array(j) = j
end do
deallocate(array)
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$END OMP PARALLEL DO
I would except a linear speedup out of this piece of code, with each iteration taking as much time as in the sequential version, as there are no possible race conditions or false sharing issues. However, I obtain the following results on a machine with 2 Xeon E5-2670 (8 cores each):
With only one thread:
It 1 Thread 0 time 0.435683965682983
It 2 Thread 0 time 0.435048103332520
It 3 Thread 0 time 0.435137987136841
It 4 Thread 0 time 0.434695959091187
It 5 Thread 0 time 0.434970140457153
It 6 Thread 0 time 0.434894084930420
It 7 Thread 0 time 0.433521986007690
It 8 Thread 0 time 0.434685945510864
It 9 Thread 0 time 0.433223009109497
It 10 Thread 0 time 0.434834957122803
It 11 Thread 0 time 0.435106039047241
It 12 Thread 0 time 0.434649944305420
It 13 Thread 0 time 0.434831142425537
It 14 Thread 0 time 0.434768199920654
It 15 Thread 0 time 0.435182094573975
It 16 Thread 0 time 0.435090065002441
And with 16 threads :
It 1 Thread 0 time 1.14882898330688
It 3 Thread 2 time 1.19775915145874
It 4 Thread 3 time 1.24406099319458
It 14 Thread 13 time 1.28723978996277
It 8 Thread 7 time 1.39885497093201
It 10 Thread 9 time 1.46112895011902
It 6 Thread 5 time 1.50975203514099
It 11 Thread 10 time 1.63096308708191
It 16 Thread 15 time 1.69229602813721
It 7 Thread 6 time 1.74118590354919
It 9 Thread 8 time 1.78044819831848
It 15 Thread 14 time 1.82169485092163
It 12 Thread 11 time 1.86312794685364
It 2 Thread 1 time 1.90681600570679
It 5 Thread 4 time 1.96404480934143
It 13 Thread 12 time 2.00902700424194
Any idea where the 4x factor in the iteration time is coming from ?
I have tested with both the GNU compiler and the Intel compiler with the O3 optimization flag.
The speed of the operation
do j=1, 100000000
array(j) = j
end do
is limited not by the ALU speed but by the memory bandwith. Typically, you now have several channels to the main memory per CPU socket available, but still smaller number then the number of cores.
Also the allocation and deallocation are memory access bound. I am not sure whether there may be also some synchronization needed for the allocate and deallocate.
For the same reason, the STREAM benchmark http://www.cs.virginia.edu/stream/ gives different speed-ups than purely arithmetically intensive problems.
I'm sure I've covered that topic before, but since I cannot seem to find my earlier posts, here I go again...
Large memory allocations on Linux (and possibly on other platforms) are handled via anonymous memory mappings. That is, some area gets reserved in the virtual address space of the process by calling mmap(2) with flags MAP_ANONYMOUS. The maps are initially empty - there is no physical memory backing them up. Instead, they are associated with the so-called zero page, which is a read-only frame in physical memory filled with zeros. Since the zero page is not writeable, an attempt to write into a memory location still backed by it results in a segmentation fault. The kernel handles the fault by finding a free frame in physical memory and associating it with the virtual memory page where the fault has occurred. This process is known as faulting the memory.
Faulting the memory is a relatively slow process as it involves modifications to the process' PTEs (page table entries) and flushes of the TLB (Translation Lookaside Buffer) cache. On multicore and multisocket systems it is even slower as it involves invalidation of the remote TLBs (known as remote TLB shootdown) via expensive inter-processor interrupts. Freeing an allocation results in removal of the memory mapping and reset of the PTEs. Therefore, the whole process is repeated during the next iteration.
Indeed, if you look at the effective memory bandwidth in your serial case, it is (assuming an array of double precision floats):
(100000000 * 8) / 0.435 = 1.71 GiB/s
Should your array be of REAL or INTEGER elements, the bandwidth should be cut in half. This is nowhere the memory bandwidth that even the very first generation of E5-2670 provides.
For the parallel case, the situation is even worse, since the kernel locks the page tables while faulting the pages. That's why the average bandwidth for a single thread varies from 664 MiB/s down to 380 MiB/s for a total of 7.68 GiB/s, which is almost an order of magnitude slower than the memory bandwidth of a single CPU (and your system has two, hence twice the available bandwidth!).
A completely different picture will emerge if you move the allocation outside of the loop:
!$omp parallel default(private)
allocate(array(100000000))
!$omp do
do i = 1, 16
begin = omp_get_wtime()
do j=1, 100000000
array(j) = j
end do
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$omp end do
deallocate(array)
!$omp end parallel
Now the second and later iterations will yield twice shorter times (at least on E5-2650). This is because after the first iteration, all the memory is already faulted. The gain is even larger for the multithreaded case (increase the loop count to 32 to have each thread do two iterations).
The time to fault the memory depends heavily on the system configuration. On systems that have THP (transparent huge pages) enabled, the kernel automatically uses huge pages to implement large mappings. This reduces the number of faults by a factor of 512 (for huge pages of 2 MiB). The above cited speed gains of 2x for the serial case and 2.5x for the parallel one are from a system with THP enabled. The mere use of huge pages decreases the time for the first iteration on E5-2650 to 1/4 (1/8 if your array is of integers or single-precision floats) of the time in your case.
This is usually not the case for smaller arrays, which are allocated via subdivision of a larger and reused persistent memory allocation known as arena. Newer memory allocators in glibc usually have one arena per CPU core in order to facilitate lock-less multithreaded allocation.
That is the reason why many benchmark applications simply throw away the very first measurement.
Just to substantiate the above with real-life measurements, my E5-2650 needs 0.183 seconds to perform in serial one iteration over already faulted memory and 0.209 seconds to perform it with 16 threads (on a dual-socket system).
They're not independent. Allocate/deallocate will be sharing the heap.
Try allocating a bigger array outside of the parallel section, then timing just the memory access.
It's also a non uniform memory architecture - if all the allocations come from one cpu's local memory, access from the other cpu will be relatively slow as they get routed via the first cpu. This is tedious to work around.
I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.
#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>
using namespace std;
int main(){
srand(time(NULL));//Seed random number generator
vector<int>v;//Create vector to hold random numbers in interval [0,9]
vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0
for(int i=0;i<1e9;++i)
v.push_back(rand()%10);//Push back random numbers [0,9]
clock_t c=clock();
#pragma omp parallel for
for(int i=0;i<v.size();++i)
d[v[i]]+=1;//Count number stored at v[i]
cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;
for(vector<int>::iterator i=d.begin();i!=d.end();++i)
cout<<*i<<endl;
return 0;
}
The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)
Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?
My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.
Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?
Your code exibits:
race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time
Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.
False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:
#pragma omp parallel for
for(int i=0;i<v.size();++i)
#pragma omp atomic
d[v[i]]+=1;//Count number stored at v[i]
The second is implemented like this:
omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
int vv = v[i];
omp_set_lock(&locks[vv]);
d[vv]+=1;//Count number stored at v[i]
omp_unset_lock(&locks[vv]);
}
for (int i = 0; i < 10; i++)
omp_destroy_lock(&locks[i]);
(include omp.h to get access to the omp_* functions)
I leave it up to you to come up with an implementation of the third option.
You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.
EDIT
Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:
Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.
A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.
See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP