Missing small primes in C++ atomic prime sieve - c++

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.
My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.
Things I've tried:
Replacing compare_exchange_weak with compare_exchange_strong
Changing the memory_order to anything else.
Swapping around the 'crossing-out' and the write.
I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.
Any ideas on this are welcome, including some best practices I might have missed.
#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>
int main() {
using namespace std;
constexpr memory_order order = memory_order_relaxed;
atomic<int> output{0};
vector<atomic_bool> sieve(10000);
for (auto& each : sieve) atomic_init(&each, false);
atomic<unsigned> finished_worker_count{0};
auto const worker = [&output, &sieve, &finished_worker_count]() {
for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
current = find_if(current, sieve.end(), [](atomic_bool& value) {
bool untrue = false;
return value.compare_exchange_strong(untrue, true, order);
});
if (current == sieve.end()) break;
int const claimed = static_cast<int>(distance(sieve.begin(), current));
int zero = 0;
while (!output.compare_exchange_weak(zero, claimed, order))
;
for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
product += claimed)
sieve[product].store(true, order);
}
finished_worker_count.fetch_add(1, order);
};
const auto core_count = thread::hardware_concurrency();
vector<future<void>> futures;
futures.reserve(core_count);
generate_n(back_inserter(futures), core_count,
[&worker]() { return async(worker); });
vector<int> result;
while (finished_worker_count < core_count) {
auto current = output.exchange(0, order);
if (current > 0) result.push_back(current);
}
sort(result.begin(), result.end());
for (auto each : result) cout << each << " ";
cout << '\n';
return 0;
}

compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.
You'll want to reset zero back to zero before rechecking:
while (!output.compare_exchange_weak(zero, claimed, order))
zero = 0;

Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.
As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.
do{
zero = 0;
}while(!output.compare_exchange_weak(zero, claimed, order));
Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.
And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.
Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.
Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.
Multiple threads storing to the array will cause contention
Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)
Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).
But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)
With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)
A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).
Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).
You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.
The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.
(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)
At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)
Better strategy: each thread owns a chunk of the sieve
Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)
You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)
If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.
Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.
(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

Related

race condition using OpenMP atomic capture operation for 3D histogram of particles and making an index

I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?
You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top
You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.

Fast synchronized access to shared array with changing base address (in C11)

I am currently designing a user space scheduler in C11 for a custom co-processor under Linux (user space, because the co-processor does not run its own OS, but is controlled by software running on the host CPU). It keeps track of all the tasks' states with an array. Task states are regular integers in this case. The array is dynamically allocated and each time a new task is submitted whose state does not fit into the array anymore, the array is reallocated to twice its current size. The scheduler uses multiple threads and thus needs to synchronize its data structures.
Now, the problem is that I very often need to read entries in that array, since I need to know the states of tasks for scheduling decisions and resource management. If the base address was guaranteed to always be the same after each reallocation, I would simply use C11 atomics for accessing it. Unfortunately, realloc obviously cannot give such a guarantee. So my current approach is wrapping each access (reads AND writes) with one big lock in the form of a pthread mutex. Obviously, this is really slow, since there is locking overhead for each read, and the read is really small, since it only consists of a single integer.
To clarify the problem, I give some code here showing the relevant passages:
Writing:
// pthread_mutex_t mut;
// size_t len_arr;
// int *array, idx, x;
pthread_mutex_lock(&mut);
if (idx >= len_arr) {
len_arr *= 2;
array = realloc(array, len_arr*sizeof(int));
if (array == NULL)
abort();
}
array[idx] = x;
pthread_mutex_unlock(&mut);
Reading:
// pthread_mutex_t mut;
// int *array, idx;
pthread_mutex_lock(&mut);
int x = array[idx];
pthread_mutex_unlock(&mut);
I have already used C11 atomics for efficient synchronization elsewhere in the implementation and would love to use them to solve this problem as well, but I could not find an efficient way to do so. In a perfect world, there would be an atomic accessor for arrays which performs address calculation and memory read/write in a single atomic operation. Unfortunately, I could not find such an operation. But maybe there is a similarly fast or even faster way of achieving synchronization in this situation?
EDIT:
I forgot to specify that I cannot reuse slots in the array when tasks terminate. Since I guarantee access to the state of every task ever submitted since the scheduler was started, I need to store the final state of each task until the application terminates. Thus, static allocation is not really an option.
Do you need to be so economical with virtual address space? Can't you just set a very big upper limit and allocate enough address space for it (maybe even a static array, or dynamic if you want the upper limit to be set at startup from command-line options).
Linux does lazy memory allocation so virtual pages that you never touch aren't actually using any physical memory. See Why is iterating though `std::vector` faster than iterating though `std::array`? that show by example that reading or writing an anonymous page for the first time causes a page fault. If it was a read access, it gets the kernel to CoW (copy-on-write) map it to a shared physical zero page. Only an initial write, or a write to a CoW page, triggers actual allocation of a physical page.
Leaving virtual pages completely untouched avoids even the overhead of wiring them into the hardware page tables.
If you're targeting a 64-bit ISA like x86-64, you have boatloads of virtual address space. Using up more virtual address space (as long as you aren't wasting physical pages) is basically fine.
Practical example of allocating more address virtual space than you can use:
If you allocate more memory than you could ever practically use (touching it all would definitely segfault or invoke the kernel's OOM killer), that will be as large or larger than you could ever grow via realloc.
To allocate this much, you may need to globally set /proc/sys/vm/overcommit_memory to 1 (no checking) instead of the default 0 (heuristic which makes extremely large allocations fail). Or use mmap(MAP_NORESERVE) to allocate it, making that one mapping just best-effort growth on page-faults.
The documentation says you might get a SIGSEGV on touching memory allocated with MAP_NORESERVE, which is different than invoking the OOM killer. But I think once you've already successfully touched memory, it is yours and won't get discarded. I think it's also not going to spuriously fail unless you're actually running out of RAM + swap space. IDK how you plan to detect that in your current design (which sounds pretty sketchy if you have no way to ever deallocate).
Test program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void) {
size_t sz = 1ULL << 46; // 2**46 = 64 TiB = max power of 2 for x86-64 with 48-bit virtual addresses
// in practice 1ULL << 40 (1TiB) should be more than enough.
// the smaller you pick, the less impact if multiple things use this trick in the same program
//int *p = aligned_alloc(64, sz); // doesn't use NORESERVE so it will be limited by overcommit settings
int *p = mmap(NULL, sz, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
madvise(p, sz, MADV_HUGEPAGE); // for good measure to reduce page-faults and TLB misses, since you're using large contiguous chunks of this array
p[1000000000] = 1234; // or sz/sizeof(int) - 1 will also work; this is only touching 1 page somewhere in the array.
printf("%p\n", p);
}
$ gcc -Og -g -Wall alloc.c
$ strace ./a.out
... process startup
mmap(NULL, 70368744177664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x15c71ef7c000
madvise(0x15c71ef7c000, 70368744177664, MADV_HUGEPAGE) = 0
... stdio stuff
write(1, "0x15c71ef7c000\n", 15) = 15
0x15c71ef7c000
exit_group(0) = ?
+++ exited with 0 +++
My desktop has 16GiB of RAM (a lot of it in use by Chromium and some big files in /tmp) + 2GiB of swap. Yet this program allocated 64 TiB of virtual address space and touched 1 int of it nearly instantly. Not measurably slower than if it had only allocated 1MiB. (And future performance from actually using that memory should also be unaffected.)
The largest power-of-2 you can expect to work on current x86-64 hardware is 1ULL << 46. The total lower canonical range of the 48-bit virtual address space is 47 bits (user-space virtual address space on Linux), and some of that is already allocated for stack/code/data. Allocating a contiguous 64 TiB chunk of that still leaves plenty for other allocations.
(If you do actually have that much RAM + swap, you're probably waiting for a new CPU with 5-level page tables so you can use even more virtual address space.)
Speaking of page tables, the larger the array the more chance of putting some other future allocations very very far from existing blocks. This can have a minor cost in TLB-miss (page walk) time, if your actual in-use pages end up more scattered around your address space in more different sub-trees of the multi-level page tables. That's more page-table memory to keep in cache (including cached within the page-walk hardware).
The allocation size doesn't have to be a power of 2 but it might as well be. There's also no reason to make it that big. 1ULL << 40 (1TiB) should be fine on most systems. IDK if having more than half the available address space for a process allocated could slow future allocations; bookkeeping is I think based on extents (ptr + length) not bitmaps.
Keep in mind that if everyone starts doing this for random arrays in libraries, that could use up a lot of address space. This is great for the main array in a program that spends a lot of time using it. Keep it as small as you can while still being big enough to always be more than you need. (Optionally make it a config parameter if you want to avoid a "640kiB is enough for everyone" situation). Using up virtual address space is very low-cost, but it's probably better to use less.
Think of this as reserving space for future growth but not actually using it until you touch it. Even though by some ways of looking at it, the memory already is "allocated". But in Linux it really isn't. Linux defaults to allowing "overcommit": processes can have more total anonymous memory mapped than the system has physical RAM + swap. If too many processes try to use too much by actually touching all that allocated memory, the OOM killer has to kill something (because the "allocate" system calls like mmap have already returned success). See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
(With MAP_NORESERVE, it's only reserving address space which is shared between threads, but not reserving any physical pages until you touch them.)
You probably want your array to be page-aligned: #include <stdalign.h> so you can use something like
alignas(4096) struct entry process_array[MAX_LEN];
Or for non-static, allocate it with C11 aligned_alloc().
Give back early parts of the array when you're sure all threads are done with it
Page alignment makes it easy do the calculations to "give back" a memory page (4kiB on x86) if your array's logical size shrinks enough. madvise(addr, 4096*n, MADV_FREE); (Linux 4.5 and later). This is kind of like mmap(MAP_FIXED) to replace some pages with new untouched anonymous pages (that will read as zeroes), except it doesn't split up the logical mapping extents and create more bookkeeping for the kernel.
Don't bother with this unless you're returning multiple pages, and leave at least one page unfreed above the current top to avoid page faults if you grow again soon. Like maybe maintain a high-water mark that you've ever touched (without giving back) and a current logical size. If high_water - logical_size > 16 pages give back all page from 4 past the logical size up to the high water mark.
If you will typically be actually using/touching at least 2MiB of your array, use madvise(MADV_HUGEPAGE) when you allocate it to get the kernel to prefer using transparent hugepages. This will reduce TLB misses.
(Use strace to see return values from your madvise system calls, and look at /proc/PID/smaps, to see if your calls are having the desired effect.)
If up-front allocation is unacceptable, RCU (read-copy-update) might be viable if it's read-mostly. https://en.wikipedia.org/wiki/Read-copy-update. But copying a gigantic array every time an element changes isn't going to work.
You'd want a different data-structure entirely where only small parts need to be copied. Or something other than RCU; like your answer, you might not need the read side being always wait-free. The choice will depend on acceptable worst-case latency and/or average throughput, and also how much contention there is for any kind of ref counter that has to bounce around between all threads.
Too bad there isn't a realloc variant that attempts to grow without copying so you could attempt that before bothering other threads. (e.g. have threads with idx>len spin-wait on len in case it increases without the array address changing.)
So, I came up with a solution:
Reading:
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
int x = array[idx];
cnt--;
Writing:
if (idx == len) {
wait = true;
while (cnt > 0); // busy wait to minimize latency of reallocation
array = realloc(array, 2*len*sizeof(int));
if (!array) abort(); // shit happens
len *= 2; // must not be updated before reallocation completed
wait = false;
}
// this is why len must be updated after realloc,
// it serves for synchronization with other writers
// exceeding the current length limit
while (idx > len) {yield();}
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
array[idx] = x;
cnt--;
wait is an atomic bool initialized as false, cnt is an atomic int initialized as zero.
This only works because I know that task IDs are chosen ascendingly without gaps and that no task state is read before it is initialized by the write operation. So I can always rely on the one thread which pulls the ID which only exceeds current array length by 1. New tasks created concurrently will block their thread until the responsible thread performed reallocation. Hence the busy wait, since the reallocation should happen quickly so the other threads do not have to wait for too long.
This way, I eliminate the bottlenecking big lock. Array accesses can be made concurrently at the cost of two atomic additions. Since reallocation occurs seldom (due to exponential growth), array access is practically block-free.
EDIT:
After taking a second look, I noticed that one has to be careful about reordering of stores around the length update. Also, the whole thing only works if concurrent writes always use different indices. This is the case for my implementation, but might not generally be. Thus, this is not as elegant as I thought and the solution presented in the accepted answer should be preferred.

How to set bits of a bit vector efficiently in parallel?

Consider a bit vector of N bits in it (N is large) and an array of M numbers (M is moderate, usually much smaller than N), each in range 0..N-1 indicating which bit of the vector must be set to 1. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i, where 256 bits are packed into each __m256i structure.
How can this work be split efficiently accross multiple threads?
Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.
Let's assume you want to divide this work up among T threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of N and M.
Fully Concurrent Baseline
You could simply divide up the array M into T partitions and have each thread work on its own partition of M with a shared N. The main problem is that since M is not sorted, all threads may access any element of N and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as std::atomic::fetch_or for each modification of the shared N array, or else come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).
Let's look at ideas that are likely faster.
Private N
One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via or.
Unfortunately, this solution is O(N) + O(M/T) whereas the original single-threaded solution is O(M) and the "atomic" solution above is something like O(M/T)4. Since we know that N >> M this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the O(N) term, which comes from the merging step0 can use 256-bit wide vpor instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is O(M/T) I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of N is 10 or 100 times the size of M.
Partitions of M
The basic idea here is to partition the indexes in M such that each worker thread can then work on a disjoint part of the N array. If M was sorted, that would be trivial, but it's not, so...
A simple algorithm that will work well if M is smoothly distributed is to first partition that values of M into T buckets, with the buckets having values in the ranges [0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N). That is, divide N into T disjoint regions and then find the values of M that fall into each of them. You can spread that work across the T threads by assigning each thread an equal size chunk of M, and having them each create the T partitions and then logically merging1 them at the end so you have the T partitions of M.
The second step is to actually set all the bits: you assign one partition to each thread T which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of N2.
Both steps O(M) and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.
If the distribution of M is not smooth, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say 10 * T partitions, rather than only T and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array M is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.
Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).
0 ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.
1 The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of M, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.
2 To make it truly disjoint from a threading point of view you want to ensure the partition of N falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).
4 In practice, the exact "order" of the baseline concurrent solution using shared N is hard to define because there will be contention so the O(M/T) scaling will break down for large enough T. If we assume N is quite large and T is limited to typical hardware concurrency of at most a dozen cores or so it's probably an OK approximation.
#IraBaxter posted an interesting but flawed idea which can be made to work (at significant cost). I suspect #BeeOnRope's idea of partial-sort / partitioning the M array will perform better (especially for CPUs with large private caches which can keep parts of N hot). I'll summarize the modified version of Ira's idea that I described in comments on his deleted answer. (That answer has some suggestions about how big N has to be before it's worth multi-threading.)
Each writer thread gets a chunk of M with no sorting/partitioning.
The idea is that conflicts are very rare because N is large compared to the number of stores that can be in flight at once. Since setting a bit is idempotent, so we can handle conflicts (where two threads want to set different bits in the same byte) by checking the value in memory to make sure it really does have the bit set that we want after a RMW operation like or [N + rdi], al (with no lock prefix).
E.g. thread 1 tried to store 0x1 and stepped on thread 2's store of 0x2. Thread 2 must notice and retry the read-modify-write (probably with lock or to keep it simple and make multiple retries not possible) to end up with 0x3 in the conflict byte.
We need an mfence instruction before the read-back. Otherwise store-forwarding will give us the value we we just wrote before other threads see our store. In other words, a thread can observe its own stores earlier than they appear in the global order. x86 does have a Total Order for stores, but not for loads. Thus, we need mfence to prevent StoreLoad reordering. (Intel's "Loads Are not Reordered with Older Stores to the Same Location" guarantee is not as useful as it sounds: store/reload isn't a memory barrier; they're just talking about out-of-order execution preserving program-order semantics.)
mfence is expensive, but the trick that makes this better than just using lock or [N+rdi], al is that we can batch operations. e.g. do 32 or instructions and then 32 read-back. It's a tradeoff between mfence overhead per operation vs. increased chance of false-sharing (reading back cache lines that had already been invalidated by another CPU claiming them).
Instead of an actual mfence instruction, we can do the last or of a group as a lock or. This is better for throughput on both AMD and Intel. For example, according to Agner Fog's tables, mfence has one per 33c throughput on Haswell/Skylake, where lock add (same performance as or) has 18c or 19c throughput. Or for Ryzen, ~70c (mfence) vs. ~17c (lock add).
If we keep the amount of operations per fence very low, the array index (m[i]/8) + mask (1<<(m[i] & 7)) can be kept in registers for all the operations. This probably isn't worth it; fences are too expensive to do as often as every 6 or operations. Using the bts and bt bit-string instructions would mean we could keep more indices in registers (because no shift-result is needed), but probably not worth it because they're slow.
Using vector registers to hold indices might be a good idea, to avoid having to reload them from memory after the barrier. We want the load addresses to be ready as soon as the read-back load uops can execute (because they're waiting for the last store before the barrier to commit to L1D and become globally visible).
Using single-byte read-modify-write makes actual conflicts as unlikely as possible. Each write of a byte only does a non-atomic RMW on 7 neighbouring bytes. Performance still suffers from false-sharing when two threads modify bytes in the same 64B cache-line, but at least we avoid having to actually redo as many or operations. 32-bit element size would make some things more efficient (like using xor eax,eax / bts eax, reg to generate 1<<(m[i] & 31) with only 2 uops, or 1 for BMI2 shlx eax, r10d, reg (where r10d=1).)
Avoid the bit-string instructions like bts [N], eax: it has worse throughput than doing the indexing and mask calculation for or [N + rax], dl. This is the perfect use-case for it (except that we don't care about the old value of the bit in memory, we just want to set it), but still its CISC baggage is too much.
In C, a function might look something like
/// UGLY HACKS AHEAD, for testing only.
// #include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void set_bits( volatile uint8_t * restrict N, const unsigned *restrict M, size_t len)
{
const int batchsize = 32;
// FIXME: loop bounds should be len-batchsize or something.
for (int i = 0 ; i < len ; i+=batchsize ) {
for (int j = 0 ; j<batchsize-1 ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
N[idx] |= mask;
}
// do the last operation of the batch with a lock prefix as a memory barrier.
// seq_cst RMW is probably a full barrier on non-x86 architectures, too.
unsigned idx = M[i+batchsize-1];
unsigned mask = 1U << (idx&7);
idx >>= 3;
__atomic_fetch_or(&N[idx], mask, __ATOMIC_SEQ_CST);
// _mm_mfence();
// TODO: cache `M[]` in vector registers
for (int j = 0 ; j<batchsize ; j++ ) {
unsigned idx = M[i+j];
unsigned mask = 1U << (idx&7);
idx >>= 3;
if (! (N[idx] & mask)) {
__atomic_fetch_or(&N[idx], mask, __ATOMIC_RELAXED);
}
}
}
}
This compiles to approximately what we want with gcc and clang. The asm (Godbolt) could be more efficient in several ways, but might be interesting to try this. This is not safe: I just hacked this together in C to get the asm I wanted for this stand-alone function, without inlining into a caller or anything. __atomic_fetch_or is not a proper compiler barrier for non-atomic variables the way asm("":::"memory") is. (At least the C11 stdatomic version isn't.) I should probably have used the legacy __sync_fetch_and_or, which is a full barrier for all memory operations.
It uses GNU C atomic builtins to do atomic RMW operations where desired on variables that aren't atomic_uint8_t. Running this function from multiple threads at once would be C11 UB, but we only need it to work on x86. I used volatile to get the asynchronous-modification-allowed part of atomic without forcing N[idx] |= mask; to be atomic. The idea is to make sure that the read-back checks don't optimize away.
I use __atomic_fetch_or as a memory barrier because I know it will be on x86. With seq_cst, it probably will be on other ISAs, too, but this is all a big hack.
There are a couple of operations involved in sets (A,B = set, X = element in a set):
Set operation Instruction
---------------------------------------------
Intersection of A,B A and B
Union of A,B A or B
Difference of A,B A xor B
A is subset of B A and B = B
A is superset of B A and B = A
A <> B A xor B <> 0
A = B A xor B = 0
X in A BT [A],X
Add X to A BTS [A],X
Subtract X from A BTC [A],X
Given the fact that you can use the boolean operators to replace set operations you can use VPXOR, VPAND etc.
To set, reset or test individual bits you simply use
mov eax,BitPosition
BT [rcx],rax
You can set if a set is (equal to) empty (or something else) using the following code
vpxor ymm0,ymm0,ymm0 //ymm0 = 0
//replace the previous instruction with something else if you don't want
//to compare to zero.
vpcmpeqqq ymm1,ymm0,[mem] //compare mem qwords to 0 per qword
vpslldq ymm2,ymm1,8 //line up qw0 and 1 + qw2 + 3
vpand ymm2,ymm1,ymm2 //combine qw0/1 and qw2/3
vpsrldq ymm1,ymm2,16 //line up qw0/1 and qw2/3
vpand ymm1,ymm1,ymm2 //combine qw0123, all in the lower 64 bits.
//if the set is empty, all bits in ymm1 will be 1.
//if its not, all bits in ymm1 will be 0.
(I'm sure this code can be improved using the blend/gather etc instructions)
From here you can just extend to bigger sets or other operations.
Note that bt, btc, bts with a memory operand is not limited to 64 bits.
The following will work just fine.
mov eax,1023
bts [rcx],rax //set 1024st element (first element is 0).

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same
This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);
Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).
Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.