C++ super fast thread-safe rand function - c++

void NetClass::Modulate(vector <synapse> & synapses )
{
int size = synapses.size();
int split = 200 * 0.5;
for(int w=0; w < size; w++)
if(synapses[w].active)
synapses[w].rmod = ((rand_r(seedp) % 200 - split ) / 1000.0);
}
The function rand_r(seedp) is seriously bottle-necking my program. Specifically, its slowing me by 3X when run serialy, and 4.4X when run on 16 cores. rand() is not an option because its even worse. Is there anything I can do to streamline this? If it will make a difference, I think I can sustain a loss in terms of statistical randomness. Would pre-generating (before execution) a list of random numbers and then loading to the thread stacks be an option?

Problem is that seedp variable (and its memory location) is shared among several threads. Processor cores must synchronize their caches each time they access this ever changing value, which hampers performance. The solution is that all threads work with their own seedp, and so avoid cache synchronization.

It depends on how good the statistical randomness needs to be. For high quality, the Mersenne twister, or its SIMD variant, is a good choice. You can generate and buffer a large block of pseudo-random numbers at a time, and each thread can have its own state vector. The Park-Miller-Carta PRNG is extremely simple - these guys even implemented it as a CUDA kernel.

Marsaglia's xor-shift generator is the probably fastest "reasonable quality" generator that you can use. It does not quite have the same "quality" as MT19937 or WELL, but honestly these differences are academic sophistries.
For all real, practical uses, there is no observable difference, except 1-2 orders of magnitude difference in execution speed, and 3 orders of magnitude of difference in memory consumption.
The xor-shift generator is also naturally thread-safe (in the sense that it will produce non-deterministic, pseudorandom results, and it will not crash) without anything special, and it can be trivially made thread-safe in another sense (in the sense that it will generate per-thread independent, deterministic, pseudorandom numbers) by having one instance per thread.
It could also be made threadsafe in yet another sense (generate a deterministic, pseudorandom sequence handed out to threads as they come) using atomic compare-exchange, but I don't think that's very useful.
The only three notable issues with the xor-shift generator are:
It is not k-distributed for up to 623 dimensions, but honestly who cares. I can't think in more than 4 dimensions (and even that's a lie!), and can't imagine many applications where more than 10 or 20 dimensions could possibly matter. That would have to be some quite esoteric simulation.
It passes most, but not ever pedantic statistic test. Again, who cares. Most people use a random generator that does not even pass a single test and never notice.
A zero seed will produce a zero sequence. This is trivially fixed by adding a non-zero constant to one of the temporaries (I wonder why Marsaglia never thought of that?). Having said that, MT19937 also behaves extremely badly given a zero seed, and does not recover nearly as well.

Have a look at Boost: http://www.boost.org/doc/libs/1_47_0/doc/html/boost_random.html
It has a number of options which vary in complexity (= speed) and randomness (cycle length).
If you don't need maximum randomness, you might get away with a simple Mersenne Twister.

do you absolutely need to have 1 shared random?
I had a similar contention problem a while ago, the solution that worked best for me was to create a new Random class (I was working in C#) for each thread. they're dead cheap anyway.
If you seed them properly to make sure you don't create duplicate seeds you should be fine. Then you won't have shared state so you don't need to use the threadsafe function.
Regards GJ

maybe you don't have to call it in every iteration? you could initialize an array of pre-randomized elements and make successive use of it...

I think you can use OpenMP for paralleling like this:
#pragma omp parallel
for(int w=0; w < size; w++)

Related

How to generate huge uniform distribution over small range of int in C++ fastly?

Edited: to clarify high dimension meaning
My problem is generating N size array or vector(corresponding to math N dimensional vector), N is huge, more than 10^8, each dimension lies at i.i.d uniform distribution over 1~q(or 0~q-1), where q is very small, q<=2^4. And array must be good statistically. So the basic soltuion is
constexpr auto N = 1000000000UZ;
constexpr auto q = 12;
std::array<std::uint8_t, N> Array{};
std::random_device Device{};
std::mt19937_64 Eng{Device()};
std::uniform_int_distribution<std::uint8_t> Dis(0, q);
std::ranges::generate(Array, [&]{return Dis(Eng);});
But the problem lies in performance, I have several plans to improve it:
because s=q^8<=2^32 so, we use
std::uniform_int_distribution<std::uint8_t> Dis(0, s);
add decompose the result t<q^8 into 8 different t_i<q, but this decomposition is not straightforward, and may have performance flaw in decomposition.
use boost::random::uniform_smallint, but I don't know how much improvement will be? this cann't be used together with method 1.
use multi-threading like openmp or <thread>, but C++ PRNG may not be thread-safe, so it's hard to write to my knowledge.
use other generator such as pcg32 or anything else, but these are not thread-safe as well.
Does anyone offer any suggestions?
If the running time for a sequential initialization is prohibitive (for instance because it happens multiple times), you can indeed consider doing it in parallel. First of all, the C++ random number generation does not have global state, so you don't have to worry about thread-safety if you create a separate generator per thread.
However, you have to wonder independence. Clearly you can not start them all on the same seed. But picking random seeds is also dangerous. Just imagine what would happen if sucessive generators started on successive values in the sequence.
Here are two approaches:
If you know how many threads there are, say p, first start each generator at the same seed, and then thread t generates t numbers. That's the starting point. Now every time you need a new number, you take p steps! This means that the threads are using interlacing values from the random sequence.
On the other hand, if you know how many iterations i will happen in total, start each thread on the same seed, and then let thread t take t * i steps. So now each thread gets a disjoint block of values from the sequence.
There are other approaches to parallel random number generation, but these are simple solutions based on the standard sequential generator.

How efficient is it to make a temporary uniform random distribution each time round a loop?

For example:
for (...)
{
... std::uniform_real_distribution<float>(min, max)(rng) ...
}
Intuitively it seems to me that the constructor can't need to do much besides store the two values, and there shouldn't be any state in the uniform_*_distribution instance. I haven't profiled it myself (I'm not at that kind of stage in the project yet), but I felt this question belonged out there :)
I am aware that this would be a bad idea for some distribution types - for example, std::normal_distribution might generate its numbers in pairs, and the second number would be wasted each time.
I feel what I have is more readable than just accessing rng() and doing the maths myself, but I'd be interested if there are any other ways to write this more straightforwardly.
std::uniform_real_distribution's objects are lightweight, so it's not a problem to construct them each time inside the loop.
Sometimes, the hidden internal state of distribution is important, but not in this case. reset() function is doing nothing in all popular STL implementations:
void
reset() { }
For example, it's not true for std::normal_distribution:
void
reset()
{ _M_saved_available = false; }
Well, some rng's have substantial state, e.g. Mersenne twister has something like 600 words of state (although I don't know about the C++ implementation of that). So there's the potential for taking up some time in the loop just to repeatedly create the rng.
However, it may well be that even when taking the state into account, it's still fast enough that taking the extra time to construct the rng doesn't matter. Only profiling can tell you for sure.
My personal preference is to move the rng out of the loop and compute the desired value via a simple formula if possible (e.g. min + (max - min)*x for a uniform distribution or mu + sigma*x for a Gaussian distribution).

Thread-safe using of <random>

STL said on his lecture on Going Native, that
Multiple threads cannot simultaneously call a single object.
(Last slide on pptx file online.)
If I need uniform distribution of random numbers across multiple threads (not independent random generators for threads), how to correctly use the same uniform_int_distribution from multiple threads? Or it is not possible at all?
Just create multiple copies. A distribution is a lightweight object, cheaper than the mutex you'd need to protect it.
You're never meant to use a PRNG for long enough that you can see a perfectly uniform distribution from it. You should only see a random approximation of a uniform distribution. The fact that giving every thread its own PRNG means it would take numThreads times longer for every thread to see that perfect distribution does not matter.
PRNGs are subject to mathematical analysis to confirm that they produce each possible output the same number of times, but this does not reflect how they're meant to be used.
If they were used this way it would represent a weakness. If you watch it for long enough you know that having seen output x n times, and every other output n+1 times, the next output must be x. This is uniform but it is obviously not random.
A true RNG will never produce a perfectly uniform output, but it should also never show the same bias twice. A PRNG known to have non-uniform output will have the same non-uniform output every time it's used. This means that if you run simulations for longer to average out the noise, the PRNG's own bias will eventually become the most statistically significant factor. So an ideal PRNG should eventually emit a perfectly uniform distribution (actually often not perfect, but within a known, very small margin) over a sufficiently long period.
Your random seed will pick a random point somewhere in that sequence and your random number requests will proceed some way along that sequence, but if you ever find yourself going all the way around and coming back to your own start point (or even within 1/10th that distance) then you need to get a bigger PRNG.
That said, there is another consideration. Often PRNGs are used specifically so that their results will be predictable.
If you're in a situation where you need a mutex to safely access a single PRNG, then you've probably already lost your predictable behaviour because you can't tell which thread got the next [predictable] random number. Each thread still sees an unpredictable sequence because you can't be sure which subset of PRNG results it got.
If every thread has its own PRNG, then (so long as other ordering constraints are met as necessary) you can still get predictable results.
You can also use thread_local storage when defining your random generator engine (assuming you want to have only one declared in your program). In this way, each thread will have access to a "local" copy of it, so you won't have any data races. You cannot just use a single engine across all threads, since you may have data races conditions when changing the state of it during the generation sequence.

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same
This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);
Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).
Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

Entropy and parallel random number generator seeding

I have a loop where I am adding noise to some points; these are being later used as the basis for some statistical tests.
The datasets involved are quite large, so I would like to parallelise it using openMP to speed things up. The issue comes up when I want to have multiple PRNGs. I have my own PRNG class based upon NR's modulo method (rand4 I think), but I am unsure how to seed the PRNGs correctly to ensure appropriate entropy
Normalliy I would do something like this
prng.initTimer();
But if I have an array of prngs, one per worker thread, then I cannot simply call initTimer on each instance -- the timer may not change, and the timers being close may introduce correlation.
I need to protect against natural correlations, not against malicious attackers (this is experimental data), so I need to have a safe way of seeding the rng array.
I thought of simply using
prng[0].initTimer()
for(int i=1; i<numRNGs; i++)
prng[i].init(prng[0].getRandNum());
Then calling my loop, but am unsure if this will introduce correlations in the modulo method.
Seeding PRNGs doesn't necessary create independent streams. You should seed only the the first instance (call it reference) and initialise the remaining instances by fast forwarding the reference instance. This only works if you know how many random numbers each thread will consume and the fast forwarding algorithm is available.
I don't know much about your rand4 (googled it, but nothing specific came up), but you shouldn't assume that it is possible to create independent streams just by seeding. You probably want to use different (a better) PRNG. Take a look at WELL. It is fast, has good statistical properties and developed by well know experts. WELL 512 and 1024 are one of the fastest PRNGs available and both have huge periods. You can initialise several WELL instances with distinct seeds in order to create independent streams. Thanks to huge period there is almost zero chance that your PRNGs will generate overlapping streams of random numbers.
If your PRNGs are called frequently, beware of false sharing. This Herb Sutter's article explains how false sharing can kill multi-core performance. Packing multiple PRNGs into a contiguous array is almost a perfect recipe for false sharing. In order to avoid false sharing either add padding between PRNGs or allocate PRNGs on heap/free store. In the later case each RNG should be allocated individually using some sort of aligned allocator. Your compiler should provide a version of aligned malloc. Check the docs (well, googling is actually faster than reading manuals). Visual C++ has _aligned_malloc, GCC has memalign and posix_memalign. The aliment value must be a multiple of CPU's cache line size. The common practice is to align along 128 byte boundaries. For portable solution you can use TBB's cache aligned allocator.
I think it depends on the properties of your PRNG. Usual PRNGs weaknesses are lower entropy in the lower bits and lower entropy for the first n values. So I think you should check your PRNG for such weaknesses and change your code accordingly.
Perhaps some of the diehard tests give useful information, but you can also check the first n values and their statistical properties like sum and variance yourself and compare them to the expected values.
For example, seed the PRNG and sum up the first 100 values modulo 11 of your PRNG, repeat this R times. If the total sum is very different from the expected (5*100*R), your PRNG suffers from one or both weaknesses mentioned above.
Knowing nothing about the PRNG, I'd feel safer using something like this:
prng[0].initTimer();
// Throw the first 100 values away
for(int i=1; i < 100; i++)
prng[0].getRandNum();
// Use only higher bits for seed values (assuming 32 bit size)
for(int i=1; i<numRNGs; i++)
prng[i].init(((prng[0].getRandNum() >> 16) << 16)
+ (prng[0].getRandNum() >> 16));
But of course, these are speculations about the PRNG. With an ideal PRNG, your approach should work fine as it is.
If you seed your PRNGs using a sequence of numbers from the same type of PRNG, they will all be producing the same sequence of numbers, offset by one from each other. If you want them to produce different numbers, you will need to seed them with a sequence of pseudorandom numbers from a different PRNG.
Alternatively, if you are on a unix-like system with a /dev/random, you can just read from that device to get a sequence of random numbers to use as your seeds.