Why isn't this random number generator thread-safe?

Why isn't this random number generator thread-safe? - c++

I was using rand() function to generate pseudo-random numbers between 0,1 for a simulation purpose, but when I decided to make my C++ code run in parallel (via OpenMP) I noticed rand() isn't thread-safe and also is not very uniform.
So I switched to using a (so-called) more uniform generator presented in many answers on other questions. Which looks like this
double rnd(const double & min, const double & max) {
static thread_local mt19937* generator = nullptr;
if (!generator) generator = new mt19937(clock() + omp_get_thread_num());
uniform_real_distribution<double> distribution(min, max);
return fabs(distribution(*generator));
}
But I saw many scientific errors in my original problem which I was simulating. Problems which were both against the results from rand() and also against common sense.
So I wrote a code to generate 500k random numbers with this function, calculate their average and do this for 200 times and plot the results.
double SUM=0;
for(r=0; r<=10; r+=0.05){
#pragma omp parallel for ordered schedule(static)
for(w=1; w<=500000; w++){
double a;
a=rnd(0,1);
SUM=SUM+a;
}
SUM=SUM/w_max;
ft<<r<<'\t'<<SUM<<'\n';
SUM=0;
}
We know if instead of 500k I could do it for infinite times it should be a simple line with value 0.5. But with 500k we will have fluctuations around 0.5.
When running the code with a single thread, the result is acceptable:
But here is the result with 2 threads:
3 threads:
4 threads:
I do not have my 8 threaded CPU right now but the results were even worth there.
As you can see, They are both not uniform and very fluctuated around their average.
So is this pseudo-random generator thread-unsafe too?
Or am I making a mistake somewhere?

There are three observations about your test output I would make:
It has much stronger variance than a good random source's average should provide. You observed this yourself by comparing to the single thread results.
The calculated average decreases with thread count and never reaches the original 0.5 (i.e. it's not just higher variance but also changed mean).
There is a temporal relation in the data, particularly visible in the 4 thread case.
All this is explained by the race condition present in your code: You assign to SUM from multiple threads. Incrementing a double is not an atomic operation (even on x86, where you'll probably get atomic reads and writes on registers). Two threads may read the current value (e.g. 10), increment it (e.g. both add 0.5) and then write the value back to memory. Now you have two threads writing a 10.5 instead of the correct 11.
The more threads try to write to SUM concurrently (without synchronization), the more of their changes are lost. This explains all observations:
How hard the threads race each other in each individual run determines how many results are lost, and this can vary from run to run.
The average is lower with more races (for example more threads) because more results are lost. You can never exceeed the statistical 0.5 average because you only ever lose writes.
As the threads and scheduler "settle in", the variance is reduced. This is a similar reason to why you should "warm up" your tests when benchmarking.
Needless to say, this is undefined behavior. It just shows benign behavior on x86 CPUs, but this is not something the C++ standard guarantees you. For all you know, the individual bytes of a double might be written to by different threads at the same time resulting in complete garbage.
The proper solution would be adding the doubles thread-locally and then (with synchronization) adding them together in the end. OMP has reduction clauses for this specific purpose.
For integral types, you could use std::atomic<IntegralType>::fetch_add(). std::atomic<double> exists but (before C++20) the mentioned function (and others) are only available for integral types.

The problem is not in your RNG, but in your summation. There is simply a race condition on SUM. To fix this, use a reduction, e.g. change the pragma to:
#pragma omp parallel for ordered schedule(static) reduction(+:SUM)
Note that using thread_local with OpenMP is technically not defined behavior. It will probably work in practice, but the interaction between OpenMP and C++11 threading concepts is not well defined (see also this question). So the safe OpenMP alternative for you would be:
static mt19937* generator = nullptr;
#pragma omp threadprivate(generator)

Related

How to optimize a simple loop?

The loop is simple
void loop(int n, double* a, double const* b)
{
#pragma ivdep
for (int i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?

This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.
If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.
To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.
Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.
If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).
Some up to date links with syntax examples and more info:
unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
restrict vs. ivdep
NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf
Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.
Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).

Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.
In older ICC version "restrict" was the only clean way to provide that key information to the compiler. In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).
But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.

Loop unrolling manually is a simple way to optimize your code, and following is my code. Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'. I don't have Intel ICC to verify the code, but I think the optimization principles are the same.
Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled. If you are interested, see here.
P.S. Of course loop2 only works when n is even.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define LEN 512*1024
#define times 1000
void loop(int n, double* a, double const* b){
int i;
for(i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
void loop2(int n, double* a, double const* b){
int i;
for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
*a *= *b;
*(a+1) *= *(b+1);
}
int main(void){
double *la, *lb;
struct timeval begin, end;
int i;
la = (double *)malloc(LEN*sizeof(double));
lb = (double *)malloc(LEN*sizeof(double));
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop2(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
free(la);
free(lb);
return 0;
}

I assume, that n is large. You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements. Use big chunks of consecutive data for each thread, don't do finegrained interleaving. Try to align the chunks with cache lines.
If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results. In this case, it might not really help, because the workload for each step is very simple. You'll have to run tests for that.

How to generate a massive amount of high quality Random Numbers?

I'm working on a random walk simulation of particles moving in a lattice. For that reason I must create a massive amount of random numbers, about 10^12 and above. Currently I'm using the possibilities C++11 provides with <random>. When profiling my program, I see that a major amount of time is spent in <random>. The vast majority of those numbers are between 0 and 1, evenly distributed. Here a then I need a number from a binomial distribution. But the focus lies on the 0..1 numbers.
The question is: What can I do to reduce the CPU time needed to generate these numbers and what would the impact be on their quality?
As you can see, I tried different engines, but that had no big effect on CPU time. Further, what is the difference between my uniform01(gen) and generate_canonical<double,numeric_limits<double>::digits>(gen) anyhow?
Edit: Reading through the answers I conclude that there is not THE ideal solution for my problem. Thus I decided to first make my program multi threading capable and run multiple RNG in different threads (seeded with one random_device number + an thread individual increment). For the time being this seams to be the most unavoidable step (multi threading would be required anyhow). As a further step, pending on exact requirements I consider switching to the suggested Intel RNG or to Thrust. Meaning that my RNG implementation should not be to complex, which, currently is is not. But for now I like to focus on the physical correctness of my model and not on programming stuff, this comes as soon as the output of my program is physically correct.
Thrust
Concerning Intel RNG
Here is what I do currently:
class Generator {
public:
Generator();
virtual ~Generator();
double rand01(); //random number [0,1)
int binomial(int n, double p); //binomial distribution with n samples with probability p
private:
std::random_device randev; //seed
/*Engines*/
std::mt19937_64 gen;
//std::mt19937 gen;
//std::default_random_engine gen;
/*Distributions*/
std::uniform_real_distribution<double> uniform01;
std::binomial_distribution<> binomialdist;
};
Generator::Generator() : randev(), gen(randev()), uniform01(0.,1.), binomial(1,1.) {
}
Generator::~Generator() { }
double Generator::rand01() {
//return uniform01(gen);
return generate_canonical<double,numeric_limits<double>::digits>(gen);
}
int Generator::binomialdist(int n, double p) {
binomial.param(binomial_distribution<>::param_type(n,p));
return binomial(gen);
}

You can pre-process random numbers and use them when you need.
If you need true random numbers I suggest you to use a service like http://www.random.org/ that ensures random numbers calculated by environment ambient instead that some algorithm.
And, speaking about random numbers, you must also check this:

If you need a massive amount of random numbers, and I mean MASSIVE, do a careful search on the internet for IBM's floating point random number generator, published maybe ten years ago. You'll have to buy either a PowerPC machine, or a newer Intel machine with fused multiply-add. They achieved random numbers at a rate of one per cycle per core. So if you bought a new Mac Pro, you could achieve probably 50 billion random numbers per second.

Perhaps instead of using a CPU you could use a GPU to generate many numbers concurrently?
Efficient Random Number Generation and Application Using CUDA

On my i3, the following program runs in about five seconds:
#include <random>
std::mt19937_64 foo;
double drand() {
union {
double d;
long long l;
} x;
x.d = 1.0;
x.l |= foo() & (1LL<<53)-1;
return x.d-1;
}
int main() {
double d;
for (int i = 0; i < 1e9; i++)
d += drand();
printf("%g\n", d);
}
whereas replacing the drand() call with the following results in a program that runs in about ten seconds:
double drand2() {
return std::generate_canonical<double,
std::numeric_limits<double>::digits>(foo);
}
Using the following instead of drand() also results in a program that runs in about ten seconds:
std::uniform_real_distribution<double> uni;
double drand3() {
return uni(foo);
}
Perhaps the hacky drand() above suits your purposes better than the standard solutions..

Task Definition
OP asks to get answer for both the
1. Speed of generation -- assuming a set of 10E+012 random numbers to be "massive"
and
2. Quality of generator -- with a weak assumption that numbers just evenly distributed over some range of values are also random
However, there are more cardinal aspects to be addressed and successfully solved for the real system:
A. Define, whether your system simulation needs to be provided with a guarantee of a repeatability of the sequence of the random numbers for future re-runs of an experiment.
If this is not the case, the re-runs of the simulated experiment will yield principally different results then the randomizer process ( or pre-randomizer and randomized-selector ) need not worry about their re-entrant, state-full mode of operation and will get much simpler implementation.
B. Define, to what level do you need to proof a quality of randomness of the generated random numbers ( or does the generated sets of random numbers have to belong to some specific law of statistic theory ( some known synthetic distributions or truly random with an utmost Kolmogorov complexity of the resulting set of random numbers )). One need not be NSA expert to state that numerical generators of true-random sequences is a very hard issue and has it's computational costs associated with production of high-randomness products.
Hyper-chaotic and true-random sequences are computationally extemely expensive. Using low- or poor-randomness generators is not an option for randomness-quality sensitive applications ( whatever the marketing papers may say, no MIL-STD- or NSA-graded system will ever try this compromised quality in enviroments, where the results indeed matter, so why to settle for less in scientific simulations? Perhaps not a problem if you do not mind to miss so many "unvisited" states of the simulated phenomena ).
C. Verify, how many random numbers does your simulation system need to "consume per [usec]" and whether this design requirement parameter is constant or may get scaled-up by going into multi-threaded, vectorised, Grid-/Cloud-based distributed computation framework.
D. Does your simulation system require to maintain a global or per-thread- or perGrid/CloudNode- individual access management to the pool-of-randomized numbers in case of vectorized or Grid/Cloud-based computational strategy.
Task Solution Approach
Fastest [1] and best [2] solution with [A] and [B] solved and options for [D] is to pre-generate an utmost randomness quality numbers into an adequate access-pool ( and pay an acceptable cost of [C] and [D] on access-policy and access-management controls to re-read from the pool, rather than to re-generate ).

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Use Threads or Processes, you may want to look to OpenMp

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

How do I force the compiler not to skip my function calls?

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:
//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
r = a(vals[i], vals[i+1]);
}
//stop timer here
Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.
Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?
(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )

Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:
The loop is always identical (no difference due to optimizations based on a)
The overhead of the function call is always the same
To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a
Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.
A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.

How to properly choose rng seed for parallel processes

I'm currently working on a C/C++ project where I'm using a random number generator (gsl or boost). The whole idea can be simplified to a non-trivial stochastic process which receives a seed and returns results. I'm computing averages over different realisations of the process.
So, the seed is important: the processes must be with different seeds or it will bias the averages.
So far, I'm using time(NULL) to give a seed. However, if two processes start at the same second, the seed is the same. That happens because I'm using parallelisation (using openMP).
So, my question is: how to implement a "seed giver" on C/C++ which gives independent seeds?
For instance, I though in using the thread number (thread_num), seed = time(NULL)*thread_num. However, this means that the seeds are correlated: they are multiple of each others. Does that poses any problem to the "pseudo-random" or is it as good as sequential seeds?
The requirements are that it must work on both Mac OS (my pc) and Linux distribution similar to OS Cent (the cluster) (and naturally give independent realisations).

A commonly used scheme for this is to have a "master" RNG used to generate seeds for each process-specific RNG.
The advantage of such a scheme is that the whole computation is determined by only one seed, which you can record somewhere to be able to replay any simulation (this might be useful to debug nasty bugs).

We ran into a similar problem on a Beowulf computing grid, the solution we used was to incorporate the pid of the process into the RNG seed, like so:
time(NULL)*thread_num*getpid()
Of course, you could just read from /dev/urandom or /dev/random into an integer.

When faced with this problem I often use seed_rng from Boost.Uuid. It uses time, clock and random data from /dev/urandom to calculate a seed. You can use it like
#include <boost/uuid/seed_rng.hpp>
#include <iostream>
int main() {
int seed = boost::uuids::detail::seed_rng()();
std::cout << seed << std::endl;
}
Note that seed_rng comes from a detail namespace, so it can go away without further notice. In that case writing your own implementation based on seed_rng shouldn't be too hard.

Mac OS is Unix too, so it probably has /dev/random. If so, that's the
best solution for obtaining the seeds. Otherwise, if the generator is
good, taking time( NULL ) once, and then incrementing it for the seed
of each generator, should give reasonably good results.

If you are on x86 and don't mind making the code non-portable then you could read the Time Stamp Counter (TSC) which is a 64-bit counter that increments at the CPU (max) clock rate (about 3 GHz) and use that as a seed.
#include <stdint.h>
static inline uint64_t rdtsc()
{
uint64_t tsc;
asm volatile
(
"rdtsc\n\t"
"shl\t$32,%%rdx\n\t" // rdx = TSC[ 63 : 32 ] : 0x00000000
"add\t%%rdx,%%rax\n\t" // rax = TSC[ 63 : 0 ]
: "=a" (tsc) : : "%rdx"
);
return tsc;
}

When compare two infinite time sequences produced by the same pseudo-random number generator with different seeds, we can see that they are same delayed by some time tau. Usually this time time scale is much bigger than your problem to ensure that the two random walks are uncorrelated.
If your stochastic process is in a high dimensional phase space, I think that one good suggestion could be:
seed = MAXIMUM_INTEGER/NUMBER_OF_PARALLEL_RW*thread_num + time(NULL)
Notice that using scheme you are not guaranteeing that time tau is big !!
If you have some knowledge of your system time scale, you can call your random number generator some number o times in order to generate seeds that are equidistant by some time interval.

Maybe you could try std::chrono high resolution clock from C++11:
Class std::chrono::high_resolution_clock represents the clock with the
smallest tick period available on the system. It may be an alias of
std::chrono::system_clock or std::chrono::steady_clock, or a third,
independent clock.
http://en.cppreference.com/w/cpp/chrono/high_resolution_clock
BUT tbh Im not sure that there is anything wrong with srand(0); srand(1), srand(2).... but my knowledge of rand is very very basic. :/
For crazy safety consider this:
Note that all pseudo-random number generators described below are
CopyConstructible and Assignable. Copying or assigning a generator
will copy all its internal state, so the original and the copy will
generate the identical sequence of random numbers.
http://www.boost.org/doc/libs/1_51_0/doc/html/boost_random/reference.html#boost_random.reference.generators
Since most of the generators have crazy long cycles you could generate one, copy it as first generator, generate X numbers with original, copy it as second, generate X numbers with original, copy it as third...
If your users call their own generator less than X time they will not be overlapping.

The way I understand your question, you have multiple processes using the same pseudo-random number generation algorithm, and you want each "stream" of random numbers (in each process) to be independent of each other. Am I correct ?
In that case, you are right in suspecting that giving different (correlated) seeds does not guaranty you anything unless the rng algorithm says so. You basically have two solutions:
Simple version
Use a single source of random numbers, with a single seed. Then feed random numbers in a round-robin fashion to each process.
This solution is slow but provide some guaranty that the number you give to your processes are ok.
You can do the same thing but generating all the random numbers you need at once, and then splitting this set into as many slices as you have processes.
Use a RNG designed for that
You can find in papers and on the web several algorithms specifically designed to provide independent streams of random numbers from a single initial state. They are complicated but most provide source code. The idea is generally to "split" the RNG space (values you can obtain from the initial state) into various chunks like above. They are just faster because the algorithm used makes it possible to compute easily what would be the state of the RNG if you skipped a given number of values.
These generators are generally called "parallel random number generators".
The most popular ones are probably these two:
RngStreams: http://statmath.wu.ac.at/software/RngStreams/
SPRNG: http://sprng.cs.fsu.edu/
Check their manuals to fully understand what they do, how they do it, and if it really is what you need.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js