I'm trying to speed up a program, in the heart of which is a trivially-looking loop:
double sum=0.;
#pragma omp parallel for reduction(+:sum) // fails
for( size_t i=0; i<_S.size(); ++i ){
sum += _S[i].first* R(atoms,_S[i].second) ;
}
While the looping itself is trivial, the objects inside it are not PODs: Here _S is in fact an
std::vector< std::pair<double, std::vector<size_t> > >, and R(...) is an overloaded operator(...) const of some object. Both of its arguments qualified as const, so that the call does not have any side effects.
Since some 90% of the runtime is spent in this call, it seemed a simple thing to throw in an OpenMP pragma as shown above, and enjoy a speedup by a factor of two or three;
but of course --- the code works OK with a single thread, but gives plain wrong results for more than one thread :-).
There is no data dependency, both _S and R(...) seem to be safe to share between threads, but still it produces nonsense.
I'd really appreciate any pointers into how to find what goes wrong.
UPD2:
Figured it. As all bugs, it's trivial. The R(...) was calling the operator() of something of this sort:
class objR{
public:
objR(const size_t N){
_buffer.reserve(N);
};
double operator(...) const{
// do something, using the _buffer to store intermediaries
}
private:
std::vector<double> _buffer;
};
Clearly, different threads use the _buffer at the same time and mess it up. My solution so far is to allocate more space (memory is not a problem, the code is CPU-bound):
class objR{
public:
objR(const size_t N){
int nth=1;
#ifdef _OPENMP
nth=omp_get_max_threads();
#endif
_buffer.resize(N);
}
double operator(...) const{
int thread_id=0;
#ifdef _OPENMP
thread_id = omp_get_thread_num();
#endif
// do something, using the _buffer[thread_id] to store intermediaries
}
private:
std::vector< std::vector<double> > _buffer;
};
This seems to work correctly. Still, since it's my first foray into things mutithreaded, I'd appreciate if somebody knowledgeable could comment on if there is a better approach.
The access to _S[i].first and _S[i].second is perfectly safe (can't guarantee anything about atoms). This means that your function call to R must be what's causing the problem. You need to find out what R is and post what it's doing.
In another point, names which begin with an underscore and begin with an uppercase character are reserved for the implementation and you invoke undefined behaviour by using them.
Related
I am trying to parallelise a code that I am using since a while without parellelisation and without issues in a relatively large C++ project (compilation made on C++11). The code heavily rely on the Eigen package (version 3.3.9 used here).
Here is a minimalistic version of the code that has to be parallelised and for which I get crashes (I hope not having introduced errors when squashing things...):
The main function with the two for loop that need to be parallelised:
Data_eigensols solve(const VectorXd& nu_l0_in, const int el, const long double delta0l,
const long double DPl, const long double alpha, const long double q, const long double sigma_p,
const long double resol){
const int Nmmax=5000;
int np, ng, s, s0m;
double nu_p, nu_g;
VectorXi test;
VectorXd nu_p_all, nu_g_all, nu_m_all(Nmmax);
Data_coresolver sols_iter;
Data_eigensols nu_sols;
//----
// Some code that initialize several variables (that I do not show here for clarity).
// This includes initialising freq_max, freq_min, tol, nu_p_all, nu_g_all, deriv_p and deriv_g structures
//----
// Problematic loops that crash with omp parallel but works fine when not using omp
s0m=0;
#pragma omp parallel for collapse(2) num_threads(4) default(shared) private(np, ng)
for (np=0; np<nu_p_all.size(); np++)
{
for (ng=0; ng<nu_g_all.size();ng++)
{
nu_p=nu_p_all[np];
nu_g=nu_g_all[ng];
sols_iter=solver_mm(nu_p, nu_g); // Depends on several other function but for clarity, I do not show them here (all those 'const long double' or 'const int')
//printf("np = %d, ng= %d, threadId = %d \n", np, ng, omp_get_thread_num());
for (s=0;s<sols_iter.nu_m.size();s++)
{
// Cleaning doubles: Assuming exact matches or within a tolerance range
if ((sols_iter.nu_m[s] >= freq_min) && (sols_iter.nu_m[s] <= freq_max))
{
test=where_dbl(nu_m_all, sols_iter.nu_m[s], tol, 0, s0m); // This function returns -1 if there is no match for the condition (here means that sols_iter.nu_m[s] is not found in nu_m_all)
if (test[0] == -1) // Keep the solution only if it was not pre-existing in nu_m_all
{
nu_m_all[s0m]=sols_iter.nu_m[s];
s0m=s0m+1;
}
}
}
}
}
nu_m_all.conservativeResize(s0m); // Reduced the size of nu_m_all to its final size
nu_sols.nu_m=nu_m_all;
nu_sols.nu_p=nu_p_all;
nu_sols.nu_g=nu_g_all;
nu_sols.dnup=deriv_p.deriv;
nu_sols.dPg=deriv_g.deriv;
return, nu_sols;
}
The types Data_coresolver and Data_eigensols are defined as:
struct Data_coresolver{
VectorXd nu_m, ysol, nu,pnu, gnu;
};
struct Data_eigensols{
VectorXd nu_p, nu_g, nu_m, dnup, dPg;
};
The where_dbl() is as follows:
VectorXi where_dbl(const VectorXd& vec, double value, const double tolerance){
/*
* Gives the indexes of values of an array that match the value.
* A tolerance parameter allows you to control how close the match
* is considered as acceptable. The tolerance is in the same unit
* as the value
*
*/
int cpt;
VectorXi index_out;
index_out.resize(vec.size());
cpt=0;
for(int i=0; i<vec.size(); i++){
if(vec[i] > value - tolerance && vec[i] < value + tolerance){
index_out[cpt]=i;
cpt=cpt+1;
}
}
if(cpt >=1){
index_out.conservativeResize(cpt);
} else{
index_out.resize(1);
index_out[0]=-1;
}
return index_out;
}
Regarding the solver_mm():
I don't details this function as it calls few subroutines and may be too long to show here and I don't think it is relevant here. It is basically a function that search to solve an implicit equation.
What the main function is supposed to do:
The main function solve() calls iteratively solver_mm() in order to solve an implicit equation under different conditions, where the only variables are nu_p and nu_g. Sometimes solutions of a pair (nu_p(i),nu_g(j)) leads to duplicate solution than another pair (nu_p(k), nu_g(l)). This is why there is a section calling where_dbl() to detect those duplicated solution and throw them, keeping only unique solutions.
What is the problem:
Without the #pragma call, the code works fine. But it fails at random point of the execution with it.
After few tests, it seems that the culprit is somewhat related to the part that remove duplicate solutions. My guess is that there is a concurrent writing on the nu_m_all VectorXd. I tried to use the #pragma omp barrier without success. But I am quite new to omp and I may have misunderstood how the barrier works.
Can someone let me know why I have a crash here and how to solve it? The solution might be obvious for some person with good experience in omp.
nu_p, nu_g, sols_iter and test should be private. Since these variables are declared as shared, multiple threads might write in the same memory region in a non thread safe manner. This might be your problem.
I am trying to implement the parallel foreach loop for std::vector which runs the computations in optimal number of threads (number of cores minus 1 for main thread), however, my implementation seems to be not fast enough – it actually runs 6 times slower than the single-threaded one!
The thread instantiation is often blamed for being a bottleneck so I tried a larger vector, however, that did not seem to help.
I am currently stuck watching the parallel algorithm executed in 13000-20000 microseconds in a separate thread while single-threaded one is executed in 120-200 microseconds in the main thread and cannot figure out what I am doing wrong. Out of those 13-20 ms parallel algorithm runs for 8 or 9 are usually utilized to create thread, however, I can still see no reason for std::for_each running through 1/3 of the vector in a separate thread for several times longer than another std::for_each need to iterate through the whole vector.
#include <iostream>
#include <vector>
#include <thread>
#include <algorithm>
#include <chrono>
const unsigned int numCores = std::thread::hardware_concurrency();
const size_t numUse = numCores - 1;
struct foreach
{
inline static void go(std::function<void(uint32_t&)>&& func, std::vector<uint32_t>& cont)
{
std::vector<std::thread> vec;
vec.reserve(numUse);
std::vector<std::vector<uint32_t>::iterator> arr(numUse + 1);
size_t distance = cont.size() / numUse;
for (size_t i = 0; i < numUse; i++)
arr[i] = cont.begin() + i * distance;
arr[numUse] = cont.end();
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
vec.emplace_back([&] { std::for_each(cont.begin() + (numUse - 1) * distance, cont.end(), func); });
for (auto &d : vec)
{
d.join();
}
}
};
int main()
{
std::chrono::steady_clock clock;
std::vector<uint32_t> numbers;
for (size_t i = 0; i < 50000000; i++)
numbers.push_back(i);
std::chrono::steady_clock::time_point t0m = clock.now();
std::for_each(numbers.begin(), numbers.end(), [](uint32_t& value) { ++value; });
std::chrono::steady_clock::time_point t1m = clock.now();
std::cout << "Single-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1m - t0m).count() << "mcs\n";
std::chrono::steady_clock::time_point t0s = clock.now();
foreach::go([](uint32_t& i) { ++i; }, numbers);
std::chrono::steady_clock::time_point t1s = clock.now();
std::cout << "Multi-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1s - t0s).count() << "mcs\n";
getchar();
}
Is there a way I can optimize this and increase the performance?
The compiler I am using is Visual Studio 2017's one. Config is Release x86. I have also been advised to use a profiler and am currently figuring out how to use one.
I actually managed to get parallel code run faster than the regular one, however, this required vector of dozens of thousands of vectors of five elements. If anyone has advices on how to improve performance or where can I find better implementation to check its structure, that would be appreciated.
Thank you for providing some example code.
Getting good metrics (especially on parallel code) can be pretty tricky. Your metrics are tainted.
Use high_resolution_clock instead of steady_clock for profiling.
Don't include the thread startup time in your timing measurement. Thread launch/join is orders of magnitude longer than your actual work here. You should create the threads once and use condition variables to make them sleep until you signal them to work. This is not trivial, but it is essential that you don't measure the thread startup time.
Visual Studio has a profiler. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Another issue kind of related to having good metrics is that your "work" is just incrementing an integer. Is that really representative of the work your program is going to be doing? Increment is really fast. If you look at the assembly generated by your sequential version, everything gets inlined into a really short loop.
Lambdas have a very good chance of being inlined. But in your go function, you're casting the lambda to std::function. std::function has a very poor chance of being inlined.
So if you want to keep the chance of getting the lambda inlined, you have to do some template tricks:
template <typename FUNC>
inline static void go(FUNC&& func, std::vector<uint32_t>& cont)
By manually inlining your code (I moved the contents of the go function to main) and doing step 2 above, I was able to get the parallel version (4 threads on a hyperthreaded dual-core) to run in about 75% of the time. That's not particularly good scaling, but it's not bad considering that the original was already pretty fast. For a further optimization, I would use SIMD aka "vector" (different from std::vector except in the sense that they both relate to arrays) operations which will apply the increment to multiple array elements in one iteration.
You have a race condition here:
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
because you set the default lambda capture to capture-by-reference, the i variable is a reference and that could cause some threads to check the wrong range or too long of a range. You could do this: [&, i], but why risk shooting yourself in the foot again? Scott Meyers recommends against using default capture modes. Just do [&cont, &distance, &func, i]
UPDATE:
I think it's a fine idea to move your foreach to its own space. I think what you should do is separate the thread creation from task dispatch. That means you need some kind of signaling system (generally condition variables). You could look into thread pools.
An easy way to add threadpools is to use OpenMP, which Visual Studio 2017 has support for (OpenMP 2.0). A caveat is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section (it's implementation dependent). So it trades off performance with ease of use.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each
As for using std::function you can use it, you just don't want your basic operation (the one that will be called 50,000,000 times) to be a std::function.
Bad:
void go(std::function<...>& func)
{
std::thread t(std::for_each(v.begin(), v.end(), func));
...
}
...
go([](int& i) { ++i; });
Good:
void go(std::function<...>& func)
{
std::thread t(func);
...
}
...
go([&v](){ std::for_each(v.begin(), v.end(), [](int& i) { ++i; })});
In the good version, the short inner lambda (i.e. ++i) gets inlined in the call to for_each. That's important because it gets called 50 million times. The call to the bigger lambda is not inlined (because it's converted to std::function) but that's ok because it only gets called once per thread.
I'm doing the following code that construct a distance matrix between each point and all the other points that I have in the map dat[]. Although the code is working perfect, the performance of the code in terms of running time doesn't improve which means that it takes the same time if I set the number of thread = 1 or even 10 on an 8 core machine. Therefore, I'd appreciate if anyone can help me know what is wrong in my code and if anyone have any suggestion to help make the code runs faster that would be very helpful too.
The following is the code:
map< int,string >::iterator datIt;
map <int, map< int, double> > dist;
int mycont=0;
datIt=dat.begin();
int size=dat.size();
omp_lock_t lock;
omp_init_lock(&lock);
#pragma omp parallel //construct the distance matrix
{
map< int,string >::iterator datItLocal=datIt;
int lastIdx = 0;
#pragma omp for
for(int i=0;i<size;i++)
{
std::advance(datItLocal, i - lastIdx);
lastIdx = i;
map< int,string >::iterator datIt2=datItLocal;
datIt2++;
while(datIt2!=dat.end())
{
double ecl=0;
int c=count((*datItLocal).second.begin(),(*datItLocal).second.end(),delm);
string line1=(*datItLocal).second;
string line2=(*datIt2).second;
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
omp_set_lock(&lock);
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&lock);
datIt2++;
}
}
}
omp_destroy_lock(&lock);
My guess is that using a single lock for protecting 'dist' serializes your program.
Option 1:
Consider using a fine-grained locking strategy. Typically, you benefit from this if dist.size() is much larger than the number of threads.
map <int, omp_lock_t > locks;
...
int key1 = (*datItLocal).first;
int key2 = (*datIt2).first;
omp_set_lock(&(locks[key1]));
omp_set_lock(&(locks[key2]));
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&(locks[key2]));
omp_unset_lock(&(locks[key1]));
Option 2:
Your compiler might already have this optimization mention in option 1, so you can try to drop your lock and use the built-in critical section:
#pragma omp critical
{
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
}
I'm a bit unsure of exactly what you're trying to do with your loops etc, which looks rather like it's going to do a quadratic nested loop over the map. Assuming that's expected though, I think the following line will perform poorly when parallelised:
std::advance(datItLocal, i - lastIdx);
If OpenMP were disabled, that's advancing by one step each time, which is fine. But with OpenMP, there are going to be multiple threads doing chunks of that loop at random. So one of them might start at i=100000, so it has to advance 100000 steps through the map to begin. That might happen quite a lot if there are lots of threads being given relatively small chunks of the loop at a time. It might even be that you end up being memory/cache constrained since you're constantly having to walk over all of this presumably large map. It seems like this might be (part of) your culprit since it may get worse as more threads are available.
Fundamentally I guess I'm a bit suspicious of trying to parallelise iteration over a sequential data structure. You may know more about which parts of it really are slow or not though if you've profiled it.
I have a function that I eventually want to parallelize.
Currently, I call things in a for loop.
double temp = 0;
int y = 123; // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
temp += doStuff(i, y);
}
doStuff needs to know how far down the list it is. So I use i - data.begin() to calculate.
Next, I'd like to use the stl::for_each function instead. My challenge is that I need to pass the address of my iterator and the value of y. I've seen examples of using bind2nd to pass a parameter to the function, but how can I pass the address of the iterator as the first parameter?
The boost FOREACH functions also looks like a possibility, however I do not know if it will parallelize auto-magically like the STL version does.
Thoughts, ideas, suggestions?
If you want real parallelization here, use
GCC with tree vectorization optimization on (-O3) and SIMD (e.g. -march=native to get SSE support). If the operation (dostuff) is non-trivial, you could opt to do it ahead of time (std::transform or std::for_each) and accumulate next (std::accumulate) since the accumulation will be optimized like nothing else on SSE instructions!
void apply_function(double& value)
{
value *= 3; // just a sample...
}
// ...
std::vector<double> data(1000);
std::for_each(data.begin(), data.end(), &apply_function);
double sum = std::accumulate(data.begin(), data.end(), 0);
Note that though this will not actually run on multiple threads, the performance increase will be massive since SSE4 instructions can handle many floating operations *in parallell _on a single core_ .
If you wanted true parallelism, use one of the following
GNU Parallel Mode
Compile with g++ -fopenmp -D_GLIBCXX_PARALLEL:
__gnu_parallel::accumulate(data.begin(), data.end(), 0.0);
OpenMP directly
Compile with g++ -fopenmp
double sum = 0.0;
#pragma omp parallel for reduction (+:sum)
for (size_t i=0; i<data.end(); i++)
{
sum += do_stuff(i, data[i]);
}
This will result in the loop being parallelized into as many threads (OMP team) as there are (logical) CPU cores on the actual machine, and the result 'magically' combined and synchronized.
Final remarks:
You can simulate the binary function for for_each by using a stateful function object. This is not exactly recommended practice. It will also appear to be very inefficient (when compiling without optimization, it is). This is due to the fact that function objects are passed by value thoughout the STL. However, it is reasonable to expect a compiler to completely optimize the potential overhead of that away, especially for simple cases like the following:
struct myfunctor
{
size_t index;
myfunctor() : index(0) {}
double operator()(const double& v) const
{
return v * 3; // again, just a sample
}
};
// ...
std::for_each(data.begin(), data.end(), myfunctor());
temp += doStuff( i, y ); cannot be auto parallelized. The operator += doesn't play well with concurrency.
Further the stl algorithms don't parallelize anything. Both Visual Studio and GCC have parallel algorithms similar to std::for_each. If that is what you're after you'll have to use those.
OpenMP can auto parallelize for loops, but you have to use pragmas to tell the compiler when and how (it can't figure it out for you).
You may have confused parallelization with loop unrolling, which is a common optimization in std::for_each implementations.
This is fairly straightforward if you can change doStuff so that it takes the value of the current element separately from the index at which the current element is located. Consider:
struct context {
std::size_type _index;
int _y;
double _result;
};
context do_stuff_wrapper(context current, double value)
{
current._result += doStuff(current._index, value, current._y);
current._index++;
}
context c = { 0, 123, 0.0 };
context result = std::accumulate(data.begin(), data.end(), c, do_stuff_wrapper);
Note, however, that the Standard Library algorithms cannot "auto-parallelize" because the functions they call may have side effects (the compiler knows whether side effects are produced, but the library functions don't). If you want a parallelized loop, you'll have to go with a special-purpose parallelizing algorithms library, like PPL or TBB.
Hey, id like to make this as fast as possible because it gets called A LOT in a program i'm writing, so is there any faster way to initialize a C++ vector to random values than:
double range;//set to the range of a particular function i want to evaluate.
std::vector<double> x(30, 0.0);
for (int i=0;i<x.size();i++) {
x.at(i) = (rand()/(double)RAND_MAX)*range;
}
EDIT:Fixed x's initializer.
Right now, this should be really fast since the loop won't execute.
Personally, I'd probably use something like this:
struct gen_rand {
double range;
public:
gen_rand(double r=1.0) : range(r) {}
double operator()() {
return (rand()/(double)RAND_MAX) * range;
}
};
std::vector<double> x(num_items);
std::generate_n(x.begin(), num_items, gen_rand());
Edit: It's purely a micro-optimization that might make no difference at all, but you might consider rearranging the computation to get something like:
struct gen_rand {
double factor;
public:
gen_rand(double r=1.0) : factor(range/RAND_MAX) {}
double operator()() {
return rand() * factor;
}
};
Of course, there's a really good chance the compiler will already do this (or something equivalent) but it won't hurt to try it anyway (though it's really only likely to help with optimization turned off).
Edit2: "sbi" (as is usually the case) is right: you might gain a bit by initially reserving space, then using an insert iterator to put the data into place:
std::vector<double> x;
x.reserve(num_items);
std::generate_n(std::back_inserter(x), num_items, gen_rand());
As before, we're into such microscopic optimization, I'm not at all sure I'd really expect to see a difference at all. In particular, since this is all done with templates, there's a pretty good chance most (if not all) the code will be generated inline. In that case, the optimizer is likely to notice that the initial data all gets overwritten, and skip initializing it.
In the end, however, nearly the only part that's really likely to make a significant difference is getting rid of the .at(i). The others might, but with optimizations turned on, I wouldn't really expect them to.
I have been using Jerry Coffin's functor method for some time, but with the arrival of C++11, we have loads of cool new random number functionality. To fill an array with random float values we can now do something like the following . . .
const size_t elements = 300;
std::vector<float> y(elements);
std::uniform_real_distribution<float> distribution(0.0f, 2.0f); //Values between 0 and 2
std::mt19937 engine; // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
std::generate_n(y.begin(), elements, generator);
See the relevant section of Wikipedia for more engines and distributions
Yes, whereas x.at(i) does bounds checking, x[i] does not do so. Also, your code is incorrect as you have failed to specify the size of x in advance. You need to use std::vector<double> x(n), where n is the number of elements that you want to use; otherwise, your loop there will never execute.
Alternatively, you may want to make a custom iterator for generating random values and filling it using the iterator; because the std::vector constructor will initialize its elements, anyway, so if you have a custom iterator class that generates random values you may be able to eliminate a pass over the items.
In terms of implementing an iterator of your own, here is my untested code:
class random_iterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef double value_type;
typedef int difference_type;
typedef double* pointer;
typedef double& reference;
random_iterator() : _range(1.0), _count(0) {}
random_iterator(double range, int count) :
_range(range), _count(count) {}
random_iterator(const random_iterator& o) :
_range(o._range), _count(o._count) {}
~random_iterator(){}
double operator*()const{ return ((rand()/(double)RAND_MAX) * _range); }
int operator-(const random_iterator& o)const{ return o._count-_count; }
random_iterator& operator++(){ _count--; return *this; }
random_iterator operator++(int){ random_iterator cpy(*this); _count--; return cpy; }
bool operator==(const random_iterator& o)const{ return _count==o._count; }
bool operator!=(const random_iterator& o)const{ return _count!=o._count; }
private:
double _range;
int _count;
};
With the code above, it should be possible to use:
std::vector<double> x(random_iterator(range,number),random_iterator());
That said, the generate code for the other solution given is simpler, and frankly, I would just explicitly fill the vector without resorting to anything fancy like this.... but it is kind of cool to think about.
#include <iostream>
#include <vector>
#include <algorithm>
struct functor {
functor(double v):val(v) {}
double operator()() const {
return (rand()/(double)RAND_MAX)*val;
}
private:
double val;
};
int main(int argc, const char** argv) {
const int size = 10;
const double range = 3.0f;
std::vector<double> dvec;
std::generate_n(std::back_inserter(dvec), size, functor(range));
// print all
std::copy(dvec.begin(), dvec.end(), (std::ostream_iterator<double>(std::cout, "\n")));
return 0;
}
опоздал :(
You may consider using a pseudo-random number generator that gives output as a sequence. Since most PRNGs just provide a sequence anyways, that will be a lot more efficient than simply calling rand() over and over again.
But then, I think I really need to know more about your situation.
Why does this piece of code execute so much? Can you restructure your code to avoid re-generating random data so frequently?
How big are your vectors?
How "good" does your random number generator need to be? High-quality distributions tend to be more expensive to calculate.
If your vectors are large, are you reusing their buffer space, or are you throwing it away and reallocating it elsewhere? Creating new vectors willy-nilly is a great way to destroy your cache.
#Jerry Coffin's answer looks very good. Two other thoughts, though:
Inlining - All of your vector access will be very fast, but if the call to rand() is out-of-line, the function call overhead might dominate. If that's the case, you may need to roll your own pseudorandom number generator.
SIMD - If you're going to roll your own PRNG, you might as well make it compute 2 doubles (or 4 floats) at once. This will reduce the number of the int-to-float conversions as well as the multiplications. I've never tried it, but apparently there's a SIMD version of the Mersenne Twister that's quite good. A simple linear congruential generator might be good enough too (and that's probably what rand() is using already).
int main() {
int size = 10;
srand(time(NULL));
std::vector<int> vec(size);
std::generate(vec.begin(), vec.end(), rand);
std::vector<int> vec_2(size);
std::generate(vec_2.begin(), vec_2.end(), [](){ return rand() % 50;})
}
You need to include vector, algorithm, time, cstdlib.
The way I think about these is a rubber-meets-the-road approach.
In other words, there are certain minimal things that have to happen, no getting around it, such as:
the rand() function has to be called N times.
the result of rand() has to be converted to double and then multiplied by something.
the resulting numbers have to get stored in consecutive elements of an array.
The object is, at a minimum, to get those things done.
Other concerns, like whether or not to use an std::vector and iterators are fine as long as they don't add any extra cycles.
The easiest way to see if they add significant extra cycles is to single-step the code at the assembly language level.