I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.
So I first came up with some code:
int countPrimes(int n) {
vector<bool> flag(n+1,1);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:
int countPrimes(int n) {
// vector<bool> flag(n+1,1);
bool flag[n+1] ;
fill(flag,flag+n+1,true);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.
EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.
std::vector<bool> isn't like any other vector. The documentation says:
std::vector<bool> is a possibly space-efficient specialization of
std::vector for the type bool.
That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.
std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).
Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.
Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).
As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.
I'd like to add some remarks to the good answers already posted.
The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.
See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).
This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.
Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.
If you can bare the less readable code, you could try to profile the following snippet.
int countPrimes(int n)
{
if ( n < 2 )
return 0;
// Sieve starting from 3 up to n, the number of odd number between 3 and n are
int sieve_size = n / 2 - 1;
std::vector<char> sieve(sieve_size);
int result = 1; // 2 is a prime.
for (int i = 0; i < sieve_size; ++i)
{
if ( sieve[i] == 0 )
{
// It's a prime, no need to scan the vector again
++result;
// Some ugly transformations are needed, here
int prime = i * 2 + 3;
for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
sieve[j / 2 - 1] = 1;
}
}
return result;
}
Edit
As Peter Cordes noted in the comments, using an unsigned type for the variable j
the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.
It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.
I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:
vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69
Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>
size_t countPrimes(size_t n) {
std::vector<bool> flag(n+1,1);
//std::vector<char> flag(n+1,1);
//bool flag[n+1]; std::fill(flag,flag+n+1,true);
for(size_t i=2;i<n;i++) {
if(flag[i]==1) {
for(size_t j=i;i*j<n;j++) {
flag[i*j]=0;
}
}
}
size_t result=0;
for(size_t i=2;i<n;i++) {
result+=flag[i];
}
return result;
}
int main() {
{
const rlim_t kStackSize = 16*1024*1024;
struct rlimit rl;
int result = getrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
if(rl.rlim_cur < kStackSize) {
rl.rlim_cur = kStackSize;
result = setrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
}
}
printf("%zu\n", countPrimes(10e6));
return 0;
}
Related
Problem Statement is to find prime number below 2 billion in timeframe < 20 sec.
I followed below approaches.
Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec
Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?
Approach 1:
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 20000000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int j = 0;
int limit = sqrt(i);
for (j = 2 ; j <= limit;j++)
{
if(i % j == 0)
{
break;
}
}
if( j > limit)
{
primeno.push_back(i);
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
Approach 2 :
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
cout << "No of divisions : " << noofdiv;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 10000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int limit = sqrt(i);
for (int iter : primeno)
{
noofdiv++;
if (iter <= limit && (i%iter) == 0)
{
break;
}
else if (iter > limit)
{
primeno.push_back(i);
break;
}
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
The reason your second example takes longer is you're iterating a std::list.
A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.
Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.
It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.
Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.
constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);
for(int i = 4; i < max_prime ; i += 2)
bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
if(!bitset->test(i)) { // If i is prime
for(int j = i * 2; j < no_primes; j += i)
bitset->set(j); // set all multiples of i to non-prime
}
}
This takes between 4.2 and 4.5 seconds 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.
Comparison for 100 million:
time taken: 63.515 seconds
time taken bitset: 1.874 seconds
No of divisions : 1975961174
No of primes found: 5761455
No of primes found bitset: 5761455
I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.
The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.
std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).
For the case where you are only appending to the end of a list std::list has the following issues:
Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses
A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.
A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.
A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:
/* check_prime__list:
time taken No of divisions No of primes
10M: 0.873 seconds 286144936 664579
20M: 2.169 seconds 721544444 1270607 */
2B: projected time: at least 16 minutes but likely much more (*)
/* check_prime__nums:
time taken No of divisions No of primes
10M: 4.650 seconds 1746210131 664579
20M: 12.585 seconds 4677014576 1270607 */
2B: projected time: at least 3 hours but likely much more (*)
I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.
But the run time wasn't being affected by that. A wall clock is a wall clock.
Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.
(*) The time projection was made by the empirical orders of growth analysis:
100**1.32 * 2.169 / 60 = 15.8
100**1.45 * 12.585 / 3600 = 2.8
Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n1.32 vs. the n1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n1.5/log n) vs. O(n1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.
Consider the following code snippet
double *x, *id;
int i, n; // = vector size
// allocate and zero x
// set id to 0:n-1
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.
For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.
The questions:
Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?
Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.
For convenience, the entire code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
toc("x(id) = 1");
}
EDIT: Disregard if you can't split the arrays!
I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:
struct my_arrays
{
double x;
int id;
};
struct my_arrays* arr = malloc(sizeof(my_arrays)*n);
Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.
EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.
My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this:
What is the most effective way for float and double comparison?
It might be worth considering examination of double type representation.
For example, the following code shows how to compare a double number greater than 1 to 999:
bool check(double x)
{
union
{
double d;
uint32_t y[2];
};
d = x;
bool answer;
uint32_t exp = (y[1] >> 20) & 0x3ff;
uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
if (fraction2 != 0 || fraction1 != 0)
answer = false;
else if (exp > 8)
answer = false;
else if (exp == 8)
answer = (y[1] < 0x408f3800); // this is the representation of 999
else
answer = true;
return answer;
}
This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.
I encountered a problem here at Codechef. I am trying to use a vector for memoization. As I am still new at programming and quite unfamiliar with STL containers, I have used vector, for the lookup table. (although, I was suggested that using map helps to solve the problem).
So, my question is how is the solution given below running into a run time error. In order to get the error, I used the boundary value for the problem (100000000) as the input. The error message displayed by my Netbeans IDE is RUN FAILED (exit value 1, total time: 4s) with input as 1000000000. Here is the code:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <string>
#define LCM 12
#define MAXSIZE 100000000
using namespace std;
/*
*
*/
vector<unsigned long> lookup(MAXSIZE,0);
int solve(int n)
{
if ( n < 12) {
return n;
}
else {
if (n < MAXSIZE) {
if (lookup[n] != 0) {
return lookup[n];
}
}
int temp = solve(n/2)+solve(n/3)+solve(n/4);
if (temp >= lookup[n] ) {
lookup[n] = temp;
}
return lookup[n];
}
}
int main(int argc, char** argv) {
int t;
cin>>t;
int n;
n = solve(t);
if ( t >= n) {
cout<<t<<endl;
}
else {
cout<<n<<endl;
}
return 0;
}
I doubt if this is a memory issue because he already said that the program actually runs and he inputs 100000000.
One things that I noticed, in the if condition you're doing a lookup[n] even if n == MAXSIZE (in this exact condition). Since C++ is uses 0-indexed vectors, then this would be 1 beyond the end of the vector.
if (n < MAXSIZE) {
...
}
...
if (temp >= lookup[n] ) {
lookup[n] = temp;
}
return lookup[n];
I can't guess what the algorithm is doing but I think the closing brace } of the first "if" should be lower down and you could return an error on this boundary condition.
You either don't have enough memory or don't have enough contiguous address space to store 100,000,000 unsigned longs.
This mostly is a memory issue. For a vector, you need contiguous memory allocation [so that it can keep up with its promise of constant time lookup]. In your case, with an 8 byte double, you are basically requesting your machine to give you around 762 mb of memory, in a single block.
I don't know which problem you're solving, but it looks like you're solving Bytelandian coins. For this, it is much better to use a map, because:
You will mostly not be storing the values for all 100000000 cases in a test case run. So, what you need is a way to allocate memory for only those values that you are actually memoize.
Even if you are, you have no need for a constant time lookup. Although it would speed up your program, std::map uses trees to give you logarithmic look up time. And it does away with the requirement of using up 762 mb contiguously. 762 mb is not a big deal, but expecting in a single block is.
So, the best thing to use in your situation is an std::map. In your case, actually just replacing std::vector<unsigned long> by std::map<int, unsigned long> would work as map also has [] operator access [for the most part, it should].
I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];
double d[10];
int length = 10;
memset(d, length * sizeof(double), 0);
//or
for (int i = length; i--;)
d[i] = 0.0;
If you really care you should try and measure. However the most portable way is using std::fill():
std::fill( array, array + numberOfElements, 0.0 );
Note that for memset you have to pass the number of bytes, not the number of elements because this is an old C function:
memset(d, 0, sizeof(double)*length);
memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.
But for type safety and more readable code I would recommend std::fill() - it is the c++ way of doing things, and consider memset if a performance optimization is needed at this place in the code.
Try this, if only to be cool xD
{
double *to = d;
int n=(length+7)/8;
switch(length%8){
case 0: do{ *to++ = 0.0;
case 7: *to++ = 0.0;
case 6: *to++ = 0.0;
case 5: *to++ = 0.0;
case 4: *to++ = 0.0;
case 3: *to++ = 0.0;
case 2: *to++ = 0.0;
case 1: *to++ = 0.0;
}while(--n>0);
}
}
Assuming the loop length is an integral constant expression, the most probable outcome it that a good optimizer will recognize both the for-loop and the memset(0). The result would be that the assembly generated is essentially equal. Perhaps the choice of registers could differ, or the setup. But the marginal costs per double should really be the same.
In addition to the several bugs and omissions in your code, using memset is not portable. You can't assume that a double with all zero bits is equal to 0.0. First make your code correct, then worry about optimizing.
memset(d,0,10*sizeof(*d));
is likely to be faster. Like they say you can also
std::fill_n(d,10,0.);
but it is most likely a prettier way to do the loop.
calloc(length, sizeof(double))
According to IEEE-754, the bit representation of a positive zero is all zero bits, and there's nothing wrong with requiring IEEE-754 compliance. (If you need to zero out the array to reuse it, then pick one of the above solutions).
According to this Wikipedia article on IEEE 754-1975 64-bit floating point a bit pattern of all 0s will indeed properly initialize a double to 0.0. Unfortunately your memset code doesn't do that.
Here is the code you ought to be using:
memset(d, 0, length * sizeof(double));
As part of a more complete package...
{
double *d;
int length = 10;
d = malloc(sizeof(d[0]) * length);
memset(d, 0, length * sizeof(d[0]));
}
Of course, that's dropping the error checking you should be doing on the return value of malloc. sizeof(d[0]) is slightly better than sizeof(double) because it's robust against changes in the type of d.
Also, if you use calloc(length, sizeof(d[0])) it will clear the memory for you and the subsequent memset will no longer be necessary. I didn't use it in the example because then it seems like your question wouldn't be answered.
Memset will always be faster, if debug mode or a low level of optimization is used. At higher levels of optimization, it will still be equivalent to std::fill or std::fill_n.
For example, for the following code under Google Benchmark:
(Test setup: xubuntu 18, GCC 7.3, Clang 6.0)
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>
double total = 0;
static void memory_memset(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::memset(ints, 0, sizeof(int) * 50000);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_filln(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill_n(ints, 50000, 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_fill(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill(std::begin(ints), std::end(ints), 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
// Register the function as a benchmark
BENCHMARK(memory_filln);
BENCHMARK(memory_fill);
BENCHMARK(memory_memset);
int main (int argc, char ** argv)
{
benchmark::Initialize (&argc, argv);
benchmark::RunSpecifiedBenchmarks ();
printf("Total = %f\n", total);
getchar();
return 0;
}
Gives the following results in release mode for GCC (-O2;-march=native):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 16488 ns 16477 ns 42460
memory_fill 16493 ns 16493 ns 42440
memory_memset 8414 ns 8408 ns 83022
And the following results in debug mode (-O0):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 87209 ns 87139 ns 8029
memory_fill 94593 ns 94533 ns 7411
memory_memset 8441 ns 8434 ns 82833
While at -O3 or with clang at -O2, the following is obtained:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 8437 ns 8437 ns 82799
memory_fill 8437 ns 8437 ns 82756
memory_memset 8436 ns 8436 ns 82754
TLDR: use memset unless told you absolutely have to use std::fill or a for-loop, at least for POD types which are not non-IEEE-754 floating-points. There are no strong reasons not to.
(note: the for loops counting the array contents are necessary for clang not to optimize away the google benchmark loops entirely (it will detect they're not used otherwise))
The example will not work because you have to allocate memory for your array. You can do this on the stack or on the heap.
This is an example to do it on the stack:
double d[50] = {0.0};
No memset is needed after that.
Don't forget to compare a properly optimized for loop if you really care about performance.
Some variant of Duff's device if the array is sufficiently long, and prefix --i not suffix i-- (although most compilers will probably correct that automatically.).
Although I'd question if this is the most valuable thing to be optimising. Is this genuinely a bottleneck for the system?
memset(d, 10, 0) is wrong as it only nulls 10 bytes.
prefer std::fill as the intent is clearest.
In general the memset is going to be much faster, make sure you get your length right, obviously your example has not (m)allocated or defined the array of doubles. Now if it truly is going to end up with only a handful of doubles then the loop may turn out to be faster. But as get to the point where the fill loop shadows the handful of setup instructions memset will typically use larger and sometimes aligned chunks to maximize speed.
As usual, test and measure. (although in this case you end up in the cache and the measurement may turn out to be bogus).
One way of answering this question is to quickly run the code through Compiler Explorer: If you check this link, you'll see assembly for the following code:
void do_memset(std::array<char, 1024>& a) {
memset(&a, 'q', a.size());
}
void do_fill(std::array<char, 1024>& a) {
std::fill(a.begin(), a.end(), 'q');
}
void do_loop(std::array<char, 1024>& a) {
for (int i = 0; i < a.size(); ++i) {
a[i] = 'q';
}
}
The answer (at least for clang) is that with optimization levels -O0 and -O1, the assembly is different and std::fill will be slower because the use of the iterators is not optimized out. For -O2 and higher, do_memset and do_fill produce the same assembly. The loop ends up calling memset on every item in the array even with -O3.
Assuming release builds tend to run -O2 or higher, there are no performance considerations and I'd recommend using std::fill when it's available, and memset for C.
If you're required to not use STL...
double aValues [10];
ZeroMemory (aValues, sizeof(aValues));
ZeroMemory at least makes the intent clear.
As an alternative to all stuff proposed, I can suggest you NOT to set array to all zeros at startup. Instead, set up value to zero only when you first access the value in a particular cell. This will stave your question off and may be faster.
I think you mean
memset(d, 0, length * sizeof(d[0]))
and
for (int i = length; --i >= 0; ) d[i] = 0;
Personally, I do either one, but I suppose std::fill() is probably better.