How do I get clang/gcc to vectorize looped array comparisons? - c++

bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
for(int ii = 0; ii < 64; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,8);
b2=(uint8_t*)__builtin_assume_aligned(b2,8);
for(int ii = 0; ii < 8; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

"I assume" the following is most efficient
memcmp(b1, b2, any_size_you_need);
especially for huge arrays!
(For small arrays, there is not a lot to gain anyway!)
Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.
You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
bool ret = true;
for(int ii = 0; ii < 64; ++ii){
ret &= (b1[ii]==b2[ii]);
}
return ret;
}
Godbolt demo: https://godbolt.org/z/3ePh7q5rM
gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

Related

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Why in C++ overwritingis is slower than writing?

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.
This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

Optimize check for a bit-vector being a proper subset of another?

I would like some help optimizing the most computationally intensive function of my program.
Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.
The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.
First up is the basic variant. It checks to see if blocks_ is a proper subset of x.blocks_. (Not exactly equal.) These are bitmaps, aka bit vectors or bitsets.
//Check for self comparison
if (this == &x)
return false;
//A subset is equal to or smaller.
if (no_bits_ > x.no_bits_)
return false;
int i;
bool equal = false;
//Pointers should not change.
const unsigned int *tptr = blocks_;
const unsigned int *xptr = x.blocks_;
for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
if ((*tptr & *xptr) != *tptr)
return false;
if (*tptr != *xptr)
equal = true;
}
return equal;
Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.
//starting pointers.
const __m128i* start = (__m128i*)&blocks_;
const __m128i* xstart = (__m128i*)&x.blocks_;
__m128i block;
__m128i xblock;
//Unsigned ints are 32 bits, meaning 4 can fit in a register.
for (i = 0; i < no_blocks_; i+=4) {
block = _mm_load_si128(start + i);
xblock = _mm_load_si128(xstart + i);
//Equivalent to (block & xblock) != block
if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
return false;
//Equivalent to block != xblock
if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
equal = true;
}
return equal;
Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?
I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.
There are three possibilities I see here.
First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.
Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.
Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.
Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.
bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
__m128i eq1 = _mm_setzero_si128();
__m128i eq2 = _mm_setzero_si128();
for (unsigned i = 0; i != count; i += 8)
{
__m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
__m128i xb1 = _mm_load_si128((__m128i const*)(b + i));
eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));
__m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
__m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));
eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));
if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
return false;
}
return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}
If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.
I seems that your problem is a memory bandwidth bounded problem:
Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers.
But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.
There are some alternatives strategies to improve performance in bandwidth bounded problem:
Multi-thread hide access memory by concurrent arithmetic
operations in hyper-threading context.
Fine tuning of size of data load at time improve memory bandwidth.
Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)

Where is the loop-carried dependency here?

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.
From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

Which is faster/preferred: memset or for loop to zero out an array of doubles?

double d[10];
int length = 10;
memset(d, length * sizeof(double), 0);
//or
for (int i = length; i--;)
d[i] = 0.0;
If you really care you should try and measure. However the most portable way is using std::fill():
std::fill( array, array + numberOfElements, 0.0 );
Note that for memset you have to pass the number of bytes, not the number of elements because this is an old C function:
memset(d, 0, sizeof(double)*length);
memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.
But for type safety and more readable code I would recommend std::fill() - it is the c++ way of doing things, and consider memset if a performance optimization is needed at this place in the code.
Try this, if only to be cool xD
{
double *to = d;
int n=(length+7)/8;
switch(length%8){
case 0: do{ *to++ = 0.0;
case 7: *to++ = 0.0;
case 6: *to++ = 0.0;
case 5: *to++ = 0.0;
case 4: *to++ = 0.0;
case 3: *to++ = 0.0;
case 2: *to++ = 0.0;
case 1: *to++ = 0.0;
}while(--n>0);
}
}
Assuming the loop length is an integral constant expression, the most probable outcome it that a good optimizer will recognize both the for-loop and the memset(0). The result would be that the assembly generated is essentially equal. Perhaps the choice of registers could differ, or the setup. But the marginal costs per double should really be the same.
In addition to the several bugs and omissions in your code, using memset is not portable. You can't assume that a double with all zero bits is equal to 0.0. First make your code correct, then worry about optimizing.
memset(d,0,10*sizeof(*d));
is likely to be faster. Like they say you can also
std::fill_n(d,10,0.);
but it is most likely a prettier way to do the loop.
calloc(length, sizeof(double))
According to IEEE-754, the bit representation of a positive zero is all zero bits, and there's nothing wrong with requiring IEEE-754 compliance. (If you need to zero out the array to reuse it, then pick one of the above solutions).
According to this Wikipedia article on IEEE 754-1975 64-bit floating point a bit pattern of all 0s will indeed properly initialize a double to 0.0. Unfortunately your memset code doesn't do that.
Here is the code you ought to be using:
memset(d, 0, length * sizeof(double));
As part of a more complete package...
{
double *d;
int length = 10;
d = malloc(sizeof(d[0]) * length);
memset(d, 0, length * sizeof(d[0]));
}
Of course, that's dropping the error checking you should be doing on the return value of malloc. sizeof(d[0]) is slightly better than sizeof(double) because it's robust against changes in the type of d.
Also, if you use calloc(length, sizeof(d[0])) it will clear the memory for you and the subsequent memset will no longer be necessary. I didn't use it in the example because then it seems like your question wouldn't be answered.
Memset will always be faster, if debug mode or a low level of optimization is used. At higher levels of optimization, it will still be equivalent to std::fill or std::fill_n.
For example, for the following code under Google Benchmark:
(Test setup: xubuntu 18, GCC 7.3, Clang 6.0)
#include <cstring>
#include <algorithm>
#include <benchmark/benchmark.h>
double total = 0;
static void memory_memset(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::memset(ints, 0, sizeof(int) * 50000);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_filln(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill_n(ints, 50000, 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
static void memory_fill(benchmark::State& state)
{
int ints[50000];
for (auto _ : state)
{
std::fill(std::begin(ints), std::end(ints), 0);
}
for (int counter = 0; counter != 50000; ++counter)
{
total += ints[counter];
}
}
// Register the function as a benchmark
BENCHMARK(memory_filln);
BENCHMARK(memory_fill);
BENCHMARK(memory_memset);
int main (int argc, char ** argv)
{
benchmark::Initialize (&argc, argv);
benchmark::RunSpecifiedBenchmarks ();
printf("Total = %f\n", total);
getchar();
return 0;
}
Gives the following results in release mode for GCC (-O2;-march=native):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 16488 ns 16477 ns 42460
memory_fill 16493 ns 16493 ns 42440
memory_memset 8414 ns 8408 ns 83022
And the following results in debug mode (-O0):
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 87209 ns 87139 ns 8029
memory_fill 94593 ns 94533 ns 7411
memory_memset 8441 ns 8434 ns 82833
While at -O3 or with clang at -O2, the following is obtained:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
memory_filln 8437 ns 8437 ns 82799
memory_fill 8437 ns 8437 ns 82756
memory_memset 8436 ns 8436 ns 82754
TLDR: use memset unless told you absolutely have to use std::fill or a for-loop, at least for POD types which are not non-IEEE-754 floating-points. There are no strong reasons not to.
(note: the for loops counting the array contents are necessary for clang not to optimize away the google benchmark loops entirely (it will detect they're not used otherwise))
The example will not work because you have to allocate memory for your array. You can do this on the stack or on the heap.
This is an example to do it on the stack:
double d[50] = {0.0};
No memset is needed after that.
Don't forget to compare a properly optimized for loop if you really care about performance.
Some variant of Duff's device if the array is sufficiently long, and prefix --i not suffix i-- (although most compilers will probably correct that automatically.).
Although I'd question if this is the most valuable thing to be optimising. Is this genuinely a bottleneck for the system?
memset(d, 10, 0) is wrong as it only nulls 10 bytes.
prefer std::fill as the intent is clearest.
In general the memset is going to be much faster, make sure you get your length right, obviously your example has not (m)allocated or defined the array of doubles. Now if it truly is going to end up with only a handful of doubles then the loop may turn out to be faster. But as get to the point where the fill loop shadows the handful of setup instructions memset will typically use larger and sometimes aligned chunks to maximize speed.
As usual, test and measure. (although in this case you end up in the cache and the measurement may turn out to be bogus).
One way of answering this question is to quickly run the code through Compiler Explorer: If you check this link, you'll see assembly for the following code:
void do_memset(std::array<char, 1024>& a) {
memset(&a, 'q', a.size());
}
void do_fill(std::array<char, 1024>& a) {
std::fill(a.begin(), a.end(), 'q');
}
void do_loop(std::array<char, 1024>& a) {
for (int i = 0; i < a.size(); ++i) {
a[i] = 'q';
}
}
The answer (at least for clang) is that with optimization levels -O0 and -O1, the assembly is different and std::fill will be slower because the use of the iterators is not optimized out. For -O2 and higher, do_memset and do_fill produce the same assembly. The loop ends up calling memset on every item in the array even with -O3.
Assuming release builds tend to run -O2 or higher, there are no performance considerations and I'd recommend using std::fill when it's available, and memset for C.
If you're required to not use STL...
double aValues [10];
ZeroMemory (aValues, sizeof(aValues));
ZeroMemory at least makes the intent clear.
As an alternative to all stuff proposed, I can suggest you NOT to set array to all zeros at startup. Instead, set up value to zero only when you first access the value in a particular cell. This will stave your question off and may be faster.
I think you mean
memset(d, 0, length * sizeof(d[0]))
and
for (int i = length; --i >= 0; ) d[i] = 0;
Personally, I do either one, but I suppose std::fill() is probably better.