Using SSE for vector initialzation - c++

I am relative new to C++ (moved from Java for performance for my scientific app) and I know nothing about SSE. Still, I need to improve the very simple following code:
int myMax=INT_MAX;
int size=18000003;
vector<int> nodeCost(size);
/* init part */
for (int k=0;k<size;k++){
nodeCost[k]=myMax;
}
I have measured the time for the initialization part and it takes 13ms which is way too big for my scientific app (the entire algorithm runs in 22ms which means that the initialization takes 1/2 of the total time). Keep in mind that the initialization part will be repeated multiple times for the same vector.
As you see the size of the vector is not divided by 4. Is there a way to accelerate the initialization with SSE? Can you suggest how? Do I need to use arrays or SSE can be used with vectors as well?
Please, since I need your help let's all avoid a) "how did you measure the time" or b) "premature optimization is the root of all evil" which are both reasonable for you to ask but a) the measured time is correct b) I agree with it but I have no other choice. I do not want to parallelize the code with OpenMP, so SSE is the only fallback.
Thanks for your help

Use the vector's constructor:
std::vector<int> nodeCost(size, myMax);
This will most likely use an optimized "memset"-type of implementation to fill the vector.
Also tell your compiler to generate architecture-specific code (e.g. -march=native -O3 on GCC). On my x86_64 machine, this produces the following code for filling the vector:
L5:
add r8, 1 ;; increment counter
vmovdqa YMMWORD PTR [rax], ymm0 ;; magic, ymm contains the data, and eax...
add rax, 32 ;; ... the "end" pointer for the vector
cmp r8, rdi ;; loop condition, rdi holds the total size
jb .L5
The movdqa instruction, size-prefixed for 256-bit operations, copies 32 bytes to memory at once; it is part of the AVX instruction set.

Try std::fill first as already suggested, and then if that's still not fast enough you can go to SIMD if you really need to. Note that, depending on your CPU and memory sub-system, for large vectors such as this you may well hit your DRAM's maximum bandwidth and that could be the limiting factor. Anyway, here's a fairly simple SSE implementation:
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size - 3; k += 4)
{
_mm_storeu_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k)
{
pNodeCost[k] = myMax;
}
This should work well on modern CPUs - for older CPUs you might need to handle the potential data misalignment better, i.e. use _mm_store_si128 rather than _mm_storeu_si128. E.g.
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size && (((intptr_t)&pNodeCost[k] & 15ULL) != 0); ++k)
{ // initial scalar loop until we
pNodeCost[k] = myMax; // hit 16 byte alignment
}
for ( ; k < size - 3; k += 4) // 16 byte aligned SIMD loop
{
_mm_store_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k) // scalar loop to take care of any
{ // remaining elements at end of vector
pNodeCost[k] = myMax;
}

This is an extension of the ideas in Mats Petersson's comment.
If you really care about this, you need to improve your referential locality. Plowing through 72 megabytes of initialization, only to come back later to overwrite it, is extremely unfriendly to the memory hierarchy.
I do not know how to do this in straight C++, since std::vector always initializes itself. But you might try (1) using calloc and free to allocate the memory; and (2) interpreting the elements of the array as "0 means myMax and n means n-1". (I am assuming "cost" is non-negative. Otherwise you need to adjust this scheme a bit. The point is to avoid the explicit initialization.)
On a Linux system, this can help because calloc of a sufficiently large block does not need to explicitly zero the memory, since pages acquired directly from the kernel are already zeroed. Better yet, they only get mapped and zeroed the first time you touch them, which is very cache-friendly.
(On my Ubuntu 13.04 system, Linux calloc is smart enough not to explicitly initialize. If yours is not, you might have to do an mmap of /dev/zero to use this approach...)
Yes, this does mean every access to the array will involve adding/subtracting 1. (Although not for operations like "min" or "max".) Main memory is pretty darn slow by comparison, and simple arithmetic like this can often happen in parallel with whatever else you are doing, so there is a decent chance this could give you a big performance win.
Of course whether this helps will be platform dependent.

Related

How to let GCC compiler turn variable-division into mul(if faster)

int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);
When compiling, I got
...
::00401C1E:: C70424 24304000 MOV DWORD PTR [ESP],403024 %d %d
::00401C25:: E8 36FFFFFF CALL 00401B60 scanf
::00401C2A:: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP+1C]
::00401C2F:: 8B4424 18 MOV EAX,[ESP+18]
::00401C33:: 31D2 XOR EDX,EDX
::00401C35:: F7F1 DIV ECX
::00401C37:: 894424 04 MOV [ESP+4],EAX
::00401C3B:: C70424 2A304000 MOV DWORD PTR [ESP],40302A %d\x0A
::00401C42:: E8 21FFFFFF CALL 00401B68 printf
Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?
int main() {
uint a, s=0, i, t;
scanf("%d", &a);
diviuint aa = a;
t = clock();
for (i=0; i<1000000000; i++)
s += i/a;
printf("Result:%10u\n", s);
printf("Time:%12u\n", clock()-t);
return 0;
}
where diviuint(a) make a memory of 1/a and use multiple instead
Using s+=i/aa makes the speed 2 times of s+=i/a
You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)
Using a multiplicative inverse
If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.
Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).
libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.
(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)
Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.
Other ways to get the div out of the loop
for (i=0; i<1000000000; i++)
s += i/a;
In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)
I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:
uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;
If you just needed i/a in a loop, it might be worth it to do something like:
// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
// use i_over_a
++remainder;
if (remainder == a) { // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86. But then you need a clever variable name other than remainder.
remainder = 0;
++i_over_a;
}
}
Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))
Replacing DIV with MUL may make sense (but doesn't have to in all cases) when one of the values is known at compile time. When both are user inputs, you don't know what's the range, so all usual tricks will not work.
Basically you need to handle both a and b between INT_MAX and INT_MIN. There's no space left for scaling them up/down. Even if you wanted to extend them to larger types, it would probably take longer time just to invert b and check that the result will be consistent.
The only way to KNOW if div or mul is faster is by testing both in a benchmark [obviously, if you use your above code, you'd mostly measure the time of read/write of the inputs and results, not the actual divide instruction, so you need something where you can isolate the divide instruction(s) from the input and output].
My guess would be that on slightly older processors, mul is a bit faster, on modern processors, div will be as fast as, if not faster than, a lookup of 256 int values.
If you have ONE target system, then it's plausible to test this. If you have several different systems you want to run on, you will have to ensure the "improved code" is faster on at least some of them - and not significantly slower on the rest.
Note also that you would introduce a dependency, which may in itself slow down the sequence of operations - modern CPU's are pretty good at "hiding" latency as long as there are other instructions to execute [so you should use this in an "as realistic scenario as possible"].
There is a wrong assumption in the question. The multiplicative inverse of an integer greater than 1 is a fraction less than one. These don't exist in the world of integers. A lookup table doesn't work because you can't lookup what doesn't exist. Even if you "scale" the dividend the results will not be correct in the sense of being the same as an integer division. Take this example:
printf("%x %x\n", 0x10/0x9, 0x30/0x9);
// prints: 1 5
Assuming a multiplicative inverse existed, both terms are divided by the same divisor (9) so must have the same lookup table value (multiplicative inverse). Any fixed lookup value corresponding to the divisor (9) multiplied by an integer will be precisely 3 times greater in the second term relative to the first term. As you can see from the example, the result of an actual integer division is a 5, not a 3.
You can approximate things by using a scaled lookup table. For instance a lookup table that is the multiplicative inverse when the result is divided by 2^16. You would then multiply by the lookup table value and shift the result 16 bits to the right. Time consuming and requires a 1024 byte lookup table. Even so, this would not produce the same results as an integer divide. A compiler optimization is not going to produce "approximate" results of an integer division.

How to write Linux C++ debug information when performance is critical?

I'm trying to debug a rather large program with many variables. The code is setup in this way:
while (condition1) {
//non timing sensitive code
while (condition2) {
//timing sensitive code
//many variables that change each iteration
}
}
I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.
I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.
Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.
I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.
With the information stored being in formats such as:
"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"
So, is there a cleaner option for writing large amounts of debug data quickly?
If it's really time-sensitive, then don't format strings inside the critical loop.
I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.
This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.
In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.
Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.
See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.
#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>
struct loop_log {
// Generally sort in order of size for better packing.
// Use as narrow types as possible to reduce memory bandwidth.
// e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
int64_t x, y, z;
uint64_t ux, uy, uz;
int32_t a, b, c;
uint16_t t, i, j;
uint8_t c1, c2, c3;
// isn't there a less-repetitive way to write this?
loop_log(int64_t x, int32_t a, int outer_counter, char c1)
: x(x), a(a), i(outer_counter), c1(c1)
// leaves other members *uninitialized*, not zeroed.
// note lack of gcc warning for initializing uint16_t i from an int
// and for not mentioning every member
{}
};
static constexpr size_t initial_reserve = 10000;
// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
std::vector<struct loop_log> log;
log.reserve(initial_reserve);
int outer_counter = outer_iterations;
while (--outer_counter) {
//non timing sensitive code
int32_t a = inner_param - outer_counter;
while (a != 0) {
//timing sensitive code
a <<= 1;
int64_t x = outer_counter * (100LL + a);
char c1 = x;
// much more efficient code with gcc 5.3 -O3 than push_back( a struct literal );
log.emplace_back(x, a, outer_counter, c1);
}
const auto logdata = log.data();
const size_t bytes = log.size() * sizeof(*logdata);
// write group size, then a group of records
logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
logfile.write( reinterpret_cast<const char *>(logdata), bytes );
// you could format the records into strings at this point if you want
log.clear();
}
}
#include <fstream>
int main() {
std::ofstream logfile("dbg.log");
foo(logfile, 100, 10);
}
gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:
## gcc 5.3 -masm=intel -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
test rbx, rbx # log // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
je .L6 #,
mov QWORD PTR [rbx], rbp # log_53->x, x // emplace_back the 4 elements
mov DWORD PTR [rbx+48], r12d # log_53->a, a
mov WORD PTR [rbx+62], r15w # log_53->i, outer_counter
mov BYTE PTR [rbx+66], bpl # log_53->c1, x
.L6:
add rbx, 72 # log, // struct size is 72B
mov r8, r13 # D.44727, log
test r12d, r12d # a
je .L58 #, // a != 0
.L4:
add r12d, r12d # a // a <<= 1
movsx rbp, r12d # D.44726, a // x = ...
add rbp, 100 # D.44726, // x = ...
imul rbp, QWORD PTR [rsp+8] # x, %sfp // x = ...
cmp r14, rbx # log$D40277$_M_impl$_M_end_of_storage, log
jne .L59 #, // stay in this tight loop as long as we don't run out of reserved space in the vector
// fall through into code that allocates more space and copies.
// gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq. Clang uses AVX to copy 32B at a time
// anyway, that code never runs as long as the reserve is big enough
// I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/
An attempt to avoid repetitive constructor code:
I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:
#ifdef USE_CONSTRUCTOR
// much more efficient code with gcc 5.3 -O3.
log.emplace_back(x, a, outer_counter, c1);
#else
// Put the mapping from local var names to struct member names right here in with the loop
log.push_back( (struct loop_log) {
.x = x, .y =0, .z=0, // C99 designated-initializers are a GNU extension to C++,
.ux=0, .uy=0, .uz=0, // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
.a = a, .b=0, .c=0, // without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
.t=0, .i = outer_counter, .j=0,
.c1 = (uint8_t)c1
} );
#endif
This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:
mov rax, QWORD PTR [rsp+72]
mov QWORD PTR [rdx+8], rax // rdx points into the vector's buffer
mov rax, QWORD PTR [rsp+80]
mov QWORD PTR [rdx+16], rax
... // total of 9 loads/stores for a 72B struct
So it will have more impact on the inner loop.
There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.
BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).
For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.
This was a fun but terrible idea:
// Doesn't compiler. Worse: hard to read, probably easy to screw up
while (outerloop) {
int64_t x=0, y=1;
struct loop_log {int64_t logx=x, logy=y;}; // loop vars as default initializers
// error: default initializers can't be local vars with automatic storage.
while (innerloop) { x+=y; y+=x; log.emplace_back(loop_log()); }
}
Lower overhead from using a flat array instead of a std::vector
Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)
So in your loop, you could just have code like
loop_log *current_record = logbuf;
while(inner_loop) {
int64_t x = ...;
current_record->x = x;
...
current_record->i = (short)outer_counter;
...
// or maybe
// *current_record = { .x = x, .i = (short)outer_counter };
// compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back
current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);
Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.
Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.
You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.
It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.
Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.

Performance detoriation for certain array sizes

I'm having an issue with the following code, and I fail to understand where is the problem. The problem is occurring however only with V2 intel processor and not V3.
Consider the following code in C++:
struct Tuple{
size_t _a;
size_t _b;
size_t _c;
size_t _d;
size_t _e;
size_t _f;
size_t _g;
size_t _h;
};
void
deref_A(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
}
void
deref_AB(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
}
void
deref_ABC(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
}
....
void
deref_ABCDEFG(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
aTuple._d = D[aIdx];
aTuple._e = E[aIdx];
aTuple._f = F[aIdx];
aTuple._g = G[aIdx];
}
Note that A, B, C, ..., G are simple arrays (declared globally). Arrays are filled with integers.
The methods "deref_*", simply assign some values from arrays (accessed via index - aIdx) to the given struct parameter "aTuple". I first start by assigning to a single field of the given struct as parameter, and continue all the way to all fields. That is, each method assigns one more field than the previous one. The methods "deref_*" are called with index (aIdx) starting from 0, to MAX size of the arrays (arrays have the same size by the way). The index is used to access array elements, as shown in the code -- pretty simple.
Now, consider the graph (http://docdro.id/AUSil1f), which depicts the performance for array sizes starting with 20 million (size_t = 8 bytes) integers, up to 24 m (x-axes denote the arrays size).
For arrays with 21 million integers (size_t), the performance degrades for the methods touching at least 5 different arrays (i.e., deref_ACDE...G), therefore you will see peaks in the graph. The performance then improves again for arrays with 22 m integers and onwards. I'm wondering why this is happening for array size of 21 m only? This is happening only when I'm testing on a server with CPU: Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz, but not with Haswell, i.e., v3. Clearly this is a known issue to Intel and has been resolved, but I don't know what it is, and how to improve the code for v2.
I would highly appreciate any hint from your side.
I suspect you might be seeing cache-bank conflicts. Sandybridge/Ivybridge (Xeon Exxxx v1/v2) have them, Haswell (v3) doesn't.
Update from the OP: it was DTLB misses. Cache bank conflicts will usually only be an issue when your working set fits in cache. Being limited to one 8B read per clock instead of 2 shouldn't stop a a CPU from keeping up with main memory, even single-threaded. (8B * 3GHz = 24GB/s, which is about equal to main memory sequential-read bandwidth.)
I think there's a perf counter for that, which you can check with perf or other tools.
Quoting Agner Fog's microarchitecture doc (section 9.13):
Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is
divided into 8 banks of 16 bytes each. It is not possible to do two
memory reads in the same clock cycle if the two memory addresses have
the same bank number, i.e. if bit 4 - 6 in the two addresses are the
same.
; Example 9.5. Sandy bridge cache
mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
Changing the total size of your arrays changes the distance between two elements with the same index, if they're laid out more or less head to tail.
If you have each array aligned to a different 16B offset (modulo 128), this will help some for SnB/IvB. Access to the same index in each array will be in a different cache bank, and thus can happen in parallel. Achieving this can be as simple as allocating 128B-aligned arrays, with 16*n extra bytes at the start of each one. (Keeping track of the pointer to eventually free separately from the pointer to dereference will be an annoyance.)
If the tuple where you're writing the results has the same address as a read, modulo 4096, you also get a false dependence. (i.e. a read from one of the arrays might have to wait for a store to the tuple.) See Agner Fog's doc for the details on that. I didn't quote that part because I think cache-bank conflicts are the more likely explanation. Haswell still has the false-dependence issue, but the cache-bank conflict issue is completely gone.

How to fast initialize with 1 really big array

I have enermous array:
int* arr = new int[BIGNUMBER];
How to fullfil it with 1 number really fast. Normally I would do
for(int i = 0; i < BIGNUMBER; i++)
arr[i] = 1
but I think it would take long.
Can I use memcpy or similar?
You could try using the standard function std::uninitialized_fill_n:
#include <memory>
// ...
std::uninitialized_fill_n(arr, BIGNUMBER, 1);
In any case, when it comes to performance, the rule is to always make measurements to back up your assumptions - especially if you are going to abandon a clear, simple design to embrace a more complex one because of an alleged performance improvement.
EDIT:
Notice that - as Benjamin Lindley mentioned in the comments - for trivial types std::uninitialized_fill_n does not bring any advantage over the more obvious std::fill_n. The advantage would exist for non-trivial types, since std::uninitialized_fill would allow you to allocate a memory region and then construct objects in place.
However, one should not fall into the trap of calling std::uninitialized_fill_n for a memory region that is not uninitialized. The following, for instance, would give undefined behavior:
my_object* v = new my_object[BIGNUMBER];
std::uninitialized_fill_n(my_object, BIGNUMBER, my_object(42)); // UB!
Alternative to a dynamic array is std::vector<int> with the constructor that accepts an initial value for each element:
std::vector<int> v(BIGNUMBER, 1); // 'BIGNUMBER' elements, all with value 1.
as already stated, performance would need measured. This approach provides the additional benefit that the memory will be freed automatically.
Some possible alternatives to Andy Prowl's std::uninitialized_fill_n() solution, just for posterity:
If you are lucky and your value is composed of all the same bytes, memset will do the trick.
Some implementations offer a 16-bit version memsetw, but that's not everywhere.
GCC has an extension for Designated Initializers that can fill ranges.
I've worked with a few ARM systems that had libraries that had accelerated CPU and DMA variants of word-fill, hand coded in assembly -- you might look and see if your platform offers any of this, if you aren't terribly concerned about portability.
Depending on your processor, even looking into loops around SIMD intrinsics may provide a boost; some of the SIMD units have load/store pipelines that are optimized for moving data around like this. On the other hand you may take severe penalties for moving between register types.
Last but definitely not least, to echo some of the commenters: you should test and see. Compilers tend to be pretty good at recognizing and optimizing patterns like this -- you probably are just trading off portability or readability with anything other than the simple loop or uninitialized_fill_n.
You may be interested in prior questions:
Is there memset() that accepts integers larger than char?
initializing an array of ints
How to initialize all members of an array to the same value?
Under Linux/x86 gcc with optimizations turned on, your code will compile to the following:
rax = arr
rdi = BIGNUMBER
400690: c7 04 90 01 00 00 00 movl $0x1,(%rax,%rdx,4)
Move immediate int(1) to rax + rdx
400697: 48 83 c2 01 add $0x1,%rdx
Increment register rdx
40069b: 48 39 fa cmp %rdi,%rdx
Cmp rdi to rdx
40069e: 75 f0 jne 400690 <main+0xa0>
If BIGNUMBER has been reached jump back to start.
It takes about 1 second per gigabyte on my machine, but most of that I bet is paging in physical memory to back the uninitialized allocation.
Just unroll the loop by, say, 8 or 16 times. Functions like memcpy are fast, but they're really there for convenience, not to be faster than anything you could possibly write:
for (i = 0; i < BIGNUMBER-8; i += 8){
a[i+0] = 1; // this gets rid of the test against BIGNUMBER, and the increment, on 7 out of 8 items.
a[i+1] = 1; // the compiler should be able to see that a[i] is being calculated repeatedly here
...
a[i+7] = 1;
}
for (; i < BIGNUMBER; i++) a[i] = 1;
The compiler might be able to unroll the loop for you, but why take the chance?
Use memset or memcpy
memset(arr, 0, BIGNUMER);
Try using memset?
memset(arr, 1, BIGNUMBER);
http://www.cplusplus.com/reference/cstring/memset/
memset(arr, 1, sizeof(int) * BIGNUMBER);

Fast multiplication/division by 2 for floats and doubles (C/C++)

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators
int a = 1;
int b = a<<24
However, I cannot, and I have to stick with doubles.
My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?
I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)
P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).
Edit : Or am I completely mistaken and clever compilers already optimize things for me?
Temporary results (with Qt to measure time, overkill, but I don't care):
#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>
#include <iostream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
while(true)
{
QElapsedTimer timer;
timer.start();
int n=100000000;
volatile double d=12.4;
volatile double D;
for(unsigned int i=0; i<n; ++i)
{
//D = d*32; // 200 ms
//D = d*(1<<5); // 200 ms
D = ldexp (d,5); // 6000 ms
}
qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}
return a.exec();
}
Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.
On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)
This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)
The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)
So something like this:
union{
uint64 i;
double f;
};
f = 123.;
i += 0x0010000000000000ull;
// Check for zero. And if it matters, denormals as well.
Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.
However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.
In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.
All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.
Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.
How about ldexp?
Any half-decent compiler will generate optimal code on your platform.
But as #Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.
Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).
And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.
[update]
OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...
...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.
So, multiplying by 1 << p is probably your best bet, as you have discovered.
You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:
const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK;
void unsafe_shl(double* d, int shift) {
unsigned long long* i = (unsigned long long*)d;
if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) {
*i += (unsigned long long)shift << DOUBLE_EXP_SHIFT;
} else if (*i) {
*d *= (1 << shift);
}
}
EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:
double ds[0x1000];
for (int i = 0; i != 0x1000; i++)
ds[i] = 1.2;
clock_t t = clock();
for (int j = 0; j != 1000000; j++)
for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
ds[i] *= 1 << 4;
#else
((unsigned int*)&ds[i])[1] += 4 << 20;
#endif
clock_t e = clock();
printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);
In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of
movupd xmm0,xmmword ptr [ecx]
lea ecx,[ecx+10h]
mulpd xmm0,xmm1
movupd xmmword ptr [ecx-10h],xmm0
Versus 2.4 seconds otherwise, with an inner loop of:
add dword ptr [ecx],400000h
lea ecx, [ecx+8]
Truly unexpected!
EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:
movsd xmm1,mmword ptr [esp+eax*8+38h]
mulsd xmm1,xmm0
movsd mmword ptr [esp+eax*8+38h],xmm1
inc eax
VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:
fld qword ptr [esp+eax*8+38h]
inc eax
fmul st,st(1)
fstp qword ptr [esp+eax*8+30h]
I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.
The fastest way to do this is probably:
x *= (1 << p);
This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.
Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.
p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.
Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:
d *= 0x1p64;
will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.
What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.
Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.
Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.
Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.
E.g. for
typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2
In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.
//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.