C overflows inside an equation? - c++

a + b overflows 255 back to 4 as I would expect, then c / 2 gives 2 as I expect. But then why does the last example not overflow when evaluating the same two steps?
I'm guessing the internal calculation values are stored with more bits, then only truncated down to 8 bit when doing the assignment. In that case where is the limit, it must overflow at some point?
uint8_t a = 250;
uint8_t b = 10;
uint8_t c = (a + b);
uint8_t d = c / 2;
uint8_t e = (a + b) / 2;
std::cout << unsigned(c) << ", " << unsigned(d) << ", " << unsigned(e) << "\n";
4, 2, 130

It's called integral promotion. The operations themselves are done in your CPUs native integer type, int, which can hold numbers greater than 255. In the a+b case the result must be stored in an uint8_t, and that's where the truncating is done. In the last case first there is a division which is done as an int, and the result can be perfectly stored in an uint8_t.

a+b gives value 260 which is not assigned to any uint8_t type so you are good in the last case. Only when you assing any value greater than 255 to uint8_t then there is a overflow.

In the following (a + b) does not overflow, the compiler recognizes a and b as Integer Types so addition results in an Integer Type, the result of this expression is not restricted by the size of the terms or factors in the expression.
Lets assume the type of a variable like a or b in this case limits the result to only that type. While possible it would be almost impossible to use a language like this. Imagine five variables that when no type consideration is made they sum to 500 ie this..
uint8_t a = 98;
uint8_t b = 99;
uint8_t c = 100;
uint8_t d = 101;
uint8_t e = 102;
The sum of the above variables == 500. Now... in the following the result of any expression cannot exceed the size of one of the terms...
int incorrect = (a + b + c + d + e);
in this case (a + b + c) == 41 then (41 + d + e) == 244. This is a nonsensical answer.. The alternative that most people recognize ie
(98 + 99 + 100 + 101 + 102) == 500;
This is one reason why type conversion exists.
Intermediate results in expressions should not be restricted by the terms or factors in the expression but by the resultant type ie the lvalue.

#atturri is correct. here is what happen to your variables in x86 machine language:
REP STOS DWORD PTR ES:[EDI]
MOV BYTE PTR SS:[a],0FA
MOV BYTE PTR SS:[b],0A
MOVZX EAX,BYTE PTR SS:[a] ; promotion to 32bit integer
MOVZX ECX,BYTE PTR SS:[b] ; promotion to 32bit integer
ADD EAX,ECX
MOV BYTE PTR SS:[c],AL ; ; demotion to 8bit integer
MOVZX EAX,BYTE PTR SS:[c]
CDQ
SUB EAX,EDX
SAR EAX,1
MOV BYTE PTR SS:[d],AL
MOVZX EAX,BYTE PTR SS:[a]
MOVZX ECX,BYTE PTR SS:[b]
ADD EAX,ECX
CDQ
SUB EAX,EDX
SAR EAX,1
MOV BYTE PTR SS:[e],AL

Related

Efficient symmetric comparison based on a bool toggle

my code has a lot of patterns like
int a, b.....
bool c = x ? a >= b : a <= b;
and similarly for other inequality comparison operators. Is there a way to write this to achieve better performance/branchlessness for x86.
Please spare me with have you benchmarked your code? Is this really your bottleneck? type comment. I am asking for other ways to write this so I can benchmark and test.
EDIT:
bool x
Original expression:
x ? a >= b : a <= b
Branch-free equivalent expression without short-circuit evaluation:
!!x & a >= b | !x & a <= b
This is an example of a generic pattern without resorting to arithmetic trickery. Watch out for operator precedence; you may need parentheses for more complex examples.
Another way would be :
bool c = (2*x - 1) * (a - b) >= 0;
This generates a branch-less code here: https://godbolt.org/z/1nAp7G
#include <stdbool.h>
bool foo(int a, int b, bool x)
{
return (2*x - 1) * (a - b) >= 0;
}
------------------------------------------
foo:
movzx edx, dl
sub edi, esi
lea eax, [rdx-1+rdx]
imul eax, edi
not eax
shr eax, 31
ret
Since you're just looking for equivalent expressions, this comes from patching #AlexanderZhang's comment:
(a==b) || (c != (a<b))
The way you currently have it is possibly unbeatable.
But for positive integral a and b and bool x you can use
a / b * x + b / a * !x
(You could adapt this, at the cost of extra cpu burn, by replacing a with a + 1 and similarly for b if you need to support zero.)
If a>=b, a-b will be positive and the first bit(sign bit) is 0. Otherwise a-b is negative and first bit is 1.
So we can simply “xor” the first bit of a-b and the the value of x
constexpr auto shiftBit = sizeof(int)*8-1;
bool foo(bool x, int a, int b){
return x ^ bool((a-b)>>shiftBit);
}
foo(bool, int, int):
sub esi, edx
mov eax, edi
shr esi, 31
xor eax, esi
ret

Why A / <constant-int> is faster when A is unsigned vs signed? [duplicate]

This question already has answers here:
performance of unsigned vs signed integers
(12 answers)
Closed 4 years ago.
I have been reading through the Optimizing C++ wikibook. In the faster operations chapter one of the advice is as follows:
Integer division by a constant
When you divide an integer (that is known to be positive or zero) by a
constant, convert the integer to unsigned.
If s is a signed integer, u is an unsigned integer, and C is a
constant integer expression (positive or negative), the operation s /
C is slower than u / C, and s % C is slower than u % C. This is most
significant when C is a power of two, but in all cases, the sign must
be taken into account during division.
The conversion from signed to unsigned, however, is free of charge, as
it is only a reinterpretation of the same bits. Therefore, if s is a
signed integer that you know to be positive or zero, you can speed up
its division using the following (equivalent) expressions: (unsigned)s
/ C and (unsigned)s % C.
I tested this statement with gcc and the u / C expression seems to perform consistently better than the s / c
The following example is also provided below:
#include <iostream>
#include <chrono>
#include <cstdlib>
#include <vector>
#include <numeric>
using namespace std;
int main(int argc, char *argv[])
{
constexpr int vsize = 1e6;
std::vector<int> x(vsize);
std::iota(std::begin(x), std::end(x), 0); //0 is the starting number
constexpr int a = 5;
auto start_signed = std::chrono::system_clock::now();
int sum_signed = 0;
for ([[gnu::unused]] auto i : x)
{
// signed is by default
int v = rand() % 30 + 1985; // v in the range 1985-2014
sum_signed += v / a;
}
auto end_signed = std::chrono::system_clock::now();
auto start_unsigned = std::chrono::system_clock::now();
int sum_unsigned = 0;
for ([[gnu::unused]] auto i : x)
{
int v = rand() % 30 + 1985; // v in the range 1985-2014
sum_unsigned += static_cast<unsigned int>(v) / a;
}
auto end_unsigned = std::chrono::system_clock::now();
// signed
std::chrono::duration<double> diff_signed = end_signed - start_signed;
std::cout << "sum_signed: " << sum_signed << std::endl;
std::cout << "Time it took SIGNED: " << diff_signed.count() * 1000 << "ms" << std::endl;
// unsigned
std::chrono::duration<double> diff_unsigned = end_unsigned - start_unsigned;
std::cout << "sum_unsigned: " << sum_unsigned << std::endl;
std::cout << "Time it took UNSIGNED: " << diff_unsigned.count() * 1000 << "ms" << std::endl;
return 0;
}
You can compile and run the example here: http://cpp.sh/8kie3
Why is this happening?
After some toying around, I believe I've tracked down the source of the problem to be the guarantee by the standard that negative integer divisions are rounded towards zero since C++11. For the simplest case, which is division by two, check out the following code and the corresponding assembly (godbolt link).
constexpr int c = 2;
int signed_div(int in){
return in/c;
}
int unsigned_div(unsigned in){
return in/c;
}
Assembly:
signed_div(int):
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
unsigned_div(unsigned int):
mov eax, edi
shr eax
ret
What do these extra instructions accomplish? shr eax, 31 (right shift by 31) just isolates the sign bit, meaning that if input is non-negative, eax == 0, otherwise eax == 1. Then the input is added to eax. In other words, these two instructions translate to "if input is negative, add 1 to it. The implications of the addition are the following (only for negative input).
If input is even, its least significant bit is set to 1, but the shift discards it. The output is not affected by this operation.
If input is odd, its least significant bit was already 1 so the addition causes a remainder to propagate to the rest of the digits. When the right shift occurs, the least significant bit is discarded and the output is greater by one than the output we'd have if we hadn't added the sign bit to the input. Because by default right-shift in two's complement rounds towards negative infinity, the output now is the result of the same division but rounded towards zero.
In short, even negative numbers aren't affected, and odd numbers are now rounded towards zero instead of towards negative infinity.
For non-power-of-2 constants it gets a bit more complicated. Not all constants give the same output, but for a lot of them it looks similar to the following (godbolt link).
constexpr int c = 3;
int signed_div(int in){
return in/c;
}
int unsigned_div(unsigned in){
return in/c;
}
Assembly:
signed_div(int):
mov eax, edi
mov edx, 1431655766
sar edi, 31
imul edx
mov eax, edx
sub eax, edi
ret
unsigned_div(unsigned int):
mov eax, edi
mov edx, -1431655765
mul edx
mov eax, edx
shr eax
ret
We don't care about the change of the constant in the assembly output, because it does not affect execution time. Assuming that mul and imul take the same amount of time (which I don't know for sure but hopefully someone more knowledgeable than me can find a source on it), the signed version once again takes longer because it has extra instructions to handle the sign bit for negative operands.
Notes
Compilation was done on godbot using x86-64 GCC 7.3 with the -O2 flag.
Rounds towards zero behavior is standard-mandated since C++11. Before it was implementation defined, according to this cppreference page.

In special cases: Is & faster than %?

I saw the chosen answer to this post.
I was suprised that (x & 255) == (x % 256) if x is an unsigned integer, I wondered if it makes sense to always replace % with & in x % n for n = 2^a (a = [1, ...]) and x being a positive integer.
Since this is a special case in which I as a human can decide because I know with which values the program will deal with and the compiler does not. Can I gain a significant performance boost if my program uses a lot of modulo operations?
Sure, I could just compile and look at the dissassembly. But this would only answer my question for one compiler/architecture. I would like to know if this is in principle faster.
If your integral type is unsigned, the compiler will optimize it, and the result will be the same. If it's signed, something is different...
This program:
int mod_signed(int i) {
return i % 256;
}
int and_signed(int i) {
return i & 255;
}
unsigned mod_unsigned(unsigned int i) {
return i % 256;
}
unsigned and_unsigned(unsigned int i) {
return i & 255;
}
will be compiled (by GCC 6.2 with -O3; Clang 3.9 produces very similar code) into:
mod_signed(int):
mov edx, edi
sar edx, 31
shr edx, 24
lea eax, [rdi+rdx]
movzx eax, al
sub eax, edx
ret
and_signed(int):
movzx eax, dil
ret
mod_unsigned(unsigned int):
movzx eax, dil
ret
and_unsigned(unsigned int):
movzx eax, dil
ret
The result assembly of mod_signed is different because
If both operands to a multiplication, division, or modulus expression have the same sign, the result is positive. Otherwise, the result is negative. The result of a modulus operation's sign is implementation-defined.
and AFAICT, most of implementation decided that the result of a modulus expression is always the same as the sign of the first operand. See this documentation.
Hence, mod_signed is optimized to (from nwellnhof's comment):
int d = i < 0 ? 255 : 0;
return ((i + d) & 255) - d;
Logically, we can prove that i % 256 == i & 255 for all unsigned integers, hence, we can trust the compiler to do its job.
I did some measurements with gcc, and
if the argument of a / or % is a compiled time constant that's a power of 2, gcc can turn it into the corresponding bit operation.
Here are some of my benchmarks for divisions
What has a better performance: multiplication or division? and as you can see, the running times with divisors that are statically known powers of two are noticably lower than with other statically known divisors.
So if / and % with statically known power-of-two arguments describe your algorithm better than bit ops, feel free to prefer / and %.
You shouldn't lose any performance with a decent compiler.

Can someone explain the meaning of malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))

In decompiled code generated by IDA I see expressions like:
malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))
malloc(6 * n | -(3 * (unsigned __int64)(unsigned int)(2 * n) >> 32 != 0))
Can someone explain the purpose of these calculations?
c and n are int (signed integer) values.
Update.
Original C++ code was compiled with MSVC for 32-bit platform.
Here's assembly code for second line of decompiled C-code above (malloc(6 * ..)):
mov ecx, [ebp+pThis]
mov [ecx+4], eax
mov eax, [ebp+pThis]
mov eax, [eax]
shl eax, 1
xor ecx, ecx
mov edx, 3
mul edx
seto cl
neg ecx
or ecx, eax
mov esi, esp
push ecx ; Size
call dword ptr ds:__imp__malloc
I'm guessing that original source code used the C++ new operator to allocate an array and was compiled with Visual C++. As user3528438's answer indicates this code is meant to prevent overflows. Specifically it's a 32-bit unsigned saturating multiply. If the result of the multiplication would be greater than 4,294,967,295, the maximum value of a 32-bit unsigned number, the result is clamped or "saturated" to that maximum.
Since Visual Studio 2005, Microsoft's C++ compiler has generated code to protect against overflows. For example, I can generate assembly code that could be decompiled into your examples by compiling the following with Visual C++:
#include <stdlib.h>
void *
operator new[](size_t n) {
return malloc(n);
}
struct S {
char a[20];
};
struct T {
char a[6];
};
void
foo(int n, S **s, T **t) {
*s = new S[n];
*t = new T[n * 2];
}
Which, with Visual Studio 2015's compiler generates the following assembly code:
mov esi, DWORD PTR _n$[esp]
xor ecx, ecx
mov eax, esi
mov edx, 20 ; 00000014H
mul edx
seto cl
neg ecx
or ecx, eax
push ecx
call _malloc
mov ecx, DWORD PTR _s$[esp+4]
; Line 19
mov edx, 6
mov DWORD PTR [ecx], eax
xor ecx, ecx
lea eax, DWORD PTR [esi+esi]
mul edx
seto cl
neg ecx
or ecx, eax
push ecx
call _malloc
Most of the decompiled expression is actually meant to handle just one assembly statement. The assembly instruction seto cl sets CL to 1 if the previous MUL instruction overflows, otherwise it sets CL to 0. Similarly the expression 20 * (unsigned __int64)(unsigned int)c >> 32 != 0 evaluates to 1 if the result of 20 * c overflows, and evaluates to 0 otherwise.
If this overflow protection wasn't there and the result of 20 * c did actually overflow then the call to malloc would probably succeed, but allocate much less memory than the program intended. The program would then likely write past the end of the memory actually allocated and trash other bits of memory. This would amount to a buffer overrun, one that could be potentially exploited by hackers.
Since this code is decompiled from ASM, so we can only guess what it actually does.
Let's first format it so figure the precedence:
malloc(20 * c | -(20 * (unsigned __int64)(unsigned int)c >> 32 != 0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//this is first evaluated, promoting c to
//64 bit unsigned int without doing sign
//extension, regardless the type of c
malloc(20 * c | -(20 * (uint64_t)c >> 32 != 0))
^^^^^^^^^^^^^^^^
//then, multiply by 20, with uint64 result
malloc(20 * c | -(20 * (uint64_t)c >> 32 != 0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
//if 20c is greater than 2^32-1, then result is true,
//use -1 to generate a mask of 0xffffffff,
//bitwise operator | then masks 20c to 0xffffffff
//(2^32-1, the maximum of size_t, input type to malloc)
//regardless what 20c actually is
//if 20c is smaller than 2^32-1, then result is false,
//the mask is 0, bitwise operator | keeps the final
//input to malloc as 20c untouched
What are 20 and 6?
Those probably come from the common usage of
malloc(sizeof(Something)*count). Those two calls to malloc are probably made with sizeof(Something) and sizeof(SomethingElse) evaluated to 20 and 6 at compile time.
So what this code actually does:
My guess, it's trying to prevent sizeof(Something)*count from overflowing and cause the malloc to succeed and cause buffer overflow when the memory is used.
By evaluating the product in 64 bit unsigned int and test against 2^32-1, when size is greater than 2^32-1, the input to malloc is set to a very large value that makes it guaranteed to fail (No 32 bit system can allocate 2^32-1 bytes of memory).
Can someone explain the purpose of these calculations?
It is important to understand that compiling changes the semantic meaning of code. Much unspecified behavior of the original code becomes specified by the compilation process.
IDA has no idea whether things the generated assembly code just happens to do are important or not. To be safe, it tries to perfectly replicate the behavior of the assembly code, even in cases that cannot possibly happen given the way the code is used.
Here, IDA is probably replicating the overflow characteristics that the conversion of types just happens to have on this platform. It can't just replicate the original C code because the original C code likely had unspecified behavior for some values of c or n, likely negative ones.
For example, say I write this C code: int f(unsigned j) { return j; }. My compiler will likely turn that into very simple assembly code giving whatever behavior for negative values of j that my platform just happens to give.
But if you decompile the generated assembly, you cannot decompile it to int f(unsigned j) { return j; } because that will not behave the same as the my assembly code did on platforms with different overflow behavior. That could compile to code (on other platforms) that returns different values than my assembly code does for negative values of j.
So it is often literally impossible (in fact, incorrect) to decompile C code into the original code, it will often have these kinds of "portably replicate this platform's behavior" oddities.
it's rounding up to the nearest block size.
forgive me. What it's doing is calculating a multiple of c while simultaneously checking for a negative value (overflow):
#include <iostream>
#include <cstdint>
size_t foo(char c)
{
return 20 * c | -(20 * (std::uint64_t)(unsigned int)c >> 32 != 0);
}
int main()
{
using namespace std;
for (char i = -4 ; i < 4 ; ++i)
{
cout << "input is: " << int(i) << ", result is " << foo(i) << endl;
}
return 0;
}
results:
input is: -4, result is 18446744073709551615
input is: -3, result is 18446744073709551615
input is: -2, result is 18446744073709551615
input is: -1, result is 18446744073709551615
input is: 0, result is 0
input is: 1, result is 20
input is: 2, result is 40
input is: 3, result is 60
To me the number 18446744073709551615 doesn't mean much, at a glance. Only after seeing it expressed in hex I went "ah". – Jongware
adding << hex:
input is: -1, result is ffffffffffffffff

Accessing three static arrays is quicker than one static array containing 3x data?

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:
1) Three 700-element arrays, one array for each of the three attributes. So:
item0.a = array1[0]
item0.b = array2[0]
item0.e = array3[0]
2) One 2100-element array containing data for the three attributes consecutively. So:
item0.a = array[(0*3)+0]
item0.b = array[(0*3)+1]
item0.e = array[(0*3)+2]
Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:
Three 700-element arrays = 3300 CPU cycles on average for the whole loop
One 2100-element array = 3500 CPU cycles on average for the whole loop
Here is the code for the 2100-array technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];
//I have left out code for simplicity. You can assume by now the array is populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned short j = i * 3;
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array[j + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
and here is the code for the three 700-element arrays technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];
//I have left out code for simplicity. You can assume by now the arrays are populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned int a= array1[i]; //Array 1
unsigned int b= array2[i]; //Array 2
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array3[i]; //Array 3
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.
I used MSVC 2012, Win 7 64
Assembly for 3x 700-element array technique:
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r8,[this]
or rax,rdx
mov dword ptr [r8],ecx
mov r8d,8ch
mov r9,rax
lea rdx,[rbx+0Ch]
for(int i=0; i < 700; i++){
sub rdi,rbx
unsigned int a = array1[i];
unsigned int b = array2[i];
data_for_all_items = data_for_all_items & (a != -1 & b != -1);
cmp dword ptr [rdi+rdx-0Ch],0FFFFFFFFh
lea rdx,[rdx+14h]
setne cl
cmp dword ptr [rdi+rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-14h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-20h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-14h],0FFFFFFFFh
setne al
and cl,al
and r15b,cl
dec r8
jne 013F26DA53h
unsigned int e = array3[i];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Assembler for the 2100-element array technique:
start = __rdtscp(&x);
rdtscp
lea r8,[this]
shl rdx,20h
or rax,rdx
mov dword ptr [r8],ecx
for(int i=0; i < 700; i++){
xor r8d,r8d
mov r10,rax
unsigned short j = i*3;
movzx ecx,r8w
add cx,cx
lea edx,[rcx+r8]
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
movzx ecx,dx
cmp dword ptr [r9+rcx*4+4],0FFFFFFFFh
setne dl
cmp dword ptr [r9+rcx*4],0FFFFFFFFh
setne al
inc r8d
and dl,al
and r14b,dl
cmp r8d,2BCh
jl 013F05DA10h
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.
The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.
The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.
Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.
The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....
Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.
I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.
typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};
uint4* av = (uint4*)&array[0];
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];
for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
test = test & ((av[i]!= -1) & (bv[i] != -1));
cv += (av[i] * ev[i]);
dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];
The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.
Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.
Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.
The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.
However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.
In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).
The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).