Why are all arrays aligned to 16 bytes on my implementation? - c++

My very simple code shows below
#include <iostream>
#include <stdalign.h>
int main() {
char array_char[2] = {'a', 'b'};
float array_float[2] = {1, 2};
std::cout << "alignof(array_char): " << alignof(array_char) << std::endl;
std::cout << "alignof(array_float): " << alignof(array_float) << std::endl;
std::cout << "address of array_char: " << (void *) array_char << std::endl;
std::cout << "address of array_float: " << array_float << std::endl;
}
The output of this code is
alignof(array_char): 1
alignof(array_float): 4
address of array_char: 0x7fff5e8ec580
address of array_float: 0x7fff5e8ec570
The results of alignof operator is under expectation, but the real addresses of the two arrays are not consistent with them. No matter how many times I tried, the addresses are always 16 bytes aligned.
I'm using gcc 5.4.0 on Ubuntu 16.04 with Intel CORE i5 7th Gen CPU.

I have found this patch.
This seems to have been a bug for x86_64 fixed in GCC 6.4.
The System V x86-64 ABI requires aggregate types (such as arrays and structs) to be aligned to at least 16 bytes if they are at least 16 bytes large. According to a comment in the ABI specification this is meant to facilitate use of SSE instructions.
GCC seem to have mistakenly applied that rule to aggregates of size 16 bits (instead of bytes) and larger.
I suggest you upgrade your compiler to a more recent GCC version.
This is however only an optimization issue, not a correctness one. There is nothing wrong with stricter alignment for the variables and (as with the mentioned SSE) overalignment may have performance benefits in some situations that outweight the cost of the wasted stack memory.

Related

Max number of elements vector

I was looking to see how many elements I can stick into a vector before the program crashes. When running the code below the program crashed with a bad alloc at i=90811045, aka when trying to add the 90811045th element. My question is: Why 90811045?
it is:
not a power of two
not the value that vector.max_size() gives
the same number both in debug and release
the same number after restarting my computer
the same number regardless of what the value of the long long is
note: I know I can fix this by using vector.reserve() or other methods, I am just interested in where 90811045 comes from.
code used:
#include <iostream>
#include <vector>
int main() {
std::vector<long long> myLongs;
std::cout << "Max size expected : " << myLongs.max_size() << std::endl;
for (int i = 0; i < 160000000; i++) {
myLongs.push_back(i);
if (i % 10000 == 0) {
std::cout << "Still going! : " << i << " \r";
}
}
return 0;
}
extra info:
I am currently using 64 bit windows with 16 GB of ram.
Why 90811045?
It's probably just incidental.
That vector is not the only thing that uses memory in your process. There is the execution stack where local variables are stored. There is memory allocated by for buffering the input and output streams. Furthermore, the global memory allocator uses some of the memory for bookkeeping.
90811044 were added succesfully. The vector implementation (typically) has a deterministic strategy for allocating larger internal buffer. Typically, it multiplies the previous capacity by a constant factor (greater than 1). Hence, we can conclude that 90811044 * sizeof(long long) + other_usage is consistently small enough to be allocated successfully, but (90811044 * sizeof(long long)) * some_factor + other_usage is consistently too much.

C++ pointer showing impossible memory location

I made a C++ variable and printed its address and it came out to be very large: 0x7ffdf584da2c.
My code is as follows:
#include <iostream>
using namespace std;
int main()
{
int var = 10;
cout << "value: " << var << " address: " << &var << endl;
return 0;
}
value: 10 address: 0x7ffec6f111c4
This type of Hexadecimal memory address (0x7ffdf584da2c) looks impossible as its Decimal comes out to be (140728722577964) which is fairly large for my laptop.
I've a dual boot laptop with Windows 10 and Ubuntu and Memory around 500 GB. The code is written in Ubuntu.
It's fine.
Computers in 2020 are very complicated. Your process gets a virtual address space whose maximum "slot" will almost certainly exceed the size of the actual RAM on your system (and page file).
You are seeing the stack which goes from top of memory (not actual memory but what the operating system handles) to bottom. So normally if begins at a high value limited by OS and architecture bits (32, 64, ...).

Why does VS Debug build allocates variables so far apart?

I'm using Visual Studio 2019, and I noticed that in debug builds, the variables are allocated so far apart from one another. I looked at Project Properties and tried searching online but could not find anything. I ran the following code below in both Debug and Release mode and here are the respective outputs.
int main() {
int a = 3;
int b = 5;
int c = 8;
int d[5] = { 10,10,10,10,10 };
int e = 14;
std::cout << "a: " << &a
<< "\nb: " << &b
<< "\nc: " << &c
<< "\nd_start: " << &d[0]
<< "\nd_end: " << &d[4] + 1
<< "\ne: " << &e
<< std::endl;
}
As you can see below, variables are allocated as you would expect (one after the other) with no wasted memory in between. Even the last variable, e, is optimized to slot between c and d.
// Release_x64 Build Ouput
a: 0000003893EFFC40
b: 0000003893EFFC44
c: 0000003893EFFC48
d_start: 0000003893EFFC50
d_end: 0000003893EFFC64
e: 0000003893EFFC4C // e is optimized in between c and d
Below is the output that confuses me. Here you can see that a and b are allocated 32 bytes apart! So there is 28 bytes of wasted/uninitialized memory between them. The same thing happens for other variables except for the int d[5]. d has 32 uninitialized bytes after c but only has 24 uninitialized bytes before e.
// Debug_x64 Build Output
a: 00000086D7EFF3F4
b: 00000086D7EFF414
c: 00000086D7EFF434
d_start: 00000086D7EFF458
d_end: 00000086D7EFF46C
e: 00000086D7EFF484
My question is that why is this happening? Why does the MSVC allocate these variables so far apart from one another and what determines how much space to separate them by so that it's different for arrays?
The debug version of the allocates storage differently than the release version. In particular, the debug version allocates some space at the beginning and end of each block of storage, so its allocation patterns are somewhat different.
The debug allocator also checks the storage at the start and end of the block it allocated to see if it has been damaged in any way.
Storage is allocated in quantized chunks, where the quantum is unspecified but is something like 16, or 32 bytes. Thus, if you allocated a DWORD array of six elements (size = 6 * sizeof(DWORD) bytes = 24 bytes) then the allocator will actually deliver 32 bytes (one 32-byte quantum or two 16-byte quanta). So if you write element [6] (the seventh element) you overwrite some of the "dead space" and the error is not detected. But in the release version, the quantum might be 8 bytes, and three 8-byte quanta would be allocated, and writing the [6] element of the array would overwrite a part of the storage allocator data structure that belongs to the next chunk. After that it is all downhill. There error might not even show up until the program exits! You can construct similar "boundary condition" situations for any size quantum. Because the quantum size is the same for both versions of the allocator, but the debug version of the allocator adds hidden space for its own purposes, you will get different storage allocation patterns in debug and release mode.

Bus error with allocated memory on a heap

I have Bus Error in such code:
char* mem_original;
int int_var = 987411;
mem_original = new char [250];
memcpy(&mem_original[250-sizeof(int)], &int_var, sizeof(int));
...
const unsigned char* mem_u_const = (unsigned char*)mem_original;
...
const unsigned char *location = mem_u_const + 250 - sizeof(int);
std::cout << "sizeof(int) = " << sizeof(int) << std::endl;//it's printed out as 4
std::cout << "byte 0 = " << int(*location) << std::endl;
std::cout << "byte 1 = " << int(*(location+1)) << std::endl;
std::cout << "byte 2 = " << int(*(location+2)) << std::endl;
std::cout << "byte 3 = " << int(*(location+3)) << std::endl;
int original_var = *((const int*)location);
std::cout << "original_var = " << original_var << std::endl;
That works well few times, printing out:
sizeof(int) = 4
byte 0 = 0
byte 1 = 15
byte 2 = 17
byte 3 = 19
original_var = 987411
And then it fails with:
sizeof(int) = 4
byte 0 = 0
byte 1 = 15
byte 2 = 17
byte 3 = 19
Bus Error
It's built & run on Solaris OS (C++ 5.12)
Same code on Linux (gcc 4.12) & Windows (msvc-9.0) is working well.
We can see:
memory was allocated on the heap by new[].
memory is accessible (we can read it byte by byte)
memory contains exactly what there should be, not corrupted.
So what may be reason for Bus Error? Where should I look?
UPD:
If I memcpy(...) location in the end to original_var, it works. But what the problem in *((const int*)location) ?
This is a common issue for developers with no experience on hardware that has alignment restrictions - such as SPARC. x86 hardware is very forgiving of misaligned access, albeit with performance impacts. Other types of hardware? SIGBUS.
This line of code:
int original_var = *((const int*)location);
invokes undefined behavior. You're taking an unsigned char * and interpreting what it points to as an int. You can't do that safely. Period. It's undefined behavior - for the very reason you're experiencing.
You're violating the strict aliasing rule. See What is the strict aliasing rule? Put simply, you can't refer to an object of one type as another type. A char * does not and can not refer to an int.
Oracle's Solaris Studio compilers actually provide a command-line argument that will let you get away with that on SPARC hardware - -xmemalign=1i (see https://docs.oracle.com/cd/E19205-01/819-5265/bjavc/index.html). Although to be fair to GCC, without that option, the forcing you do in your code will still SIGBUS under the Studio compiler.
Or, as you've already noted, you can use memcpy() to copy bytes around no matter what they are - as long as you know the source object is safe to copy into the target object - yes, there are cases when that's not true.
I get the following warning when I compile your code:
main.cpp:19:26: warning: cast from 'const unsigned char *' to 'const int *' increases required alignment from 1 to 4 [-Wcast-align]
int original_var = *((const int*)location);
^~~~~~~~~~~~~~~~~~~~
This seems to be the cause of the bus error, because improperly aligned access can cause a bus error.
Although I don’t have access to a SPARC right now to test this, I’m pretty sure from my experiences on that platform that this line is your problem:
const unsigned char *location = mem_u_const + 250 - sizeof(int);
The mem_u_const block was originally allocated by new for an array of characters. Since sizeof(unsigned char) is 1 and sizeof(int) is 4, you are adding 246 bytes. This is not a multiple of 4.
On SPARC, the CPU can only read 4-byte words if they are aligned to 4-byte boundaries. Your attempt to read a misaligned word is what causes the bus error.
I recommend allocating a struct with an array of unsigned char followed by an int, rather than a bunch of pointer math and casts like the one that caused this bug.

How to speed up floating-point to integer number conversion? [duplicate]

This question already has answers here:
What is the fastest way to convert float to int on x86
(10 answers)
Closed 8 years ago.
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
Most of the other answers here just try to eliminate loop overhead.
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
Is the time large enough that it outweighs the cost of starting a couple of threads?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
See this Intel article for speeding up integer conversions:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior.
for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding.
read up on the fld and fistp fpu instructions and the fpu control word.
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
As thirtyseven said above, the CPU can convert float<->int in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float over double.
On Intel your best bet is inline SSE2 calls.
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
If you do not care very much about the rounding semantics, you can use the lrint() function. This allows for more freedom in rounding and it can be much faster.
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
lrint documentation
rounding only
excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
In IEEE 754
6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
6755399441055743.5
will also however compile to
0100001100111000000000000000000000000000000000000000000000000000
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
to do truncation you would have to add 0.5 to your double then do this
the guard digits should take care of rounding to the correct result done this way.
also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
WRITING ONLY...
1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s
154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s
108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s
USING CASTING...
5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s
2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s
1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s
USING xs_CRoundToInt...
1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s
1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s
1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}