C++ program crashing due to seemingly unrelated array - c++

I'm developing a terrain generator in OpenGL and encounter a strange bug. I can define how long the triangles that the terrain is made from are like this:
#define TRIANGLE_WIDTH 0.25f
#define INVERSE_TRIANGLE_WIDTH 4
With those sizes I create arrays, that are used to store the data of the terrain. I have made some typedefs for the array that look like this:
typedef std::array<std::array<glm::vec3, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec3Type;
typedef std::array<std::array<glm::vec2, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec2Type;
I am using OpenGLs index based rendering. To minimize memory use I automatically calculate if unsigned shorts are sufficient or if I need unsigned ints like this:
#define CHUNK_SIDE_LENGHT (CHUNK_WIDTH * INVERSE_TRIANGLE_WIDTH)
#define CHUNK_SIDE_LENGHT_FLOAT (float)(CHUNK_SIDE_LENGHT)
#define CHUNK_ARRAY_SIZE CHUNK_SIDE_LENGHT * CHUNK_SIDE_LENGHT
#if CHUNK_ARRAY_SIZE >= 65535
#define MAP_INDICES_ARRAY_TYPE unsigned int
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_INT
#else
#define MAP_INDICES_ARRAY_TYPE unsigned short
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_SHORT
#endif
This all works but when the TRIANGLE_WIDTH gets smaller then 0.125 I get a crash when running the program.
I am using Xcode and whenever a program crashes, Xcode shows me where the error is in the assembly code:
SDL-OpenGL-Tests-3`main:
0x100030760 <+0>: pushq %rbp
0x100030761 <+1>: movq %rsp, %rbp
0x100030764 <+4>: subq $0x807f60, %rsp ; imm = 0x807F60
0x10003076b <+11>: movq 0x5b936(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x100030772 <+18>: movq (%rax), %rax
0x100030775 <+21>: movq %rax, -0x8(%rbp)
0x100030779 <+25>: movl $0x0, -0x291c(%rbp)
0x100030783 <+35>: movl %edi, -0x2920(%rbp)
0x100030789 <+41>: movq %rsi, -0x2928(%rbp)
0x100030790 <+48>: leaq 0x4357b(%rip), %rsi ; "/dev/urandom"
0x100030797 <+55>: leaq -0x2948(%rbp), %rax
0x10003079e <+62>: movq %rax, %rdi
-> 0x1000307a1 <+65>: movq %rax, -0x807178(%rbp)
0x1000307a8 <+72>: callq 0x100030090 ; std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string<std::nullptr_t> at string:817
0x1000307ad <+77>: leaq -0x2930(%rbp), %rdi
The error is marked by the arrow. From my limited knowledge of assembly I think this part is corespondent to the start of my int main() that looks like this:
int main(int argc, const char * argv[]) {
std::random_device randomDevice;
std::mt19937 randomEngine(randomDevice());
bool capFps = true;
int fpsCap = 60;
double frameTimeCap = double(1e6) / double(fpsCap);
std::chrono::steady_clock::time_point start, end;
std::cout << "Main Thread ID: " << std::this_thread::get_id() << std::endl;
I really don't get how my terrain generation is affecting the random functions of C++ since I don't use the random engine in my terrain generation, I use it to randomly place some light sources.
Update:
Because elsamuko pointed out that I should interchange the C++ random function with some pseudo random number I removed the random functions and now just set all the random values to zero and it still crashes now with this code:
SDL-OpenGL-Tests-3`main:
0x10002fc30 <+0>: pushq %rbp
0x10002fc31 <+1>: movq %rsp, %rbp
0x10002fc34 <+4>: subq $0x807510, %rsp ; imm = 0x807510
0x10002fc3b <+11>: movsd 0x5aacd(%rip), %xmm0 ; typeinfo name for Sphere + 12, xmm0 = mem[0],zero
0x10002fc43 <+19>: movq 0x5c45e(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x10002fc4a <+26>: movq (%rax), %rax
0x10002fc4d <+29>: movq %rax, -0x8(%rbp)
0x10002fc51 <+33>: movl $0x0, -0x291c(%rbp)
0x10002fc5b <+43>: movl %edi, -0x2920(%rbp)
0x10002fc61 <+49>: movq %rsi, -0x2928(%rbp)
0x10002fc68 <+56>: movb $0x1, -0x2929(%rbp)
0x10002fc6f <+63>: movl $0x3c, -0x2930(%rbp)
0x10002fc79 <+73>: cvtsi2sdl -0x2930(%rbp), %xmm1
0x10002fc81 <+81>: divsd %xmm1, %xmm0
0x10002fc85 <+85>: movsd %xmm0, -0x2938(%rbp)
0x10002fc8d <+93>: leaq -0x2940(%rbp), %rdi
-> 0x10002fc94 <+100>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fc99 <+105>: leaq -0x2948(%rbp), %rdi
0x10002fca0 <+112>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fca5 <+117>: movq 0x5c384(%rip), %rdi ; (void *)0x00007fff92814760: std::__1::cout
0x10002fcac <+124>: leaq 0x44007(%rip), %rsi ; "Main Thread ID: "
0x10002fcb3 <+131>: callq 0x10007132c ; symbol stub for: std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::operator<<<std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*)
It seems as the code just pics the first command it gets to crash.

The crash is due to a stack overflow that occurs when the gets too large. The simplest solution is to use a std::unique_ptr, because this array will be allocated on the heap.

Related

Does a For loop with Vector iterator copy values, making it inefficient? [duplicate]

This question already has answers here:
What is the correct way of using C++11's range-based for?
(4 answers)
Closed 1 year ago.
I'm using a for loop to go through all elements in a vector, and I've seen the slick code:
std::vector<int> vi;
// ... assume the vector gets populated
for(int i : vi)
{
// do stuff with i
}
However, from a quick test, it looks like it's copying the value from the vector into i each time (I tried modifying i in the for loop and the vector remains unchanged).
The reason I'm asking, is that I'm actually doing this with a vector of a large struct.
std::vector<MyStruct> myStructList;
for(MyStruct oneStruct : myStructList)
{
cout << oneStruct;
}
So... is this a poor way of doing things, given the amount of memory copying? Is it more efficient to use traditional indexing?
for(int i=0; i<myStructList.size(); i++)
{
cout << myStructList[i];
}
Thanks,
I tested that on Compiler Explorer and found that copying can actually be done even with -O3 optimization with gcc 10.3.
Here is my code for test:
#include <iostream>
#include <vector>
using std::cout;
struct MyStruct {
int a[32];
};
std::ostream& operator<<(std::ostream& s, const MyStruct& m) {
for (int i = 0; i < 32; i++) s << m.a[i] << ' ';
return s;
}
std::vector<MyStruct> myStructList;
void test(void) {
for(MyStruct oneStruct : myStructList)
{
cout << oneStruct;
}
}
Here is a part of the result:
test():
pushq %r13
pushq %r12
pushq %rbp
pushq %rbx
subq $152, %rsp
movq myStructList(%rip), %r12
movq myStructList+8(%rip), %r13
cmpq %r13, %r12
je .L8
leaq 144(%rsp), %rbp
.L11:
movdqu (%r12), %xmm0
movdqu 16(%r12), %xmm1
leaq 16(%rsp), %rbx
movdqu 32(%r12), %xmm2
movdqu 48(%r12), %xmm3
movdqu 64(%r12), %xmm4
movdqu 80(%r12), %xmm5
movups %xmm0, 16(%rsp)
movdqu 96(%r12), %xmm6
movdqu 112(%r12), %xmm7
movups %xmm1, 32(%rsp)
movups %xmm2, 48(%rsp)
movups %xmm3, 64(%rsp)
movups %xmm4, 80(%rsp)
movups %xmm5, 96(%rsp)
movups %xmm6, 112(%rsp)
movups %xmm7, 128(%rsp)
.L10:
movl (%rbx), %esi
movl $_ZSt4cout, %edi
addq $4, %rbx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movl $1, %edx
leaq 15(%rsp), %rsi
movb $32, 15(%rsp)
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
cmpq %rbp, %rbx
jne .L10
subq $-128, %r12
cmpq %r12, %r13
jne .L11
.L8:
addq $152, %rsp
popq %rbx
popq %rbp
popq %r12
popq %r13
ret
The lines between .L10: and jne .L11 corresponds to the operator<< function, and you can see large copying is done before that.
You should add & between MyStruct and oneStruct in the for loop to make it a reference and avoid unwanted data copies. Here is a part of a result with & added:
test():
pushq %r12
pushq %rbp
pushq %rbx
subq $16, %rsp
movq myStructList(%rip), %rbp
movq myStructList+8(%rip), %r12
cmpq %r12, %rbp
je .L8
subq $-128, %rbp
.L11:
leaq -128(%rbp), %rbx
.L10:
movl (%rbx), %esi
movl $_ZSt4cout, %edi
addq $4, %rbx
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movl $1, %edx
leaq 15(%rsp), %rsi
movb $32, 15(%rsp)
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
cmpq %rbp, %rbx
jne .L10
leaq 128(%rbp), %rax
cmpq %rbp, %r12
je .L8
movq %rax, %rbp
jmp .L11
.L8:
addq $16, %rsp
popq %rbx
popq %rbp
popq %r12
ret
Now you can see the large copying is eliminated and the pointer to the structure is directly used for execution of operator<<.

Cause of EXC_BREAKPOINT crash

I've got a crash occurring on some users computers in a C++ audiounit component running inside Logic X. I can't repeat it locally unfortunately and in the process of trying to work out how it might occur I've got some questions.
Here's the relevant info from the crash dump:
Exception Type: EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Trace/BPT trap: 5
Termination Reason: Namespace SIGNAL, Code 0x5
Terminating Process: exc handler [0]
The questions are:
What might cause a EXC_BREAKPOINT in the situation I'm looking at. Is this information from Apple complete and accurate: "Similar to an Abnormal Exit, this exception is intended to give an attached debugger the chance to interrupt the process at a specific point in its execution. You can trigger this exception from your own code using the __builtin_trap() function. If no debugger is attached, the process is terminated and a crash report is generated."
Why would it occur on SharedObject + 200 (see disassembly)
Is RBX the 'this' pointer at the moment the crash occurs.
The crash occurs here:
juce::ValueTree::SharedObject::SharedObject(juce::ValueTree::SharedObject const&) + 200
The C++ is as follows:
SharedObject (const SharedObject& other)
: ReferenceCountedObject(),
type (other.type), properties (other.properties), parent (nullptr)
{
for (int i = 0; i < other.children.size(); ++i)
{
SharedObject* const child = new SharedObject (*other.children.getObjectPointerUnchecked(i));
child->parent = this;
children.add (child);
}
}
The disassembly:
-> 0x127167950 <+0>: pushq %rbp
0x127167951 <+1>: movq %rsp, %rbp
0x127167954 <+4>: pushq %r15
0x127167956 <+6>: pushq %r14
0x127167958 <+8>: pushq %r13
0x12716795a <+10>: pushq %r12
0x12716795c <+12>: pushq %rbx
0x12716795d <+13>: subq $0x18, %rsp
0x127167961 <+17>: movq %rsi, %r12
0x127167964 <+20>: movq %rdi, %rbx
0x127167967 <+23>: leaq 0x589692(%rip), %rax ; vtable for juce::ReferenceCountedObject + 16
0x12716796e <+30>: movq %rax, (%rbx)
0x127167971 <+33>: movl $0x0, 0x8(%rbx)
0x127167978 <+40>: leaq 0x599fe9(%rip), %rax ; vtable for juce::ValueTree::SharedObject + 16
0x12716797f <+47>: movq %rax, (%rbx)
0x127167982 <+50>: leaq 0x10(%rbx), %rdi
0x127167986 <+54>: movq %rdi, -0x30(%rbp)
0x12716798a <+58>: leaq 0x10(%r12), %rsi
0x12716798f <+63>: callq 0x12711cf70 ; juce::Identifier::Identifier(juce::Identifier const&)
0x127167994 <+68>: leaq 0x18(%rbx), %rdi
0x127167998 <+72>: movq %rdi, -0x38(%rbp)
0x12716799c <+76>: leaq 0x18(%r12), %rsi
0x1271679a1 <+81>: callq 0x12711c7b0 ; juce::NamedValueSet::NamedValueSet(juce::NamedValueSet const&)
0x1271679a6 <+86>: movq $0x0, 0x30(%rbx)
0x1271679ae <+94>: movl $0x0, 0x38(%rbx)
0x1271679b5 <+101>: movl $0x0, 0x40(%rbx)
0x1271679bc <+108>: movq $0x0, 0x48(%rbx)
0x1271679c4 <+116>: movl $0x0, 0x50(%rbx)
0x1271679cb <+123>: movl $0x0, 0x58(%rbx)
0x1271679d2 <+130>: movq $0x0, 0x60(%rbx)
0x1271679da <+138>: cmpl $0x0, 0x40(%r12)
0x1271679e0 <+144>: jle 0x127167aa2 ; <+338>
0x1271679e6 <+150>: xorl %r14d, %r14d
0x1271679e9 <+153>: nopl (%rax)
0x1271679f0 <+160>: movl $0x68, %edi
0x1271679f5 <+165>: callq 0x12728c232 ; symbol stub for: operator new(unsigned long)
0x1271679fa <+170>: movq %rax, %r13
0x1271679fd <+173>: movq 0x30(%r12), %rax
0x127167a02 <+178>: movq (%rax,%r14,8), %rsi
0x127167a06 <+182>: movq %r13, %rdi
0x127167a09 <+185>: callq 0x127167950 ; <+0>
0x127167a0e <+190>: movq %rbx, 0x60(%r13) // MY NOTES: child->parent = this
0x127167a12 <+194>: movl 0x38(%rbx), %ecx
0x127167a15 <+197>: movl 0x40(%rbx), %eax
0x127167a18 <+200>: cmpl %eax, %ecx
Update 1:
It looks like RIP is suggesting we are in the middle of the 'add' call which is this function, inlined:
/** Appends a new object to the end of the array.
This will increase the new object's reference count.
#param newObject the new object to add to the array
#see set, insert, addIfNotAlreadyThere, addSorted, addArray
*/
ObjectClass* add (ObjectClass* const newObject) noexcept
{
data.ensureAllocatedSize (numUsed + 1);
jassert (data.elements != nullptr);
data.elements [numUsed++] = newObject;
if (newObject != nullptr)
newObject->incReferenceCount();
return newObject;
}
Update 2:
At the point of crash register values of relevant registers:
this == rbx: 0x00007fe5bc37c950
&other == r12: 0x00007fe5bc348cc0
rax = 0
rcx = 0
There may be a few problems in this code:
like SM mentioned, other.children.getObjectPointerUnchecked(i) could return nullptr
in ObjectClass* add (ObjectClass* const newObject) noexcept, you check if newObject isn't null before calling incReferenceCount (which means a null may occur in this method call), but you don't null-check before adding this object during data.elements [numUsed++] = newObject;, so you may have a nullptr here, and if you call this array somewhere else without checking, you may have a crash.
we don't really know the type of data.elements (I assume an array), but there could be a copy operation occuring because of the pointer assignment (if the pointer is converted to object, if data.elements is not an array of pointer or there is some operator overload), there could be a crash here (but it's unlikely)
there is a circular ref between children and parent, so there may be a problem during object destruction
I suspect that the problem is that a shared, ref-counted object that should have been placed on the heap was inadvertently allocated on the stack. If the stack unwinds and is overwritten afterwards, and a reference to the ref-counted object still exists, then random things will happen at unpredictable times when that reference is accessed. (The proximity in address-space of this and &other also makes me suspect that both are on the stack.)

Vectorization of sin and cos

I was playing around with Compiler Explorer and ran into an anomaly (I think). If I want to make the compiler vectorize a sin calculation using libmvec, I would write:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T s(const T x)
{
return sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = s(x[i]);
}
}
compile with gcc 6.2 and -O3 -march=native -ffast-math and get
func(float*, float*, int):
testl %edx, %edx
jle .L10
leaq 8(%rsp), %r10
andq $-32, %rsp
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r14
xorl %r14d, %r14d
pushq %r13
leal -8(%rdx), %r13d
pushq %r12
shrl $3, %r13d
movq %rsi, %r12
pushq %r10
addl $1, %r13d
pushq %rbx
movq %rdi, %rbx
subq $8, %rsp
.L4:
vmovaps (%rbx), %ymm0
addl $1, %r14d
addq $32, %r12
addq $32, %rbx
call _ZGVcN8v_sinf // YAY! Vectorized trig!
vmovaps %ymm0, -32(%r12)
cmpl %r13d, %r14d
jb .L4
vzeroupper
addq $8, %rsp
popq %rbx
popq %r10
popq %r12
popq %r13
popq %r14
popq %rbp
leaq -8(%r10), %rsp
.L10:
ret
But when I add a cosine to the function, there is no vectorization:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T f(const T x)
{
return cosf(x)+sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = f(x[i]);
}
}
which gives:
func(float*, float*, int):
testl %edx, %edx
jle .L10
pushq %r12
leal -1(%rdx), %eax
pushq %rbp
leaq 4(%rdi,%rax,4), %r12
movq %rsi, %rbp
pushq %rbx
movq %rdi, %rbx
subq $16, %rsp
.L4:
vmovss (%rbx), %xmm0
leaq 8(%rsp), %rsi
addq $4, %rbx
addq $4, %rbp
leaq 12(%rsp), %rdi
call sincosf // No vectorization
vmovss 12(%rsp), %xmm0
vaddss 8(%rsp), %xmm0, %xmm0
vmovss %xmm0, -4(%rbp)
cmpq %rbx, %r12
jne .L4
addq $16, %rsp
popq %rbx
popq %rbp
popq %r12
.L10:
ret
I see two good alternatives. Either call a vectorized version of sincosf or call the vectorized sin and cos sequentially. I tried adding -fno-builtin-sincos to no avail. -fopt-info-vec-missed complains about complex float, which there is none.
Is this a known issue with gcc? Either way, is there a way I can convince gcc to vectorize the latter example?
(As an aside, is there any way to get gcc < 6 to vectorize trigonometric functions automatically?)

Segmentation fault with array of __m256i when using clang/g++

I'm attempting to generate arrays of __m256i's to reuse in another computation. When I attempt to do that (even with a minimal testcase), I get a segmentation fault - but only if the code is compiled with g++ or clang. If I compile the code with the Intel compiler (version 16.0), no segmentation fault occurs. Here is a test case I created:
int main() {
__m256i *table = new __m256i[10000];
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
table[99] = zeroes;
}
When compiling the above with clang 3.6 and g++ 4.8, a segmentation fault occurs.
Here's the assembly generated by the Intel compiler (from https://gcc.godbolt.org/, icc 13.0):
pushq %rbx #3.12
movq %rsp, %rbx #3.12
andq $-32, %rsp #3.12
pushq %rbp #3.12
pushq %rbp #3.12
movq 8(%rbx), %rbp #3.12
movq %rbp, 8(%rsp) #3.12
movq %rsp, %rbp #3.12
subq $112, %rsp #3.12
movl $3200, %eax #4.38
vzeroupper #4.38
movq %rax, %rdi #4.38
call operator new[](unsigned long) #4.38
movq %rax, -112(%rbp) #4.38
movq -112(%rbp), %rax #4.38
movq %rax, -104(%rbp) #4.20
vxorps %ymm0, %ymm0, %ymm0 #5.22
vmovdqu %ymm0, -80(%rbp) #5.22
vmovdqu -80(%rbp), %ymm0 #5.22
vmovdqu %ymm0, -48(%rbp) #5.20
movl $3168, %eax #6.17
addq -104(%rbp), %rax #6.5
vmovdqu -48(%rbp), %ymm0 #6.17
vmovdqu %ymm0, (%rax) #6.5
movl $0, %eax #7.1
vzeroupper #7.1
leave #7.1
movq %rbx, %rsp #7.1
popq %rbx #7.1
ret #7.1
And here's from clang 3.7:
pushq %rbp
movq %rsp, %rbp
andq $-32, %rsp
subq $192, %rsp
xorl %eax, %eax
movl $3200, %ecx # imm = 0xC80
movl %ecx, %edi
movl %eax, 28(%rsp) # 4-byte Spill
callq operator new[](unsigned long)
movq %rax, 88(%rsp)
movq $0, 168(%rsp)
movq $0, 160(%rsp)
movq $0, 152(%rsp)
movq $0, 144(%rsp)
vmovq 168(%rsp), %xmm0 # xmm0 = mem[0],zero
vmovq 160(%rsp), %xmm1 # xmm1 = mem[0],zero
vpunpcklqdq %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0]
vmovq 152(%rsp), %xmm1 # xmm1 = mem[0],zero
vpslldq $8, %xmm1, %xmm1 # xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
vmovaps %xmm1, %xmm2
vinserti128 $1, %xmm0, %ymm2, %ymm2
vmovaps %ymm2, 96(%rsp)
vmovaps %ymm2, 32(%rsp)
movq 88(%rsp), %rax
vmovaps %ymm2, 3168(%rax)
movl 28(%rsp), %eax # 4-byte Reload
movq %rbp, %rsp
popq %rbp
vzeroupper
retq
Am I running into a compiler bug in clang/g++? Or am I simply doing something wrong?
I have said many times before that implicit SIMD loads/stores are a bad idea. Stop using them. Use explicit loads/stores like this
int64_t* table = new int64_t[4*10000];
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
_mm256_storeu_si256((__m256i*)&table[4*99], zeroes);
or since this is POD use the cross-compiler/OS function _mm_malloc
int64_t* table = (int64_t*)_mm_malloc(sizeof(int64_t)*4*10000, 32);
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
_mm256_store_si256((__m256i*)&table[4*99], zeroes);
You can use _mm256_setzero_si256() instead of _mm256_set_epi64x(0, 0, 0, 0) (note that _mm256_set_epi64x does not work in 32-bit mode on some version of MSVC) but GCC and Clang are smart enough to know they are the same thing.
Since you're using intrinsics which are not part of the C/C++ specification then some rules such as strict aliasing may be overlooked.
I guess the problem has to do with wrong memory alignment. vmovaps requires the memory location to start at a 32-byte boundary and vmovdqu does not. That's why the Intel version works whereas the clang/g++ code crashes. I don't know if this is a compiler bug, but you may want alignment anyway.
The following code should work, although it's more C than C++.
int main() {
__m256i *table = (__m256i*) memalign( 32, 10000 * sizeof(__m256i) );
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
table[99] = zeroes;
}

openCL glGetProgramInfo causing Core Foundation crash

I am writing a C++ command line program in XCode (6.4) on OSX Yosemite (10.10.4). I am using Apple's openCL framework and am trying to save my openCL binaries to disk.
First I create my program from source and build as follows:
cl_int err;
cl_program result = clCreateProgramWithSource(context, numFiles, (const char **)sourceStrings, NULL, &err);
ErrorManager::CheckError(err, "Failed to create a compute program");
err = clBuildProgram(result, deviceCount, devices, NULL, NULL, NULL);
ErrorManager::CheckError(err, "Failed to build program");
The above code works fine and I can launch my kernels without any error.
Then I try to access the binary sizes...
size_t *programBinarySizes = new size_t[deviceCount];
err = clGetProgramInfo(result, CL_PROGRAM_BINARY_SIZES, sizeof(size_t) * deviceCount, programBinarySizes, NULL);
However this call to clGetProgramInfo causes Xcode to throw an EXC_BREAKPOINT that has the following output:
CoreFoundation`__CFTypeCollectionRetain:
0x7fff8e2c2480 <+0>: pushq %rbp
0x7fff8e2c2481 <+1>: movq %rsp, %rbp
0x7fff8e2c2484 <+4>: pushq %r14
0x7fff8e2c2486 <+6>: pushq %rbx
0x7fff8e2c2487 <+7>: movq %rsi, %rbx
0x7fff8e2c248a <+10>: testq %rbx, %rbx
0x7fff8e2c248d <+13>: je 0x7fff8e2c25ae ; <+302>
0x7fff8e2c2493 <+19>: cmpb $0x0, -0x17f41f28(%rip) ; __CFDeallocateZombies
0x7fff8e2c249a <+26>: je 0x7fff8e2c257c ; <+252>
0x7fff8e2c24a0 <+32>: testq %rdi, %rdi
0x7fff8e2c24a3 <+35>: je 0x7fff8e2c24b5 ; <+53>
0x7fff8e2c24a5 <+37>: leaq -0x17f917cc(%rip), %rax ; kCFAllocatorSystemDefault
0x7fff8e2c24ac <+44>: cmpq %rdi, (%rax)
0x7fff8e2c24af <+47>: jne 0x7fff8e2c257c ; <+252>
0x7fff8e2c24b5 <+53>: testb $0x1, %bl
0x7fff8e2c24b8 <+56>: je 0x7fff8e2c24e4 ; <+100>
0x7fff8e2c24ba <+58>: movl %ebx, %eax
0x7fff8e2c24bc <+60>: shrl %eax
0x7fff8e2c24be <+62>: andl $0x7, %eax
0x7fff8e2c24c1 <+65>: cmpl $0x6, %eax
0x7fff8e2c24c4 <+68>: ja 0x7fff8e2c259c ; <+284>
0x7fff8e2c24ca <+74>: leaq 0xef(%rip), %rcx ; <+320>
0x7fff8e2c24d1 <+81>: movslq (%rcx,%rax,4), %rax
0x7fff8e2c24d5 <+85>: addq %rcx, %rax
0x7fff8e2c24d8 <+88>: jmpq *%rax
0x7fff8e2c24da <+90>: callq 0x7fff8e2df7b0 ; CFStringGetTypeID
0x7fff8e2c24df <+95>: jmp 0x7fff8e2c2594 ; <+276>
0x7fff8e2c24e4 <+100>: movl 0x8(%rbx), %eax
0x7fff8e2c24e7 <+103>: shrl $0x8, %eax
0x7fff8e2c24ea <+106>: andl $0x3ff, %eax
0x7fff8e2c24ef <+111>: movq (%rbx), %rcx
0x7fff8e2c24f2 <+114>: testq %rcx, %rcx
0x7fff8e2c24f5 <+117>: je 0x7fff8e2c2522 ; <+162>
0x7fff8e2c24f7 <+119>: cmpq -0x17f41f96(%rip), %rcx ; __CFConstantStringClassReferencePtr
0x7fff8e2c24fe <+126>: je 0x7fff8e2c2522 ; <+162>
0x7fff8e2c2500 <+128>: leaq -0x17f43fa7(%rip), %rdx ; __CFRuntimeObjCClassTable
0x7fff8e2c2507 <+135>: movq (%rdx,%rax,8), %r14
0x7fff8e2c250b <+139>: cmpq %r14, %rcx
0x7fff8e2c250e <+142>: je 0x7fff8e2c2522 ; <+162>
0x7fff8e2c2510 <+144>: testb $0x1, %cl
0x7fff8e2c2513 <+147>: je 0x7fff8e2c2594 ; <+276>
0x7fff8e2c2515 <+149>: movq %rbx, %rdi
0x7fff8e2c2518 <+152>: callq 0x7fff8e481b6a ; symbol stub for: object_getClass
0x7fff8e2c251d <+157>: cmpq %r14, %rax
0x7fff8e2c2520 <+160>: jne 0x7fff8e2c2594 ; <+276>
0x7fff8e2c2522 <+162>: callq 0x7fff8e481aaa ; symbol stub for: objc_collectableZone
0x7fff8e2c2527 <+167>: movq %rax, %rdi
0x7fff8e2c252a <+170>: movq %rbx, %rsi
0x7fff8e2c252d <+173>: callq 0x7fff8e481594 ; symbol stub for: auto_zone_is_valid_pointer
0x7fff8e2c2532 <+178>: testl %eax, %eax
0x7fff8e2c2534 <+180>: je 0x7fff8e2c255b ; <+219>
0x7fff8e2c2536 <+182>: movl 0x8(%rbx), %eax
0x7fff8e2c2539 <+185>: shrl $0x8, %eax
0x7fff8e2c253c <+188>: andl $0x3ff, %eax
0x7fff8e2c2541 <+193>: leaq -0x17f45ff8(%rip), %rcx ; __CFRuntimeClassTable
0x7fff8e2c2548 <+200>: movq (%rcx,%rax,8), %rax
0x7fff8e2c254c <+204>: testb $0x4, (%rax)
0x7fff8e2c254f <+207>: je 0x7fff8e2c2594 ; <+276>
0x7fff8e2c2551 <+209>: movq %rbx, %rdi
0x7fff8e2c2554 <+212>: callq 0x7fff8e2bf500 ; CFRetain
0x7fff8e2c2559 <+217>: jmp 0x7fff8e2c2594 ; <+276>
0x7fff8e2c255b <+219>: cmpl $0x0, 0xc(%rbx)
0x7fff8e2c255f <+223>: je 0x7fff8e2c2594 ; <+276>
0x7fff8e2c2561 <+225>: leaq -0x17f72248(%rip), %rsi ; #"storing a non-GC object %p in a GC collection, break on CFCollection_non_gc_storage_error to debug."
0x7fff8e2c2568 <+232>: movl $0x4, %edi
0x7fff8e2c256d <+237>: xorl %eax, %eax
0x7fff8e2c256f <+239>: movq %rbx, %rdx
0x7fff8e2c2572 <+242>: callq 0x7fff8e3b61e0 ; CFLog
0x7fff8e2c2577 <+247>: callq 0x7fff8e3c3290 ; CFCollection_non_gc_storage_error
0x7fff8e2c257c <+252>: movq %rbx, %rdi
0x7fff8e2c257f <+255>: popq %rbx
0x7fff8e2c2580 <+256>: popq %r14
0x7fff8e2c2582 <+258>: popq %rbp
0x7fff8e2c2583 <+259>: jmp 0x7fff8e2bf500 ; CFRetain
0x7fff8e2c2588 <+264>: callq 0x7fff8e2df7f0 ; CFNumberGetTypeID
0x7fff8e2c258d <+269>: jmp 0x7fff8e2c2594 ; <+276>
0x7fff8e2c258f <+271>: callq 0x7fff8e2df850 ; CFDateGetTypeID
0x7fff8e2c2594 <+276>: movq %rbx, %rax
0x7fff8e2c2597 <+279>: popq %rbx
0x7fff8e2c2598 <+280>: popq %r14
0x7fff8e2c259a <+282>: popq %rbp
0x7fff8e2c259b <+283>: retq
0x7fff8e2c259c <+284>: int3
0x7fff8e2c259d <+285>: callq 0x7fff8e482020 ; symbol stub for: getpid
0x7fff8e2c25a2 <+290>: movl $0x9, %esi
0x7fff8e2c25a7 <+295>: movl %eax, %edi
0x7fff8e2c25a9 <+297>: callq 0x7fff8e482080 ; symbol stub for: kill
0x7fff8e2c25ae <+302>: leaq 0x36325e(%rip), %rax ; "*** __CFTypeCollectionRetain() called with NULL; likely a collection has been corrupted ***"
0x7fff8e2c25b5 <+309>: movq %rax, -0x17f460c4(%rip) ; gCRAnnotations + 8
0x7fff8e2c25bc <+316>: int3
-> 0x7fff8e2c25bd <+317>: jmp 0x7fff8e2c259d ; <+285>
0x7fff8e2c25bf <+319>: nop
This exception doesn't get thrown if I call clGetProgramInfo with a param name other than CL_PROGRAM_BINARY_SIZES.
I cannot figure out the reason why I'm getting this exception. It's puzzling because I'm writing a C++ program, not an ObjC or Swift one, so I don't see why I would be getting any errors related to Core Foundation or retain counts.
My only guess is maybe there is some build settings option that I have set incorrectly that is causing my program to think this is an objc environment but that seems unlikely.
Any help would be appreciated