Until now, I used inline asm with clobbering what it is not the best choice to get good performance. I am starting with assembly, but I am programing in my machine (GCC), but the result code is to run in other on (ICC), both in 64 bit (Sandy Bridge & Haswell).
To call a function without arguments we can do it with a CALL, but I do not understand too well how to call a function with arguments, and because of that I am trying use an inline __asm__ inside of all function. It is a good choice?
My function:
void add_N(size_t *cnum, size_t *ap, size_t *bp, long &n, unsigned int &c){
__asm__(
//Insert my code here
);
}
And when I see the disassembly (with GCC), I have:
add_N(unsigned long*, unsigned long*, unsigned long*, long&, unsigned int&):
0x100001ff0 <+0>: pushq %rbp
0x100001ff1 <+1>: movq %rsp, %rbp
0x100001ff4 <+4>: movq %rdi, -0x8(%rbp)
0x100001ff8 <+8>: movq %rsi, -0x10(%rbp)
0x100001ffc <+12>: movq %rdx, -0x18(%rbp)
0x100002000 <+16>: movq %rcx, -0x20(%rbp)
0x100002004 <+20>: movq %r8, -0x28(%rbp)
0x100002008 <+24>: popq %rbp
0x100002009 <+25>: retq
I understand what is happening.. Will different compilers/microarchitectures always associate the same registers addresses if the function signature be the same?
Then put some code inside of my function (NOT __ASM__ CODE), and the desassembly PUSH a lot of registers. Why did it happen? Why didn't I need to push %rax and %rsi (for example), and need to push r13, r14 and r15?
If I need to push the r** registers, can I do in the inline __asm__?
0x100001ea0 <+0>: pushq %rbp
0x100001ea1 <+1>: movq %rsp, %rbp
0x100001ea4 <+4>: pushq %r15
0x100001ea6 <+6>: pushq %r14
0x100001ea8 <+8>: pushq %r13
0x100001eaa <+10>: pushq %r12
0x100001eac <+12>: pushq %rbx
0x100001ead <+13>: movq %rdi, -0x30(%rbp)
0x100001eb1 <+17>: movq %rsi, -0x38(%rbp)
0x100001eb5 <+21>: movq %rdx, -0x40(%rbp)
0x100001eb9 <+25>: movq %rcx, -0x48(%rbp)
0x100001ebd <+29>: movq %r8, -0x50(%rbp)
For the last question - yes, it will use the same register for parameters as long as they use the same ABI. The Linux x86_64 ABI is defined here: http://www.x86-64.org/documentation/abi.pdf and all compilers have to conform to it. Specifically you are interested in page 16 - Parameters Passing.
Windows have slightly different ABI I believe. So you cannot run your program or library compiled on Linux and run on Windows for example (there are some additional reasons for that though).
For details about gcc inline assembly check some existing tutorial as it's quite long topic. This is good start: http://asm.sourceforge.net/articles/rmiyagi-inline-asm.txt
Related
I'm developing a terrain generator in OpenGL and encounter a strange bug. I can define how long the triangles that the terrain is made from are like this:
#define TRIANGLE_WIDTH 0.25f
#define INVERSE_TRIANGLE_WIDTH 4
With those sizes I create arrays, that are used to store the data of the terrain. I have made some typedefs for the array that look like this:
typedef std::array<std::array<glm::vec3, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec3Type;
typedef std::array<std::array<glm::vec2, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec2Type;
I am using OpenGLs index based rendering. To minimize memory use I automatically calculate if unsigned shorts are sufficient or if I need unsigned ints like this:
#define CHUNK_SIDE_LENGHT (CHUNK_WIDTH * INVERSE_TRIANGLE_WIDTH)
#define CHUNK_SIDE_LENGHT_FLOAT (float)(CHUNK_SIDE_LENGHT)
#define CHUNK_ARRAY_SIZE CHUNK_SIDE_LENGHT * CHUNK_SIDE_LENGHT
#if CHUNK_ARRAY_SIZE >= 65535
#define MAP_INDICES_ARRAY_TYPE unsigned int
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_INT
#else
#define MAP_INDICES_ARRAY_TYPE unsigned short
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_SHORT
#endif
This all works but when the TRIANGLE_WIDTH gets smaller then 0.125 I get a crash when running the program.
I am using Xcode and whenever a program crashes, Xcode shows me where the error is in the assembly code:
SDL-OpenGL-Tests-3`main:
0x100030760 <+0>: pushq %rbp
0x100030761 <+1>: movq %rsp, %rbp
0x100030764 <+4>: subq $0x807f60, %rsp ; imm = 0x807F60
0x10003076b <+11>: movq 0x5b936(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x100030772 <+18>: movq (%rax), %rax
0x100030775 <+21>: movq %rax, -0x8(%rbp)
0x100030779 <+25>: movl $0x0, -0x291c(%rbp)
0x100030783 <+35>: movl %edi, -0x2920(%rbp)
0x100030789 <+41>: movq %rsi, -0x2928(%rbp)
0x100030790 <+48>: leaq 0x4357b(%rip), %rsi ; "/dev/urandom"
0x100030797 <+55>: leaq -0x2948(%rbp), %rax
0x10003079e <+62>: movq %rax, %rdi
-> 0x1000307a1 <+65>: movq %rax, -0x807178(%rbp)
0x1000307a8 <+72>: callq 0x100030090 ; std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string<std::nullptr_t> at string:817
0x1000307ad <+77>: leaq -0x2930(%rbp), %rdi
The error is marked by the arrow. From my limited knowledge of assembly I think this part is corespondent to the start of my int main() that looks like this:
int main(int argc, const char * argv[]) {
std::random_device randomDevice;
std::mt19937 randomEngine(randomDevice());
bool capFps = true;
int fpsCap = 60;
double frameTimeCap = double(1e6) / double(fpsCap);
std::chrono::steady_clock::time_point start, end;
std::cout << "Main Thread ID: " << std::this_thread::get_id() << std::endl;
I really don't get how my terrain generation is affecting the random functions of C++ since I don't use the random engine in my terrain generation, I use it to randomly place some light sources.
Update:
Because elsamuko pointed out that I should interchange the C++ random function with some pseudo random number I removed the random functions and now just set all the random values to zero and it still crashes now with this code:
SDL-OpenGL-Tests-3`main:
0x10002fc30 <+0>: pushq %rbp
0x10002fc31 <+1>: movq %rsp, %rbp
0x10002fc34 <+4>: subq $0x807510, %rsp ; imm = 0x807510
0x10002fc3b <+11>: movsd 0x5aacd(%rip), %xmm0 ; typeinfo name for Sphere + 12, xmm0 = mem[0],zero
0x10002fc43 <+19>: movq 0x5c45e(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x10002fc4a <+26>: movq (%rax), %rax
0x10002fc4d <+29>: movq %rax, -0x8(%rbp)
0x10002fc51 <+33>: movl $0x0, -0x291c(%rbp)
0x10002fc5b <+43>: movl %edi, -0x2920(%rbp)
0x10002fc61 <+49>: movq %rsi, -0x2928(%rbp)
0x10002fc68 <+56>: movb $0x1, -0x2929(%rbp)
0x10002fc6f <+63>: movl $0x3c, -0x2930(%rbp)
0x10002fc79 <+73>: cvtsi2sdl -0x2930(%rbp), %xmm1
0x10002fc81 <+81>: divsd %xmm1, %xmm0
0x10002fc85 <+85>: movsd %xmm0, -0x2938(%rbp)
0x10002fc8d <+93>: leaq -0x2940(%rbp), %rdi
-> 0x10002fc94 <+100>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fc99 <+105>: leaq -0x2948(%rbp), %rdi
0x10002fca0 <+112>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fca5 <+117>: movq 0x5c384(%rip), %rdi ; (void *)0x00007fff92814760: std::__1::cout
0x10002fcac <+124>: leaq 0x44007(%rip), %rsi ; "Main Thread ID: "
0x10002fcb3 <+131>: callq 0x10007132c ; symbol stub for: std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::operator<<<std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*)
It seems as the code just pics the first command it gets to crash.
The crash is due to a stack overflow that occurs when the gets too large. The simplest solution is to use a std::unique_ptr, because this array will be allocated on the heap.
I've got a crash occurring on some users computers in a C++ audiounit component running inside Logic X. I can't repeat it locally unfortunately and in the process of trying to work out how it might occur I've got some questions.
Here's the relevant info from the crash dump:
Exception Type: EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Trace/BPT trap: 5
Termination Reason: Namespace SIGNAL, Code 0x5
Terminating Process: exc handler [0]
The questions are:
What might cause a EXC_BREAKPOINT in the situation I'm looking at. Is this information from Apple complete and accurate: "Similar to an Abnormal Exit, this exception is intended to give an attached debugger the chance to interrupt the process at a specific point in its execution. You can trigger this exception from your own code using the __builtin_trap() function. If no debugger is attached, the process is terminated and a crash report is generated."
Why would it occur on SharedObject + 200 (see disassembly)
Is RBX the 'this' pointer at the moment the crash occurs.
The crash occurs here:
juce::ValueTree::SharedObject::SharedObject(juce::ValueTree::SharedObject const&) + 200
The C++ is as follows:
SharedObject (const SharedObject& other)
: ReferenceCountedObject(),
type (other.type), properties (other.properties), parent (nullptr)
{
for (int i = 0; i < other.children.size(); ++i)
{
SharedObject* const child = new SharedObject (*other.children.getObjectPointerUnchecked(i));
child->parent = this;
children.add (child);
}
}
The disassembly:
-> 0x127167950 <+0>: pushq %rbp
0x127167951 <+1>: movq %rsp, %rbp
0x127167954 <+4>: pushq %r15
0x127167956 <+6>: pushq %r14
0x127167958 <+8>: pushq %r13
0x12716795a <+10>: pushq %r12
0x12716795c <+12>: pushq %rbx
0x12716795d <+13>: subq $0x18, %rsp
0x127167961 <+17>: movq %rsi, %r12
0x127167964 <+20>: movq %rdi, %rbx
0x127167967 <+23>: leaq 0x589692(%rip), %rax ; vtable for juce::ReferenceCountedObject + 16
0x12716796e <+30>: movq %rax, (%rbx)
0x127167971 <+33>: movl $0x0, 0x8(%rbx)
0x127167978 <+40>: leaq 0x599fe9(%rip), %rax ; vtable for juce::ValueTree::SharedObject + 16
0x12716797f <+47>: movq %rax, (%rbx)
0x127167982 <+50>: leaq 0x10(%rbx), %rdi
0x127167986 <+54>: movq %rdi, -0x30(%rbp)
0x12716798a <+58>: leaq 0x10(%r12), %rsi
0x12716798f <+63>: callq 0x12711cf70 ; juce::Identifier::Identifier(juce::Identifier const&)
0x127167994 <+68>: leaq 0x18(%rbx), %rdi
0x127167998 <+72>: movq %rdi, -0x38(%rbp)
0x12716799c <+76>: leaq 0x18(%r12), %rsi
0x1271679a1 <+81>: callq 0x12711c7b0 ; juce::NamedValueSet::NamedValueSet(juce::NamedValueSet const&)
0x1271679a6 <+86>: movq $0x0, 0x30(%rbx)
0x1271679ae <+94>: movl $0x0, 0x38(%rbx)
0x1271679b5 <+101>: movl $0x0, 0x40(%rbx)
0x1271679bc <+108>: movq $0x0, 0x48(%rbx)
0x1271679c4 <+116>: movl $0x0, 0x50(%rbx)
0x1271679cb <+123>: movl $0x0, 0x58(%rbx)
0x1271679d2 <+130>: movq $0x0, 0x60(%rbx)
0x1271679da <+138>: cmpl $0x0, 0x40(%r12)
0x1271679e0 <+144>: jle 0x127167aa2 ; <+338>
0x1271679e6 <+150>: xorl %r14d, %r14d
0x1271679e9 <+153>: nopl (%rax)
0x1271679f0 <+160>: movl $0x68, %edi
0x1271679f5 <+165>: callq 0x12728c232 ; symbol stub for: operator new(unsigned long)
0x1271679fa <+170>: movq %rax, %r13
0x1271679fd <+173>: movq 0x30(%r12), %rax
0x127167a02 <+178>: movq (%rax,%r14,8), %rsi
0x127167a06 <+182>: movq %r13, %rdi
0x127167a09 <+185>: callq 0x127167950 ; <+0>
0x127167a0e <+190>: movq %rbx, 0x60(%r13) // MY NOTES: child->parent = this
0x127167a12 <+194>: movl 0x38(%rbx), %ecx
0x127167a15 <+197>: movl 0x40(%rbx), %eax
0x127167a18 <+200>: cmpl %eax, %ecx
Update 1:
It looks like RIP is suggesting we are in the middle of the 'add' call which is this function, inlined:
/** Appends a new object to the end of the array.
This will increase the new object's reference count.
#param newObject the new object to add to the array
#see set, insert, addIfNotAlreadyThere, addSorted, addArray
*/
ObjectClass* add (ObjectClass* const newObject) noexcept
{
data.ensureAllocatedSize (numUsed + 1);
jassert (data.elements != nullptr);
data.elements [numUsed++] = newObject;
if (newObject != nullptr)
newObject->incReferenceCount();
return newObject;
}
Update 2:
At the point of crash register values of relevant registers:
this == rbx: 0x00007fe5bc37c950
&other == r12: 0x00007fe5bc348cc0
rax = 0
rcx = 0
There may be a few problems in this code:
like SM mentioned, other.children.getObjectPointerUnchecked(i) could return nullptr
in ObjectClass* add (ObjectClass* const newObject) noexcept, you check if newObject isn't null before calling incReferenceCount (which means a null may occur in this method call), but you don't null-check before adding this object during data.elements [numUsed++] = newObject;, so you may have a nullptr here, and if you call this array somewhere else without checking, you may have a crash.
we don't really know the type of data.elements (I assume an array), but there could be a copy operation occuring because of the pointer assignment (if the pointer is converted to object, if data.elements is not an array of pointer or there is some operator overload), there could be a crash here (but it's unlikely)
there is a circular ref between children and parent, so there may be a problem during object destruction
I suspect that the problem is that a shared, ref-counted object that should have been placed on the heap was inadvertently allocated on the stack. If the stack unwinds and is overwritten afterwards, and a reference to the ref-counted object still exists, then random things will happen at unpredictable times when that reference is accessed. (The proximity in address-space of this and &other also makes me suspect that both are on the stack.)
I was playing around with Compiler Explorer and ran into an anomaly (I think). If I want to make the compiler vectorize a sin calculation using libmvec, I would write:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T s(const T x)
{
return sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = s(x[i]);
}
}
compile with gcc 6.2 and -O3 -march=native -ffast-math and get
func(float*, float*, int):
testl %edx, %edx
jle .L10
leaq 8(%rsp), %r10
andq $-32, %rsp
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r14
xorl %r14d, %r14d
pushq %r13
leal -8(%rdx), %r13d
pushq %r12
shrl $3, %r13d
movq %rsi, %r12
pushq %r10
addl $1, %r13d
pushq %rbx
movq %rdi, %rbx
subq $8, %rsp
.L4:
vmovaps (%rbx), %ymm0
addl $1, %r14d
addq $32, %r12
addq $32, %rbx
call _ZGVcN8v_sinf // YAY! Vectorized trig!
vmovaps %ymm0, -32(%r12)
cmpl %r13d, %r14d
jb .L4
vzeroupper
addq $8, %rsp
popq %rbx
popq %r10
popq %r12
popq %r13
popq %r14
popq %rbp
leaq -8(%r10), %rsp
.L10:
ret
But when I add a cosine to the function, there is no vectorization:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T f(const T x)
{
return cosf(x)+sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = f(x[i]);
}
}
which gives:
func(float*, float*, int):
testl %edx, %edx
jle .L10
pushq %r12
leal -1(%rdx), %eax
pushq %rbp
leaq 4(%rdi,%rax,4), %r12
movq %rsi, %rbp
pushq %rbx
movq %rdi, %rbx
subq $16, %rsp
.L4:
vmovss (%rbx), %xmm0
leaq 8(%rsp), %rsi
addq $4, %rbx
addq $4, %rbp
leaq 12(%rsp), %rdi
call sincosf // No vectorization
vmovss 12(%rsp), %xmm0
vaddss 8(%rsp), %xmm0, %xmm0
vmovss %xmm0, -4(%rbp)
cmpq %rbx, %r12
jne .L4
addq $16, %rsp
popq %rbx
popq %rbp
popq %r12
.L10:
ret
I see two good alternatives. Either call a vectorized version of sincosf or call the vectorized sin and cos sequentially. I tried adding -fno-builtin-sincos to no avail. -fopt-info-vec-missed complains about complex float, which there is none.
Is this a known issue with gcc? Either way, is there a way I can convince gcc to vectorize the latter example?
(As an aside, is there any way to get gcc < 6 to vectorize trigonometric functions automatically?)
My code looks like this:
template<typename F>
void printHello(F f)
{
f("Hello!");
}
int main() {
std::string buf;
printHello([&buf](const char*msg) { buf += msg; });
printHello([&buf]() { });
}
The question is - how can I restrict the F type to accept only lambdas that have a signature void(const char*), so that the second call to printHello doesn't fail at some obscure place inside printHello but instead on the line that calls printHello incorrectly?
==EDIT==
I know that std::function can solve it in this particular case (is what I'd use if I really wanted to print 'hello'). But std::function is really something else and comes at a cost (however small that cost is, as of today, April 2016, GCC and MSVC cannot optimize away the virtual call). So my question can be seen as purely academic - is there a "template" way to solve it?
unless you're using an ancient standard library, std::function will have optimisations for small function objects (of which yours is one). You will see no performance reduction whatsoever.
People who tell you not to use std::function because of performance reasons are the very same people who 'optimise' code before measuring performance bottlenecks.
Write the code that expresses intent. IF it becomes a performance bottleneck (it won't) then look at changing it.
I once worked on a financial forwards pricing system. Someone decided that it ran too slowly (64 cores, multiple server boxes, hundreds of thousands of discrete algorithms running in parallel in a massive DAG). So we profiled it.
What did we find?
The processing took almost no time at all. The program spent 99% of its time converting doubles to strings and strings to doubles at the boundaries of the IO, where we had to communicate with a message bus.
Using a lambda in place of a std::function for the callbacks would have made no difference whatsoever.
Write elegant code. Express your intent clearly. Compile with optimisations. Marvel as the compiler does its job and turns your 100 lines of c++ into 5 machine code instructions.
A simple demonstration:
#include <functional>
// external function forces an actual function call
extern void other_func(const char* p);
// try to force std::function to call polymorphically
void test_it(const std::function<void(const char*)>& f, const char* p)
{
f(p);
}
int main()
{
// make our function object
auto f = std::function<void(const char*)>([](const char* p) { other_func(p); });
const char* const data[] = {
"foo",
"bar",
"baz"
};
// call it in a tight loop
for(auto p : data) {
test_it(f, p);
}
}
compile with apple clang, -O2:
result:
.globl _main
.align 4, 0x90
_main: ## #main
Lfunc_begin1:
.cfi_startproc
.cfi_personality 155, ___gxx_personality_v0
.cfi_lsda 16, Lexception1
## BB#0: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i
#
# the normal stack frame stuff...
#
pushq %rbp
Ltmp13:
.cfi_def_cfa_offset 16
Ltmp14:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp15:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %rbx
subq $72, %rsp
Ltmp16:
.cfi_offset %rbx, -40
Ltmp17:
.cfi_offset %r14, -32
Ltmp18:
.cfi_offset %r15, -24
movq ___stack_chk_guard#GOTPCREL(%rip), %rbx
movq (%rbx), %rbx
movq %rbx, -32(%rbp)
leaq -80(%rbp), %r15
movq %r15, -48(%rbp)
#
# take the address of std::function's vtable... we'll need it (once)
#
leaq __ZTVNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEE+16(%rip), %rax
#
# here's the tight loop...
#
movq %rax, -80(%rbp)
leaq L_.str(%rip), %rdi
movq %rdi, -88(%rbp)
Ltmp3:
#
# oh look! std::function's call has been TOTALLY INLINED!!
#
callq __Z10other_funcPKc
Ltmp4:
LBB1_2: ## %_ZNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEclEOS6_.exit
## =>This Inner Loop Header: Depth=1
#
# notice that the loop itself uses more instructions than the call??
#
leaq L_.str1(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
je LBB1_1
## BB#3: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.1
## in Loop: Header=BB1_2 Depth=1
#
# destructor called once (constant time, therefore irrelevant)
#
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp5:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp6:
## BB#4: ## in Loop: Header=BB1_2 Depth=1
leaq L_.str2(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
jne LBB1_5
#
# the rest of this function is exception handling. Executed at most
# once, in exceptional circumstances. Therefore, irrelevant.
#
LBB1_1: ## in Loop: Header=BB1_2 Depth=1
movl $8, %edi
callq ___cxa_allocate_exception
movq __ZTVNSt3__117bad_function_callE#GOTPCREL(%rip), %rcx
addq $16, %rcx
movq %rcx, (%rax)
Ltmp10:
movq __ZTINSt3__117bad_function_callE#GOTPCREL(%rip), %rsi
movq __ZNSt3__117bad_function_callD1Ev#GOTPCREL(%rip), %rdx
movq %rax, %rdi
callq ___cxa_throw
Ltmp11:
jmp LBB1_2
LBB1_9: ## %.loopexit.split-lp
Ltmp12:
jmp LBB1_10
LBB1_5: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.2
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp7:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp8:
## BB#6:
movq -48(%rbp), %rdi
cmpq %r15, %rdi
je LBB1_7
## BB#15:
testq %rdi, %rdi
je LBB1_17
## BB#16:
movq (%rdi), %rax
callq *40(%rax)
jmp LBB1_17
LBB1_7:
movq -80(%rbp), %rax
leaq -80(%rbp), %rdi
callq *32(%rax)
LBB1_17: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
cmpq -32(%rbp), %rbx
jne LBB1_19
## BB#18: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
xorl %eax, %eax
addq $72, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
Can we stop arguing about performance now please?
You can add compile time checking for template parameters by defining constraints.
This'll allow to catch such errors early and you also won't have runtime overhead as no code is generated for a constraint using current compilers.
For example we can define such constraint:
template<class F, class T> struct CanCall
{
static void constraints(F f, T a) { f(a); }
CanCall() { void(*p)(F, T) = constraints; }
};
CanCall checks (at compile time) that a F can be called with T.
Usage:
template<typename F>
void printHello(F f)
{
CanCall<F, const char*>();
f("Hello!");
}
As a result compilers also give readable error messages for a failed constraint.
Well... Just use SFINAE
template<typename T>
auto printHello(T f) -> void_t<decltype(f(std::declval<const char*>()))> {
f("hello");
}
And void_t is implemented as:
template<typename...>
using void_t = void;
The return type will act as a constraint on parameter sent to your function. If the expression inside the decltype cannot be evaluated, it will result in an error.
I'm attempting to generate arrays of __m256i's to reuse in another computation. When I attempt to do that (even with a minimal testcase), I get a segmentation fault - but only if the code is compiled with g++ or clang. If I compile the code with the Intel compiler (version 16.0), no segmentation fault occurs. Here is a test case I created:
int main() {
__m256i *table = new __m256i[10000];
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
table[99] = zeroes;
}
When compiling the above with clang 3.6 and g++ 4.8, a segmentation fault occurs.
Here's the assembly generated by the Intel compiler (from https://gcc.godbolt.org/, icc 13.0):
pushq %rbx #3.12
movq %rsp, %rbx #3.12
andq $-32, %rsp #3.12
pushq %rbp #3.12
pushq %rbp #3.12
movq 8(%rbx), %rbp #3.12
movq %rbp, 8(%rsp) #3.12
movq %rsp, %rbp #3.12
subq $112, %rsp #3.12
movl $3200, %eax #4.38
vzeroupper #4.38
movq %rax, %rdi #4.38
call operator new[](unsigned long) #4.38
movq %rax, -112(%rbp) #4.38
movq -112(%rbp), %rax #4.38
movq %rax, -104(%rbp) #4.20
vxorps %ymm0, %ymm0, %ymm0 #5.22
vmovdqu %ymm0, -80(%rbp) #5.22
vmovdqu -80(%rbp), %ymm0 #5.22
vmovdqu %ymm0, -48(%rbp) #5.20
movl $3168, %eax #6.17
addq -104(%rbp), %rax #6.5
vmovdqu -48(%rbp), %ymm0 #6.17
vmovdqu %ymm0, (%rax) #6.5
movl $0, %eax #7.1
vzeroupper #7.1
leave #7.1
movq %rbx, %rsp #7.1
popq %rbx #7.1
ret #7.1
And here's from clang 3.7:
pushq %rbp
movq %rsp, %rbp
andq $-32, %rsp
subq $192, %rsp
xorl %eax, %eax
movl $3200, %ecx # imm = 0xC80
movl %ecx, %edi
movl %eax, 28(%rsp) # 4-byte Spill
callq operator new[](unsigned long)
movq %rax, 88(%rsp)
movq $0, 168(%rsp)
movq $0, 160(%rsp)
movq $0, 152(%rsp)
movq $0, 144(%rsp)
vmovq 168(%rsp), %xmm0 # xmm0 = mem[0],zero
vmovq 160(%rsp), %xmm1 # xmm1 = mem[0],zero
vpunpcklqdq %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0]
vmovq 152(%rsp), %xmm1 # xmm1 = mem[0],zero
vpslldq $8, %xmm1, %xmm1 # xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0,1,2,3,4,5,6,7]
vmovaps %xmm1, %xmm2
vinserti128 $1, %xmm0, %ymm2, %ymm2
vmovaps %ymm2, 96(%rsp)
vmovaps %ymm2, 32(%rsp)
movq 88(%rsp), %rax
vmovaps %ymm2, 3168(%rax)
movl 28(%rsp), %eax # 4-byte Reload
movq %rbp, %rsp
popq %rbp
vzeroupper
retq
Am I running into a compiler bug in clang/g++? Or am I simply doing something wrong?
I have said many times before that implicit SIMD loads/stores are a bad idea. Stop using them. Use explicit loads/stores like this
int64_t* table = new int64_t[4*10000];
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
_mm256_storeu_si256((__m256i*)&table[4*99], zeroes);
or since this is POD use the cross-compiler/OS function _mm_malloc
int64_t* table = (int64_t*)_mm_malloc(sizeof(int64_t)*4*10000, 32);
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
_mm256_store_si256((__m256i*)&table[4*99], zeroes);
You can use _mm256_setzero_si256() instead of _mm256_set_epi64x(0, 0, 0, 0) (note that _mm256_set_epi64x does not work in 32-bit mode on some version of MSVC) but GCC and Clang are smart enough to know they are the same thing.
Since you're using intrinsics which are not part of the C/C++ specification then some rules such as strict aliasing may be overlooked.
I guess the problem has to do with wrong memory alignment. vmovaps requires the memory location to start at a 32-byte boundary and vmovdqu does not. That's why the Intel version works whereas the clang/g++ code crashes. I don't know if this is a compiler bug, but you may want alignment anyway.
The following code should work, although it's more C than C++.
int main() {
__m256i *table = (__m256i*) memalign( 32, 10000 * sizeof(__m256i) );
__m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0);
table[99] = zeroes;
}