This question already has answers here:
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
Closed 4 years ago.
I have successfully written some inline assembler in gcc to rotate right one bit
following some nice instructions: http://www.cs.dartmouth.edu/~sergey/cs108/2009/gcc-inline-asm.pdf
Here's an example:
static inline int ror(int v) {
asm ("ror %0;" :"=r"(v) /* output */ :"0"(v) /* input */ );
return v;
}
However, I want code to count clock cycles, and have seen some in the wrong (probably microsoft) format. I don't know how to do these things in gcc. Any help?
unsigned __int64 inline GetRDTSC() {
__asm {
; Flush the pipeline
XOR eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
}
}
I tried:
static inline unsigned long long getClocks() {
asm("xor %%eax, %%eax" );
asm(CPUID);
asm(RDTSC : : %%edx %%eax); //Get RDTSC counter in edx:eax
but I don't know how to get the edx:eax pair to return as 64 bits cleanly, and don't know how to really flush the pipeline.
Also, the best source code I found was at: http://www.strchr.com/performance_measurements_with_rdtsc
and that was mentioning pentium, so if there are different ways of doing it on different intel/AMD variants, please let me know. I would prefer something that works on all x86 platforms, even if it's a bit ugly, to a range of solutions for each variant, but I wouldn't mind knowing about it.
The following does what you want:
inline unsigned long long rdtsc() {
unsigned int lo, hi;
asm volatile (
"cpuid \n"
"rdtsc"
: "=a"(lo), "=d"(hi) /* outputs */
: "a"(0) /* inputs */
: "%ebx", "%ecx"); /* clobbers*/
return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
It is important to put as little inline ASM as possible in your code, because it prevents the compiler from doing any optimizations. That's why I've done the shift and oring of the result in C code rather than coding that in ASM as well. Similarly, I use the "a" input of 0 to let the compiler decide when and how to zero out eax. It could be that some other code in your program already zeroed it out, and the compiler could save an instruction if it knows that.
Also, the "clobbers" above are very important. CPUID overwrites everything in eax, ebx, ecx, and edx. You need to tell the compiler that you're changing these registers so that it knows not to keep anything important there. You don't have to list eax and edx because you're using them as outputs. If you don't list the clobbers, there's a serious chance your program will crash and you will find it extremely difficult to track down the issue.
This will store the result in value. Combining the results takes extra cycles, so the number of cycles between calls to this code will be a few less than the difference in results.
unsigned int hi,lo;
unsigned long long value;
asm (
"cpuid\n\t"
"rdtsc"
: "d" (hi), "a" (lo)
);
value = (((unsigned long long)hi) << 32) | lo;
Related
is there a way to compute:
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a > b ? a - b : 0;
}
using some binary operations instead of a test?
The original function compiles to a conditional move and is not subject to the performance concern you have. You are attempting to perform premature optimization here, and it will do absolutely no good for you. Write readable code, then profile it, then identify and optimize performance bottlenecks.
You say "tests can lower performances". If you want to do micro-optimizations like this, you should aim to understand the reasons underlying these rules of thumb, not apply them as unconditional truth.
What lowers performance about "tests" are (mispredicted) control flow branches. Modern CPUs heavily use instruction pipelining, and control flow branches mean it's unclear which instruction comes next. The common approach is that the CPU guesses (using branch prediction hardware/algorithms) which way the branch will go. If it gets it wrong, it has to flush the entire pipeline and refill it, which wastes cycles.
Well, the code in question has no such issue. It compiles to a conditional move:
https://godbolt.org/z/upkgok
clipped_substract(unsigned char, unsigned char):
mov eax, edi
mov edx, 0
sub eax, esi
cmp dil, sil
cmovbe eax, edx
ret
You can see that there is no control flow branch here. See Why is a conditional move not vulnerable for Branch Prediction Failure? for why that is not subject to the above performance concern.
I repeat: Write readable code, then profile it, then identify and optimize performance bottlenecks. What you attempted here was to fix a performance problem code that didn't even exist in the first place (not to mention you didn't prove that the performance of this code is even relevant to your program's performance).
Using clang 11.0, compiler optimizations enabled
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
long c = b - a;
return ((unsigned long)c >> (sizeof(c)*8-1)) * c;
}
Although time taken by any of the alternatives is neglagibly different, but when performing 10000000 iterations, other ways take around 0.300secs on my machine and this solution takes around 0.200secs.
But this is only valid if you use a type whose size is greater than char, for example here I took long. I don't remember exact definitions of char or long but I know long is wider on my machine compared to char. So adjust it accordingly.
Basically, I just store the result of subtraction in a larger width variable. If the result is negative, most significant bit will be 1 else 0.
/** clipped(a - b) **/
unsigned char clipped_substract(unsigned char a, unsigned char b)
{
return a - (a+b) / 2 + getAbs(a-b) / 2;
}
unsigned int getAbs(int n)
{
int const mask = n >> (sizeof(int) * 8 - 1);
return ((n + mask) ^ mask);
}
(Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object?, trying to use an MSVC-specific method on GCC's definition of __m128i. But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.)
I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking.
I continue getting the error "request for member ‘m128i_i8’ in ‘(my var name)', which is of non-class type ‘wirelabel {aka __vector(2) long long int}’" which I don't understand because I've included all the correct headers and it does recognize __m128i variables.
Note1: wirelabel is a typedef for __m128i i.e. there exists in a header
typedef __m128i wirelabel
Note2: The reason Note1 was used is explained in the following other question:
tbb::cache_aligned_allocator: Getting "request for member...which is of non-class type" with __m128i. User error or bug?
Note3: I'm using the compiler g++
Note4: This following question doesn't answer mine but does discuss related information Why should you not access the __m128i fields directly?
I also know that there is a _mm_set_epi8 function but it requires you set all 8 bit sections at once and that is not an option for me currently.
The question the accepted answer answers:
Edit: I was asked for more specifics as to why I think I need to access each of the 16 8-bit parts of the __m128i object, and here is why: I have a bool array with size 'n*128' (n is a size_t) and I need to store these within an array of 'wirelabel' with size 'n'.
Now because wirelabel is just an alias/typedef (correct me if there is a difference) for __m128i, each of the 'n' indices of 128 bools can be stored in the 'wirelabel' array.
However, in order to do this I believe need to convert every 8-bits into it's signed equivalent and store it in the correct 8bit index in each 'wirelabel' pointer in the array.
So your source data is contiguous? You should use _mm_load_si128 instead of messing around with scalar components of vector types.
Your real problem is packing an array of bool (1 byte per element in the ABI used by g++ on x86) into a bitmap. You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.
pmovmskb (_mm_movemask_epi8) is fantastic for extracting one bit per byte of input. You just need to arrange to get the bit you want into the high bit.
The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb on Haswell (port 0). (http://agner.org/optimize/). Instead, adding 0x7F will produce 0x80 (high bit set) for an input of 1, but 0x7F (high bit clear) for an input of 0. (And a bool in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).
Why not pcmpeqb against _mm_set1_epi8(1)? Skylake runs pcmpeqb on ports 0/1, but paddb on all 3 vector ALU ports (0/1/5). It's very common to use pmovmskb on the result of pcmpeqb/w/d/q, though.
#include <immintrin.h>
#include <stdint.h>
// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
// you can later access dst with __m128i loads/stores
__m128i carry_to_highbit = _mm_set1_epi8(0x7F);
for (size_t i = 0 ; i < n ; i+=1) {
__m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
__m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
dst[i] = _mm_movemask_epi8(highbits);
}
}
Because we want to use scalar stores when writing this bitmap, we want dst to be in uint16_t for strict-aliasing reasons. With AVX2, you'd want uint32_t. (Or if you did combine = tmp1 << 16 | tmp to combine two pmovmskb results. But probably don't do that.)
This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)
.L3:
movdqu xmm0, XMMWORD PTR [rsi]
add rsi, 16
add rdi, 2
paddb xmm0, xmm1
pmovmskb eax, xmm0
mov WORD PTR [rdi-2], ax
cmp rdx, rsi
jne .L3
So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.
Using a shift (pslld xmm0, 7) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0.
Create an anonymous union containing a _m128i member and an array of the other type whose members you want to set. Type-punning is legal in C, and supported as an extension in g++, clang++ and MSVC. If you want to set individual bits, you can declare the other member as a struct of bitfields. The order of a bitfield is implementation-defined, but you’re using an Intel intrinsic anyway, so it’ll be little-endian.
I'm compiling from source the python extension IGRAPH for x64 instead of x86 which is available in the distro. I have gotten it all sorted out in VS 2012 and it compiles when I comment out as follows in src/math.c
#ifndef HAVE_LOGBL
long double igraph_logbl(long double x) {
long double res;
/**#if defined(_MSC_VER)
__asm { fld [x] }
__asm { fxtract }
__asm { fstp st }
__asm { fistp [res] }
#else
__asm__ ("fxtract\n\t"
"fstp %%st" : "=t" (res) : "0" (x));
#endif*/
return res;
}
#endif
The problem is I don't know asm well and I don't know it well enough to know if there are issues going from x86 to x64. This is a short snippet of 4 assembly intsructions that have to be converted to x64 intrinsics, from what I can see.
Any pointers? Is going intrinsic the right way? Or should it be subroutine or pure C?
Edit: Link for igraph extension if anyone wanted to see http://igraph.sourceforge.net/download.html
In x64 floating point will generally be performed using the SSE2 instructions as these are generally a lot faster. Your only problem here is that there is no equivalent to the fxtract op in SSE (Which generally means the FPU version will be implemented as a compound instruction and hence very slow). So implementing as a C function will likely be just as fast on x64.
I'm finding the function a bit hard to read however as from what I can tell it is calling fxtract and then storing an integer value to the address pointed to by a long double. This means the long double is going to have a 'partially' undefined value in it. As best I can tell the above code assembly shouldn't work ... but its been a VERY long time since I wrote any x87 code so I'm probably just rusty.
Anyway the function, appears to be an implementation of logb which you won't find implemented in MSVC. It can, however, be implemented as follows using the frexp function:
long double igraph_logbl(long double x)
{
int exp = 0;
frexpl( x, &exp );
return (long double)exp;
}
The function below calculates absolute value of 32-bit floating point value:
__forceinline static float Abs(float x)
{
union {
float x;
int a;
} u;
//u.x = x;
u.a &= 0x7FFFFFFF;
return u.x;
}
union u declared in the function holds variable x, which is different from the x which is passed as parameter in the function. Is there any way to create a union with argument to the function - x?
Any reason the function above with uncommented line be executing longer than this one?
__forceinline float fastAbs(float a)
{
int b= *((int *)&a) & 0x7FFFFFFF;
return *((float *)(&b));
}
I'm trying to figure out best way to take Abs of floating point value in as little count of read/writes to memory as possible.
For the first question, I'm not sure why you can't just what you want with an assignment. The compiler will do whatever optimizations that can be done.
In your second sample code. You violate strict aliasing. So it isn't the same.
As for why it's slower:
It's because CPUs today tend to have separate integer and floating-point units. By type-punning like that, you force the value to be moved from one unit to the other. This has overhead. (This is often done through memory, so you have extra loads and stores.)
In the second snippet: a which is originally in the floating-point unit (either the x87 FPU or an SSE register), needs to be moved into the general purpose registers to apply the mask 0x7FFFFFFF. Then it needs to be moved back.
In the first snippet: The compiler is probably smart enough to load a directly into the integer unit. So you bypass the FPU in the first stage.
(I'm not 100% sure until you show us the assembly. It will also depend heavily on whether the parameter starts off in a register or on the stack. And whether the output is used immediately by another floating-point operation.)
Looking at the disassembly of the code compiled in release mode the difference is quite clear!
I removed the inline and used two virtual function to allow the compiler to not optimize too much and let us show the differences.
This is the first function.
013D1002 in al,dx
union {
float x;
int a;
} u;
u.x = x;
013D1003 fld dword ptr [x] // Loads a float on top of the FPU STACK.
013D1006 fstp dword ptr [x] // Pops a Float Number from the top of the FPU Stack into the destination address.
u.a &= 0x7FFFFFFF;
013D1009 and dword ptr [x],7FFFFFFFh // Execute a 32 bit binary and operation with the specified address.
return u.x;
013D1010 fld dword ptr [x] // Loads the result on top of the FPU stack.
}
This is the second function.
013D1020 push ebp // Standard function entry... i'm using a virtual function here to show the difference.
013D1021 mov ebp,esp
int b= *((int *)&a) & 0x7FFFFFFF;
013D1023 mov eax,dword ptr [a] // Load into eax our parameter.
013D1026 and eax,7FFFFFFFh // Execute 32 bit binary and between our register and our constant.
013D102B mov dword ptr [a],eax // Move the register value into our destination variable
return *((float *)(&b));
013D102E fld dword ptr [a] // Loads the result on top of the FPU stack.
The number of floating point operations and the usage of FPU stack in the first case is greater.
The functions are executing exactly what you asked, so no surprise.
So i expect the second function to be faster.
Now... removing the virtual and inlining things are a little different, is hard to write the disassembly code here because of course the compiler does a good job, but i repeat, if values are not constants, the compiler will use more floating point operation in the first function.
Of course, integer operations are faster than floating point operations.
Are you sure that directly using math.h abs function is slower than your method?
If correctly inlined, abs function will just do this!
00D71016 fabs
Micro-optimizations like this are hard to see in long code, but if your function is called in a long chain of floating point operations, fabs will work better since values will be already in FPU stack or in SSE registers! abs would be faster and better optimized by the compiler.
You cannot measure the performances of optimizations running a loop in a piece of code, you must see how the compiler mix all together in the real code.
I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.