Converting inline ASM to intrinsic for x64 igraph - c++

I'm compiling from source the python extension IGRAPH for x64 instead of x86 which is available in the distro. I have gotten it all sorted out in VS 2012 and it compiles when I comment out as follows in src/math.c
#ifndef HAVE_LOGBL
long double igraph_logbl(long double x) {
long double res;
/**#if defined(_MSC_VER)
__asm { fld [x] }
__asm { fxtract }
__asm { fstp st }
__asm { fistp [res] }
#else
__asm__ ("fxtract\n\t"
"fstp %%st" : "=t" (res) : "0" (x));
#endif*/
return res;
}
#endif
The problem is I don't know asm well and I don't know it well enough to know if there are issues going from x86 to x64. This is a short snippet of 4 assembly intsructions that have to be converted to x64 intrinsics, from what I can see.
Any pointers? Is going intrinsic the right way? Or should it be subroutine or pure C?
Edit: Link for igraph extension if anyone wanted to see http://igraph.sourceforge.net/download.html

In x64 floating point will generally be performed using the SSE2 instructions as these are generally a lot faster. Your only problem here is that there is no equivalent to the fxtract op in SSE (Which generally means the FPU version will be implemented as a compound instruction and hence very slow). So implementing as a C function will likely be just as fast on x64.
I'm finding the function a bit hard to read however as from what I can tell it is calling fxtract and then storing an integer value to the address pointed to by a long double. This means the long double is going to have a 'partially' undefined value in it. As best I can tell the above code assembly shouldn't work ... but its been a VERY long time since I wrote any x87 code so I'm probably just rusty.
Anyway the function, appears to be an implementation of logb which you won't find implemented in MSVC. It can, however, be implemented as follows using the frexp function:
long double igraph_logbl(long double x)
{
int exp = 0;
frexpl( x, &exp );
return (long double)exp;
}

Related

Why is fetestexcept in C++ compiled to a function call rather than inlined

I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.
Consider the following minimal example:
#include <fenv.h>
#pragma STDC FENV_ACCESS on
inline int fetestexcept_inline(int e)
{
unsigned int mxcsr;
asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
return mxcsr & e & FE_ALL_EXCEPT;
}
double f1(double a)
{
double r = a * a;
if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
else return r;
}
double f2(double a)
{
double r = a * a;
if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
else return r;
}
And the output as produced by GCC: https://godbolt.org/z/jxjzYY
The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.
This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.
It's required to support long double arithmetic. fetestexcept needs to merge the SSE and FPU states because long double operations only update the FPU state, but not the MXSCR register. Therefore, the benefit from inlining is somewhat reduced.

Porting to Mac OS X error

I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?
The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.

32-bit C++ to 64-bit C++ using Visual Studio 2010

I recently want to convert a 32-bit C++ project to 64-bit, but I am stuck with the first try. Could you point out any suggestions/checklist/points when converting 32-bit C++ to 64-bit in VS (like converting 32-bit Delphi to 64-bit).
int GetVendorID_0(char *pVendorID,int iLen)
{
#ifdef WIN64 // why WIN64 is not defined switching to Active (x64) ?
// what to put here?
#else
DWORD dwA,dwB,dwC,dwD;
__asm
{
PUSHAD
MOV EAX,0
CPUID //CPUID(EAX=0),
MOV dwA,EAX
MOV dwC,ECX
MOV dwD,EDX
MOV dwB,EBX
POPAD
}
memset( pVendorID, 0,iLen);
memcpy( pVendorID, &dwB,4);
memcpy(&pVendorID[4], &dwD,4);
memcpy(&pVendorID[8], &dwC,4);
return dwA;
#endif
}
Microsoft's compilers (some of them, anyway) have a flag to point out at least some common problems where code will probably need modification to work as 64-bit code.
As far as your GetVendorID_0 function goes, I'd use Microsoft's _cpuid function, something like this:
int GetVendorID_0(char *pVendorID, int iLen) {
DWORD data[4];
_cpuid(0, data);
memcpy(pVendorID, data+1, 12);
return data[0];
}
That obviously doesn't replace all instances of inline assembly language. You choices are fairly simple (though not necessarily easy). One is to find an intrinsic like this to do the job. The other is to move the assembly code into a separate file and link it with your code in C++ (and learn the x64 calling convention). The third is to simply forego what you're doing now, and write the closest equivalent you can with more portable code.

making mistake in inline assembler in gcc [duplicate]

This question already has answers here:
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
Closed 4 years ago.
I have successfully written some inline assembler in gcc to rotate right one bit
following some nice instructions: http://www.cs.dartmouth.edu/~sergey/cs108/2009/gcc-inline-asm.pdf
Here's an example:
static inline int ror(int v) {
asm ("ror %0;" :"=r"(v) /* output */ :"0"(v) /* input */ );
return v;
}
However, I want code to count clock cycles, and have seen some in the wrong (probably microsoft) format. I don't know how to do these things in gcc. Any help?
unsigned __int64 inline GetRDTSC() {
__asm {
; Flush the pipeline
XOR eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
}
}
I tried:
static inline unsigned long long getClocks() {
asm("xor %%eax, %%eax" );
asm(CPUID);
asm(RDTSC : : %%edx %%eax); //Get RDTSC counter in edx:eax
but I don't know how to get the edx:eax pair to return as 64 bits cleanly, and don't know how to really flush the pipeline.
Also, the best source code I found was at: http://www.strchr.com/performance_measurements_with_rdtsc
and that was mentioning pentium, so if there are different ways of doing it on different intel/AMD variants, please let me know. I would prefer something that works on all x86 platforms, even if it's a bit ugly, to a range of solutions for each variant, but I wouldn't mind knowing about it.
The following does what you want:
inline unsigned long long rdtsc() {
unsigned int lo, hi;
asm volatile (
"cpuid \n"
"rdtsc"
: "=a"(lo), "=d"(hi) /* outputs */
: "a"(0) /* inputs */
: "%ebx", "%ecx"); /* clobbers*/
return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
It is important to put as little inline ASM as possible in your code, because it prevents the compiler from doing any optimizations. That's why I've done the shift and oring of the result in C code rather than coding that in ASM as well. Similarly, I use the "a" input of 0 to let the compiler decide when and how to zero out eax. It could be that some other code in your program already zeroed it out, and the compiler could save an instruction if it knows that.
Also, the "clobbers" above are very important. CPUID overwrites everything in eax, ebx, ecx, and edx. You need to tell the compiler that you're changing these registers so that it knows not to keep anything important there. You don't have to list eax and edx because you're using them as outputs. If you don't list the clobbers, there's a serious chance your program will crash and you will find it extremely difficult to track down the issue.
This will store the result in value. Combining the results takes extra cycles, so the number of cycles between calls to this code will be a few less than the difference in results.
unsigned int hi,lo;
unsigned long long value;
asm (
"cpuid\n\t"
"rdtsc"
: "d" (hi), "a" (lo)
);
value = (((unsigned long long)hi) << 32) | lo;

Using bts assembly instruction with gcc compiler

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.