I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?
The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.
Related
I'm seeing some 10% run-time overhead when using a clone of a constexpr enhanced boost.units with the float value type using clang and -O3 level optimization. This is showing up with some of the more elaborate applications of a library that I've been working on. Given this situation, I have two questions that I'd really like to solve and would love help with:
Boost units is supposed to be a zero-overhead library so why am I seeing the overhead?
More importantly, besides not using boost.units, how can I get the overhead to go away?
Details...
I've been working on an interactive physics engine written in C++14. With the many different physical quantities and units it uses, I love using the compile-time enforced units and quantities that boost.units provides. Unfortunately enabling boost units seems to be coming with this run-time cost. The engine comes with a benchmark application that uses google's benchmark library to provide this insight and it takes some of the more elaborate simulations to see the overhead.
At present, due to the overhead, the engine builds by default without using boost units. By defining the right preprocessor macro name, the engine can be built with boost units. I achieved this switching using code like the following:
// #define USE_BOOST_UNITS
#if defined(USE_BOOST_UNITS)
...
#include <boost/units/systems/si/time.hpp>
...
#endif // defined(USE_BOOST_UNITS)
#if defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) boost::units::quantity<BoostDimension, float>
#define UNIT(Quantity, BoostUnit) Quantity{BoostUnit * float{1}}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) Quantity{BoostUnit * float{Ratio}}
#else // defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) float
#define UNIT(Quantity, BoostUnit) float{1}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) float{Ratio}}
#endif // defined(USE_BOOST_UNITS)
using Time = QUANTITY(boost::units::si::time);
constexpr auto Second = UNIT(Time, boost::units::si::second);
What I did with the UNIT macro feels a bit suspect to me in that it's taking a boost unit type and turning it into a value. That makes switching between using or not using boost units easier however since either way expressions like 3.0f * Second compile without warning. Checking what clang and gcc do with expressions like these appeared to confirm that they were smart enough to avoid run-time multiplying 3.0f * 1.0f and just recognized the expression as 3.0f. I wonder anyway if that's the cause of the overhead or if it's something else that I've done.
I've also wondered if maybe the problem is rooted in the constexpr enhancement code I'm using or if the author(s) of that code had any idea about this overhead. On search the internet, I found a mention of overhead with the normal boost units library so seems safe to assume the enhanced units are not at fault. A suggestion that came out of my inquiring though (and my thanks go to GitHub user muggenhor for it) was the following:
I expect this is likely caused by the amount of inlining done by the compiler. Because of the wrapper functions for the operators this adds at least one function call that needs to be inlined per operation. For expressions depending on the result of sub-expressions this requires the sub-expressions to be inlined first. As a result I expect the minimum amount of inlining passes to be able to properly optimize your code to be equal to the depth of the produced expression tree...
This sounds like a pretty viable theory to me. Unfortunately, I don't know how to test it and admittedly I'm more fond of digging into my own code at the moment than into clang/LLVM code. I've tried using -inline-threshold=10000 but that doesn't seem to make the overhead go away. To my understanding of clang at least, I don't believe that specifically increases the number of inlining passes. Is there another command line argument that does? Or are there parameters within clang's sources that someone can point me to looking at as a starting point to maybe recompiling clang and trying the modified compiler?
Another theory I've had is whether using float is the problem. I can rebuild my physics engine to use double instead and compare benchmark results between building with and without the boost units support enabled. What I find when using double is that the overhead at least seems to decrease. I've wondered if maybe boost units is somewhere using double even when I use float in its quantity template and maybe that's causing the overhead.
Lastly, I built boost unit's performance example with the constexpr enhancements and ran it with both double and float. Got no reliable sign of any overhead which seems to eliminate my theory of float being the problem.
Update With Data & Code
Got some more isolated data and code on this where it seems I'm seeing significantly more than 10% overhead...
Some benchmark data where Length is basically boost::units::si::length:
LesserLength/1000 953 ns 953 ns 724870
LesserFloat/1000 590 ns 590 ns 1093647
LesserDouble/1000 619 ns 618 ns 1198938
What the related code looks like:
static void LesserLength(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f * playrho::Meter, 100.0f * playrho::Meter);
auto c = 0.0f * playrho::Meter;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
static_assert(std::is_same<decltype(b), const playrho::Length>::value, "not Length");
const auto v = (a < b)? a: b;
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserFloat(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f, 100.0f);
auto c = 0.0f;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const float>::value, "not float");
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserDouble(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0, 100.0);
auto c = 0.0;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const double>::value, "not double");
benchmark::DoNotOptimize(c = v);
}
}
}
With this as a hint to me, I checked Godbolt with the following code to see what clang 5.0.0 and gcc 7.2 would generate:
#include <algorithm>
#include <boost/units/systems/si/length.hpp>
#include <boost/units/cmath.hpp>
using length = boost::units::quantity<boost::units::si::length, float>;
float f(float a, float b)
{
return a < b? a: b;
}
length f(length a, length b)
{
return a < b? a: b;
}
I see that the generated assembly looks quite different between the two functions and between clang and gcc. Here's a gist of the relevant assembly from clang (with the boost stuff here simply shown as length):
f(float, float): # #f(float, float)
minss xmm0, xmm1
ret
f(length, length)
movss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero
ucomiss xmm0, dword ptr [rsi]
cmova rdx, rsi
mov eax, dword ptr [rdx]
mov dword ptr [rdi], eax
mov rax, rdi
ret
Shouldn't both of these compilers using -O3 optimization be returning the same assembly though for the length version as they do for the float version? Is the problem that they're not quite optimizing down all the way to the same code as for float? Seems like this is the problem and if so that's progress but I still want to figure out what can be done to really get zero overhead.
Given this code:
#include <stdio.h>
int main(int argc, char **argv)
{
int x = 1;
printf("Hello x = %d\n", x);
}
I'd like to access and manipulate the variable x in inline assembly. Ideally, I want to change its value using inline assembly. GNU assembler, and using the AT&T syntax.
In GNU C inline asm, with x86 AT&T syntax:
(But https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it).
// this example doesn't really need volatile: the result is the same every time
asm volatile("movl $0, %[some]"
: [some] "=r" (x)
);
after this, x contains 0.
Note that you should generally avoid mov as the first or last instruction of an asm statement. Don't copy from %[some] to a hard-coded register like %%eax, just use %[some] as a register, letting the compiler do register allocation.
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and https://stackoverflow.com/tags/inline-assembly/info for more docs and guides.
Not all compilers support GNU syntax.
For example, for MSVC you do this:
__asm mov x, 0 and x will have the value of 0 after this statement.
Please specify the compiler you would want to use.
Also note, doing this will restrict your program to compile with only a specific compiler-assembler combination, and will be targeted only towards a particular architecture.
In most cases, you'll get as good or better results from using pure C and intrinsics, not inline asm.
asm("mov $0, %1":"=r" (x):"r" (x):"cc"); -- this may get you on the right track. Specify register use as much as possible for performance and efficiency. However, as Aniket points out, highly architecture dependent and requires gcc.
I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}
I'm compiling from source the python extension IGRAPH for x64 instead of x86 which is available in the distro. I have gotten it all sorted out in VS 2012 and it compiles when I comment out as follows in src/math.c
#ifndef HAVE_LOGBL
long double igraph_logbl(long double x) {
long double res;
/**#if defined(_MSC_VER)
__asm { fld [x] }
__asm { fxtract }
__asm { fstp st }
__asm { fistp [res] }
#else
__asm__ ("fxtract\n\t"
"fstp %%st" : "=t" (res) : "0" (x));
#endif*/
return res;
}
#endif
The problem is I don't know asm well and I don't know it well enough to know if there are issues going from x86 to x64. This is a short snippet of 4 assembly intsructions that have to be converted to x64 intrinsics, from what I can see.
Any pointers? Is going intrinsic the right way? Or should it be subroutine or pure C?
Edit: Link for igraph extension if anyone wanted to see http://igraph.sourceforge.net/download.html
In x64 floating point will generally be performed using the SSE2 instructions as these are generally a lot faster. Your only problem here is that there is no equivalent to the fxtract op in SSE (Which generally means the FPU version will be implemented as a compound instruction and hence very slow). So implementing as a C function will likely be just as fast on x64.
I'm finding the function a bit hard to read however as from what I can tell it is calling fxtract and then storing an integer value to the address pointed to by a long double. This means the long double is going to have a 'partially' undefined value in it. As best I can tell the above code assembly shouldn't work ... but its been a VERY long time since I wrote any x87 code so I'm probably just rusty.
Anyway the function, appears to be an implementation of logb which you won't find implemented in MSVC. It can, however, be implemented as follows using the frexp function:
long double igraph_logbl(long double x)
{
int exp = 0;
frexpl( x, &exp );
return (long double)exp;
}
I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.