In Borland, there is a macro __emit__, "a pseudo-function that injects literal values directly into the object code" (James Holderness).
Is there an equivalent for gcc / g++?
(I can't seem to find one in the documentation)
If not, how could I implement it in my C++ source code?
Usage can be found at Metamorphic Code Examples
You can take a look at .byte assembler directive:
asm __volatile__ (".byte 0xEA, 0x00, 0x00, 0xFF, 0xFF");
GCC's optimizers sometimes discard asm statements if they determine there is no need for the output variables. Also, the optimizers may move code out of loops if they believe that the code will always return the same result (i.e. none of its input values change between calls). Using the volatile qualifier disables these optimizations.
Anyway you should pay attention to many corner cases (e.g. gcc skips asm code after goto...)
Related
I would like to bit-wisr xor zmm0 with zmm1.
I read around the internet and tried:
asm volatile(
"vmovdqa64 (%0),%%zmm0;\n"
"vmovdqa64 (%1),%%zmm1;\n"
"vpxorq %%zmm1, %%zmm0;\n"
"vmovdqa64 %%zmm0,(%0);\n"
:: "r"(p_dst), "r" (p_src)
: );
But the compiler gives "Error: number of operands mismatch for `vpxorq'".
What am I doing wrong?
Inline asm for this is pointless (https://gcc.gnu.org/wiki/DontUseInlineAsm), and your code is unsafe and inefficient even if you fixed the syntax error by adding the 3rd operand.
Use the intrinsic _mm512_xor_epi64( __m512i a, __m512i b); as documented in Intel's asm manual entry for pxor. Look at the compiler-generated asm if you want to see how it's done.
Unsafe because you don't have a "memory" clobber to tell the compiler that you read/write memory, and you don't declare clobbers on zmm0 or zmm1.
And inefficient for many reasons, including forcing the addressing modes and not using a memory source operand. And not letting the compiler pick which registers to use.
Just fixing the asm syntax so it compiles will go from having an obvious compile-time bug to a subtle and dangerous runtime bug that might only be visible with optimization enabled.
See https://stackoverflow.com/tags/inline-assembly/info for more about inline asm. But again, there is basically zero reason to use it for most SIMD because you can get the compiler to make asm that's just as efficient as what you can do by hand, and more efficient than this.
Most AVX512 instructions use 3+ operands, i.e. you need to add additional operand - dst register (it can be the same as one of the other operands).
This is also true for AVX2 version, see https://www.felixcloutier.com/x86/pxor:
VPXOR ymm1, ymm2, ymm3/m256
VPXORD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst
Note, that the above is intel syntax and would roughly translate into *mm1 = *mm2 ^ **mm3, in your case I guess you wanted to use "vpxorq %%zmm1, %%zmm0, %%zmm0;\n"
Be advised, that using inline assembly is generally a bad practice reserved for really special occasions. SIMD programming is better (faster, easier) done by using intrinsics supported by all major compilers. You can browse them here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I want to create a function for addition two 16-bit integers with overflow detection. I have generic variant written in portable c. But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc. Of course, there is__builtin_add_overflow(), but in my case it generates some boilerplate.
So I write the following code:
#include <cstdint>
struct result_t
{
uint16_t src;
uint16_t dst;
uint8_t of;
};
static void add_u16_with_overflow(result_t& r)
{
char of, cf;
asm (
" addw %[dst], %[src] "
: [dst] "+mr"(r.dst)//, "=#cco"(of), "=#ccc"(cf)
: [src] "imr" (r.src)
: "cc"
);
asm (" seto %0 " : "=rm" (r.of) );
}
uint16_t test_add(uint16_t a, uint16_t b)
{
result_t r;
r.src = a;
r.dst = b;
add_u16_with_overflow(r);
add_u16_with_overflow(r);
return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}
I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results
test_add(unsigned short, unsigned short):
seto %al
movzbl %al, %eax
addw %si, %di
addw %si, %di
addl %esi, %eax
ret
So, seto %0 is reordered. It seems gcc think there is no dependency between two consequent asm() statements. And "cc" clobber doesn't have any effect for flags dependency.
I can't use volatile because seto %0 or whole function can be (and have to) optimized out if result (or some part of result) is not used.
I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) );, and reordering will not happen. But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add and seto statement.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
I haven't looked at gcc's output for __builtin_add_overflow, but how bad is it? #David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. asm defeats constant propagation and some other things.
Also, if you are going to use ASM, note that att syntax is add %[src], %[dst] operand order. See the tag wiki for details, unless you're always going to build your code with -masm=intel.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
No. Put the flag-consuming instruction (seto) inside the same asm block as the flag-producing instruction. An asm statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). Anyway, an extra write-only output on the statement containing the add isn't going to cause any inefficiency.
I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS. But that doesn't include OF, only the other condition codes. This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS, so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF. But since that isn't the case, setc + seto are probably better than pushf / reload, but that is worth considering.
Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea or mov) between your two separate asm statements.
You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense. Even on an in-order CPU, add is low latency so it's not a big bottleneck to put a dependent instruction right after it.
And BTW, a jcc might be more efficient if overflow is an error condition that doesn't happen normally. But unfortunately GNU C asm goto doesn't support output operands. You could take a pointer input and modify dst in memory (and use a "memory" clobber), but forcing a store/reload sucks more than using setc or seto to produce an input for a compiler-generated test/jnz.
If you didn't also need an output, you could put C labels on a return true and a return false statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if(). e.g. see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch.)
In my C++ / C project I want to set the stack pointer equal to the base pointer... Intuitively I would use something like this:
asm volatile(
"movl %%ebp %%esp"
);
However, when I execute this, I get this error message:
Error: bad register name `%%ebp %%esp'
I use gcc / g++ version 4.9.1 compiler.
I dont know whether I need to set specific g++ or gcc flag though... There should be a way to manipulate the esp and ebp registers but I just don't know the right way to do it.
Doe anybody know how to manipulate these two registers in c++? Maybe I should do it with hexed OP codes?
You're using GNU C Basic Asm syntax (no input/output/clobber constraints), so % is not special and therefore, it shouldn't be escaped.
It's only in Extended Asm (with constraints) that % needs to be escaped to end up with a single % in front of hard-coded register names in the compiler's asm output (as required in AT&T syntax).
You also have to separate the operands with a comma:
asm volatile(
"movl %ebp, %esp"
);
asm statements with no output operands are implicitly volatile, but it doesn't hurt to write an explicit volatile.
Note, however, that putting this statement inside a function will likely interfere with the way the compiler handles the stack frame.
I have a simple (but performance critical) algorithm in C (embedded in C++) to manipulate a data buffer... the algorithm 'naturally' uses 64-bit big-endian register values - and I'd like to optimise this using assembler to gain direct access to the carry flag and BSWAP and, hence, avoid having to manipulate the 64-bit values one byte at a time.
I want the solution to be portable between OS/Compilers - minimally supporting GNU g++ and Visual C++ - and between Linux and Windows respectively. For both platforms, obviously, I'm assuming a processor that supports the x86-64 instruction set.
I've found this document about inline assembler for MSVC/Windows, and several fragments via Google detailing an incompatible syntax for g++. I accept that I might need to implement this functionality separately in each dialect. I've not been able to find sufficiently detailed documentation on syntax/facilities to tackle this development.
What I'm looking for is clear documentation detailing the facilities available to me - both with MS and GNU tool sets. While I wrote some 32-bit assembler many years ago, I'm rusty - I'd benefit from a concise document detailing facilities are available at an assembly level.
A further complication is that I'd like to compile for windows using the Visual C++ Express Edition 2010... I recognise that this is a 32-bit compiler - but, I wondered, is it possible to embed 64-bit assembly into its executables? I only care about 64-bit performance in the section I plan to hand-code.
Can anyone offer any pointers (please pardon the pun...)?
Just to give you a taste of the obstacles that lie in your path, here is a simple inline assembler function, in two dialects. First, the Borland C++ Builder version (I think this compiles under MSVC++ too):
int BNASM_AddScalar (DWORD* result, DWORD x)
{
int carry = 0 ;
__asm
{
mov ebx,result
xor eax,eax
mov ecx,x
add [ebx],ecx
adc carry,eax // Return the carry flag
}
return carry ;
}
Now, the g++ version:
int BNASM_AddScalar (DWORD* result, DWORD x)
{
int carry = 0 ;
asm volatile (
" addl %%ecx,(%%edx)\n"
" adcl $0,%%eax\n" // Return the carry flag
: "+a"(carry) // Output (and input): carry in eax
: "d"(result), "c"(x) // Input: result in edx and x in ecx
) ;
return carry ;
}
As you can see, the differences are major. And there is no way around them. These are from a large integer arithmetic library that I wrote for a 32-bit environment.
As for embedding 64-bit instructions in a 32-bit executable, I think this is forbidden. As I understand it, a 32-bit executable runs in 32-bit mode, any 64-bit instruction just generates a trap.
Unfortunately, MSVC++ doesn't support inline assembly in 64-bit code and it does not support __emit either. With MSVC++ you should either implement pieces of code in separate .asm files and compile and link them with the rest of the code or resort to dirty hacks like the following (implemented for 32-bit code as proof of concept):
#include <windows.h>
#include <stdio.h>
unsigned char BswapData[] =
{
0x0F, 0xC9, // bswap ecx
0x89, 0xC8, // mov eax, ecx
0xC3 // ret
};
unsigned long (__fastcall *Bswap)(unsigned long) =
(unsigned long (__fastcall *)(unsigned long))BswapData;
int main(void)
{
DWORD dummy;
VirtualProtect(BswapData, sizeof(BswapData), PAGE_EXECUTE_READWRITE, &dummy);
printf("0x%lX\n", Bswap(0x10203040));
return 0;
}
Output: 0x40302010
I think you should be able to do the same not only with gcc but also Linux with about two minor differences (VirtualProtect() is one, calling conventions is the other).
EDIT: Here's how BSWAP can be done for 64-bit values in 64-bit mode on Windows (untested):
unsigned char BswapData64[] =
{
0x48, 0x0F, 0xC9, // bswap rcx
0x48, 0x89, 0xC8, // mov rax, rcx
0xC3 // ret
};
unsigned long long (*Bswap64)(unsigned long long) =
(unsigned long long (*)(unsigned long long))BswapData64;
And the rest is trivial.
There are many functions available for swapping endianess, for example from BSD sockets:
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
64 bits is less portable:
unsigned __int64 _byteswap_uint64(unsigned __int64); // Visual C++
int64_t __builtin_bswap64 (int64_t x). // GCC
Don't resort to assembly every time something is not expressible in standard C++.
By definition, asm statements in C or C++ are not portable, in particular because they are tied to a particular instruction set. In particular, don't expect your code to run on ARM, if your assembler statements are for x86.
Besides, even on the same hardware platform like 64 bits x86-64 (that is, modern PC-s), different systems (e.g. Linux vs Windows) have different assembler syntax and different calling conventions. So you should have several variants of your code.
If using GCC, it offers you a lot of builtin functions which can help you. And probably (assuming a recent GCC, ie. a 4.6 version), it is able to optimize quite efficiently your function.
If performance is very important, and if your system have a GPU (that is a powerful graphic card), you might consider recoding numerical kernels in OpenCL or in CUDA.
Inline assembler is not one of your possibilities: Win64 Visual C compilers do not support __asm, you'll need to use seperate [m|y|n]asm-compiled files.
I've been messing around with the free Digital Mars Compiler at work (naughty I know), and created some code to inspect compiled functions and look at the byte code for learning purposes, seeing if I can learn anything valuable from how the compiler builds its functions. However, recreating the same method in MSVC++ has failed miserably and the results I am getting are quite confusing. I have a function like this:
unsigned int __stdcall test()
{
return 42;
}
Then later I do:
unsigned char* testCode = (unsigned char*)test;
I can't seem to get the C++ static_cast to work in this case (it throws a compiler error)... hence the C-style cast, but that's besides the point... I've also tried using the reference &test, but that helps none.
Now, when I examine the contents of the memory pointed to by testCode I am confused because what I see doesn't even look like valid code, and even has a debug breakpoint stuck in there... it looks like this (target is IA-32):
0xe9, 0xbc, 0x18, 0x00, 0x00, 0xcc...
This is clearly wrong, 0xe9 is a relative jump instruction, and looking 0xbc bytes away it looks like this:
0xcc, 0xcc, 0xcc...
i.e. memory initialised to the debug breakpoint opcode as expected for unallocated or unused memory.
Where as what I would expect from a function returning 42 would be something like:
0x8b, 0x2a, 0x00, 0x00, 0x00, 0xc3
or at least some flavour of mov followed by a ret (0xc2, 0xc3, 0xca or 0xcb)a little further down
Is MSVC++ taking steps to prevent me from doing this sort of thing for security reasons, or am I doing something stupid and not realising it? This method seems to work fine using DMC as the compiler...
I'm also having trouble going the other way (executing bytes), but I suspect that the underlying cause is the same.
Any help or tips would be greatly appreciated.
I can only guess, but I'm pretty sure you are inspecting a debug build.
In debug mode the MSVC++ compiler replaces all calls by calls to jump stubs. This means, that every function starts with a jump to the real function and this is exactly what you are facing here.
The surrounding 0xCC bytes are indeed breakpoint instructions, in order to fire a possibly attached debugger in case you're executing code where you shouldn't.
Try the same with a release build. That should work as expected.
Edit:
This is actually affected by the linker setting /INCREMENTAL. The reason that the effect you're describing doesn't show up in release builds is that these jump stubs are simply optimized away if any kind of optimization is turned on (which is of course usually the case for release builds).
For your cast you want:
unsigned char* testCode = reinterpret_cast<unsigned char*>( test );
Switch Debug Information Format from 'Program Database for Edit & Continue (/ZI)' to 'Program Database (/Zi)' in Project -> Properties -> C/C++ -> General. I believe it's that setting which causes the compiler to insert jump code so the debugger can rebuild a function and hot patch it in while the program is running. Probably turn off 'Enable Minimal Rebuild' also.
A much simpler way of inspecting the code in MSVC is to simply set a break point and inspect the disassembly (right click on the line and select 'Goto disassembly' from the pop-up menu. It annotates the disassembly with the source code so you can see what each line is compiled to.
If you want to look at assembly and machine code for a given compiled function, it'll be easier to supply the /FAcs command line option to the compiler and look at the ensuing .asm file.
I'm not sure what the defined behavior is for casting a function pointer to a byte-stream -- it may not even work properly -- but one possible source of additional confusion is that x86 functions are all variable sizes and little-endian too.
If this is with incremental linking turned on, then what you're seeing is a jmp [destination]. You can run the debugger and see what the disassembly is to verify as well.