Determine processor support for SSE2? - c++

I need to do determine processor support for SSE2 prior installing a software. From what I understand, I came up with this:
bool TestSSE2(char * szErrorMsg)
{
__try
{
__asm
{
xorpd xmm0, xmm0 // executing SSE2 instruction
}
}
#pragma warning (suppress: 6320)
__except (EXCEPTION_EXECUTE_HANDLER)
{
if (_exception_code() == STATUS_ILLEGAL_INSTRUCTION)
{
_tcscpy_s(szErrorMsg,MSGSIZE, _T("Streaming SIMD Extensions 2(SSE2) is not supported by the CPU.\r\n Unable to launch APP"));
return false;
}
_tcscpy_s(szErrorMsg,MSGSIZE, _T("Streaming SIMD Extensions 2(SSE2) is not supported by the CPU.\r\n Unable to launch APP"));
return false;
}
return true;
}
Would this work? I'm not really sure how to test, since my CPU supports it, so I don't get false from the function call.
How do I determine processor support for SSE2?

I found this one by accident in the MSDN:
BOOL sse2supported = ::IsProcessorFeaturePresent( PF_XMMI64_INSTRUCTIONS_AVAILABLE );
Windows-only, but if you are not interested in anything cross-platform, very simple.

Call CPUID with eax = 1 to load the feature flags in to edx. Bit 26 is set if SSE2 is available. Some code for demonstration purposes, using MSVC++ inline assembly (only for x86 and not portable!):
inline unsigned int get_cpu_feature_flags()
{
unsigned int features;
__asm
{
// Save registers
push eax
push ebx
push ecx
push edx
// Get the feature flags (eax=1) from edx
mov eax, 1
cpuid
mov features, edx
// Restore registers
pop edx
pop ecx
pop ebx
pop eax
}
return features;
}
// Bit 26 for SSE2 support
static const bool cpu_supports_sse2 = (cpu_feature_flags & 0x04000000)!=0;

The most basic way to check for SSE2 support is by using the CPUID instruction (on platforms where it is available). Either using inline assembly or using compiler intrinsics.

You can use the _cpuid function. All is explained in the MSDN.

Related

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?
p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

Error : Invalid Character '(' in mnemonic

Hi I am trying to compile the below assembly code on Linux using gcc 7.5 version but somehow getting the error
Error : Invalid Character '(' in mnemonic
bool InterlockedCompareAndStore128(int *dest,int *newVal,int *oldVal)
{
asm(
"push %rbx\n"
"push %rdi\n"
"mov %rcx, %rdi\n" // ptr to dest -> RDI
"mov 8(%rdx), %rcx\n" // newVal -> RCX:RBX
"mov (%rdx), %rbx\n"
"mov 8(%r8), %rdx\n" // oldVal -> RDX:RAX
"mov (%r8), %rax\n"
"lock (%rdi), cmpxchg16b\n"
"mov $0, %rax\n"
"jnz exit\n"
"inc1 %rax\n"
"exit:;\n"
"pop %rdi\n"
"pop %rbx\n"
);
}
Can anyone suggest how to resolve this . Checked many online links and tutorials for Assembly code but could not relate the exact issue.
Thanks for the help in advance.
In Windows I could see the implementation of the above function as:
function InterlockedCompareExchange128;
asm
.PUSHNV RBX
MOV R10,RCX
MOV RBX,R8
MOV RCX,RDX
MOV RDX,[R9+8]
MOV RAX,[R9]
LOCK CMPXCHG16B [R10]
MOV [R9+8],RDX
MOV [R9],RAX
SETZ AL
MOVZX EAX, AL
end;
For PUSHNV , I could not found anything related to this on Linux. So , basically I am trying to implement same functionality in c++ on Linux.
The question here was about Invalid Character '(' in mnemonic which the other answer addresses.
However, OP's code has a number of issues beyond that problem. Here's (what I think are) two better approaches to this problem. Note that I've changed the order of the parameters and turned them const.
This one continues to use inline asm, but uses Extended asm instead of Basic. While I'm of the don't use inline asm school of thought, this might be useful or at least educational.
bool InterlockedCompareAndStore128B(__int64 *dest, const __int64 *oldVal, const __int64 *newVal)
{
bool result;
__int64 ovl = oldVal[0];
__int64 ovh = oldVal[1];
asm volatile ("lock cmpxchg16b %[ptr]"
: "=#ccz" (result), [ptr] "+m" (*dest),
"+d" (ovh), "+a" (ovl)
: "c" (newVal[1]), "b" (newVal[0])
: "cc", "memory");
// cmpxchg16b changes rdx:rax to the current value in dest. Useful if you need
// to loop until you succeed, but OP's code doesn't save the values, so I'm
// just following that spec.
//oldVal[0] = ovl;
//oldVal[1] = ovh;
return result;
}
In addition to solving the problems with the original code, it's also inlineable and shorter. The constraints likely make it harder to read, but the fact that there's only 1 line of asm might help offset that. If you want to understand what the constraints mean, check out this page (scroll down to x86 family) and the description of flag output constraints (again, scroll down for x86 family).
As an alternative, this code uses a gcc builtin and allows the compiler to generate the appropriate asm instructions. Note that this must be built with -mcx16 for best results.
bool InterlockedCompareAndStore128C(__int128 *dest, const __int128 *oldVal, const __int128 *newVal)
{
// While a sensible person would use __atomic_compare_exchange_n and let gcc generate
// cmpxchg16b, gcc decided they needed to turn this into a big hairy function call:
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
// In short, if someone wants to compare/exchange against readonly memory, you can't just
// use cmpxchg16b cuz it would crash. Why would anyone try to exchange memory that can't
// be written to? Apparently because it's expected to *not* crash if the compare fails
// and nothing gets written. So no one gets to use that 1 line instruction and everyone
// gets an entire routine (that uses MUTEX instead of lockfree) to support this absurd
// border case. Sounds dumb to me, but that's where things stand as of 2021-05-07.
// Use the legacy function instead.
bool b = __sync_bool_compare_and_swap(dest, *oldVal, *newVal);
return b;
}
For the kibizters in the crowd, here's the code generated by -m64 -O3 -mcx16 for that last one:
InterlockedCompareAndStore128C(__int128*, __int128 const*, __int128 const*):
mov rcx, rdx
push rbx
mov rax, QWORD PTR [rsi]
mov rbx, QWORD PTR [rcx]
mov rdx, QWORD PTR [rsi+8]
mov rcx, QWORD PTR [rcx+8]
lock cmpxchg16b XMMWORD PTR [rdi]
pop rbx
sete al
ret
If someone wants to fiddle, here's the godbolt link.
There are a number of problems with this code, and I'm not convinced I'm doing you any favors by telling you how to fix the specific problem.
But the short answer is that
"lock (%rdi), cmpxchg16b\n"
should be
"lock cmpxchg16b (%rdi)\n"
Tada, now it compiles. Well, it would if inc1 was a real instruction.
But I can't help but notice that the pointers here are int *, which is 4 bytes, not 16. And that this function is not declared as naked. And using Extended asm would save you from having to push all these registers around by hand, making this code a lot slower than it needs to be.
But most of all, you should really use the builtins, like __atomic_compare_exchange because inline asm is error prone, not portable, and really hard to maintain.

Compiler optimization of static constexpr

Given the following C++ code:
#include <stdio.h>
static constexpr int x = 1;
void testfn() {
if (x == 2)
printf("This is test.\n");
}
int main() {
for (int a = 0; a < 10; a++)
testfn();
return 0;
}
Visual Studio 2019 produces the following Debug build assembly (viewed using Approach 1 of accepted answer at: How to view the assembly behind the code using Visual C++?)
int main() {
00EC1870 push ebp
00EC1871 mov ebp,esp
00EC1873 sub esp,0CCh
00EC1879 push ebx
00EC187A push esi
00EC187B push edi
00EC187C lea edi,[ebp-0CCh]
00EC1882 mov ecx,33h
00EC1887 mov eax,0CCCCCCCCh
00EC188C rep stos dword ptr es:[edi]
00EC188E mov ecx,offset _6D4A0457_how_compiler_treats_staticconstexpr#cpp (0ECC003h)
00EC1893 call #__CheckForDebuggerJustMyCode#4 (0EC120Dh)
for (int a = 0; a < 10; a++)
00EC1898 mov dword ptr [ebp-8],0
00EC189F jmp main+3Ah (0EC18AAh)
00EC18A1 mov eax,dword ptr [ebp-8]
00EC18A4 add eax,1
00EC18A7 mov dword ptr [ebp-8],eax
00EC18AA cmp dword ptr [ebp-8],0Ah
00EC18AE jge main+47h (0EC18B7h)
testfn();
00EC18B0 call testfn (0EC135Ch)
00EC18B5 jmp main+31h (0EC18A1h)
return 0;
00EC18B7 xor eax,eax
}
As can be seen in the assembly, possibly because this is a Debug build, there is pointless references to the for loop and testfn in main. I would have hoped that they should not find any mention in the assembly code at all given that the printf in testfn will never be hit since static constexpr int x=1.
I have 2 questions:
(1)Perhaps in the Release build, the for loop is optimized away. How can I check this? Viewing the release build assembly code does not work for me even on using the Approach 2 specified at at: How to view the assembly behind the code using Visual C++?. The file with the assembly code is not produced at all.
(2)In using static constexpr int/double/char as opposed to #define's, under what circumstances is one guaranteed that the former does not involve any unnecessary overhead (runtime computations/evaluations)? #define's, though much maligned, seem to offer much greater guarantee than static constexpr's in this regard.
The issue here is that you are compiling the code using a debug build. If you want sanity in the asm, compile as release instead. The problem is that a debugger is used to help confirm the logic of the underlying code. The logic in your underlying code is that it should call testfn() 10 times. As a result, you should be able to place a breakpoint on that method, and hit it at the correct point in the execution. In a release build, that breakpoint would never be hit (because it would have been optimised away).
In your case however, it's entirely incorrect to say that the constexpr is being ignored. You may notice that there are no calls to printf() in the generated asm, so the compiler has correctly identified that if (x == 2) can never be true, and has removed it. However, if the compiler removed the call to testfn() completely, your breakpoint would never be hit, and the debugger would basically be useless.
Don't look at the output of a debug build and imagine it tells you anything useful about the code or compiler. You should expect the code to be deliberately de-optimised.

Intel DRNG giving only giving 4 bytes of data instead of 8

I am trying to implement Intel DRNG in c++.
According to its guide to generate a 64 bit unsigned long long the code should be:
int rdrand64_step (unsigned long long *rand)
{
unsigned char ok;
asm volatile ("rdrand %0; setc %1"
: "=r" (*rand), "=qm" (ok));
return (int) ok;
}
However the output of this function rand is only giving me an output of only 32 bits as shown.
bd4a749d
d461c2a8
8f666eee
d1d5bcc4
c6f4a412
any reason why this is happening?
more info: the IDE I'm using is codeblocks
Use int _rdrand64_step (unsigned __int64* val) from immintrin.h instead of writing inline asm. You don't need it, and there are many reasons (including this one) to avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm
In this case, the problem is that you're probably compiling 32-bit code, so of course 64-bit rdrand is not encodeable. But the way you used inline-asm ended up giving you a 32-bit rdrand, and storing garbage from another register for the high half.
gcc -Wall -O3 -m32 -march=ivybridge (and similar for clang) produces (on Godbolt):
In function 'rdrand64_step':
7 : <source>:7:1: warning: unsupported size for integer register
rdrand64_step:
push ebx
rdrand ecx; setc al
mov edx, DWORD PTR [esp+8] # load the pointer arg
movzx eax, al
mov DWORD PTR [edx], ecx
mov DWORD PTR [edx+4], ebx # store garbage in the high half of *rand
pop ebx
ret
I guess you called this function with a caller that happened to have ebx=0. Or else you used a different compiler that did something different. Maybe something else happens after inlining. If you looked at disassembly of what you actually compiled, you could explain exactly what's going on.
If you'd used the intrinsic, you would have gotten error: '_rdrand64_step' was not declared in this scope, because immintrin.h only declares it in 64-bit mode (and with a -march setting that implies rdrand support. Or [-mrdrnd]3. Best option: use -march=native if you're building on the target machine).
You'd also get significantly more efficient code for a retry loop, at least with clang:
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step(&rand)); // TODO: retry limit in case RNG is broken.
return rand;
}
use_intrinsic: # #use_intrinsic
.LBB2_1: # =>This Inner Loop Header: Depth=1
rdrand rax
jae .LBB2_1
ret
That avoids setcc and then testing that, which is of course redundant. gcc6 has syntax for returning flag results from inline asm. You can also use asm goto and put a jcc inside the asm, jumping to a label: return 1; target or falling through to a return 0. (The inline-asm docs have an example of doing this. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. See also the inline-assembly tag wiki.)
Using your inline-asm, clang (in 64-bit mode) compiles it to:
use_asm:
.LBB1_1:
rdrand rax
setb byte ptr [rsp - 1]
cmp byte ptr [rsp - 1], 0
je .LBB1_1
ret
(clang makes bad decisions for constraints with multiple options that include memory.)
gcc7.2 and ICC17 actually end up with better code from the asm than from the intrinsic. They use cmovc to get a 0 or 1 and then test that. It's pretty dumb. But that's a gcc/ICC missed optimization that will hopefully be.

Passing a pointer from C to assembly

I want to use "_test_and_set lock" assembly language implementation with atomic swap assembly instruction in my C/C++ program.
class LockImpl
{
public:
static void lockResource(DWORD resourceLock )
{
__asm
{
InUseLoop: mov eax, 0;0=In Use
xchg eax, resourceLock
cmp eax, 0
je InUseLoop
}
}
static void unLockResource(DWORD resourceLock )
{
__asm
{
mov resourceLock , 1
}
}
};
This works but there is a bug in here.
The problem is that i want to pass DWORD * resourceLock instead of DWORD resourceLock.
So question is that how to pass a pointer from C/C++ to assembly and get it back. ?
thanks in advance.
Regards,
-Jay.
P.S. this is done to avoid context switches between user space and kernel space.
If you're writing this for Windows, you should seriously consider using a critical section object. The critical section API functions are optimised such that they won't transition into kernel mode unless they really need to, so the normal case of no contention has very little overhead.
The biggest problem with your spin lock is that if you're on a single CPU system and you're waiting for the lock, then you're using all the cycles you can and whatever is holding the lock won't even get a chance to run until your timeslice is up and the kernel preempts your thread.
Using a critical section will be more successful than trying to roll your own user mode spin lock.
In terms of your actual question, it's pretty simple: just change the function headers to use volatile DWORD *resourceLock, and change the assembly lines that touch resourceLock to use indirection:
mov ecx, dword ptr [resourceLock]
xchg eax, dword ptr [ecx]
and
mov ecx, dword ptr [resourceLock]
lock mov dword ptr [ecx], 1
However, note that you've got a couple of other problems looming:
You say you're developing this on Windows, but want to switch to Linux. However, you're using MSVC-specific inline assembly - this will have to be ported to gcc-style when you move to Linux (in particular that involves switching from Intel syntax to AT&T syntax). You will be much better off developing with gcc even on Windows; that will minimise the pain of migration (see mingw for gcc for Windows).
Greg Hewgill is absolutely right about spinning uselessly, stopping the lock-holder from getting CPU. Consider yielding the CPU if you've been spinning for too long.
On a multiprocessor x86, you might well have a problem with memory loads and stores being re-ordered around your lock - mfence instructions in the lock and unlock procedures might be necessary.
Really, if you're worrying about locking that means you're using threading, which probably means you're using the platform-specific threading APIs already. So use the native synchronisation primitives, and switch out to the pthreads versions when you switch to Linux.
Apparently, you are compiling with MSVC using inline assembly blocks in your C++ code.
As a general remark, you should really use compiler intrinsics as inline assembly has no future: it's no more supported my MS compilers when compiling for x64.
If you need to have functions fine tuned in assembly, you will have to implement them in separate files.
The main problems with the original version in the question is that it needs to use register indirect addressing and take a reference (or pointer parameter) rather than a by-value parameter for the lock DWORD.
Here's a working solution for Visual C++. EDIT: I have worked offline with the author and we have verified the code in this answer works in his test harness correctly.
But if you're using Windows, you should really by using the Interlocked API (i.e. InterlockedExchange).
Edit: As noted by CAF, lock xchg is not required because xchg automatically asserts a BusLock.
I also added a faster version that does a non-locking read before attempting to do the xchg. This significantly reduces BusLock contention on the memory interface. The algorithm can be sped up quite a bit more (in a contentious multithreaded case) by doing backoffs (yield then sleep) for locks held a long time. For the single-threaded-CPU case, using a OS lock that sleeps immediately on held-locks will be fastest.
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
static void lockResource(volatile DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, 0 ;0=In Use
xchg eax, [ebx]
cmp eax, 0
je InUseLoop
}
}
static void lockResource_FasterVersion(DWORD &resourceLock )
{
__asm
{
mov ebx, resourceLock
InUseLoop:
mov eax, [ebx] ;// Read without BusLock
cmp eax, 0
je InUseLoop ;// Retry Read if Busy
mov eax, 0
xchg eax, [ebx] ;// XCHG with BusLock
cmp eax, 0
je InUseLoop ;// Retry if Busy
}
}
static void unLockResource(volatile DWORD &resourceLock)
{
__asm
{
mov ebx, resourceLock
mov [ebx], 1
}
}
};
// A little testing code here
volatile DWORD aaa=1;
void test()
{
LockImpl::lockResource(aaa);
LockImpl::unLockResource(aaa);
}
You should be using something like this:
volatile LONG resourceLock = 1;
if(InterlockedCompareExchange(&resourceLock, 0, 1) == 1) {
// success!
// do something, and then
resourceLock = 1;
} else {
// failed, try again later
}
See InterlockedCompareExchange.
Look at your compiler documentation to find out how to print the generated assembly language for functions.
Print the assembly language for this function:
static void unLockResource(DWORD resourceLock )
{
resourceLock = 0;
return;
}
This may not work because the compiler can optimize the function and remove all the code. You should change the above function to pass a pointer to resourceLock and then have the function set the lock. Print the assembly of this working function.
I already provided a working version which answered the original poster's question both on how to get the parameters passed in ASM and how to get his lock working correctly.
Many other answers have questioned the wiseness of using ASM at all and mentioned that either intrinsics or C OS calls should be used. The following works as well and is a C++ version of my ASM answer. There is a snippet of ASM in there that only needs to be used if your platform does not support InterlockedExchange().
class LockImpl
{
// This is a simple SpinLock
// 0 - in use / busy
// 1 - free / available
public:
#if 1
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
// InterlockedExchange() uses LONG / He wants to use DWORD
return((DWORD)InterlockedExchange(
(volatile LONG *)variable,(LONG)newval));
}
#else
// You can use this if you don't have InterlockedExchange()
// on your platform. Otherwise no ASM is required.
static DWORD MyInterlockedExchange(volatile DWORD *variable,DWORD newval)
{
DWORD old;
__asm
{
mov ebx, variable
mov eax, newval
xchg eax, [ebx] ;// XCHG with BusLock
mov old, eax
}
return(old);
}
#endif
static void lockResource(volatile DWORD &resourceLock )
{
DWORD oldval;
do
{
while(0==resourceLock)
{
// Could have a yield, spin count, exponential
// backoff, OS CS fallback, etc. here
}
oldval=MyInterlockedExchange(&resourceLock,0);
} while (0==oldval);
}
static void unLockResource(volatile DWORD &resourceLock)
{
// _ReadWriteBarrier() is a VC++ intrinsic that generates
// no instructions / only prevents compiler reordering.
// GCC uses __sync_synchronize() or __asm__ ( :::"memory" )
_ReadWriteBarrier();
resourceLock=1;
}
};