I am doing a program, and at this point I need to make it efficient.
I am using a Haswell microarchitecture (64bits) and the 'g++'.
The objective is made use of an ADC instruction, until the loop ends.
//I removed every carry handlers from this preview, yo be more simple
size_t anum = ap[i], bnum = bp[i];
unsigned carry;
// The Carry flag is set here with an common addtion
anum += bnum;
cnum[0]= anum;
carry = check_Carry(anum, bnum);
for (int i=1; i<n; i++){
anum = ap[i];
bnum = bp[i];
//I want to remove this line and insert the __asm__ block
anum += (bnum + carry);
carry = check_Carry(anum, bnum);
//This block is not working
__asm__(
"movq -64(%rbp), %rcx;"
"adcq %rdx, %rcx;"
"movq %rsi, -88(%rbp);"
);
cnum[i] = anum;
}
Is the CF set only in the first addition? Or is it every time I do an ADC instruction?
I think that the problem is on the loss of the CF, every time the loop is done. If it is this the problem how I can solve it?
You use asm like this in the gcc family of compilers:
int src = 1;
int dst;
asm ("mov %1, %0\n\t"
"add $1, %0"
: "=r" (dst)
: "r" (src));
printf("%d\n", dst);
That is, you can refer to variables, rather than guessing where they might be in memory/registers.
[Edit] On the subject of carries: It's not completely clear what you are wanting, but: ADC takes the CF as an input and produces it as an output. However, MANY other instructions muck with the flags, (such as likely those used by the compiler to construct your for-loop), so you probably need to use some instructions to save/restore the CF (perhaps LAHF/SAHF).
Related
I'm writing this code to make a candlestick chart and I want a red box if the open price for the day is greater than the close. I also want the box to be green if the close is higher than the open price.
if(open > close) {
boxColor = red;
} else {
boxColor = green;
}
Pseudo code is easier than an English sentence for this.
So I wrote this code first and then tried to benchmark it but I don't know how to get meaningful results.
for(int i = 0; i < history.get().close.size(); i++) {
auto open = history->open[i];
auto close = history->close[i];
int red = ((int)close - (int)open) >> ((int)sizeof(close) * 8);
int green = ((int)open - (int)close) >> ((int)sizeof(close) * 8);
gl::color(red,green,0);
gl::drawSolidRect( Rectf(vec2(i - 1, open), vec2(i + 1, close)) );
}
This is how I tried to benchmark it. Each run just shows 2ns. My main question to the community is this:
Can I actually make it faster by using a right shift and avoid a conditional branch?
#include <benchmark/reporter.h>
static void BM_red_noWork(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
}
}
BENCHMARK(BM_red_noWork);
static void BM_red_fast_work(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
int red = ((int)open - (int)close) >> sizeof(int) - 1;
}
}
BENCHMARK(BM_red_fast_work);
static void BM_red_slow_work(benchmark::State& state) {
double open = (double)rand() / RAND_MAX;
double close = (double)rand() / RAND_MAX;
while (state.KeepRunning()) {
int red = open > close ? 0 : 1;
}
}
BENCHMARK(BM_red_slow_work);
Thanks!
As I stated in my comment, the compiler will do these optimizations for you. Here is a minimal compilable example:
int main() {
volatile int a = 42;
if (a <= 0) {
return 0;
} else {
return 1;
}
}
The volatile is simply to prevent optimizations from "knowing" the value of a and instead it forces it to be read.
This was compiled with the command g++ -O3 -S test.cpp and it produces a file named test.s
Inside test.s is the assembly generated by the compiler (pardon AT&T syntax):
movl $42, -4(%rsp)
movl -4(%rsp), %eax
testl %eax, %eax
setg %al
movzbl %al, %eax
ret
As you can see, it is branchless. It uses testl to set a flag if the number is <= 0 and then reads that value using setg, moves it back into the proper register, then finally it returns.
It should be noted, at this was adapted from your code. A much better way to write this is simply:
int main() {
volatile int a = 42;
return a > 0;
}
It also generates the same assembly.
This is likely to be better than anything readable you could write directly in C++. For instance your code (hopefully corrected for bit arithmetic errors):
int main() {
volatile int a = 42;
return ~(a >> (sizeof(int) * CHAR_BIT - 1)) & 1;
}
Compiles to:
movl $42, -4(%rsp)
movl -4(%rsp), %eax
notl %eax
shrl $31, %eax
ret
Which is indeed, very slightly smaller. But it's not significantly faster. Especially not when you have a GL call right next to it. I'd rather spend 1-3 additional cycles to get readable code, rather than have to scratch my head wondering what my coworker (or me from 6 months ago, which is essentially the same thing) did.
EDIT: I should be remarked that the compiler also optimized the bit arithmetic I wrote, because I wrote it less well than I could have. The assembly is actually: (~a) >> 31 which is equivalent to the ~(a >> 31) & 1 that I wrote (at least in most implementations with an unsigned integer, see comments for details).
The Xeon-Phi Knights Landing cores have a fast exp2 instruction vexp2pd (intrinsic _mm512_exp2a23_pd). The Intel C++ compiler can vectorize the exp function using the Short Vector Math Library (SVML) which comes with the compiler. Specifically, it calls the fuction __svml_exp8.
However, when I step through a debugger I don't see that __svml_exp8 uses the vexp2pd instruction. It is a complication function with many FMA operations. I understand that vexp2pd is less accurate than exp but if I use -fp-model fast=1 (the default) or fp-model fast=2 I expect the compiler to use this instruction but it does not.
I have two questions.
Is there a way to get the compiler to use vexp2pd?
How do I safely override the call to __svml_exp8?
As to the second question this is what I have done so far.
//exp(x) = exp2(log2(e)*x)
extern "C" __m512d __svml_exp8(__m512d x) {
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
Is this safe? Is there a better solution e.g. one that inlines the function? In the test code below this is about 3 times faster than if I don't override.
//https://godbolt.org/g/adI11c
//icpc -O3 -xMIC-AVX512 foo.cpp
#include <math.h>
#include <stdio.h>
#include <x86intrin.h>
extern "C" __m512d __svml_exp8(__m512d x) {
//exp(x) = exp2(log2(e)*x)
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
void foo(double * __restrict x, double * __restrict y) {
__assume_aligned(x, 64);
__assume_aligned(y, 64);
for(int i=0; i<1024; i++) y[i] = exp(x[i]);
}
int main(void) {
double x[1024], y[1024];
for(int i=0; i<1024; i++) x[i] = 1.0*i;
for(int r=0; r<1000000; r++) foo(x,y);
double sum=0;
//for(int i=0; i<1024; i++) sum+=y[i];
for(int i=0; i<8; i++) printf("%f ", y[i]); puts("");
//printf("%lf",sum);
}
ICC will generate vexp2pd but only under very much relaxed math requirements as specified by targeted -fimf* switches.
#include <math.h>
void vfoo(int n, double * a, double * r)
{
int i;
#pragma simd
for ( i = 0; i < n; i++ )
{
r[i] = exp(a[i]);
}
}
E.g. compile with -xMIC-AVX512 -fimf-domain-exclusion=1 -fimf-accuracy-bits=22
..B1.12:
vmovups (%rsi,%rax,8), %zmm0
vmulpd .L_2il0floatpacket.2(%rip){1to8}, %zmm0, %zmm1
vexp2pd %zmm1, %zmm2
vmovupd %zmm2, (%rcx,%rax,8)
addq $8, %rax
cmpq %r8, %rax
jb ..B1.12
Please be sure to understand the accuracy implications as not only the end result is only about 22 bits accurate, but the vexp2pd also flushes to zero any denormalized results irrespective of the FTZ/DAZ bits set in the MXCSR.
To the second question: "How do I safely override the call to __svml_exp8?"
Your approach is generally not safe. SVML routines are internal to Intel Compiler and rely on custom calling conventions, so a generic routine with the same name can potentially clobber more registers than a library routine would, and you may end up in a hard-to-debug ABI mismatch.
A better way of providing your own vector functions would be to utilize #pragma omp declare simd, e.g. see https://software.intel.com/en-us/node/524514 and possibly the vector_variant attribute if prefer coding with intrinsics, see https://software.intel.com/en-us/node/523350. Just don't try to override standard math names or you'll get an error.
I use Linux x86_64 and clang 3.3.
Is this even possible in theory?
std::atomic<__int128_t> doesn't work (undefined references to some functions).
__atomic_add_fetch also doesn't work ('error: cannot compile this atomic library call yet').
Both std::atomic and __atomic_add_fetch work with 64-bit numbers.
It's not possible to do this with a single instruction, but you can emulate it and still be lock-free. Except for the very earliest AMD64 CPUs, x64 supports the CMPXCHG16B instruction. With a little multi-precision math, you can do this pretty easily.
I'm afraid I don't know the instrinsic for CMPXCHG16B in GCC, but hopefully you get the idea of having a spin loop of CMPXCHG16B. Here's some untested code for VC++:
// atomically adds 128-bit src to dst, with src getting the old dst.
void fetch_add_128b(uint64_t *dst, uint64_t* src)
{
uint64_t srclo, srchi, olddst[2], exchlo, exchhi;
srchi = src[0];
srclo = src[1];
olddst[0] = dst[0];
olddst[1] = dst[1];
do
{
exchlo = srclo + olddst[1];
exchhi = srchi + olddst[0] + (exchlo < srclo); // add and carry
}
while(!_InterlockedCompareExchange128((long long*)dst,
exchhi, exchlo,
(long long*)olddst));
src[0] = olddst[0];
src[1] = olddst[1];
}
Edit: here's some untested code going off of what I could find for the GCC intrinsics:
// atomically adds 128-bit src to dst, returning the old dst.
__uint128_t fetch_add_128b(__uint128_t *dst, __uint128_t src)
{
__uint128_t dstval, olddst;
dstval = *dst;
do
{
olddst = dstval;
dstval = __sync_val_compare_and_swap(dst, dstval, dstval + src);
}
while(dstval != olddst);
return dstval;
}
That isn't possible. There is no x86-64 instruction that does a 128-bit add in one instruction, and to do something atomically, a basic starting point is that it is a single instruction (there are some instructions which aren't atomic even then, but that's another matter).
You will need to use some other lock around the 128-bit number.
Edit: It is possible that one could come up with something that uses something like this:
__volatile__ __asm__(
" mov %0, %%rax\n"
" mov %0+4, %%rdx\n"
" mov %1,%%rbx\n"
" mov %1+4,%%rcx\n"
"1:\n
" add %%rax, %%rbx\n"
" adc %%rdx, %%rcx\n"
" lock;cmpxcchg16b %0\n"
" jnz 1b\n"
: "=0"
: "0"(&arg1), "1"(&arg2));
That's just something I just hacked up, and I haven't compiled it, never mind validated that it will work. But the principle is that it repeats until it compares equal.
Edit2: Darn typing too slow, Cory Nelson just posted the same thing, but using intrisics.
Edit3: Update loop to not unnecessary read memory that doesn't need reading... CMPXCHG16B does that for us.
Yes; you need to tell your compiler that you're on hardware that supports it.
This answer is going to assume you're on x86-64; there's likely a similar spec for arm.
From the generic x86-64 microarchitecture levels, you'll want at least x86-64-v2 to let the compiler know that you have the cmpxchg16b instruction.
Here's a working godbolt, note the compiler flag -march=x86-64-v2:
https://godbolt.org/z/PvaojqGcx
For more reading on the x86-64-psABI, the spec is published here.
I want to write the following loop using GCC extended inline ASM:
long* arr = new long[ARR_LEN]();
long* act_ptr = arr;
long* end_ptr = arr + ARR_LEN;
while (act_ptr < end_ptr)
{
*act_ptr = SOME_VALUE;
act_ptr += STEP_SIZE;
}
delete[] arr;
An array of type long with length ARR_LEN is allocated and zero-initialized. The loop walks through the array with an increment of STEP_SIZE. Every touched element is set to SOME_VALUE.
Well, this was my first attempt in GAS:
long* arr = new long[ARR_LEN]();
asm volatile
(
"loop:"
"movl %[sval], (%[aptr]);"
"leal (%[aptr], %[incr], 4), %[aptr];"
"cmpl %[eptr], %[aptr];"
"jl loop;"
: // no output
: [aptr] "r" (arr),
[eptr] "r" (arr + ARR_LEN),
[incr] "r" (STEP_SIZE),
[sval] "i" (SOME_VALUE)
: "cc", "memory"
);
delete[] arr;
As mentioned in the comments, it is true that this assembler code is more of a do {...} while loop, but it does in fact do the same work.
The strange thing about that piece of code really is, that it worked fine for me at first. But when I later tried to make it work in another project, it just seemed as if it wouldn't do anything. I even made some 1:1 copies of the working project, compiled again and... still the result is random.
Maybe I took the wrong constraints for the input operands, but I've actually tried nearly all of them by now and I have no real idea left. What puzzles me in particular is, that it still works in some cases.
I am not an expert at ASM whatsoever, although I learned it when I was still at university. Please note that I am not looking for optimization - I am just trying to understand how inline assembly works. So here is my question: Is there anything fundamentally wrong with my attempt or did I make a more subtle mistake here? Thanks in advance.
(Working with g++ MinGW Win32 x86 v.4.8.1)
Update
I have already tried out every single suggestion that has been contributed here so far. In particular I tried
using the "q" operand constraint instead of "r", sometimes it works, sometimes it doesn't,
writing ... : [aptr] "=r" (arr) : "0" (arr) ... instead, same result,
or even ... : [aptr] "+r" (arr) : ..., still the same.
Meanwhile I know the official documentation pretty much by heart, but I still can't see my error.
You are modifying an input operand (aptr) which is not allowed. Either constrain it match an output operand or change it to an input/output operand.
Here is a complete code that has the intended behavior.
Note that the code is written for a 64-bit machine. Therefore, for example %%rbx is used instead of %%ebx as the base address for the array. For the same reason leaq and cmpq should be used instead of leal and cmpl.
movq should be used since the array is of type long.
Type long is 8 byte not 4 byte on a 64-bit machine.
jl in the question should be changed to jg.
Register labels can not be used since they will be replaced by the compiler with the 32-bit version of the chosen register (e.g., ebx).
Constraint "r" can not be used. "r" means any register can be used, however not any combination of registers is acceptable for leaq. Look here: x86 addressing modes
#include <iostream>
using namespace std;
int main(){
int ARR_LEN=20;
int STEP_SIZE=2;
long SOME_VALUE=100;
long* arr = new long[ARR_LEN];
int i;
for (i=0; i<ARR_LEN; i++){
arr[i] = 0;
}
__asm__ __volatile__
(
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
: // no output
: "b" (arr),
"a" (arr+ARR_LEN),
"c" (STEP_SIZE),
"d" (SOME_VALUE)
: "cc", "memory"
);
for (i=0; i<ARR_LEN; i++){
cout << "element " << i << " is " << arr[i] << endl;
}
delete[] arr;
return 0;
}
How about an answer that works for both x86 and x64 (although it does assume longs are always 4 bytes, a la Windows)? The main change from the OP is using "+r" and (temp).
#include <iostream>
using namespace std;
int main(){
int ARR_LEN=20;
size_t STEP_SIZE=2;
long SOME_VALUE=100;
long* arr = new long[ARR_LEN];
for (int i=0; i<ARR_LEN; i++){
arr[i] = 0;
}
long* temp = arr;
asm volatile (
"loop:\n\t"
"movl %[sval], (%[aptr])\n\t"
"lea (%[aptr], %[incr], %c[size]), %[aptr]\n\t"
"cmp %[eptr], %[aptr]\n\t"
"jl loop\n\t"
: [aptr] "+r" (temp)
: [eptr] "r" (arr + ARR_LEN),
[incr] "r" (STEP_SIZE),
[sval] "i" (SOME_VALUE),
[size] "i" (sizeof(long))
: "cc", "memory"
);
for (int i=0; i<ARR_LEN; i++){
cout << "element " << i << " is " << arr[i] << endl;
}
delete[] arr;
return 0;
}
How can I detect which CPU is being used at runtime ? The c++ code needs to differentiate between AMD / Intel architectures ? Using gcc 4.2.
The cpuid instruction, used with EAX=0 will return a 12-character vendor string in EBX, EDX, ECX, in that order.
For Intel, this string is "GenuineIntel". For AMD, it's "AuthenticAMD". Other companies that have created x86 chips have their own strings.The Wikipedia page for cpuid has many (all?) of the strings listed, as well as an example ASM listing for retrieving the details.
You really only need to check if ECX matches the last four characters. You can't use the first four, because some Transmeta CPUs also start with "Genuine"
For Intel, this is 0x6c65746e
For AMD, this is 0x444d4163
If you convert each byte in those to a character, they'll appear to be backwards. This is just a result of the little endian design of x86. If you copied the register to memory and looked at it as a string, it would work just fine.
Example Code:
bool IsIntel() // returns true on an Intel processor, false on anything else
{
int id_str; // The first four characters of the vendor ID string
__asm__ ("cpuid":\ // run the cpuid instruction with...
"=c" (id_str) : // id_str set to the value of EBX after cpuid runs...
"a" (0) : // and EAX set to 0 to run the proper cpuid function.
"eax", "ebx", "edx"); // cpuid clobbers EAX, ECX, and EDX, in addition to EBX.
if(id_str==0x6c65746e) // letn. little endian clobbering of GenuineI[ntel]
return true;
else
return false;
}
EDIT: One other thing - this can easily be changed into an IsAMD function, IsVIA function, IsTransmeta function, etc. just by changing the magic number in the if.
If you're on Linux (or on Windows running under Cygwin), you can figure that out by reading the special file /proc/cpuinfo and looking for the line beginning with vendor_id. If the string is GenuineIntel, you're running on an Intel chip. If you get AuthenticAMD, you're running on an AMD chip.
void get_vendor_id(char *vendor_id) // must be at least 13 bytes
{
FILE *cpuinfo = fopen("/proc/cpuinfo", "r");
if(cpuinfo == NULL)
; // handle error
char line[256];
while(fgets(line, 256, cpuinfo))
{
if(strncmp(line, "vendor_id", 9) == 0)
{
char *colon = strchr(line, ':');
if(colon == NULL || colon[1] == 0)
; // handle error
strncpy(vendor_id, 12, colon + 2);
fclose(cpuinfo);
return;
}
}
// if we got here, handle error
fclose(cpuinfo);
}
If you know you're running on an x86 architecture, a less portable method would be to use the CPUID instruction:
void get_vendor_id(char *vendor_id) // must be at least 13 bytes
{
// GCC inline assembler
__asm__ __volatile__
("movl $0, %%eax\n\t"
"cpuid\n\t"
"movl %%ebx, %0\n\t"
"movl %%edx, %1\n\t"
"movl %%ecx, %2\n\t"
: "=m"(vendor_id), "=m"(vendor_id + 4), "=m"(vendor_id + 8) // outputs
: // no inputs
: "%eax", "%ebx", "%edx", "%ecx", "memory"); // clobbered registers
vendor_id[12] = 0;
}
int main(void)
{
char vendor_id[13];
get_vendor_id(vendor_id);
if(strcmp(vendor_id, "GenuineIntel") == 0)
; // it's Intel
else if(strcmp(vendor_id, "AuthenticAMD") == 0)
; // it's AMD
else
; // other
return 0;
}
On Windows, you can use the GetNativeSystemInfo function
On Linux, try sysinfo
You probably should not check at all. Instead, check whether the CPU supports the features you need, e.g. SSE3. The differences between two Intel chips might be greater than between AMD and Intel chips.
You have to define it in your Makefile arch=uname -p 2>&1 , then use #ifdef i386 some #endif for diferent architectures.
I have posted a small project:
http://sourceforge.net/projects/cpp-cpu-monitor/
which uses the libgtop library and exposes data through UDP. You can modify it to suit your needs. GPL open-source. Please ask if you have any questions regarding it.