A simple while-loop in GCC inline assembly

A simple while-loop in GCC inline assembly - c++

I want to write the following loop using GCC extended inline ASM:
long* arr = new long[ARR_LEN]();
long* act_ptr = arr;
long* end_ptr = arr + ARR_LEN;
while (act_ptr < end_ptr)
{
*act_ptr = SOME_VALUE;
act_ptr += STEP_SIZE;
}
delete[] arr;
An array of type long with length ARR_LEN is allocated and zero-initialized. The loop walks through the array with an increment of STEP_SIZE. Every touched element is set to SOME_VALUE.
Well, this was my first attempt in GAS:
long* arr = new long[ARR_LEN]();
asm volatile
(
"loop:"
"movl %[sval], (%[aptr]);"
"leal (%[aptr], %[incr], 4), %[aptr];"
"cmpl %[eptr], %[aptr];"
"jl loop;"
: // no output
: [aptr] "r" (arr),
[eptr] "r" (arr + ARR_LEN),
[incr] "r" (STEP_SIZE),
[sval] "i" (SOME_VALUE)
: "cc", "memory"
);
delete[] arr;
As mentioned in the comments, it is true that this assembler code is more of a do {...} while loop, but it does in fact do the same work.
The strange thing about that piece of code really is, that it worked fine for me at first. But when I later tried to make it work in another project, it just seemed as if it wouldn't do anything. I even made some 1:1 copies of the working project, compiled again and... still the result is random.
Maybe I took the wrong constraints for the input operands, but I've actually tried nearly all of them by now and I have no real idea left. What puzzles me in particular is, that it still works in some cases.
I am not an expert at ASM whatsoever, although I learned it when I was still at university. Please note that I am not looking for optimization - I am just trying to understand how inline assembly works. So here is my question: Is there anything fundamentally wrong with my attempt or did I make a more subtle mistake here? Thanks in advance.
(Working with g++ MinGW Win32 x86 v.4.8.1)
Update
I have already tried out every single suggestion that has been contributed here so far. In particular I tried
using the "q" operand constraint instead of "r", sometimes it works, sometimes it doesn't,
writing ... : [aptr] "=r" (arr) : "0" (arr) ... instead, same result,
or even ... : [aptr] "+r" (arr) : ..., still the same.
Meanwhile I know the official documentation pretty much by heart, but I still can't see my error.

You are modifying an input operand (aptr) which is not allowed. Either constrain it match an output operand or change it to an input/output operand.

Here is a complete code that has the intended behavior.
Note that the code is written for a 64-bit machine. Therefore, for example %%rbx is used instead of %%ebx as the base address for the array. For the same reason leaq and cmpq should be used instead of leal and cmpl.
movq should be used since the array is of type long.
Type long is 8 byte not 4 byte on a 64-bit machine.
jl in the question should be changed to jg.
Register labels can not be used since they will be replaced by the compiler with the 32-bit version of the chosen register (e.g., ebx).
Constraint "r" can not be used. "r" means any register can be used, however not any combination of registers is acceptable for leaq. Look here: x86 addressing modes
#include <iostream>
using namespace std;
int main(){
int ARR_LEN=20;
int STEP_SIZE=2;
long SOME_VALUE=100;
long* arr = new long[ARR_LEN];
int i;
for (i=0; i<ARR_LEN; i++){
arr[i] = 0;
}
__asm__ __volatile__
(
"loop:"
"movq %%rdx, (%%rbx);"
"leaq (%%rbx, %%rcx, 8), %%rbx;"
"cmpq %%rbx, %%rax;"
"jg loop;"
: // no output
: "b" (arr),
"a" (arr+ARR_LEN),
"c" (STEP_SIZE),
"d" (SOME_VALUE)
: "cc", "memory"
);
for (i=0; i<ARR_LEN; i++){
cout << "element " << i << " is " << arr[i] << endl;
}
delete[] arr;
return 0;
}

How about an answer that works for both x86 and x64 (although it does assume longs are always 4 bytes, a la Windows)? The main change from the OP is using "+r" and (temp).
#include <iostream>
using namespace std;
int main(){
int ARR_LEN=20;
size_t STEP_SIZE=2;
long SOME_VALUE=100;
long* arr = new long[ARR_LEN];
for (int i=0; i<ARR_LEN; i++){
arr[i] = 0;
}
long* temp = arr;
asm volatile (
"loop:\n\t"
"movl %[sval], (%[aptr])\n\t"
"lea (%[aptr], %[incr], %c[size]), %[aptr]\n\t"
"cmp %[eptr], %[aptr]\n\t"
"jl loop\n\t"
: [aptr] "+r" (temp)
: [eptr] "r" (arr + ARR_LEN),
[incr] "r" (STEP_SIZE),
[sval] "i" (SOME_VALUE),
[size] "i" (sizeof(long))
: "cc", "memory"
);
for (int i=0; i<ARR_LEN; i++){
cout << "element " << i << " is " << arr[i] << endl;
}
delete[] arr;
return 0;
}

Related

Is there a built-in function that quickly counts how many bits an int takes up?

I am trying to count how many bits an int takes up,e.g. count(10)=4,count(7)=3,count(127)=7 etc.
I have tried brute-forcing(<<ing a 1 until it's strictly bigger than the number) and using floor(log2(v))+1,but both are too slow for my needs.
I know that there exists a __builtin_popcnt function that quickly count how many 1s there are in an int,but I had trouble finding a built-in that fits my applictions,is there no such function or have I overlooked something?
Edit:I'm working with g++ version 9.3.0
Edit2:mediocrevegetable1's answer was chosen because it was the only one usable with g++ at the time.However,future readers may also try out chris's answer for better compatibility or in hopes that the compiler will give a more efficient implementation.

C++20 added std::bit_width (live example):
#include <bit>
#include <iostream>
int main() {
std::cout
<< std::bit_width(10u) << ' ' // 4
<< std::bit_width(7u) << ' ' // 3
<< std::bit_width(127u); // 7
}
With Clang trunk and, for example, -O3 -march=skylake, a function doing nothing but calling bit_width produces the following assembly:
count(unsigned int):
lzcnt ecx, edi
mov eax, 32
sub eax, ecx
ret
There are a number of similar functions in this new <bit> header as well. You can see them all here.

You can use the GCC built-in function __builtin_clz, which gives the leading zero's:
#include <climits>
constexpr unsigned bit_space(unsigned n)
{
return (sizeof n * CHAR_BIT) - __builtin_clz(n);
}

you can use asm
int index;
int value = 0b100010000000;
asm("bsr %1, %0":"=a"(index):"a"(value)); // now index equals 11
this method returns zero-based index of the last set bit
which means it will also return zero if no bit is set
to get around this, one can first check if value is not zero
if value is zero that means no bit is set
if value is not zero that means index + 1 is the number of bits used by value

There is a standard library function for this.
No need to write anything yourself.
https://en.cppreference.com/w/cpp/utility/bitset/count
#include <cassert>
#include <bitset>
int main()
{
std::bitset<32> value = 0b10010110;
auto count = value.count();
assert(count == 4);
}
After the feedback below, that this was the incorrect answer.
I don't know if there is a build in function.
But this compile time constexpr will give the same result.
#include <cassert>
static constexpr auto bits_needed(unsigned int value)
{
size_t n = 0;
for (; value > 0; value >>= 1, ++n);
return n;
}
int main()
{
assert(3 == bits_needed(7));
assert(4 == bits_needed(10));
assert(4 == bits_needed(9));
assert(16 == bits_needed(65000));
}

Insert an `asm` block to do a addition in very large numbers

I am doing a program, and at this point I need to make it efficient.
I am using a Haswell microarchitecture (64bits) and the 'g++'.
The objective is made use of an ADC instruction, until the loop ends.
//I removed every carry handlers from this preview, yo be more simple
size_t anum = ap[i], bnum = bp[i];
unsigned carry;
// The Carry flag is set here with an common addtion
anum += bnum;
cnum[0]= anum;
carry = check_Carry(anum, bnum);
for (int i=1; i<n; i++){
anum = ap[i];
bnum = bp[i];
//I want to remove this line and insert the __asm__ block
anum += (bnum + carry);
carry = check_Carry(anum, bnum);
//This block is not working
__asm__(
"movq -64(%rbp), %rcx;"
"adcq %rdx, %rcx;"
"movq %rsi, -88(%rbp);"
);
cnum[i] = anum;
}
Is the CF set only in the first addition? Or is it every time I do an ADC instruction?
I think that the problem is on the loss of the CF, every time the loop is done. If it is this the problem how I can solve it?

You use asm like this in the gcc family of compilers:
int src = 1;
int dst;
asm ("mov %1, %0\n\t"
"add $1, %0"
: "=r" (dst)
: "r" (src));
printf("%d\n", dst);
That is, you can refer to variables, rather than guessing where they might be in memory/registers.
[Edit] On the subject of carries: It's not completely clear what you are wanting, but: ADC takes the CF as an input and produces it as an output. However, MANY other instructions muck with the flags, (such as likely those used by the compiler to construct your for-loop), so you probably need to use some instructions to save/restore the CF (perhaps LAHF/SAHF).

inline asm, when to use r and when to use m? why this behavior?

I am trying to delve into some inline assembly. It is interesting stuff but the documentation is scarce and newb unfriendly.
This code works as expected, it correctly multiplies
{
int other_var=3;
asm volatile
(
"mov $3,%0\n\t"
"roll $2,%0;"
:"=r"(other_var)
:"r"(other_var)
);
cout << "other_var equals " << other_var <<endl;
return 0;
}
but this
int other_var=3;
cout << "other_var equals " << other_var <<endl;
asm volatile
(
"roll $2,%0;"
:"=r"(other_var)
:"r"(other_var)
);
cout << "other_var equals " <<hex<< other_var <<endl;
return 0;
}
When I remove the seemingly arbitrary mov, the code behaves as if undefined and outputs garbage. Suddenly the program does not load other_var from memory to register and the "=m" and "m" option is needed. Why is that? What is the piece of information I am missing here?

You should probably find your self a couple of reference books, pdf, or Websites. 1 that documents the very compiler specific nature of inline assembly, and 1 that documents the specific nature of assembly language. Then hope nobody ever tries to run your code on different hardware.
In the first chunk of code you assign the constant value 3, "$3", to the Output bound register, "%0".
Then you performes a roll on the output bound register, "%0", by the constant 2, "$2", bits.
Effectively multiplying 3 by 4.
Neither block of code actually reads the original value from the variable other_var.
m is for memory, r is for register. = is for output, no = is used for input.
mov %1, %0; load the register used for output with the value of the register used for input..
roll $2, %0; Then roll the output register
When you just grab a register and start using the existing bit pattern found there you are likely going to see something that resembles "Garbage"..
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#s5
http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html

Will strlen be calculated multiple times if used in a loop condition?

I'm not sure if the following code can cause redundant calculations, or is it compiler-specific?
for (int i = 0; i < strlen(ss); ++i)
{
// blabla
}
Will strlen() be calculated every time when i increases?

Yes, strlen() will be evaluated on each iteration. It's possible that, under ideal circumstances, the optimiser might be able to deduce that the value won't change, but I personally wouldn't rely on that.
I'd do something like
for (int i = 0, n = strlen(ss); i < n; ++i)
or possibly
for (int i = 0; ss[i]; ++i)
as long as the string isn't going to change length during the iteration. If it might, then you'll need to either call strlen() each time, or handle it through more complicated logic.

Yes, every time you use the loop. Then it will every time calculate the length of the string.
so use it like this:
char str[30];
for ( int i = 0; str[i] != '\0'; i++)
{
//Something;
}
In the above code str[i] only verifies one particular character in the string at location i each time the loop starts a cycle, thus it will take less memory and is more efficient.
See this Link for more information.
In the code below every time the loop runs strlen will count the length of the whole string which is less efficient, takes more time and takes more memory.
char str[];
for ( int i = 0; i < strlen(str); i++)
{
//Something;
}

A good compiler may not calculate it every time, but I don't think you can be sure, that every compiler does it.
In addition to that, the compiler has to know, that strlen(ss) does not change. This is only true if ss is not changed in for loop.
For example, if you use a read-only function on ss in for loop but don't declare the ss-parameter as const, the compiler cannot even know that ss is not changed in the loop and has to calculate strlen(ss) in every iteration.

If ss is of type const char * and you're not casting away the constness within the loop the compiler might only call strlen once, if optimizations are turned on. But this is certainly not behavior that can be counted upon.
You should save the strlen result in a variable and use this variable in the loop. If you don't want to create an additional variable, depending on what you're doing, you may be ale to get away with reversing the loop to iterate backwards.
for( auto i = strlen(s); i > 0; --i ) {
// do whatever
// remember value of s[strlen(s)] is the terminating NULL character
}

Formally yes, strlen() is expected to be called for every iteration.
Anyway I do not want to negate the possibility of the existance of some clever compiler optimisation, that will optimise away any successive call to strlen() after the first one.

The predicate code in it's entirety will be executed on every iteration of the for loop. In order to memoize the result of the strlen(ss) call the compiler would need to know that at least
The function strlen was side effect free
The memory pointed to by ss doesn't change for the duration of the loop
The compiler doesn't know either of these things and hence can't safely memoize the result of the first call

Yes. strlen will be calculated everytime when i increases.
If you didn't change ss with in the loop means it won't affect logic otherwise it will affect.
It is safer to use following code.
int length = strlen(ss);
for ( int i = 0; i < length ; ++ i )
{
// blabla
}

Yes, the strlen(ss) will calculate the length at each iteration. If you are increasing the ss by some way and also increasing the i; there would be infinite loop.

Yes, the strlen() function is called every time the loop is evaluated.
If you want to improve the efficiency then always remember to save everything in local variables... It will take time but it's very useful ..
You can use code like below:
String str="ss";
int l = strlen(str);
for ( int i = 0; i < l ; i++ )
{
// blablabla
}

Yes, strlen(ss) will be calculated every time the code runs.

Not common nowadays but 20 years ago on 16 bit platforms, I'd recommend this:
for ( char* p = str; *p; p++ ) { /* ... */ }
Even if your compiler isn't very smart in optimization, the above code can result in good assembly code yet.

Yes. The test doesn't know that ss doesn't get changed inside the loop. If you know that it won't change then I would write:
int stringLength = strlen (ss);
for ( int i = 0; i < stringLength; ++ i )
{
// blabla
}

Arrgh, it will, even under ideal circumstances, dammit!
As of today (January 2018), and gcc 7.3 and clang 5.0, if you compile:
#include <string.h>
void bar(char c);
void foo(const char* __restrict__ ss)
{
for (int i = 0; i < strlen(ss); ++i)
{
bar(*ss);
}
}
So, we have:
ss is a constant pointer.
ss is marked __restrict__
The loop body cannot in any way touch the memory pointed to by ss (well, unless it violates the __restrict__).
and still, both compilers execute strlen() every single iteration of that loop. Amazing.
This also means the allusions/wishful thinking of #Praetorian and #JaredPar doesn't pan out.

YES, in simple words.
And there is small no in rare condition in which compiler is wishing to, as an optimization step if it finds that there is no changes made in ss at all. But in safe condition you should think it as YES. There are some situation like in multithreaded and event driven program, it may get buggy if you consider it a NO.
Play safe as it is not going to improve the program complexity too much.

Yes.
strlen() calculated everytime when i increases and does not optimized.
Below code shows why the compiler should not optimize strlen().
for ( int i = 0; i < strlen(ss); ++i )
{
// Change ss string.
ss[i] = 'a'; // Compiler should not optimize strlen().
}

We can easily test it :
char nums[] = "0123456789";
size_t end;
int i;
for( i=0, end=strlen(nums); i<strlen(nums); i++ ) {
putchar( nums[i] );
num[--end] = 0;
}
Loop condition evaluates after each repetition, before restarting the loop .
Also be careful about the type you use to handle length of strings . it should be size_t which has been defined as unsigned int in stdio. comparing and casting it to int might cause some serious vulnerability issue.

well, I noticed that someone is saying that it is optimized by default by any "clever" modern compiler. By the way look at results without optimization. I tried: Minimal C code:
#include <stdio.h>
#include <string.h>
int main()
{
char *s="aaaa";
for (int i=0; i<strlen(s);i++)
printf ("a");
return 0;
}
My compiler: g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Command for generation of assembly code: g++ -S -masm=intel test.cpp
Gotten assembly code at the output:
...
L3:
mov DWORD PTR [esp], 97
call putchar
add DWORD PTR [esp+40], 1
.L2:
THIS LOOP IS HERE
**<b>mov ebx, DWORD PTR [esp+40]
mov eax, DWORD PTR [esp+44]
mov DWORD PTR [esp+28], -1
mov edx, eax
mov eax, 0
mov ecx, DWORD PTR [esp+28]
mov edi, edx
repnz scasb</b>**
AS YOU CAN SEE it's done every time
mov eax, ecx
not eax
sub eax, 1
cmp ebx, eax
setb al
test al, al
jne .L3
mov eax, 0
.....

Elaborating on Prætorian's answer I recommend the following:
for( auto i = strlen(s)-1; i > 0; --i ) {foo(s[i-1];}
auto because you don't want to care about which type strlen returns. A C++11 compiler (e.g. gcc -std=c++0x, not completely C++11 but auto types work) will do that for you.
i = strlen(s) becuase you want to compare to 0 (see below)
i > 0 because comparison to 0 is (slightly) faster that comparison to any other number.
disadvantage is that you have to use i-1 in order to access the string characters.

Are C++ enums slower to use than integers?

It's really a simple problem :
I'm programming a Go program. Should I represent the board with a QVector<int> or a QVector<Player> where
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
I guess that of course, using Player instead of integers will be slower. But I wonder how much more, because I believe that using enum is better coding.
I've done a few tests regarding assigning and comparing Players (as opposed to int)
QVector<int> vec;
vec.resize(10000000);
int size = vec.size();
for(int i =0; i<size; ++i)
{
vec[i] = 0;
}
for(int i =0; i<size; ++i)
{
bool b = (vec[i] == 1);
}
QVector<Player> vec2;
vec2.resize(10000000);
int size = vec2.size();
for(int i =0; i<size; ++i)
{
vec2[i] = EMPTY;
}
for(int i =0; i<size; ++i)
{
bool b = (vec2[i] == BLACK);
}
Basically, it's only 10% slower. Is there anything else I should know before continuing?
Thanks!
Edit : The 10% difference is not a figment of my imagination, it seems to be specific to Qt and QVector. When I use std::vector, the speed is the same

Enums are completely resolved at compile time (enum constants as integer literals, enum variables as integer variables), there's no speed penalty in using them.
In general the average enumeration won't have an underlying type bigger than int (unless you put in it very big constants); in facts, at §7.2 ¶ 5 it's explicitly said:
The underlying type of an enumeration is an integral type that can represent all the enumerator values defined in the enumeration. It is implementation-defined which integral type is used as the underlying type for an enumeration except that the underlying type shall not be larger than int unless the value of an enumerator cannot fit in an int or unsigned int.
You should use enumerations when it's appropriate because they usually make the code easier to read and to maintain (have you ever tried to debug a program full of "magic numbers"? :S).
As for your results: probably your test methodology doesn't take into account the normal speed fluctuations you get when you run code on "normal" machines1; have you tried running the test many (100+) times and calculating mean and standard deviation of your times? The results should be compatible: the difference between the means shouldn't be bigger than 1 or 2 times the RSS2 of the two standard deviations (assuming, as usual, a Gaussian distribution for the fluctuations).
Another check you could do is to compare the generated assembly code (with g++ you can get it with the -S switch).
On "normal" PCs you have some indeterministic fluctuations because of other tasks running, cache/RAM/VM state, ...
Root Sum Squared, the square root of the sum of the squared standard deviations.

In general, using an enum should make absolutely no difference to performance. How did you test this?
I just ran tests myself. The differences are pure noise.
Just now, I compiled both versions to assembler. Here's the main function from each:
int
LFB1778:
pushl %ebp
LCFI11:
movl %esp, %ebp
LCFI12:
subl $8, %esp
LCFI13:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
Player
LFB1774:
pushl %ebp
LCFI10:
movl %esp, %ebp
LCFI11:
subl $8, %esp
LCFI12:
movl $65535, %edx
movl $1, %eax
call __Z41__static_initialization_and_destruction_0ii
leave
ret
It's hazardous to base any statement regarding performance on micro-benchmarks. There are too many extraneous factors skewing the data.

Enums should be no slower. They're implemented as integers.

if you use Visual Studio for example you can create a simple project where you have
a=Player::EMPTY;
and if you right click "go to disassembly" the code will be
mov dword ptr [a],0
So the compiler replace the value of the enum, and normally it will not generate any overhead.

Well, I did a few tests and there wasn't much difference between the integer and enum forms. I also added a char form which was consistently about 6% quicker (which isn't surprising as it is using less memory). Then I just used a char array rather than a vector and that was 300% faster! Since we've not been given what QVector is, it could be a wrapper for an array rather than the std::vector I've used.
Here's the code I used, compiled using standard release options in Dev Studio 2005. Note that I've changed the timed loop a small amount as the code in the question could be optimised to nothing (you'd have to check the assembly code).
#include <windows.h>
#include <vector>
#include <iostream>
using namespace std;
enum Player
{
EMPTY = 0,
BLACK = 1,
WHITE = 2
};
template <class T, T search>
LONGLONG TimeFunction ()
{
vector <T>
vec;
vec.resize (10000000);
size_t
size = vec.size ();
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <T> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == search)
{
break;
}
}
QueryPerformanceCounter (&end);
return end.QuadPart - start.QuadPart;
}
LONGLONG TimeArrayFunction ()
{
size_t
size = 10000000;
char
*vec = new char [size];
for (size_t i = 0 ; i < size ; ++i)
{
vec [i] = static_cast <char> (rand () % 3);
}
LARGE_INTEGER
start,
end;
QueryPerformanceCounter (&start);
for (size_t i = 0 ; i < size ; ++i)
{
if (vec [i] == 10)
{
break;
}
}
QueryPerformanceCounter (&end);
delete [] vec;
return end.QuadPart - start.QuadPart;
}
int main ()
{
cout << " Char form = " << TimeFunction <char, 10> () << endl;
cout << "Integer form = " << TimeFunction <int, 10> () << endl;
cout << " Player form = " << TimeFunction <Player, static_cast <Player> (10)> () << endl;
cout << " Array form = " << TimeArrayFunction () << endl;
}

The compiler should convert enum into integers. They get inlined at compile time, so once your program is compiled, it's supposed to be exactly the same as if you used the integers themselves.
If your testing produces different results, there could be something going on with the test itself. Either that, or your compiler is behaving oddly.

This is implementation dependent, and it is quite possible for enums and ints to have different performance and either the same or different assembly code, although it is probably a sign of a suboptimal compiler. some ways to get differences are:
QVector may be specialized on your enum type to do something surprising.
enum doesn't get compiled to int but to "some integral type no larger than int". QVector of int may be specialized differently from QVector of some_integral_type.
even if QVector isn't specialized, the compiler may do a better job of aligning ints in memory than of aligning some_integral_type, leading to a greater cache miss rate when you loop over the vector of enums or of some_integral_type.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js