Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
void func(int depth){
if(depth== 0) return;
int number(12345);
cout << number; //it does something with number
func(--depth);
}
void func2(int depth){
if(depth== 0) return;
{
int number(12345);
cout << number; //it does something with number
}
func2(--depth);
}
void main(){
func(10); //will this function cost more memory?
func2(10);//will this function cost less memory?
}
Hi. I have two functions here. Will func2 cost less memory because its number(12345) is encapsulated by "{}" so that by the time func2 calls the next iteration, number(12345) can go outside the scope and disappear?
I believe func will cost more because its number(12345) is not outside of the scope even when it reaches the next iteration?
Assuming we have AMD/Intel x86_64 architecture and our compiler is GCC.
Lets took assembly output (-O2 -S) and analyze:
func:
.LFB1560:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
movl %ecx, %ebx
testl %ecx, %ecx
je .L3
movq .refptr._ZSt4cout(%rip), %rsi
.p2align 4,,10
.L5:
movl $12345, %edx
movq %rsi, %rcx
call _ZNSolsEi
subl $1, %ebx
jne .L5
.L3:
addq $40, %rsp
popq %rbx
popq %rsi
ret
.seh_endproc
func2:
.LFB2046:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
movl %ecx, %ebx
testl %ecx, %ecx
je .L10
movq .refptr._ZSt4cout(%rip), %rsi
.p2align 4,,10
.L12:
movl $12345, %edx
movq %rsi, %rcx
call _ZNSolsEi
subl $1, %ebx
jne .L12
.L10:
addq $40, %rsp
popq %rbx
popq %rsi
ret
.seh_endproc
So as you can see, both functions completely identical to each other.
Well, GCC 6.2.1 with -O0 produces the same assembly for both of your functions.
What closing braces do, within a single function, is determine when destructors are called. For instance consider the following:
struct Number {
int n;
~Number() { std::cout << "~" << n << ", "; }
};
void func(int depth){
if(depth== 0) return;
Number number{depth};
std::cout << number.n << ", ";
func(--depth);
}
void func2(int depth){
if(depth== 0) return;
{
Number number{depth};
std::cout << number.n << ", ";
}
func2(--depth);
}
int main(int, char**) {
std::cout << "Calling func. \n";
func(10);
std::cout << "\n\n";
std::cout << "Calling func1. \n";
func2(10);
}
This outputs
Calling func.
10, 9, 8, 7, 6, 5, 4, 3, 2, 1, ~1, ~2, ~3, ~4, ~5, ~6, ~7, ~8, ~9, ~10,
Calling func1.
10, ~10, 9, ~9, 8, ~8, 7, ~7, 6, ~6, 5, ~5, 4, ~4, 3, ~3, 2, ~2, 1, ~1,
So if number were some std::vector or maybe std::ifstream, then yes those braces would be quite necessary.
Why speculate?
Since your depth is small (i.e. 10 is small), you should run your MCVE.
Just two small changes will provide info to determine your answer:
a) cout the address of number (in addition to the value).
b) instead of a const number, fetch something that is not constant. (see below)
#include <iostream>
#include <iomanip>
class T595_t
{
public:
T595_t() = default;
~T595_t() = default;
int exec(int , char**)
{
std::cout << "\n func1(10)" << std::flush;
func1(10);
std::cout << "\n func2(10)" << std::flush;
func2(10);
return 0;
}
private: // methods
void func1(int depth)
{
if(depth== 0) return;
uint64_t number(time(nullptr));
std::cout << "\n " << depth
<< " " << number << " " << &number;
func1(--depth);
}
void func2(int depth)
{
if(depth== 0) return;
{
uint64_t number(std::time(nullptr));
std::cout << "\n " << depth
<< " " << number << " " << &number;
}
func2(--depth);
}
}; // class T595_t
int main(int argc, char* argv[])
{
T595_t t595;
return t595.exec(argc, argv);
}
With output (on Ubuntu 17.10, using g++ version 7.2.0.)
func1(10)
10 1523379410 0x7ffd449d7d90
9 1523379410 0x7ffd449d7d60
8 1523379410 0x7ffd449d7d30
7 1523379410 0x7ffd449d7d00
6 1523379410 0x7ffd449d7cd0
5 1523379410 0x7ffd449d7ca0
4 1523379410 0x7ffd449d7c70
3 1523379410 0x7ffd449d7c40
2 1523379410 0x7ffd449d7c10
1 1523379410 0x7ffd449d7be0
func2(10)
10 1523379410 0x7ffd449d7d90
9 1523379410 0x7ffd449d7d60
8 1523379410 0x7ffd449d7d30
7 1523379410 0x7ffd449d7d00
6 1523379410 0x7ffd449d7cd0
5 1523379410 0x7ffd449d7ca0
4 1523379410 0x7ffd449d7c70
3 1523379410 0x7ffd449d7c40
2 1523379410 0x7ffd449d7c10
1 1523379410 0x7ffd449d7be0
When unoptimized, the code gen for each funcX seems to do the same thing. And since the code completes in less than a second, all the values (of number) are the same.
Xcode generates an EXC_BAD_ACCESS error.
I suppose the problem is that I messed up with the registers accessing arrays' values.
Plain step-by-step explanation would be much appreciated, thanks!
void countingSort(int array[], int length, int digit) {
int i, count[10] = { };
int sorted[length];
// Store number of occurrences in count[].
// for (i = 0; i < length; i++)
// count[ (array[i] / digit) % 10 ]++;
// Inline assembler loop
// in place of commented out for loop.
asm(
"loopOne: \n\t"
"movl $0, %%ecx \n\t"
"cmpl %%ecx, (%[length]) \n\t"
"je loopTwo \n\t"
"movl %[array], %%esp \n\t"
"movl (%%esp, %%ecx, 4), %%eax \n\t"
"movl (%[digit]), %%ebx \n\t"
"divl %%ebx \n\t"
"movl %[count], %%ebp \n\t"
"movl (%%ebp, %%edx, 4), %%esi \n\t"
"inc %%esi \n\t"
"inc %%ecx \n\t"
"jmp loopOne \n\t"
"loopTwo: \n\t"
// ...
: [array] "=g" (array), [count] "=g" (count)
: [digit] "r" ((long) digit), [length] "r" ((long) length)
);
}
void radixSort(int array[], int length) {
// Maximum number helps later when counting number of digits.
int max = findMax(array, length);
// Do Counting sort for every digit.
for (int digit = 1; max / digit > 0; digit *= 10)
countingSort(array, length, digit); // Thread 1: EXC_BAD_ACCESS (code=1, address=0x0)
}
Some years ago I needed a way to do some basic 128 bit integer math with Cuda:
128 bit integer on cuda?.
Now I am having the same problem, but this time I need to run some basic 128 bit arithmetics (sums, bitshifts and multiplications) on a 32 bit embedded system (Intel Edison) that does not support 128 bits of any kind. There are, however, 64 bit integers supported directly (unsigned long long int).
I tried naively to use the asm code that was answered to me last time on the CPU, but I got a bunch of errors. I am really not experienced with asm, so: what is the most efficient way, having 64 bit integers, to implement additions, multiplications and bit shifting in 128 bits?
Update: Since the OP hasn't accepted an answer yet <hint><hint>, I've attached a bit more code.
Using the libraries discussed above is probably a good idea. While you might only need a few functions today, eventually you may find that you need one more. Then one more after that. Until eventually you end up writing, debugging and maintaining your own 128bit math library. Which is a waste of your time and effort.
That said. If you are determined to roll your own:
1) The cuda question you asked previously already has c code for multiplication. Was there some problem with it?
2) The shift probably won't benefit from using asm, so a c solution makes sense to me here as well. Although if performance is really an issue here, I'd see if the Edison supports SHLD/SHRD, which might make this a bit faster. Otherwise, m Maybe an approach like this?
my_uint128_t lshift_uint128 (const my_uint128_t a, int b)
{
my_uint128_t res;
if (b < 32) {
res.x = a.x << b;
res.y = (a.y << b) | (a.x >> (32 - b));
res.z = (a.z << b) | (a.y >> (32 - b));
res.w = (a.w << b) | (a.z >> (32 - b));
} elseif (b < 64) {
...
}
return res;
}
Update: Since it appears that the Edison may support SHLD/SHRD, here's an alternative which might be more performant than the 'c' code above. As with all code purporting to be faster, you should test it.
inline
unsigned int __shld(unsigned int into, unsigned int from, unsigned int c)
{
unsigned int res;
if (__builtin_constant_p(into) &&
__builtin_constant_p(from) &&
__builtin_constant_p(c))
{
res = (into << c) | (from >> (32 - c));
}
else
{
asm("shld %b3, %2, %0"
: "=rm" (res)
: "0" (into), "r" (from), "ic" (c)
: "cc");
}
return res;
}
inline
unsigned int __shrd(unsigned int into, unsigned int from, unsigned int c)
{
unsigned int res;
if (__builtin_constant_p(into) &&
__builtin_constant_p(from) &&
__builtin_constant_p(c))
{
res = (into >> c) | (from << (32 - c));
}
else
{
asm("shrd %b3, %2, %0"
: "=rm" (res)
: "0" (into), "r" (from), "ic" (c)
: "cc");
}
return res;
}
my_uint128_t lshift_uint128 (const my_uint128_t a, unsigned int b)
{
my_uint128_t res;
if (b < 32) {
res.x = a.x << b;
res.y = __shld(a.y, a.x, b);
res.z = __shld(a.z, a.y, b);
res.w = __shld(a.w, a.z, b);
} else if (b < 64) {
res.x = 0;
res.y = a.x << (b - 32);
res.z = __shld(a.y, a.x, b - 32);
res.w = __shld(a.z, a.y, b - 32);
} else if (b < 96) {
res.x = 0;
res.y = 0;
res.z = a.x << (b - 64);
res.w = __shld(a.y, a.x, b - 64);
} else if (b < 128) {
res.x = 0;
res.y = 0;
res.z = 0;
res.w = a.x << (b - 96);
} else {
memset(&res, 0, sizeof(res));
}
return res;
}
my_uint128_t rshift_uint128 (const my_uint128_t a, unsigned int b)
{
my_uint128_t res;
if (b < 32) {
res.x = __shrd(a.x, a.y, b);
res.y = __shrd(a.y, a.z, b);
res.z = __shrd(a.z, a.w, b);
res.w = a.w >> b;
} else if (b < 64) {
res.x = __shrd(a.y, a.z, b - 32);
res.y = __shrd(a.z, a.w, b - 32);
res.z = a.w >> (b - 32);
res.w = 0;
} else if (b < 96) {
res.x = __shrd(a.z, a.w, b - 64);
res.y = a.w >> (b - 64);
res.z = 0;
res.w = 0;
} else if (b < 128) {
res.x = a.w >> (b - 96);
res.y = 0;
res.z = 0;
res.w = 0;
} else {
memset(&res, 0, sizeof(res));
}
return res;
}
3) The addition may benefit from asm. You could try this:
struct my_uint128_t
{
unsigned int x;
unsigned int y;
unsigned int z;
unsigned int w;
};
my_uint128_t add_uint128 (const my_uint128_t a, const my_uint128_t b)
{
my_uint128_t res;
asm ("addl %5, %[resx]\n\t"
"adcl %7, %[resy]\n\t"
"adcl %9, %[resz]\n\t"
"adcl %11, %[resw]\n\t"
: [resx] "=&r" (res.x), [resy] "=&r" (res.y),
[resz] "=&r" (res.z), [resw] "=&r" (res.w)
: "%0"(a.x), "irm"(b.x),
"%1"(a.y), "irm"(b.y),
"%2"(a.z), "irm"(b.z),
"%3"(a.w), "irm"(b.w)
: "cc");
return res;
}
I just dashed this off, so use at your own risk. I don't have an Edison, but this works with x86.
Update: If you are just doing accumulation (think to += from instead of the code above which is c = a + b), this code might serve you better:
inline
void addto_uint128 (my_uint128_t *to, const my_uint128_t from)
{
asm ("addl %[fromx], %[tox]\n\t"
"adcl %[fromy], %[toy]\n\t"
"adcl %[fromz], %[toz]\n\t"
"adcl %[fromw], %[tow]\n\t"
: [tox] "+&r"(to->x), [toy] "+&r"(to->y),
[toz] "+&r"(to->z), [tow] "+&r"(to->w)
: [fromx] "irm"(from.x), [fromy] "irm"(from.y),
[fromz] "irm"(from.z), [fromw] "irm"(from.w)
: "cc");
}
If using an external library is an option then have a look at this question. You can use TTMath which is a very simple header for big precision math. On 32-bit architectures ttmath:UInt<4> will create a 128-bit int type with four 32-bit limbs. Some other alternatives are (u)int128_t in Boost.Multiprecision or calccrypto/uint128_t
If you must write it your own then there are already a lot of solutions on SO and I'll summarize them here
For addition and subtraction, it's very easy and straightforward, simply add/subtract the words (which big int libraries often called limbs) from the lowest significant to higher significant, with carry of course.
typedef struct INT128 {
uint64_t H, L;
} my_uint128_t;
inline my_uint128_t add(my_uint128_t a, my_uint128_t b)
{
my_uint128_t c;
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L); // c = a + b
return c;
}
The assembly output can be checked with Compiler Explorer
The compilers can already generate efficient code for double-word operations, but many aren't smart enough to use "add with carry" when compiling multi-word operations from high level languages as you can see in the question efficient 128-bit addition using carry flag. Therefore using 2 long longs like above will make it not only more readable but also easier for the compiler to emit a little more efficient code.
If that still doesn't suit your performance requirement, you must use intrinsic or write it in assembly. To add the 128-bit value stored in bignum to the 128-bit value in {eax, ebx, ecx, edx} you can use the following code
add edx, [bignum]
adc ecx, [bignum+4]
adc ebx, [bignum+8]
adc eax, [bignum+12]
The equivalent intrinsic will be like this for Clang
unsigned *x, *y, *z, carryin=0, carryout;
z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
carryin = carryout;
z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
carryin = carryout;
z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
carryin = carryout;
z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);
You need to change the intrinsic to the one supported by your compiler, for example __builtin_uadd_overflow in gcc, or _addcarry_u32 for MSVC and ICC
For more information read these
Working with Big Numbers Using x86 Instructions
How can I add and subtract 128 bit integers in C or C++ if my compiler does not support them?
Producing good add with carry code from clang
multi-word addition using the carry flag
For bit shifts you can find the C solution in the question Bitwise shift operation on a 128-bit number. This is a simple left shift but you can unroll the recursive call for more performance
void shiftl128 (
unsigned int& a,
unsigned int& b,
unsigned int& c,
unsigned int& d,
size_t k)
{
assert (k <= 128);
if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
{
a=b;
b=c;
c=d;
d=0;
shiftl128(a,b,c,d,k-32);
}
else
{
a = (a << k) | (b >> (32-k));
b = (b << k) | (c >> (32-k));
c = (c << k) | (d >> (32-k));
d = (d << k);
}
}
The assembly for less-than-32-bit shifts can be found in the question 128-bit shifts using assembly language?
shld edx, ecx, cl
shld ecx, ebx, cl
shld ebx, eax, cl
shl eax, cl
Right shifts can be implemented similarly, or just copy from the above linked question
Multiplication and divisions are a lot more complex and you can reference the solution in the question Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit):
class int128_t
{
uint32_t dw3, dw2, dw1, dw0;
// Various constrctors, operators, etc...
int128_t& operator*=(const int128_t& rhs) __attribute__((always_inline))
{
int128_t Urhs(rhs);
uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
dw0 ^= lhs_xor_mask;
dw1 ^= lhs_xor_mask;
dw2 ^= lhs_xor_mask;
dw3 ^= lhs_xor_mask;
Urhs.dw0 ^= rhs_xor_mask;
Urhs.dw1 ^= rhs_xor_mask;
Urhs.dw2 ^= rhs_xor_mask;
Urhs.dw3 ^= rhs_xor_mask;
*this += (lhs_xor_mask & 1);
Urhs += (rhs_xor_mask & 1);
struct mul128_t
{
int128_t dqw1, dqw0;
mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
};
mul128_t data(Urhs,*this);
asm volatile(
"push %%ebp \n\
movl %%eax, %%ebp \n\
movl $0x00, %%ebx \n\
movl $0x00, %%ecx \n\
movl $0x00, %%esi \n\
movl $0x00, %%edi \n\
movl 28(%%ebp), %%eax #Calc: (dw0*dw0) \n\
mull 12(%%ebp) \n\
addl %%eax, %%ebx \n\
adcl %%edx, %%ecx \n\
adcl $0x00, %%esi \n\
adcl $0x00, %%edi \n\
movl 24(%%ebp), %%eax #Calc: (dw1*dw0) \n\
mull 12(%%ebp) \n\
addl %%eax, %%ecx \n\
adcl %%edx, %%esi \n\
adcl $0x00, %%edi \n\
movl 20(%%ebp), %%eax #Calc: (dw2*dw0) \n\
mull 12(%%ebp) \n\
addl %%eax, %%esi \n\
adcl %%edx, %%edi \n\
movl 16(%%ebp), %%eax #Calc: (dw3*dw0) \n\
mull 12(%%ebp) \n\
addl %%eax, %%edi \n\
movl 28(%%ebp), %%eax #Calc: (dw0*dw1) \n\
mull 8(%%ebp) \n\
addl %%eax, %%ecx \n\
adcl %%edx, %%esi \n\
adcl $0x00, %%edi \n\
movl 24(%%ebp), %%eax #Calc: (dw1*dw1) \n\
mull 8(%%ebp) \n\
addl %%eax, %%esi \n\
adcl %%edx, %%edi \n\
movl 20(%%ebp), %%eax #Calc: (dw2*dw1) \n\
mull 8(%%ebp) \n\
addl %%eax, %%edi \n\
movl 28(%%ebp), %%eax #Calc: (dw0*dw2) \n\
mull 4(%%ebp) \n\
addl %%eax, %%esi \n\
adcl %%edx, %%edi \n\
movl 24(%%ebp), %%eax #Calc: (dw1*dw2) \n\
mull 4(%%ebp) \n\
addl %%eax, %%edi \n\
movl 28(%%ebp), %%eax #Calc: (dw0*dw3) \n\
mull (%%ebp) \n\
addl %%eax, %%edi \n\
pop %%ebp \n"
:"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
:"a"(&data):"%ebp");
dw0 ^= result_xor_mask;
dw1 ^= result_xor_mask;
dw2 ^= result_xor_mask;
dw3 ^= result_xor_mask;
return (*this += (result_xor_mask & 1));
}
};
You can also find a lot of related questions with the 128bit tag
I have been using memcmp function for compare 2 integers in my performance critical application. I had to use this other than using equal operators as I have to deal with the other datatypes generically. However, I suspected the memcpy performance for primitive data types and changed that to equal operator. However, the performance of the increased.
I just did some simple test as follows.
Using memcmp
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <string.h>
using namespace std;
int main(int argc, char **argv)
{
int iValue1 = atoi(argv[1]);
int iValue2 = atoi(argv[2]);
struct timeval start;
gettimeofday(&start, NULL);
for (int i = 0; i < 2000000000; i++)
{
// if (iValue1 == iValue2)
if (memcmp(&iValue1, &iValue2, sizeof(int)) == 0)
{
cout << "Hello" << endl;
};
};
struct timeval end;
gettimeofday(&end, NULL);
cout << "Time taken : " << ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) << " us" << endl;
return 0;
}
The output of the program was follows.
sujith#linux-1xs7:~> g++ -m64 -O3 Main.cpp
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13539618 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13534932 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13599818 us
sujith#linux-1xs7:~> ./a.out 3424 234
Time taken : 13639394 us
Using equal operator
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <string.h>
using namespace std;
int main(int argc, char **argv)
{
int iValue1 = atoi(argv[1]);
int iValue2 = atoi(argv[2]);
struct timeval start;
gettimeofday(&start, NULL);
for (int i = 0; i < 2000000000; i++)
{
if (iValue1 == iValue2)
// if (memcmp(&iValue1, &iValue2, sizeof(int)) == 0)
{
cout << "Hello" << endl;
};
};
struct timeval end;
gettimeofday(&end, NULL);
cout << "Time taken : " << ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) << " us" << endl;
return 0;
}
The output of the program was follows.
sujith#linux-1xs7:~> g++ -m64 -O3 Main.cpp
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 9 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 13 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 14 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 15 us
sujith#linux-1xs7:~> ./a.out 234 23423
Time taken : 16 us
Can someone please let me know whether the equal operator works faster than the memcmp for primitive data types? If so, What is happening there? Doesn't the equal operator use memcmp inside?
Microbenchmarks are hard to write.
The loop in the first case compiles to (at g++ -O3):
movl $2000000000, %ebx
jmp .L3
.L2:
subl $1, %ebx
je .L7
.L3:
leaq 12(%rsp), %rsi
leaq 8(%rsp), %rdi
movl $4, %edx
call memcmp
testl %eax, %eax
jne .L2
; code to do the printing omitted
subl $1, %ebx
jne .L3
.L7:
addq $16, %rsp
xorl %eax, %eax
popq %rbx
ret
The loop in the second case compiles to
cmpl %eax, %ebp
je .L7
.L2:
addq $8, %rsp
xorl %eax, %eax
popq %rbx
popq %rbp
ret
.L7:
movl $2000000000, %ebx
.L3:
; code to do the printing omitted
subl $1, %ebx
jne .L3
jmp .L2
Note that in the first case memcmp is called 2000000000 times. In the second case, the optimizer hoisted the comparison out of the loop, so it is done only once. Moreover, in the second case the compiler placed the two variables entirely in registers, while in the first one they need to be placed on the stack because you are taking their address.
Even when just looking at the comparison, compare two ints takes a single cmpl instruction. Using memcmp incurs a function call, and internally memcmp is likely to require some extra checks.
In this particular case, clang++ -O3 compiles the memcmp to a single cmpl instruction. However, it doesn't hoist the check outside the loop if you use memcmp.
SSE2 instruction (paddd xmm, m128) works really strange. Code tells all.
#include <iostream>
using namespace std;
int main()
{
int * v0 = new int [80];
for (int i=0; i<80; ++i)
v0[i] = i;
int * v1 = new int [80];
for (int i=0; i<80; ++i)
v1[i] = i;
asm(
".intel_syntax noprefix;"
"mov rcx , 20;"
"mov rax , %0;"
"mov rbx , %1;"
"m_start:;"
"cmp rcx , 0;"
"je m_end;"
"movdqu xmm0 , [rax];"
"paddd xmm0 , [rbx];"
"movdqu [rax] , xmm0;"
"add rbx , 16;" /* WTF?? If I put there 128, it's work really bad */
"add rax , 16;" /* but why?? I must add 128 because XMM width is 128 bits ... */
"dec rcx;"
"jmp m_start;"
"m_end:;"
".att_syntax noprefix;"
: //
: "r"(v0) , "r"(v1)
: //
);
for (int i=1; i<81; ++i)
{
cout << v0[i-1] << (char*)((i%10==0) ? "\n" : ", ");
}
return 0;
}
You must add 16 because 128 bits is 16 bytes.
Additional notes: you forgot to tell the compiler that you clobber some registers and you are not supposed to switch syntax without telling the compiler either (use -masm=intel switch instead).