left shift of 128 bit number using AVX2 instruction - c++

I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task.
Here is a snippet of my code to do the same.
l = 4;
r = 4;
targetrotate = _mm_set_epi64x (l, r);
targetleftrotate = _mm_sllv_epi64 (target, targetrotate);
The above c ode snippet rotates target by 4 to the left.
When I tested the above code with a sample input, I could see the result is not rotated correctly.
Here is the sample input and output
input: 01 23 45 67 89 ab cd ef fe dc ba 98 76 54 32 10
obtained output: 10 30 52 74 96 b8 da fc e0 cf ad 8b 69 47 25 03
But, the output I expect is
12 34 56 78 9a bc de f0 ed cb a9 87 65 43 21 00
I know that I am doing something wrong. I want to know whether my expected output is right and if so, I want to know what am I doing wrong here.
Any kind of help would be greatly appreciated and thanks in advance.

I think you have an endian issue with how you're printing your input and output.
The left-most bytes within each 64-bit half are the least-significant bytes in your actual output, so 0xfe << 4 becomes 0xe0, with the f shifting into a higher byte.
See Convention for displaying vector registers for more discussion of that.
Your "expected" output matches what you'd get if you were printing values high element first (highest address when stored). But that's not what you're doing; you're printing each byte separately in ascending memory order. x86 is little-endian. This conflicts with the numeral system we use in English, where we read Arabic numerals from left to right, highest place-value on the left, effectively human big-endian. Fun fact: The Arabic language reads from right to left so for them, written numbers are "human little-endian".
(And across elements, higher elements are at higher addresses; printing high elements first makes whole-vector shifts like _mm_bslli_si128 aka pslldq make sense in the way it shifts bytes left between elements.)
If you're using a debugger, you're probably printing within that. If you're using debug-prints, see print a __m128i variable.
BTW, you can use _mm_set1_epi64x(4) to put the same value in both elements of a vector, instead of using separate l and r variables with the same value.
In _mm_set intrinsics, the high elements come first, matching the diagrams in Intel's asm manuals, and matching the semantic meaning of "left" shift moving bits/bytes to the left. (e.g. see Intel's diagrams an element-numbering for pshufd, _mm_shuffle_epi32)
BTW, AVX512 has vprolvq rotates. But yes, to emulate rotates you want a SIMD version of (x << n) | x >> (64-n). Note that x86 SIMD shifts saturate the shift count, unlike scalar shifts which mask the count. So x >> 64 will shift out all the bits. If you want to support rotate counts above 63, you probably need to mask.
(Best practices for circular shift (rotate) operations in C++ but you're using intrinsics so you don't have to worry about C shift-count UB, just the actual known hardware behaviour.)

Related

Why GCC generates strange way to move stack pointer

I have observed that GCC's C++ compiler generates the following assembler code:
sub $0xffffffffffffff80,%rsp
This is equivalent to
add $0x80,%rsp
i.e. remove 128 bytes from the stack.
Why does GCC generate the first sub variant and not the add variant? The add variant seems way more natural to me than to exploit that there is an underflow.
This only occurred once in a quite large code base. I have no minimal C++ code example to trigger this. I am using GCC 7.5.0
Try assembling both and you'll see why.
0: 48 83 ec 80 sub $0xffffffffffffff80,%rsp
4: 48 81 c4 80 00 00 00 add $0x80,%rsp
The sub version is three bytes shorter.
This is because the add and sub immediate instructions on x86 has two forms. One takes an 8-bit sign-extended immediate, and the other a 32-bit sign-extended immediate. See https://www.felixcloutier.com/x86/add; the relevant forms are (in Intel syntax) add r/m64, imm8 and add r/m64, imm32. The 32-bit one is obviously three bytes larger.
The number 0x80 can't be represented as an 8-bit signed immediate; since the high bit is set, it would sign-extend to 0xffffffffffffff80 instead of the desired 0x0000000000000080. So add $0x80, %rsp would have to use the 32-bit form add r/m64, imm32. On the other hand, 0xffffffffffffff80 would be just what we want if we subtract instead of adding, and so we can use sub r/m64, imm8, giving the same effect with smaller code.
I wouldn't really say it's "exploiting an underflow". I'd just interpret it as sub $-0x80, %rsp. The compiler is just choosing to emit 0xffffffffffffff80 instead of the equivalent -0x80; it doesn't bother to use the more human-readable version.
Note that 0x80 is actually the only possible number for which this trick is relevant; it's the unique 8-bit number which is its own negative mod 2^8. Any smaller number can just use add, and any larger number has to use 32 bits anyway. In fact, 0x80 is the only reason that we couldn't just omit sub r/m, imm8 from the instruction set and always use add with negative immediates in its place. I guess a similar trick does come up if we want to do a 64-bit add of 0x0000000080000000; sub will do it, but add can't be used at all, as there is no imm64 version; we'd have to load the constant into another register first.

C++ data-types, and their effects on executable size

I'm basically new to C++, aside from attempting to learn the language over 10 years ago and giving up, as I didn't really have a project to motivate me... Anyways, I'm just stating that I'm pretty much a n00b to C++ to let you guys/gals know my current knowledge level. That said, I am fairly proficient with Python and PHP. And since both of those languages are loosely typed, I am not that familiar with the impact type casting in C++ has on executable size, if any.
I am writing an Arduino program to take some data from a couple of ultra-sonic distance sensors and apply the data to a servo control algorithm. No problems with that, but I am now trying to optimize my code, as I'm getting close to the Arduino Micro's limit of 28,672 bytes. My first thought was to change my data types wherever possible to things like short int's and char's, expect it to have either no effect, or to slightly reduce my executable size. What I found is that the executable actually increased in size, after these changes, by a few hundred bytes.
Could someone with more C++ knowledge than I kindly help me understand the reason for this, and why I should, or shouldn't, even bother trying to choose the smallest possible data types for my variables? Obviously the results dictate what I should do here, but I really like to understand the 'why' behind things, and after some Googling, I still came up unsure.
Also, if it's not too much to ask; does anyone have some tips, or a link to some info on optimizing C++ code for limited-memory micro-controllers such as the Arduino?
You ask many things, but this can be answered with an example:
What I found is that the executable actually increased in size, after these changes, by a few hundred bytes.
... help me understand the reason for this ...
In general, you cannot predict whether a smaller data type is better or worse, which the small bit of code below will demonstrate.
To see what is going on, you have to look at the assembly code produced by the compiler. The AVR tool chain has a component that will produce such a listing, typically an .LSS file. I don't think Arduino supports this. The assembly listings below are via Eclipse which drives the extended listing by default.
Here is a little section of an LED blink program that can be used to demonstrate your confusion. It has a brightness value that it sets to the LED in the loop:
boolean fadein = true;
int bright = 0; // we will change this data type int <-> int8_t
void loop() {
// adjust brightness based on current direction
if(fadein) {
bright += 1;
}
else {
bright -= 1;
}
// apply current light level
analogWrite(13,bright);
To demonstrate, the bright variable is changed between 1 byte and 2 byte int's and we compare the assembly listing:
Compare The Increment Line
Here is the listing for just the increment line with two data types:
// int bright - increment line - must load and store 2 bytes
// 18 bytes of code
bright += 1;
18a: 80 91 02 01 lds r24, 0x0102
18e: 90 91 03 01 lds r25, 0x0103
192: 01 96 adiw r24, 0x01 ; 1
194: 90 93 03 01 sts 0x0103, r25
198: 80 93 02 01 sts 0x0102, r24
The first column is the code space address, the second column the actual code bytes, and the last column is the assembly human readable form. LDS is load from memory, ADIW is the add, and STS is storing back to memory
Here is the smaller data type, with the expected result:
// int8_t bright - increment line - only load and store 1 byte
// 10 bytes of code
bright += 1;
18a: 80 91 02 01 lds r24, 0x0102
18e: 8f 5f subi r24, 0xFF ; 255
190: 80 93 02 01 sts 0x0102, r24
Note the weirdness of SUBI 255 instead of adding 1 -- that is compiler devs tricks.
So there you go, the smaller data type produces smaller code as you expected. You were correct! Oh wait, you already stated you where not correct...
Compare the function call
But what about function calls? The analogWrite() method expects an int, so the compiler will be forced to create a conversion if needed
// int - needs no type conversion, can directly load value
// from addresses 0x0102 and 0x0103 and call
// 16 bytes code
// apply current light level
analogWrite(13,bright);
1b0: 20 91 02 01 lds r18, 0x0102
1b4: 30 91 03 01 lds r19, 0x0103
1b8: 8d e0 ldi r24, 0x0D ; 13
1ba: b9 01 movw r22, r18
1bc: 0e 94 87 02 call 0x50e ; 0x50e <analogWrite>
LDI is loading the constant, MOVW is moving variable in preparation for call.
// int8_t - needs a type conversion before call
// 20 bytes code
// apply current light level
analogWrite(13,bright);
1a0: 80 91 02 01 lds r24, 0x0102
1a4: 28 2f mov r18, r24
1a6: 33 27 eor r19, r19
1a8: 27 fd sbrc r18, 7
1aa: 30 95 com r19
1ac: 8d e0 ldi r24, 0x0D ; 13
1ae: b9 01 movw r22, r18
1b0: 0e 94 76 02 call 0x4ec ; 0x4ec <analogWrite>
No need to understand the assembly for the type conversion to see the effect. The smaller data type has produced more code.
So what does it mean? The smaller data type both reduces code size and increase code size. Unless you can compile code in your head, you cannot figure this out by inspection, you have to just try it.
First, take a look to how to optimize your Arduino memory usage and optimizing Arduino memory use. also, take a look to saving RAM space.
Generally, you have to distinguish between code size and data size. Optimizing data size is likely to increase your code size (and also slow things down), because the compiler needs to put more instructions into the code to convert forth and back between the various possible data sizes.
So, as a rule of thumb: Use the default data size (e.g. "int") for any value, that appears in the data at most a few times. On the other hand, if you have large arrays, setting the optimum data size (e.g. "short", if the value is guaranteed to be in the range -32768 .. 32767) can greatly reduce the memory footprint of your app at runtime.
In your case, where you don't have much data, focus more on optimizing code size: Reduce the number of libraries used, and avoid wrappers etc. pp.
One of the biggest memory consumers are floating point numbers (both in RAM and FLASH). Ram because the types are larger than integers and Flash because the Arduino does not have a floating point unit. Thus all floating point operations will result in a larger executable.
Also take care that using libraries may link lots of not really needed stuff that consumes significant amounts of your memory.
Having said that: without any more details on your code it is pretty hard to determine why you have such a large memory footprint.

How to nicely print Buffer in OCaml?

I have a Buffer.
Question 1
How can I print out all byte inside one by one?
Question 2
How can I control the format of the printing?
For example, if I have a buffer like 33 33 33 33 33 33 14 40 (every byte is in HEX format), how can I print it as \x33\x33\x33\x33\x33\x33\x14\x40?
To apply an imperative function f to every byte in a buffer b, you can use String.iter f (Buffer.contents b).
To print a value with a desired format, you can use Printf.printf.
To get the integer value of a byte in a string you can use Char.code.
As a side comment, many of your recent questions could be answered extremely quickly by reading through the OCaml standard library documentation. I think this would be a good thing for you to do. There's not a lot of deep intellectual content, it's just something you should know about as an OCaml programmer.

Add two big endian values on little endian machine

I am currently facing a problem of which I have no idea how to avoid it..
I try to process data which can be either in big endian or little endian. This is not really a problem because it always starts with a header so I can check which endian mode I have to use but during the decoding of the values there are some operations which I dont know how to implement for big endian data.
The code runs on a nVidia Tegra (Cortex-A9 based on ARMv7 architecture) which is little endian (or runs in little endian mode) but sometimes I get big endian data.
Most operations on the data are not really a problem but I dont know how to get the addition right..
Example: D5 1B EE 96 | 96 EE 1B D5
+ AC 84 F4 D5 | + D5 F4 84 AC
= 1 81 A0 E3 6B | = 1 6C E2 A0 81
As you can see, most bytes are already correct in the result but some are not. They differ by +1 or -1 from the expected result because the addition is always made from right to left (little endian machine) and so we take the carry (if any) to the left.
In the case of the big endian addition on this little endian machine I would have to add from left to right and take the carry (if any) to the right.
My question now is, whether there is a possibility (maybe using special instructions for the processor?) to get the right result? Maybe ther are further operations I can make on the result to get rid of these +1/-1 differences which are "cheaper" than to revert both operands and also the result?
Best Regards,
Tobias
The most logical way to do this is to simply convert the numbers to the correct endianness, then perform the calculation, then (if needed) convert back again.
You could of course use a loop to do the byte-by-byte backwards caclulation and handle the carry - but it's more complicated, and I'm pretty certain that it won't be faster either, because there are more conditionals and processors are pretty good at "byteswapping".
You should be able to use the ntohl and htons networking functions to convert the numbers.
Something like this:
int add_big_endian(int a, int b)
{
x = ntohl(a);
y = ntohl(b);
z = x + y;
return htonl(z);
}
You have two options: you can write two sets of code, one for each endianness, and try to keep track of what's going on where, or you can use a single internal representation and convert incoming and outgoing values appropriately. The latter is much simpler.

binary protocol - byte swap trick

lets say we have a binary protocol, with fields network ordered (big endian).
struct msg1
{
int32 a;
int16 b;
uint32 c
}
if instead of copying the network buffer to my msg1 and then use the "networkToHost" functions to read msg1
I rearrange / reverse msg1 to
struct msg1
{
uint32 c
int16 b;
int32 a;
}
and simply do a reverse copy from the network buffer to create msg1. In that case, there is no need for networkToHost functions. this idiomatic approach doesn't work in big endian machines but for me this is not a problem. Apart from that, is there any other drawback that I miss?
thanks
P.S. for the above we enforce strict alignment(#pragma pack(1) etc)
Apart from that, is there any other drawback that I miss?
I'm afraid you've misunderstood the nature of endian conversion problems. "Big endian" doesn't mean your fields are laid out in reverse, so that a
struct msg1_bigendian
{
int32 a;
int16 b;
uint32 c
}
on a big endian architecture is equivalent to a
struct msg1_littleendian
{
uint32 c;
int16 b;
int32 a;
}
on a little endian architecture. Rather, it means that the byte-order within each field is reversed. Let's assume:
a = 0x1000000a;
b = 0xb;
c = 0xc;
On a big-endian architecture, this will be laid out as:
10 00 00 0a
00 0b
00 00 00 0c
The high-order (most significant) byte comes first.
On a little-endian machine, this will be laid out as:
0a 00 00 10
0b 00
0c 00 00 00
The lowest order byte comes first, the highest order last.
Serialize them and overlay the serialized form of the messages on top of each other, and you will discover the incompatibility:
10 00 00 0a 00 0b 00 00 00 0c (big endian)
0a 00 00 10 0b 00 0c 00 00 00 (little endian)
int32 a int16 b int32 c
Note that this isn't simply a case of the fields running in reverse. You proposal would result in a little endian machine mistaking the big endian representation as:
a = 0xc000000;
b = 0xb00;
c = 0xa000010;
Certainly not what was transmitted!
You really do have to convert every individual field to network byte order and back again, for every field transmitted.
UPDATE:
Ok, I understand what you are trying to do now. You want to define the struct in reverse, then memcpy from the end of the byte string to the beginning (reverse copy) and reverse the byte order that way. In which case I would say, yes, this is a hack, and yes, it makes your code un-portable, and yes, it isn't worth it. Converting between byte orders is not, in fact, a very expensive operation and it is far easier to deal with than reversing the layout of every structure.
Are you sure this is required? More than likely, your network traffic is going to be your bottleneck, rather than CPU speed.
Agree with #ribond -
This has great potential to be very confusing to developers, since they'll have to work to keep these to semantically identical structures separate.
Given that network latency is on the order of 10,000,000x slower than it would take the CPU to process it, I'd just keep them the same.
Depending on how your compiler packs the bytes inside a struct, the 16-bit number in the middle might not end up in the right place. It might be stored in a 32-bit field and when you reverse the bytes it will "vanish".
Seriously, tricks like this may seem cute when you write them but in the long term they simply aren't worth it.
edit
You added the "pack 1" information so the bug goes away but the thing about "cute tricks" still stands - not worth it. Write a function to reverse 32-bit and 16-bit numbers.
inline void reverse(int16 &n)
{
...
}
inline void reverse(int32 &n)
{
...
}
Unless you can demonstrate that there is a significant performance penalty, you should use the same code to transfer data onto and off the network regardless of the endian-ness of the machine. As an optimization, for the platforms where the network order is the same as the hardware byte order, you can use tricks, but remember about alignment requirements and the like.
In the example, many machines (especially, as it happens, big-endian ones) will require a 2-byte pad between the end of the int16 member and the next int32 member. So, although you can read into a 10-byte buffer, you cannot treat that buffer as an image of the structure - which will be 12 bytes on most platforms.
As you say, this is not portable to big-endian machines. That is an absolute dealbreaker if you ever expect your code to be used outside of the x86 world. Do the rest of us a favor and just use the ntoh/hton routines or you'll probably find yourself featured on thedailywtf someady.
Please do the programmers that come after you a favor and write explicit conversions to and from a sequence of bytes in some buffer. Trickery with structures will lead you straight into endianness and alignment hell (been there).