C++ Inline Assembly Selection Sort Implementation - c++

I'm trying to implement a selection sort algorithm in C++ Assembly Blocks. The code below shows the sorter function with the assembly block within. I am trying to emulate the selection sort algorithm shown below my code. When I compile my cpp file and try to run the given code (separate from this) I get the exact same numbers back in the same order. (the first data set has 10 numbers [10, -20, 5, 12, 30, -5, -22, 55, 52, 0]). What should I change to get my desired results?
void sorter (long* list, long count, long opcode)
{
/* Move the array pointer to rax, opcode to rbx, count to rcx */
/* The following sample code swap the array elements in reverse order */
/* You would need to replace it with the logic from the bubble sort algorithm */
long temp;
long y;
asm
(
"movq %0, %%rax;" //Sets array pointer (base address of array) to rax register
"movq %1, %%rbx;" //Sets opcode (1 for Asc. | 2 for Desc) to rbx register
"movq %2, %%rcx;" //Sets count (total amount of #'s) to rcx register
"xorq %%rdx, %%rdx;" //Sets rdx (x counter) to 0
"movq %3, %%r9;" //Sets temp (used for swapping) to r9 register
"loop_start:"
"dec %%rcx;" //Decrements rcx (count)
"cmpq %%rdx, %%rcx;" //Compares rdx (x counter) to rcx (count)
"jle done;" //If rcx (total amount of #'s) is zero, then finish
"cmpq $1,%%rbx;" //Compares rbx (opcode) to 1
"jne desc;" //Jump to descending if opcode != 1 (2 or more)
"mov %%rdx, %%rdi;" //Sets rdi (y counter) = x
"inner_loop:"
"inc %%rdi;" //Increments rdi (y counter) (y++)
"movq (%%rax, %%rdx, 8), %%rsi;" //Sets rsi to array pointer + 8*rdx (array[x])
"movq (%%rax, %%rdi, 8), %%r8;" //Sets r8 to array pointer + 8*rdi (array[y])
"cmpq %%r8, %%rsi;"
"jle swap;"
"cmpq %%rdi, %%rcx;" //Compares rdi (y) and rcx (count)
"jb inner_loop;" //Jump to inner_loop if y < count
"inc %%rdx;" //Increment rdx (x counter) (x++)
"jmp loop_start;" //Closing for outer loop (loop_start)
"swap:"
"xchgq %%rsi,%%r9;"
"xchgq %%r8, %%rsi;"
"xchgq %%r9, %%r8;"
"jmp inner_loop;"
"desc:" //if opcode is 2 then reverse the list
"movq (%%rax, %%rcx, 8), %%r10;" //Moves array pointer + 8*rcx(count) to r10 (starts at last index of the array)
"movq (%%rax, %%rdx, 8), %%r11;" //Moves array pointer + 8*rdx to r11 (starts at first index of the array)
"xchgq %%r10, (%%rax, %%rdx, 8);"
"xchgq %%r11, (%%rax, %%rcx, 8);"
"inc %%rdx;"
"jmp loop_start;"
"done:"
:
: "m" (list), "m" (opcode), "m" (count), "m" (temp)
:
);
}
Selection sort algorithm to implement:
void sorter (long* list, long count, long opcode)
{
long x, y, temp;
for (x = 0; x < count - 1; x++)
for (y = x; y < count; y++)
if (list[x] > list[y])
{
temp = list[x];
list[x] = list[y];
list[y] = temp;
}
}

Related

creating shellcode problems with mov reg to reg

Ok so im trying to creat a function that creates shellcode.
Im having alot of problems working out the rex / mod stuff.
My current code kind of works.
So far if the regs are smaller then R8 it works fine.
If i use one reg that is smaller then R8 its fine.
Problem is once i have to regs smaller then r8 and are the same or if the src is smaller i get problems
enum Reg64 : uint8_t {
RAX = 0, RCX = 1, RDX = 2, RBX = 3,
RSP = 4, RBP = 5, RSI = 6, RDI = 7,
R8 = 8, R9 = 9, R10 = 10, R11 = 11,
R12 = 12, R13 = 13, R14 = 14, R15 = 15
};
inline uint8_t encode_rex(uint8_t is_64_bit, uint8_t extend_sib_index, uint8_t extend_modrm_reg, uint8_t extend_modrm_rm) {
struct Result {
uint8_t b : 1;
uint8_t x : 1;
uint8_t r : 1;
uint8_t w : 1;
uint8_t fixed : 4;
} result{ extend_modrm_rm, extend_modrm_reg, extend_sib_index, is_64_bit, 0b100 };
return *(uint8_t*)&result;
}
inline uint8_t encode_modrm(uint8_t mod, uint8_t rm, uint8_t reg) {
struct Result {
uint8_t rm : 3;
uint8_t reg : 3;
uint8_t mod : 2;
} result{ rm, reg, mod };
return *(uint8_t*)&result;
}
inline void mov(Reg64 dest, Reg64 src) {
if (dest >= 8)
put<uint8_t>(encode_rex(1, 2, 0, 1));
else if (src >= 8)
put<uint8_t>(encode_rex(1, 1, 0, 2));
else
put<uint8_t>(encode_rex(1, 0, 0, 0));
put<uint8_t>(0x89);
put<uint8_t>(encode_modrm(3, dest, src));
}
//c.mov(Reg64::RAX, Reg64::RAX); // works
//c.mov(Reg64::RAX, Reg64::R9); // works
//c.mov(Reg64::R9, Reg64::RAX); // works
//c.mov(Reg64::R9, Reg64::R9); // Does not work returns (mov r9,rcx)
Also if there is a shorter way to do this without all the if's that would be great.
FYI, most people create shellcode by assembling with a normal assembler like NASM, then hexdumping that binary into a C string. Writing your own assembler can be a fun project but is basically a separate project.
Your encode_rex looks somewhat sensible, taking four args for the four bits. But the code in mov that calls it passes a 2 sometimes, which will truncate to 0!
Also, there are 4 possibilities for the 2 relevant extension bits (b and x) you're using for reg-reg moves. But your if/else if/else chain only covers 3 of them, ignoring the possibility of dest>=8 && src >= 8 => x:b = 3
Since those two bits are orthogonal, you should just calculate them separately like this:
put<uint8_t>(encode_rex(1, 0, dest>=8, src>=8));
The SIB-index x field should always be 0 because you don't have a SIB byte, just ModRM for a reg-reg mov.
You have your struct initializer in encode_rex mixed up, with extend_modrm_reg being 2nd where it will initialize the x field instead of r. Your bitfield names match https://wiki.osdev.org/X86-64_Instruction_Encoding#Encoding, but you have the wrong C++ variables initializing them. See that link for descriptions.
Possibly I have the dest, src order backwards, depending on whether you're using the mov r/m, r or the mov r, r/m opcode. I didn't double-check which is which.
Sanity check from NASM: I assembled with nasm -felf64 -l/dev/stdout to get a listing:
1 00000000 4889C8 mov rax, rcx
2 00000003 4889C0 mov rax, rax
3 00000006 4D89C0 mov r8, r8
4 00000009 4989C0 mov r8, rax
5 0000000C 4C89C0 mov rax, r8
You're using the same 0x89 opcode that NASM uses, so your REX prefixes should match.
return *(uint8_t*)&result; is strict-aliasing UB and not safe outside of MSVC.
Use memcpy to safely type-pun. (Or a union; most real-world C++ compilers including gcc/clang/MSVC do define the behaviour of union type-punning as in C99, unlike ISO C++).

Variable in ternary operator never increments

q.rel can be either 1, 2, or 3
for (q in qry) {
pCode = (q.rel NEQ 3 ? q.rel
: pCode GTE 3 ? pCode++ : 3);
...
}
If there are a bunch of q.rel is 3 in a row, pCode is supposed to increment, but it only shows 3.
Note that there is no initial setting of pCode anywhere else. This is complete.
That's how postfix incrementing works. I will demonstrate the order of operations with assembler and use registers to avoid confusion about a temporary variable (that actually doesn't exist).
These instructions (postfix increment):
x = 0;
x = x++;
// x is still 0
translate to:
mov eax, 0 ; x = 0
mov ebx, eax ; x assignment on the right-hand side
mov eax, ebx ; x = x
inc ebx ; x++
"x" was incremented after the assignment was executed and thus the value never changed.
Now these instructions (prefix increment):
x = 0;
x = ++x;
// x is now 1
translate to:
mov eax, 0 ; x = 0
mov ebx, eax ; x assignment on the right-hand side
inc ebx ; ++x
mov eax, ebx ; x = x
"x" was incremented before the assignment was executed and thus the value changed.
That's pretty much how every programming language handles it. This is not related to the ternary operator at all.
When q.rel is 3, All of this can be reduced down to:
pCode = pCode++;
Q. So what does this equal?
A. It does nothing. pCode never gets incremented
pCode = ++pCode;
does increment. Or more completely:
for (q in qry) {
pCode = (q.rel NEQ 3 ? q.rel
: pCode GTE 3 ? ++pCode : 3);
...
}

c++: Is table lookup vectorizable for small lookup-table

I want to vectorize the following snippet of code with SIMD intrinsics is this possible?
unsigned char chain[3][3] = {
3, 2, 1, // y --> x
4, -1, 0, // |
5, 6, 7 // |
}; // v
std::vector<int> x;
std::vector<int> y;
//initialize x, y
std::vector<int> chain_code(x.size());
for(std::size_t i = 0; i < x.size(); ++i
chain_code[i] = chain[x[i]][y[i]];
EDIT:
Support for: SSE - SSE4.2 and AVX
Architectur: Sandy Bridge i5 2500
If you make your x, y, chain_node 8-bit integers (instead of 32-bit ones), then you can process 16 values at once.
Here is the code using SSSE3:
std::vector<uint8_t> x;
std::vector<uint8_t> y;
...
int n = x.size();
std::vector<uint8_t> chain_code(n);
//initialize table register
__m128i table = _mm_setr_epi8(
chain[0][0], chain[0][1], chain[0][2], 99,
chain[1][0], chain[1][1], chain[1][2], 99,
chain[2][0], chain[2][1], chain[2][2], 99,
99, 99, 99, 99
);
int b = (n / 16) * 16;
for (int i = 0; i < b; i += 16) {
//load 16 X/Y bytes
__m128i regX = _mm_loadu_si128((__m128i*)&x[i]);
__m128i regY = _mm_loadu_si128((__m128i*)&y[i]);
//shift all X values left by 2 bits (as 16-bit integers)
__m128i regX4 = _mm_slli_epi16(regX, 2);
//calculate linear indices (x * 4 + y)
__m128i indices = _mm_add_epi8(regX4, regY);
//perform 16 lookups
__m128i res = _mm_shuffle_epi8(table, indices);
//store results
_mm_storeu_si128((__m128i*)&chain_code[i], res);
}
for (int i = b; i < n; i++)
chain_code[i] = chain[x[i]][y[i]];
The fully working version of this code is here. Generated assembly is quite simple (MSVC2013 x64):
movdqu xmm1, XMMWORD PTR [rdi+rax]
movdqu xmm0, XMMWORD PTR [rax]
psllw xmm1, 2
paddb xmm1, xmm0
movdqa xmm0, xmm6
pshufb xmm0, xmm1
movdqu XMMWORD PTR [rsi+rax], xmm0
P.S. I guess you'll have various performance issues with std::vector containers. Perhaps unaligned accesses are no longer expensive, but filling vector with zeros will certainly happen. And it can take more time than the vectorized code.

Fastest way to get IPv4 address from string

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}
Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see #IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
And here is the required precalculation for shuffleTable:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).
First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT: you can find a version modified by #IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see #PeterCordes and #IwillnotexistIdonotexist
add rcx, rcx //can be removed, see #EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
P.S. If you are still reading it, be sure to check out comments =)
As for alternatives: this is similar to yours but with some error checking:
#include <iostream>
#include <string>
#include <cstdint>
uint32_t getip(const std::string &sip)
{
uint32_t r=0, b, p=0, c=0;
const char *s;
s = sip.c_str();
while (*s)
{
r<<=8;
b=0;
while (*s&&((*s==' ')||(*s=='\t'))) s++;
while (*s)
{
if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
if (*s=='.') { p++; s++; break; }
if ((*s>='0')&&(*s<='9'))
{
b*=10;
b+=(*s-'0');
s++;
}
}
if ((b>255)||(*s=='.')) return 0;
r+=b;
c++;
}
return ((c==4)&&(p==3))?r:0;
}
void testip(const std::string &sip)
{
uint32_t nIP=0;
nIP = getip(sip);
std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}
int main()
{
testip("192.167.1.3");
testip("292.167.1.3");
testip("192.267.1.3");
testip("192.167.1000.3");
testip("192.167.1.300");
testip("192.167.1.");
testip("192.167.1");
testip("192.167..1");
testip("192.167.1.3.");
testip("192.1 67.1.3.");
testip("192 . 167 . 1 . 3");
testip(" 192 . 167 . 1 . 3 ");
return 0;
}

Concatenation of the 7th lower bits of n bytes

I just wanted to know if there is any difference between these two expressions:
1 : a = ( a | ( b&0x7F ) >> 7 );
2 : a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
I'm not only speaking about result, but also about efficiency (but the first one looks better).
The purpose is to concatenate the 7 lower bits of multiple bytes, and I was at first using number 2 like this:
while(thereIsByte)
{
a = ( ( a << 8 ) | ( b&0x7F ) << i );
++i;
}
Thanks.
The two expression don't do anything alike:
a = ( a | ( b&0x7F ) >> 7 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) >> 7 = 0000000000000000 (this is always 0), you are selecting the lowest
7 bits of 'b' and shifting right 7bit, discarding
the selected bits).
(a | (b&0x7f) >> 7) always is equal to `a`
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) << 1 = 0000000000011000
(a << 8) = 0010001100000000
(a << 8) | (b&0x7F) << 1 = 0010001100011000
In the second expression the result would have the 3 lowest bytes of a as the 3 highest bytes and the lowest byte of b without the highest bit, shifting 1 bit to the left. Would line a = a * 256 + (b & 0x7f) * 2
If you want to concatenate the lowest 7bits of b in a would be:
while (thereIsByte) {
a = (a << 7) | (b & 0x7f);
// read the next byte into `b`
}
Example in case of sizeof(a) = 4 bytes and you are concatenating four 7bits info, the result of the pseudo code would be:
a = uuuuzzzzzzzyyyyyyyxxxxxxxwwwwwww
Where the z are the 7 bits of the first byte readed, the y are the 7bits of the second and so on. The u are unused bits (contain the info in the lowest 4 bits of a at the beginning)
In this case the size of a need to be greater that the total bits you want to concatenate (eg: at least 32 bits if you want to concatenate four 7bits info).
If a and b are one byte of size won't be really much concatenating, you probably need a data structure like boost::dynamic_bitset where you can append bits multiple times and it grow accondinly.
Yes they are different. On MSVC2010 here is the disassembly when both a and b are chars.
a = ( a | ( b&0x7F ) >> 7 );
012713A6 movsx eax,byte ptr [a]
012713AA movsx ecx,byte ptr [b]
012713AE and ecx,7Fh
012713B1 sar ecx,7
012713B4 or eax,ecx
012713B6 mov byte ptr [a],al
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
012813A6 movsx eax,byte ptr [a]
012813AA shl eax,8
012813AD movsx ecx,byte ptr [b]
012813B1 and ecx,7Fh
012813B4 shl ecx,1
012813B6 or eax,ecx
012813B8 mov byte ptr [a],al
Notice that the second method does two shift operations (for a total of 9 shifted bits, which each take a clock cycle) while the first does a single shift (only 7 bits) and read. Basically this is caused by the order of operations. The first method IS more optimized, however shifting is one of the computers most efficient operations and this difference is probably negligible for most applications.
Notice the compiler treated them as bytes, NOT signed ints.