Related
I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.
Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU # 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.
First, let me present the elementary structures.
struct vector3
{
float data[3] = {};
inline float& operator()(const std::size_t& index) { return data[index]; }
inline const float& operator()(const std::size_t& index) const { return data[index]; }
inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};
// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;
The first is my scalar code. I'm compiling it with the flag "-O0".
void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
{
dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
+ float_stream[index + 2] * float_stream[index + 2];
dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
+ float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
}
for (auto index = dot_product_index; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".
void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
#pragma omp simd safelen(16)
for (auto index = std::size_t{}; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".
void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
{
// 1. load and permute the vectors
const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
// 2. do a multiply
const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
// 3. do 2 horizontal additions
const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
// 4. permute to target position
const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
// 4. store
_mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
}
for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
I've tested the code, so I know it works.
Now, brief details about the microbenchmarking:
I use a small library which I've written for this purpose: https://gitlab.com/anxiousprogrammer/tixl.
20 warm up runs, 100 timed runs.
fresh allocations in each run for cold microbenchmarking, first touch (zeroing of the first datum in each memory page) of test data prevents measuring of page-faults.
I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:
no-vec: 18.6 GB/s
auto-vec: 21.3 GB/s
intrinsic-vec: 16.4 GB/s
Questions:
Is my motivation (mentioned in paragraph 1) a sensible one?
Why is my version slower than the scalar code?
Why are they all far from the peak read-write BW of 34 GiB/s?
Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.
Your manually-vectorized code is not particularly efficient.
Try to benchmark the following 2 versions instead.
This one is simpler, and only requires SSE 4.1 instruction set.
inline __m128 loadFloat3( const float* rsi )
{
__m128 xy = _mm_castpd_ps( _mm_load_sd( (const double*)rsi ) );
// Compilers should merge following 2 lines into single INSERTPS with a memory operand
__m128 z = _mm_load_ss( rsi + 2 );
return _mm_insert_ps( xy, z, 0x20 );
}
// Simple version which uses DPPS instruction from SSE 4.1 set
void dotProductsSimple( float* rdi, size_t length, const float* rsi )
{
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
// Load complete 16 byte vector, discard the W
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
// For the last vector, load exactly 12 bytes.
// Avoids potential crash when loading from out of bounds
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}
This one is more complicated, and requires AVX1 support. Probably, going to be slightly faster on most processors.
void dotProductTransposed( float* rdi, size_t length, const float* rsi )
{
constexpr size_t maskAlign8 = ~(size_t)7;
const float* const rsiEndAligned = rsi + ( length & maskAlign8 ) * 3;
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
while( rsi < rsiEndAligned )
{
// Load lower halves
__m256 m03, m14, m25;
m03 = _mm256_castps128_ps256( _mm_loadu_ps( rsi ) );
m14 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 4 ) );
m25 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 8 ) );
// Load upper halves; VINSERTF128 supports memory operand for the second argument.
m03 = _mm256_insertf128_ps( m03, _mm_loadu_ps( rsi + 12 ), 1 );
m14 = _mm256_insertf128_ps( m14, _mm_loadu_ps( rsi + 16 ), 1 );
m25 = _mm256_insertf128_ps( m25, _mm_loadu_ps( rsi + 20 ), 1 );
rsi += 24;
// Transpose these SIMD vectors
__m256 xy = _mm256_shuffle_ps( m14, m25, _MM_SHUFFLE( 2, 1, 3, 2 ) );
__m256 yz = _mm256_shuffle_ps( m03, m14, _MM_SHUFFLE( 1, 0, 2, 1 ) );
__m256 x = _mm256_shuffle_ps( m03, xy, _MM_SHUFFLE( 2, 0, 3, 0 ) );
__m256 y = _mm256_shuffle_ps( yz, xy, _MM_SHUFFLE( 3, 1, 2, 0 ) );
__m256 z = _mm256_shuffle_ps( yz, m25, _MM_SHUFFLE( 3, 0, 3, 1 ) );
// Now we have 3 SIMD vectors with gathered x/y/z fields of 8 source 3D vectors
// Compute squares
x = _mm256_mul_ps( x, x );
y = _mm256_mul_ps( y, y );
z = _mm256_mul_ps( z, z );
// Add squares
x = _mm256_add_ps( x, y );
x = _mm256_add_ps( x, z );
// Store 8 values
_mm256_storeu_ps( rdi, x );
rdi += 8;
}
// Handle the remainder
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}
I'm trying to implement a selection sort algorithm in C++ Assembly Blocks. The code below shows the sorter function with the assembly block within. I am trying to emulate the selection sort algorithm shown below my code. When I compile my cpp file and try to run the given code (separate from this) I get the exact same numbers back in the same order. (the first data set has 10 numbers [10, -20, 5, 12, 30, -5, -22, 55, 52, 0]). What should I change to get my desired results?
void sorter (long* list, long count, long opcode)
{
/* Move the array pointer to rax, opcode to rbx, count to rcx */
/* The following sample code swap the array elements in reverse order */
/* You would need to replace it with the logic from the bubble sort algorithm */
long temp;
long y;
asm
(
"movq %0, %%rax;" //Sets array pointer (base address of array) to rax register
"movq %1, %%rbx;" //Sets opcode (1 for Asc. | 2 for Desc) to rbx register
"movq %2, %%rcx;" //Sets count (total amount of #'s) to rcx register
"xorq %%rdx, %%rdx;" //Sets rdx (x counter) to 0
"movq %3, %%r9;" //Sets temp (used for swapping) to r9 register
"loop_start:"
"dec %%rcx;" //Decrements rcx (count)
"cmpq %%rdx, %%rcx;" //Compares rdx (x counter) to rcx (count)
"jle done;" //If rcx (total amount of #'s) is zero, then finish
"cmpq $1,%%rbx;" //Compares rbx (opcode) to 1
"jne desc;" //Jump to descending if opcode != 1 (2 or more)
"mov %%rdx, %%rdi;" //Sets rdi (y counter) = x
"inner_loop:"
"inc %%rdi;" //Increments rdi (y counter) (y++)
"movq (%%rax, %%rdx, 8), %%rsi;" //Sets rsi to array pointer + 8*rdx (array[x])
"movq (%%rax, %%rdi, 8), %%r8;" //Sets r8 to array pointer + 8*rdi (array[y])
"cmpq %%r8, %%rsi;"
"jle swap;"
"cmpq %%rdi, %%rcx;" //Compares rdi (y) and rcx (count)
"jb inner_loop;" //Jump to inner_loop if y < count
"inc %%rdx;" //Increment rdx (x counter) (x++)
"jmp loop_start;" //Closing for outer loop (loop_start)
"swap:"
"xchgq %%rsi,%%r9;"
"xchgq %%r8, %%rsi;"
"xchgq %%r9, %%r8;"
"jmp inner_loop;"
"desc:" //if opcode is 2 then reverse the list
"movq (%%rax, %%rcx, 8), %%r10;" //Moves array pointer + 8*rcx(count) to r10 (starts at last index of the array)
"movq (%%rax, %%rdx, 8), %%r11;" //Moves array pointer + 8*rdx to r11 (starts at first index of the array)
"xchgq %%r10, (%%rax, %%rdx, 8);"
"xchgq %%r11, (%%rax, %%rcx, 8);"
"inc %%rdx;"
"jmp loop_start;"
"done:"
:
: "m" (list), "m" (opcode), "m" (count), "m" (temp)
:
);
}
Selection sort algorithm to implement:
void sorter (long* list, long count, long opcode)
{
long x, y, temp;
for (x = 0; x < count - 1; x++)
for (y = x; y < count; y++)
if (list[x] > list[y])
{
temp = list[x];
list[x] = list[y];
list[y] = temp;
}
}
Ok so im trying to creat a function that creates shellcode.
Im having alot of problems working out the rex / mod stuff.
My current code kind of works.
So far if the regs are smaller then R8 it works fine.
If i use one reg that is smaller then R8 its fine.
Problem is once i have to regs smaller then r8 and are the same or if the src is smaller i get problems
enum Reg64 : uint8_t {
RAX = 0, RCX = 1, RDX = 2, RBX = 3,
RSP = 4, RBP = 5, RSI = 6, RDI = 7,
R8 = 8, R9 = 9, R10 = 10, R11 = 11,
R12 = 12, R13 = 13, R14 = 14, R15 = 15
};
inline uint8_t encode_rex(uint8_t is_64_bit, uint8_t extend_sib_index, uint8_t extend_modrm_reg, uint8_t extend_modrm_rm) {
struct Result {
uint8_t b : 1;
uint8_t x : 1;
uint8_t r : 1;
uint8_t w : 1;
uint8_t fixed : 4;
} result{ extend_modrm_rm, extend_modrm_reg, extend_sib_index, is_64_bit, 0b100 };
return *(uint8_t*)&result;
}
inline uint8_t encode_modrm(uint8_t mod, uint8_t rm, uint8_t reg) {
struct Result {
uint8_t rm : 3;
uint8_t reg : 3;
uint8_t mod : 2;
} result{ rm, reg, mod };
return *(uint8_t*)&result;
}
inline void mov(Reg64 dest, Reg64 src) {
if (dest >= 8)
put<uint8_t>(encode_rex(1, 2, 0, 1));
else if (src >= 8)
put<uint8_t>(encode_rex(1, 1, 0, 2));
else
put<uint8_t>(encode_rex(1, 0, 0, 0));
put<uint8_t>(0x89);
put<uint8_t>(encode_modrm(3, dest, src));
}
//c.mov(Reg64::RAX, Reg64::RAX); // works
//c.mov(Reg64::RAX, Reg64::R9); // works
//c.mov(Reg64::R9, Reg64::RAX); // works
//c.mov(Reg64::R9, Reg64::R9); // Does not work returns (mov r9,rcx)
Also if there is a shorter way to do this without all the if's that would be great.
FYI, most people create shellcode by assembling with a normal assembler like NASM, then hexdumping that binary into a C string. Writing your own assembler can be a fun project but is basically a separate project.
Your encode_rex looks somewhat sensible, taking four args for the four bits. But the code in mov that calls it passes a 2 sometimes, which will truncate to 0!
Also, there are 4 possibilities for the 2 relevant extension bits (b and x) you're using for reg-reg moves. But your if/else if/else chain only covers 3 of them, ignoring the possibility of dest>=8 && src >= 8 => x:b = 3
Since those two bits are orthogonal, you should just calculate them separately like this:
put<uint8_t>(encode_rex(1, 0, dest>=8, src>=8));
The SIB-index x field should always be 0 because you don't have a SIB byte, just ModRM for a reg-reg mov.
You have your struct initializer in encode_rex mixed up, with extend_modrm_reg being 2nd where it will initialize the x field instead of r. Your bitfield names match https://wiki.osdev.org/X86-64_Instruction_Encoding#Encoding, but you have the wrong C++ variables initializing them. See that link for descriptions.
Possibly I have the dest, src order backwards, depending on whether you're using the mov r/m, r or the mov r, r/m opcode. I didn't double-check which is which.
Sanity check from NASM: I assembled with nasm -felf64 -l/dev/stdout to get a listing:
1 00000000 4889C8 mov rax, rcx
2 00000003 4889C0 mov rax, rax
3 00000006 4D89C0 mov r8, r8
4 00000009 4989C0 mov r8, rax
5 0000000C 4C89C0 mov rax, r8
You're using the same 0x89 opcode that NASM uses, so your REX prefixes should match.
return *(uint8_t*)&result; is strict-aliasing UB and not safe outside of MSVC.
Use memcpy to safely type-pun. (Or a union; most real-world C++ compilers including gcc/clang/MSVC do define the behaviour of union type-punning as in C99, unlike ISO C++).
Let's start with the code. I have two structures, one for vectors, and other for matrices.
struct AVector
{
explicit AVector(float x=0.0f, float y=0.0f, float z=0.0f, float w=0.0f):
x(x), y(y), z(z), w(w) {}
AVector(const AVector& a):
x(a.x), y(a.y), z(a.z), w(a.w) {}
AVector& operator=(const AVector& a) {x=a.x; y=a.y; z=a.z; w=a.w; return *this;}
float x, y, z, w;
};
struct AMatrix
{
// Row-major
explicit AMatrix(const AVector& a=AVector(), const AVector& b=AVector(), const AVector& c=AVector(), const AVector& d=AVector())
{row[0]=a; row[1]=b; row[2]=c; row[3]=d;}
AMatrix(const AMatrix& m) {row[0]=m.row[0]; row[1]=m.row[1]; row[2]=m.row[2]; row[3]=m.row[3];}
AMatrix& operator=(const AMatrix& m) {row[0]=m.row[0]; row[1]=m.row[1]; row[2]=m.row[2]; row[3]=m.row[3]; return *this;}
AVector row[4];
};
Next, code performing calculations on those structures. Dot product using inlined ASM and SSE instructions:
inline AVector AVectorDot(const AVector& a, const AVector& b)
{
// XXX
/*const double v=a.x*b.x+a.y*b.y+a.z*b.z+a.w*b.w;
return AVector(v, v, v, v);*/
AVector c;
asm volatile(
"movups (%1), %%xmm0\n\t"
"movups (%2), %%xmm1\n\t"
"mulps %%xmm1, %%xmm0\n\t" // xmm0 -> (a1+b1, , , )
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0xB1, %%xmm1, %%xmm1\n\t" // 0xB1 = 10110001
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x, y, z, w)+(y, x, w, z)=(x+y, x+y, z+w, z+w)
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0x0A, %%xmm1, %%xmm1\n\t" // 0x0A = 00001010
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x+y+z+w, , , )
"movups %%xmm0, %0\n\t"
: "=m"(c)
: "r"(&a), "r"(&b)
);
return c;
}
Matrix transposition:
inline AMatrix AMatrixTranspose(const AMatrix& m)
{
AMatrix c(
AVector(m.row[0].x, m.row[1].x, m.row[2].x, m.row[3].x),
AVector(m.row[0].y, m.row[1].y, m.row[2].y, m.row[3].y),
AVector(m.row[0].z, m.row[1].z, m.row[2].z, m.row[3].z),
AVector(m.row[0].w, m.row[1].w, m.row[2].w, m.row[3].w));
// XXX
/*printf("AMcrix c:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
c.row[0].x, c.row[0].y, c.row[0].z, c.row[0].w,
c.row[1].x, c.row[1].y, c.row[1].z, c.row[1].w,
c.row[2].x, c.row[2].y, c.row[2].z, c.row[2].w,
c.row[3].x, c.row[3].y, c.row[3].z, c.row[3].w);*/
return c;
}
Matrix-matrix multiplication - transpose first matrix, because when I have it stored as column major, and second one as row major, then I can perform multiplication using dot-products.
inline AMatrix AMatrixMultiply(const AMatrix& a, const AMatrix& b)
{
AMatrix c;
const AMatrix at=AMatrixTranspose(a);
// XXX
/*printf("AMatrix at:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
at.row[0].x, at.row[0].y, at.row[0].z, at.row[0].w,
at.row[1].x, at.row[1].y, at.row[1].z, at.row[1].w,
at.row[2].x, at.row[2].y, at.row[2].z, at.row[2].w,
at.row[3].x, at.row[3].y, at.row[3].z, at.row[3].w);*/
for(int i=0; i<4; ++i)
{
c.row[i].x=AVectorDot(at.row[0], b.row[i]).w;
c.row[i].y=AVectorDot(at.row[1], b.row[i]).w;
c.row[i].z=AVectorDot(at.row[2], b.row[i]).w;
c.row[i].w=AVectorDot(at.row[3], b.row[i]).w;
}
return c;
}
Now time for main (pun intended) part:
int main(int argc, char *argv[])
{
AMatrix a(
AVector(0, 1, 0, 0),
AVector(1, 0, 0, 0),
AVector(0, 0, 0, 1),
AVector(0, 0, 1, 0)
);
AMatrix b(
AVector(1, 0, 0, 0),
AVector(0, 2, 0, 0),
AVector(0, 0, 3, 0),
AVector(0, 0, 0, 4)
);
AMatrix c=AMatrixMultiply(a, b);
printf("AMatrix c:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
c.row[0].x, c.row[0].y, c.row[0].z, c.row[0].w,
c.row[1].x, c.row[1].y, c.row[1].z, c.row[1].w,
c.row[2].x, c.row[2].y, c.row[2].z, c.row[2].w,
c.row[3].x, c.row[3].y, c.row[3].z, c.row[3].w);
AVector v(1, 2, 3, 4);
AVector w(1, 1, 1, 1);
printf("Dot product: %f (1+2+3+4 = 10)\n", AVectorDot(v, w).w);
return 0;
}
In the above code I make two matrices, multiply them and print the resulting matrix.
It works fine if I don't use any of the compiler optimizations (g++ main.cpp -O0 -msse). With optimizations enabled (g++ main.cpp -O1 -msse) resulting matrix is empty (all fields are zeroes).
Uncommenting any block marked with XXX makes program write correct result.
It seems to me that GCC optimizes-out matrix at from AMatrixMultiply function, because it wrongly assumes it's not used in AVectorDot, which is written using SSE inlines.
Last few lines check if dot-product function really works, and yes, it does.
So, the question is: did I do or understand something wrong, or is this some kind of bug in GCC? My guess is 7:3 mix of above.
I'm using GCC version 5.1.0 (tdm-1).
This is also a very inefficient way of multiplying matrices using SSE. I'd be surprised if it was much faster than a scalar implementation with so much floating-point throughput available on modern CPUs. A better method is outlined here, no explicit transpose needed:
AMatrix & operator *= (AMatrix & m0, const AMatrix & m1)
{
__m128 r0 = _mm_load_ps(& m1[0][x]);
__m128 r1 = _mm_load_ps(& m1[1][x]);
__m128 r2 = _mm_load_ps(& m1[2][x]);
__m128 r3 = _mm_load_ps(& m1[3][x]);
for (int i = 0; i < 4; i++)
{
__m128 ti = _mm_load_ps(& m0[i][x]), t0, t1, t2, t3;
t0 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(0, 0, 0, 0));
t1 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(1, 1, 1, 1));
t2 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(2, 2, 2, 2));
t3 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(3, 3, 3, 3));
ti = t0 * r0 + t1 * r1 + t2 * r2 + t3 * r3;
_mm_store_ps(& m0[i][x], ti);
}
return m0;
}
On modern compilers, like gcc and clang, t0 * r0 + t1 * r1 + t2 * r2 + t3 * r3 is actually operating on __m128 types; though you can replace these with _mm_mul_ps and _mm_add_ps intrinsics if you want.
Return by value is then just a matter of adding a function like:
inline AMatrix operator * (const AMatrix & m0, const AMatrix & m1)
{
AMatrix lhs (m0); return (lhs *= m1);
}
Personally, I'd just replace the float x, y, z, w; with alignas (16) float _s[4] = {}; or similar - so you get a 'zero-vector' by default, or a defaulted constructor:
constexpr AVector () = default;
as well as nice constructors, like:
constexpr Vector (float x, float y, float z, float w)
: _s {x, y, z, w} {}
Your inline assembly lacks some constraints:
asm volatile(
"movups (%1), %%xmm0\n\t"
"movups (%2), %%xmm1\n\t"
"mulps %%xmm1, %%xmm0\n\t" // xmm0 -> (a1+b1, , , )
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0xB1, %%xmm1, %%xmm1\n\t" // 0xB1 = 10110001
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x, y, z, w)+(y, x, w, z)=(x+y, x+y, z+w, z+w)
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0x0A, %%xmm1, %%xmm1\n\t" // 0x0A = 00001010
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x+y+z+w, , , )
"movups %%xmm0, %0\n\t"
: "=m"(c)
: "r"(&a), "r"(&b)
);
GCC does not know that this assembler fragment clobbers %xmm0 and %xmm1, so it might not reload those registers to their previous values after the fragment has run. Some additional clobbers might be missing as well.
I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}
Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see #IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
And here is the required precalculation for shuffleTable:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).
First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT: you can find a version modified by #IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see #PeterCordes and #IwillnotexistIdonotexist
add rcx, rcx //can be removed, see #EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
P.S. If you are still reading it, be sure to check out comments =)
As for alternatives: this is similar to yours but with some error checking:
#include <iostream>
#include <string>
#include <cstdint>
uint32_t getip(const std::string &sip)
{
uint32_t r=0, b, p=0, c=0;
const char *s;
s = sip.c_str();
while (*s)
{
r<<=8;
b=0;
while (*s&&((*s==' ')||(*s=='\t'))) s++;
while (*s)
{
if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
if (*s=='.') { p++; s++; break; }
if ((*s>='0')&&(*s<='9'))
{
b*=10;
b+=(*s-'0');
s++;
}
}
if ((b>255)||(*s=='.')) return 0;
r+=b;
c++;
}
return ((c==4)&&(p==3))?r:0;
}
void testip(const std::string &sip)
{
uint32_t nIP=0;
nIP = getip(sip);
std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}
int main()
{
testip("192.167.1.3");
testip("292.167.1.3");
testip("192.267.1.3");
testip("192.167.1000.3");
testip("192.167.1.300");
testip("192.167.1.");
testip("192.167.1");
testip("192.167..1");
testip("192.167.1.3.");
testip("192.1 67.1.3.");
testip("192 . 167 . 1 . 3");
testip(" 192 . 167 . 1 . 3 ");
return 0;
}