Translating SSE to Neon: How to pack and then extract 32bit result

Translating SSE to Neon: How to pack and then extract 32bit result - c++

I have to translate the following instructions from SSE to Neon
uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );
Where:
static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1);
So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).
How does this operation translate in Neon?Should I use packing instructions?How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)
Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32
(but I will have to cast my uint8x16_t to uint32x4_t)
uint32_t vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);
or directly store the lane vst1q_lane_u32
void vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]

I found this excellent guide.
I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.
uint8x8x2_t vuzp_u8(uint8x8_t a, uint8x8_t b);
So something like:
uint8x16_t a;
uint8_t* out;
[...]
//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
vst1q_lane_u32(out,a,0);
Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))
But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):
uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
I have implemented a more general workaround through a flexible data type:
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
Edit:
Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.
static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);

I would write it as so:
uint32_t extract (uint8x16_t x)
{
uint8x8x2_t a = vuzp_u8 (vget_low_u8 (x), vget_high_u8 (x));
uint8x8x2_t b = vuzp_u8 (a.val[0], a.val[1]);
return vget_lane_u32 (vreinterpret_u32_u8 (b.val[0]), 0);
}
Which on a recent GCC version compiles to:
extract:
vuzp.8 d0, d1
vuzp.8 d0, d1
vmov.32 r0, d0[0]
bx lr

Related

gltf, are matrices ever to be transposed?

As I am learning gltf, I already have 2 working skin models, and now I am trying the RiggedFigure.
The 2 other models worked just fine and I am using the same code. I am using the vscode gltf extension to verify my output.
The documentation states:
Accessors of matrix type have data stored in column-major order; start of each column must be aligned to 4-byte boundaries.
Eigen matrices are also column major, thus copying the raw bytes into an stl vector of type Eigen::Matrix4f should result in the correct data, and indeed, this is the case for 2 of the 3 models I have tried so far.
However for the rigged figure, vs code says the matrices should be (excuse the screenshot but I cannot copy paste the matrices for some reason):
My code prints:
0.999983 0.000442018 0.00581419 -0.00398856
0 0.997123 -0.0758045 0.0520021
-0.005831 0.0758032 0.997106 -0.684015
0 0 0 1
1 0 0 0
0 -0.01376 0.999905 -0.85674
0 -0.999905 -0.0137601 0.024791
0 0 0 1
1 0 0 0
0 0.979842 0.199774 -0.224555
0 -0.199774 0.979842 -1.05133
0 0 0 1
1 0 0 0
0 -0.00751853 0.999972 -1.12647
0 -0.999972 -0.00751847 0.00796944
0 0 0 1
-1 -1.50995e-07 0 0
0 0.00364935 0.999993 -1.19299
-1.51869e-07 0.999993 -0.00364941 0.00535393
0 0 0 1
-0.0623881 0.998036 -0.00569177 0.00162297
0.891518 0.0531644 -0.449853 0.404156
-0.448667 -0.0331397 -0.893084 0.998987
0 0 0 1
0.109672 0.988876 -0.100484 0.107683
-0.891521 0.0531632 -0.449849 0.404152
-0.439503 0.13892 0.887434 -0.993169
0 0 0 1
0.530194 0.847874 0.001751 -0.183428
0.760039 -0.474352 -0.444218 0.206564
-0.375811 0.236853 -0.895917 0.973213
0 0 0 1
-0.0705104 -0.619322 0.781965 -0.761146
-0.760038 -0.474352 -0.444223 0.206569
0.646043 -0.625645 -0.437261 0.633599
0 0 0 1
0.631434 0.775418 -0.00419003 -0.228155
0.649284 -0.53166 -0.543845 0.154659
-0.423935 0.340682 -0.839175 0.951451
0 0 0 1
0.111378 -0.773831 0.623523 -0.550204
-0.649284 -0.531661 -0.543845 0.15466
0.752347 -0.344271 -0.561651 0.809067
0 0 0 1
-0.830471 -0.549474 0.091635 -0.00030848
0.0339727 -0.214148 -0.97621 0.596867
0.556025 -0.807601 0.196511 -0.159297
0 0 0 1
-0.994689 0.102198 0.0121981 -0.0750653
-0.0339737 -0.214147 -0.97621 0.596867
-0.0971548 -0.97144 0.216482 -0.140501
0 0 0 1
-0.99973 0.0232223 -7.82996e-05 0.0784336
0.0051282 0.217484 -0.97605 0.357951
-0.0226493 -0.975788 -0.217544 0.0222206
0 0 0 1
-0.998171 -0.0599068 -0.00810355 -0.0775425
-0.00512856 0.217484 -0.97605 0.357951
0.0602345 -0.974224 -0.217393 0.0251548
0 0 0 1
-0.999327 0.0366897 0 0.0783684
0.0287104 0.781987 0.622632 -0.0567413
0.0228442 0.622213 -0.782514 0.0634761
0 0 0 1
-0.999326 0.00828946 0.0357652 -0.0814984
0.0287402 0.782804 0.621604 -0.0521458
-0.0228444 0.622213 -0.782514 0.0634761
0 0 0 1
0.994013 0.109264 0.000418345 -0.0755577
0.109252 -0.993835 -0.0188101 -0.0405796
-0.00164008 0.0187438 -0.999822 0.0227357
0 0 0 1
0.994011 -0.109281 0.000483894 0.0755372
-0.109253 -0.993836 -0.018811 -0.0405797
0.00253636 0.0186453 -0.999823 0.0228038
0 0 0 1
Which are the transposed versions of what vs code says.
My loading code is this (instantiated with typoe Eigen::Matrix4f):
void CopySparseBuffer(
void* dest,
const void* src,
const size_t element_count,
const size_t stride,
const size_t type_size)
{
assert(stride >= type_size);
// Typecast src and dest addresses to (char *)
unsigned char* csrc = (unsigned char*)src;
unsigned char* cdest = (unsigned char*)dest;
// Iterate over the total number of elements to copy
for(int i = 0; i < element_count; i++)
// Copy each byte of the element. Since the stride could be different from the
// type size (in the case of padding bytes for example) the right access
// should skip over any interleaved data, that's why we use the stride.
for(int j = 0; j < type_size; j++)
*(cdest + i * type_size + j) = *(csrc + i * stride + j);
}
template<typename T>
std::vector<T> ExtractDataFromAccessor(
const tinygltf::Model& model, const int accessor_index, bool print = false)
{
const int buffer_view_index = model.accessors[accessor_index].bufferView;
const int array_type = model.accessors[accessor_index].type;
const int component_type = model.accessors[accessor_index].componentType;
const int accessor_offset = model.accessors[accessor_index].byteOffset;
const int element_num = model.accessors[accessor_index].count;
const int buffer_index = model.bufferViews[buffer_view_index].buffer;
const int buffer_length = model.bufferViews[buffer_view_index].byteLength;
const int buffer_offset = model.bufferViews[buffer_view_index].byteOffset;
const int buffer_stride = model.bufferViews[buffer_view_index].byteStride;
const std::vector<unsigned char> data = model.buffers[buffer_index].data;
assert(
component_type == ComponentCode<T>() &&
"The component type found here should match that of the type (e.g. float and "
"float).");
assert(array_type == TypeCode<T>());
// Size in bytes of a single element (e.g. 12 for a vec3 of floats)
const int type_size = sizeof(T);
assert(
buffer_stride == 0 || buffer_stride >= sizeof(T) &&
"It doesn't make sense for a positive buffer "
"stride to be less than the type size");
assert(element_num * type_size <= buffer_length);
const size_t stride = std::max(buffer_stride, type_size);
std::vector<T> holder(element_num);
CopySparseBuffer(
holder.data(),
data.data() + buffer_offset + accessor_offset,
element_num,
stride,
type_size);
return holder;
}

I just figured it out so i will leave this here in case someone is in the same situation in the future.
The VS code vectors are the columns, not the rows, so my code and vs code actually agree, it's just the vs code output is confusing.
In short, evrything works, the output is just confusing.

The size of structure (sizeof) in C++ doesn't correspond the real size in case of arrays

I use dynamic arrays of the following structure:
struct TestStructure
{
unsigned int serial;
int channel;
int pedestal;
int noise;
int test;
};
The sizeof(TestStructure) returns 20, so I assume that there is no padding/alignment in the structure. It's logically because there are only 4-bytes types.
But I discovered that size of structure multiplied by element count is not equal the size of the array. There is an additional pad between elements of the array! So, in the following code:
TestStructure* test_struct = new TestStructure[element_count];
for (int i = 0; i < element_count; i++)
FillStructure(test_struct, i, i, i, i, i, i); // assigning 'i' for all elements
Long_t size_value = element_count * sizeof(TestStructure);
unsigned char* p_value = new unsigned char[size_value];
memcpy(p_value, test_struct, size_value);
The output array of chars contains the additional pads between elements:
sizeof(TestStructure) = 20. element_count = 10. size_value = 200. char array in the hex format:
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
6c 6c 2f 6c
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
6f 70 74 2f
2 0 0 0
...
Please, explain me.
Does dynamic array add pads between elements or
Does 'sizeof' operator show wrong size of the structure?
P.S. I use GCC 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.2).
EDIT: I use this code in a macro with ROOT CINT interpreter with GCC compiled library. Sorrry, It seems this bug concerned not with GCC but with ROOT CINT.
EDIT2: Yes, in my ROOT macro (executed by CINT interpreter) sizeof(TestStructure) returns 24 and after that when I call a function of the GCC compiled library (containing the code fragment listed above), the sizeof(TestStructure) returns 20 in the compiled function.

Although a compiler can add packing to the end of a struct, a compiler absolutely cannot add additional packing between the elements when manufacturing an array.
For an array TestStructure[n], the address of the i(th) element must be TestStructure + i * sizeof TestStructure. If this were not true then pointer arithmetic would break horribly.

The indices of non-zero bytes of an SSE/AVX register

If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements?
For example, if xmm value is
| r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 |
the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array.
If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value (not necessary 1).
I am pretty much wondering if there is an instruction for that or preferably C/C++ intrinsic. In any SSE or AVX set of instructions.
EDIT 1:
It was correctly observed by #zx485 that original question was not clear enough. I was looking for any "consecutive" solution.
The example 0 1 0 1 0 1 0 1... above should result in either of the following:
If we assume that indices start from 1, then 0 would be a termination byte and the result might be
002 004 006 008 010 012 014 016 000 000 000 000 000 000 000 000
If we assume that negative byte is a termination byte the result might be
001 003 005 007 009 011 013 015 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF 0xFF
Anything, that gives as a consecutive bytes which we can interpret as indices of non-zero elements in the original value
EDIT 2:
Indeed, as #harold and #Peter Cordes suggest in the comments to the original post, one of the possible solutions is to create a mask first (e.g. with pmovmskb) and check non zero indices there. But that will lead to a loop.

Your question was unclear regarding the aspect if you want the result array to be "compressed". What I mean by "compressed" is, that the result should be consecutive. So, for example for 0 1 0 1 0 1 0 1..., there are two possibilities:
Non-consecutive:
XMM0: 000 001 000 003 000 005 000 007 000 009 000 011 000 013 000 015
Consecutive:
XMM0: 001 003 005 007 009 011 013 015 000 000 000 000 000 000 000 000
One problem of the consecutive approach is: how do you decide if it's index 0 or a termination value?
I'm offering a simple solution to the first, non-consecutive approach, which should be quite fast:
.data
ddqZeroToFifteen db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
ddqTestValue: db 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
.code
movdqa xmm0, xmmword ptr [ddqTestValue]
pxor xmm1, xmm1 ; zero XMM1
pcmpeqb xmm0, xmm1 ; set to -1 for all matching
pandn xmm0, xmmword ptr [ddqZeroToFifteen] ; invert and apply indices
Just for the sake of completeness: the second, the consecutive approach, is not covered in this answer.

Updated answer: the new solution is slightly more efficient.
You can do this without a loop by using the pext instruction from the Bit Manipulation Instruction Set 2 ,
in combination with a few other SSE instructions.
/*
gcc -O3 -Wall -m64 -mavx2 -march=broadwell ind_nonz_avx.c
*/
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
__m128i nonz_index(__m128i x){
/* Set some constants that will (hopefully) be hoisted out of a loop after inlining. */
uint64_t indx_const = 0xFEDCBA9876543210; /* 16 4-bit integers, all possible indices from 0 o 15 */
__m128i cntr = _mm_set_epi8(64,60,56,52,48,44,40,36,32,28,24,20,16,12,8,4);
__m128i pshufbcnst = _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80, 0x0E,0x0C,0x0A,0x08,0x06,0x04,0x02,0x00);
__m128i cnst0F = _mm_set1_epi8(0x0F);
__m128i msk = _mm_cmpeq_epi8(x,_mm_setzero_si128()); /* Generate 16x8 bit mask. */
msk = _mm_srli_epi64(msk,4); /* Pack 16x8 bit mask to 16x4 bit mask. */
msk = _mm_shuffle_epi8(msk,pshufbcnst); /* Pack 16x8 bit mask to 16x4 bit mask, continued. */
uint64_t msk64 = ~ _mm_cvtsi128_si64x(msk); /* Move to general purpose register and invert 16x4 bit mask. */
/* Compute the termination byte nonzmsk separately. */
int64_t nnz64 = _mm_popcnt_u64(msk64); /* Count the nonzero bits in msk64. */
__m128i nnz = _mm_set1_epi8(nnz64); /* May generate vmovd + vpbroadcastb if AVX2 is enabled. */
__m128i nonzmsk = _mm_cmpgt_epi8(cntr,nnz); /* nonzmsk is a mask of the form 0xFF, 0xFF, ..., 0xFF, 0, 0, ...,0 to mark the output positions without an index */
uint64_t indx64 = _pext_u64(indx_const,msk64); /* parallel bits extract. pext shuffles indx_const such that indx64 contains the nnz64 4-bit indices that we want.*/
__m128i indx = _mm_cvtsi64x_si128(indx64); /* Use a few integer instructions to unpack 4-bit integers to 8-bit integers. */
__m128i indx_024 = indx; /* Even indices. */
__m128i indx_135 = _mm_srli_epi64(indx,4); /* Odd indices. */
indx = _mm_unpacklo_epi8(indx_024,indx_135); /* Merge odd and even indices. */
indx = _mm_and_si128(indx,cnst0F); /* Mask out the high bits 4,5,6,7 of every byte. */
return _mm_or_si128(indx,nonzmsk); /* Merge indx with nonzmsk . */
}
int main(){
int i;
char w[16],xa[16];
__m128i x;
/* Example with bytes 15, 12, 7, 5, 4, 3, 2, 1, 0 set. */
x = _mm_set_epi8(1,0,0,1, 0,0,0,0, 1,0,1,1, 1,1,1,1);
/* Other examples. */
/*
x = _mm_set_epi8(1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1);
x = _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0);
x = _mm_set_epi8(1,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0);
x = _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,1);
*/
__m128i indices = nonz_index(x);
_mm_storeu_si128((__m128i *)w,indices);
_mm_storeu_si128((__m128i *)xa,x);
printf("counter 15..0 ");for (i=15;i>-1;i--) printf(" %2d ",i); printf("\n\n");
printf("example xmm: ");for (i=15;i>-1;i--) printf(" %2d ",xa[i]); printf("\n");
printf("result in dec ");for (i=15;i>-1;i--) printf(" %2hhd ",w[i]); printf("\n");
printf("result in hex ");for (i=15;i>-1;i--) printf(" %2hhX ",w[i]); printf("\n");
return 0;
}
It takes about five instructions to get 0xFF (the termination byte) at the unwanted positions.
Note that a function nonz_index that returns the indices and only the position of the termination byte, without actually
inserting the termination byte(s), would be much cheaper to compute and might be as suitable in a particular application.
The position of the first termination byte is nnz64>>2.
The result is:
$ ./a.out
counter 15..0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
example xmm: 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1
result in dec -1 -1 -1 -1 -1 -1 -1 15 12 7 5 4 3 2 1 0
result in hex FF FF FF FF FF FF FF F C 7 5 4 3 2 1 0
The pext instruction is supported on Intel Haswell processors or newer.

SSE2 intrinsics - comparing unsigned integers

I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and clamping the result to 0xFF:
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m3 = _mm_adds_epu8(m1, m2);
I would be interested in performing comparison for "less than" on these unsigned integers, similar to _mm_cmplt_epi8 for signed:
__m128i mask = _mm_cmplt_epi8 (m3, m1);
m1 = _mm_or_si128(m3, mask);
If an "epu8" equivalent was available, mask would have 0xFF where m3[i] < m1[i] (overflow!), 0x00 otherwise, and we would be able to clamp m1 using the "or", so m1 will hold the addition result where valid, and 0xFF where it overflowed.
Problem is, _mm_cmplt_epi8 performs a signed comparison, so for instance if m1[i] = 0x70 and m2[i] = 0x10, then m3[i] = 0x80 and mask[i] = 0xFF, which is obviously not what I require.
Using VS2012.
I would appreciate another approach for performing this. Thanks!

One way of implementing compares for unsigned 8 bit vectors is to exploit _mm_max_epu8, which returns the maximum of unsigned 8 bit int elements. You can compare for equality the (unsigned) maximum value of two elements with one of the source elements and then return the appropriate result. This translates to 2 instructions for >= or <=, and 3 instructions for > or <.
Example code:
#include <stdio.h>
#include <emmintrin.h> // SSE2
#define _mm_cmpge_epu8(a, b) \
_mm_cmpeq_epi8(_mm_max_epu8(a, b), a)
#define _mm_cmple_epu8(a, b) _mm_cmpge_epu8(b, a)
#define _mm_cmpgt_epu8(a, b) \
_mm_xor_si128(_mm_cmple_epu8(a, b), _mm_set1_epi8(-1))
#define _mm_cmplt_epu8(a, b) _mm_cmpgt_epu8(b, a)
int main(void)
{
__m128i va = _mm_setr_epi8(0, 0, 1, 1, 1, 127, 127, 127, 128, 128, 128, 254, 254, 254, 255, 255);
__m128i vb = _mm_setr_epi8(0, 255, 0, 1, 255, 0, 127, 255, 0, 128, 255, 0, 254, 255, 0, 255);
__m128i v_ge = _mm_cmpge_epu8(va, vb);
__m128i v_le = _mm_cmple_epu8(va, vb);
__m128i v_gt = _mm_cmpgt_epu8(va, vb);
__m128i v_lt = _mm_cmplt_epu8(va, vb);
printf("va = %4vhhu\n", va);
printf("vb = %4vhhu\n", vb);
printf("v_ge = %4vhhu\n", v_ge);
printf("v_le = %4vhhu\n", v_le);
printf("v_gt = %4vhhu\n", v_gt);
printf("v_lt = %4vhhu\n", v_lt);
return 0;
}
Compile and run:
$ gcc -Wall _mm_cmplt_epu8.c && ./a.out
va = 0 0 1 1 1 127 127 127 128 128 128 254 254 254 255 255
vb = 0 255 0 1 255 0 127 255 0 128 255 0 254 255 0 255
v_ge = 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255 255
v_le = 255 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255
v_gt = 0 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0
v_lt = 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0 0

The other answers got me thinking of a simpler method to answer the specific question more directly:
To simply detect clamping, do saturating and non-saturating additions, and compare the results.
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m1m2_sat = _mm_adds_epu8(m1, m2);
__m128i m1m2_wrap = _mm_add_epi8(m1, m2);
__m128i non_clipped = _mm_cmpeq_epi8(m1m2_sat, m1m2_wrap);
So that's just two instructions beyond the adds, and one of them can run in parallel with the adds. So the non_clipped mask is ready one cycle after the addition result. (Potentially 3 instructions (an extra movdqa) without AVX 3-operand non-destructive vector ops.)
If the non-saturating add result is 0xFF, it will match the saturating-add result, and be detected as not clipping. This is why it's different from just checking the output of the saturating add for 0xFF bytes.

Another way to compare unsigned bytes: add 0x80 and compare them as signed ones.
__m128i _mm_cmplt_epu8(__m128i a, __m128i b) {
__m128i as = _mm_add_epi8(a, _mm_set1_epi8((char)0x80));
__m128i bs = _mm_add_epi8(b, _mm_set1_epi8((char)0x80));
return _mm_cmplt_epi8(as, bs);
}
I don't think it is very efficient, but it works, and it may be useful in some cases. Also, you can use xor instead of addition if you want.
In some cases you can even do bidirectional range checking at once, i.e. compare a value with both lower and upper bounds. To do so, align the lower bound with 0x80, similar to what this answer does.

There is an implementation of comparison of 8-bit unsigned integer:
inline __m128i NotEqual8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), _mm_set1_epi8(-1));
}
inline __m128i Greater8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i GreaterOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
}
inline __m128i Lesser8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i LesserOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
}

Fastest way to get IPv4 address from string

I have the following code which is about 7 times faster than inet_addr . I was wondering if there is a way to improve this to make it even faster or if a faster alternative exists.
This code requires that a valid null terminated IPv4 address is supplied with no whitespace, which in my case is always the way, so I optimized for that case. Usually you would have more error checking, but if there is a way to make the following even faster or a faster alternative exists I would really appreciate it.
UINT32 GetIP(const char *p)
{
UINT32 dwIP=0,dwIP_Part=0;
while(true)
{
if(p[0] == 0)
{
dwIP = (dwIP << 8) | dwIP_Part;
break;
}
if(p[0]=='.')
{
dwIP = (dwIP << 8) | dwIP_Part;
dwIP_Part = 0;
p++;
}
dwIP_Part = (dwIP_Part*10)+(p[0]-'0');
p++;
}
return dwIP;
}

Since we are speaking about maximizing throughput of IP address parsing, I suggest using a vectorized solution.
Here is x86-specific fast solution (needs SSE4.1, or at least SSSE3 for poor):
__m128i shuffleTable[65536]; //can be reduced 256x times, see #IwillnotexistIdonotexist
UINT32 MyGetIP(const char *str) {
__m128i input = _mm_lddqu_si128((const __m128i*)str); //"192.167.1.3"
input = _mm_sub_epi8(input, _mm_set1_epi8('0')); //1 9 2 254 1 6 7 254 1 254 3 208 245 0 8 40
__m128i cmp = input; //...X...X.X.XX... (signs)
UINT32 mask = _mm_movemask_epi8(cmp); //6792 - magic index
__m128i shuf = shuffleTable[mask]; //10 -1 -1 -1 8 -1 -1 -1 6 5 4 -1 2 1 0 -1
__m128i arr = _mm_shuffle_epi8(input, shuf); //3 0 0 0 | 1 0 0 0 | 7 6 1 0 | 2 9 1 0
__m128i coeffs = _mm_set_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
__m128i prod = _mm_maddubs_epi16(coeffs, arr); //3 0 | 1 0 | 67 100 | 92 100
prod = _mm_hadd_epi16(prod, prod); //3 | 1 | 167 | 192 | ? | ? | ? | ?
__m128i imm = _mm_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 4, 2, 0);
prod = _mm_shuffle_epi8(prod, imm); //3 1 167 192 0 0 0 0 0 0 0 0 0 0 0 0
return _mm_extract_epi32(prod, 0);
// return (UINT32(_mm_extract_epi16(prod, 1)) << 16) + UINT32(_mm_extract_epi16(prod, 0)); //no SSE 4.1
}
And here is the required precalculation for shuffleTable:
void MyInit() {
memset(shuffleTable, -1, sizeof(shuffleTable));
int len[4];
for (len[0] = 1; len[0] <= 3; len[0]++)
for (len[1] = 1; len[1] <= 3; len[1]++)
for (len[2] = 1; len[2] <= 3; len[2]++)
for (len[3] = 1; len[3] <= 3; len[3]++) {
int slen = len[0] + len[1] + len[2] + len[3] + 4;
int rem = 16 - slen;
for (int rmask = 0; rmask < 1<<rem; rmask++) {
// { int rmask = (1<<rem)-1; //note: only maximal rmask is possible if strings are zero-padded
int mask = 0;
char shuf[16] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
int pos = 0;
for (int i = 0; i < 4; i++) {
for (int j = 0; j < len[i]; j++) {
shuf[(3-i) * 4 + (len[i]-1-j)] = pos;
pos++;
}
mask ^= (1<<pos);
pos++;
}
mask ^= (rmask<<slen);
_mm_store_si128(&shuffleTable[mask], _mm_loadu_si128((__m128i*)shuf));
}
}
}
Full code with testing is avaliable here. On Ivy Bridge processor it prints:
C0A70103
Time = 0.406 (1556701184)
Time = 3.133 (1556701184)
It means that the suggested solution is 7.8 times faster in terms of throughput than the code by OP. It processes 336 millions of addresses per second (single core of 3.4 Ghz).
Now I'll try to explain how it works. Note that on each line of the listing you can see contents of the value just computed. All the arrays are printed in little-endian order (though set intrinsics use big-endian).
First of all, we load 16 bytes from unaligned address by lddqu instruction. Note that in 64-bit mode memory is allocated by 16-byte chunks, so this works well automatically. On 32-bit it may theoretically cause issues with out of range access. Though I do not believe that it really can. The subsequent code would work properly regardless of the values in the after-the-end bytes. Anyway, you'd better ensure that each IP address takes at least 16 bytes of storage.
Then we subtract '0' from all the chars. After that '.' turns into -2, and zero turns into -48, all the digits remain nonnegative. Now we take bitmask of signs of all the bytes with _mm_movemask_epi8.
Depending on the value of this mask, we fetch a nontrivial 16-byte shuffling mask from lookup table shuffleTable. The table is quite large: 1Mb total. And it takes quite some time to precompute. However, it does not take precious space in CPU cache, because only 81 elements from this table are really used. That is because each part of IP address can be either one, two, three digits long => hence 81 variants in total.
Note that random trashy bytes after the end of the string may in principle cause increased memory footprint in the lookup table.
EDIT: you can find a version modified by #IwillnotexistIdonotexist in comments, which uses lookup table of only 4Kb size (it is a bit slower, though).
The ingenious _mm_shuffle_epi8 intrinsic allows us to reorder the bytes with our shuffle mask. As a result XMM register contains four 4-byte blocks, each block contains digits in little-endian order. We convert each block into a 16-bit number by _mm_maddubs_epi16 followed by _mm_hadd_epi16. Then we reorder bytes of the register, so that the whole IP address occupies the lower 4 bytes.
Finally, we extract the lower 4 bytes from the XMM register to GP register. It is done with SSE4.1 intrinsic (_mm_extract_epi32). If you don't have it, replace it with other line using _mm_extract_epi16, but it will run a bit slower.
Finally, here is the generated assembly (MSVC2013), so that you can check that your compiler does not generate anything suspicious:
lddqu xmm1, XMMWORD PTR [rcx]
psubb xmm1, xmm6
pmovmskb ecx, xmm1
mov ecx, ecx //useless, see #PeterCordes and #IwillnotexistIdonotexist
add rcx, rcx //can be removed, see #EvgenyKluev
pshufb xmm1, XMMWORD PTR [r13+rcx*8]
movdqa xmm0, xmm8
pmaddubsw xmm0, xmm1
phaddw xmm0, xmm0
pshufb xmm0, xmm7
pextrd eax, xmm0, 0
P.S. If you are still reading it, be sure to check out comments =)

As for alternatives: this is similar to yours but with some error checking:
#include <iostream>
#include <string>
#include <cstdint>
uint32_t getip(const std::string &sip)
{
uint32_t r=0, b, p=0, c=0;
const char *s;
s = sip.c_str();
while (*s)
{
r<<=8;
b=0;
while (*s&&((*s==' ')||(*s=='\t'))) s++;
while (*s)
{
if ((*s==' ')||(*s=='\t')) { while (*s&&((*s==' ')||(*s=='\t'))) s++; if (*s!='.') break; }
if (*s=='.') { p++; s++; break; }
if ((*s>='0')&&(*s<='9'))
{
b*=10;
b+=(*s-'0');
s++;
}
}
if ((b>255)||(*s=='.')) return 0;
r+=b;
c++;
}
return ((c==4)&&(p==3))?r:0;
}
void testip(const std::string &sip)
{
uint32_t nIP=0;
nIP = getip(sip);
std::cout << "\nsIP = " << sip << " --> " << std::hex << nIP << "\n";
}
int main()
{
testip("192.167.1.3");
testip("292.167.1.3");
testip("192.267.1.3");
testip("192.167.1000.3");
testip("192.167.1.300");
testip("192.167.1.");
testip("192.167.1");
testip("192.167..1");
testip("192.167.1.3.");
testip("192.1 67.1.3.");
testip("192 . 167 . 1 . 3");
testip(" 192 . 167 . 1 . 3 ");
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Translating SSE to Neon: How to pack and then extract 32bit result - c++

Related

gltf, are matrices ever to be transposed?

The size of structure (sizeof) in C++ doesn't correspond the real size in case of arrays

The indices of non-zero bytes of an SSE/AVX register

SSE2 intrinsics - comparing unsigned integers

Fastest way to get IPv4 address from string

Categories

Resources