I am trying to optimize a small piece of code with SSE intrinsics (I am a complete beginner on the topic), but I am a little stuck on the use of conditionals.
My original code is:
unsigned long c;
unsigned long constant = 0x12345678;
unsigned long table[256];
int n, k;
for( n = 0; n < 256; n++ )
{
c = n;
for( k = 0; k < 8; k++ )
{
if( c & 1 ) c = constant ^ (c >> 1);
else c >>= 1;
}
table[n] = c;
}
The goal of this code is to compute a crc table (the constant can be any polynomial, it doesn't play a role here),
I suppose my optimized code would be something like:
__m128 x;
__m128 y;
__m128 *table;
x = _mm_set_ps(3, 2, 1, 0);
y = _mm_set_ps(3, 2, 1, 0);
//offset for incrementation
offset = _mm_set1_ps(4);
for( n = 0; n < 64; n++ )
{
y = x;
for( k = 0; k < 8; k++ )
{
//if do something with y
//else do something with y
}
table[n] = y;
x = _mm_add_epi32 (x, offset);
}
I have no idea how to go through the if-else statement, but I suspect there is a clever trick. Has anybody an idea on how to do that?
(Aside from this, my optimization is probably quite poor - any advice or correction on it would be treated with the greatest sympathy)
You can get rid of the if/else entirely. Back in the days when I produced MMX assembly code, that was a common programming activity. Let me start with a series of transformations on the "false" statement:
c >>= 1;
c = c >> 1;
c = 0 ^ (c >> 1);
Why did I introduce the exclusive-or? Because exclusive-or is also found in the "true" statement:
c = constant ^ (c >> 1);
Note the similarity? In the "true" part, we xor with a constant, and in the false part, we xor with zero.
Now I'm going to show you a series of transformations on the entire if/else statement:
if (c & 1)
c = constant ^ (c >> 1); // same as before
else
c = 0 ^ (c >> 1); // just different layout
if (c & 1)
c = constant ^ (c >> 1);
else
c = (constant & 0) ^ (c >> 1); // 0 == x & 0
if (c & 1)
c = (constant & -1) ^ (c >> 1); // x == x & -1
else
c = (constant & 0) ^ (c >> 1);
Now the two branches only differ in the second argument to the binary-and, which can be calculated trivially from the condition itself, thus enabling us to get rid of the if/else:
c = (constant & -(c & 1)) ^ (c >> 1);
Disclaimer: This solution only works on a two's complement architecture where -1 means "all bits set".
The idea in SSE is to build both results and then blend the results together.
E.g. :
__m128i mask = ...; // some way to build mask[n] = 0x1
__m128i constant = ...;
__m128i tmp_c = _mm_xor_si128( _mm_srli_epis32( c, 1 ), constant );
__m128i tmp_c2 = _mm_srli_epis32( c, 1 );
__m128i v = _mm_cmpeq_epi32( c, mask );
tmp_c = _mm_and_epi32( tmp_c, mask );
tmp_c2 = _mm_andnot_si128( mask, tmp_c2 );
c = _mm_or_si128( tmp_c, tmp_c2 );
// or in sse4_1
c = _mm_blendv_epi8( tmp_c, tmp_c2, mask );
Note beside, this is not complete code, only to demonstrate the principle.
The first step in efficiently computing CRC is using a wider basic unit than the bit. See here for an example of how to do this byte per byte.
Related
I am in the quite fortunate position to say that my code for a simple SHA1 Hash generator seems to work well. Unfortunately I know that this Arduino Program runs with Little Endianness and the description on the method to generate a hash requires the original message length to be appended as Big Endian integer.
This means for the message char m[] = "Applecake" I would have 9*8 bits, expressed as a 64-bit unsigned integer that is 0x0000 0000 0000 0048. That means, stored with Little Endian, the memory would look like this: 0x0048 0000 0000 0000.
As described in Section 4 of RFC 3174 Step c) I have to
Obtain the 2-word representation of l, the number of bits in the original message. If l < 2^32 then the first word is all zeroes. Append these two words to the padded message.
So with my memory as described above, I would have to convert it to Big Endian first and then append the lower 32 bits to the padded message.
The problem is, that if I do convert the Endianness of the length, which I know is Little Endian, I get the wrong padding and therefore the wrong hash.
Why is my code working without conversion of the Endianness?
Which limitations does my code have concerning the compatibility across different Arduinos, microcontrollers and compilers?
// initialize variables
h0 = 0x67452301;
h1 = 0xEFCDAB89;
h2 = 0x98BADCFE;
h3 = 0x10325476;
h4 = 0xC3D2E1F0;
// calculate the number of required cycles and create a blocks array
uint32_t numCycles = ((ml+65)/512)+1;
uint32_t blocks[numCycles*16] = {};
// copy message
uint32_t messageBytes = ml/8 + (ml%8!=0 ? 1 : 0);
for (uint32_t i = 0; i < messageBytes; i++) {
blocks[i/4] |= ((uint32_t) message[i]) << (8*(3-(i%4)));
}
// append the 1 bit
blocks[ml/32] |= ((uint32_t) 0b1) << (31-(ml%32));
// append the 64-bit big endian ml at the end
if (ml < 0x80000000)
blocks[(numCycles*16)-1] = (uint32_t) ml;
else {
blocks[(numCycles*16)-2] = (uint32_t) ml;
blocks[(numCycles*16)-1] = (uint32_t) (ml >> 32);
}
for (uint32_t iCycle = 0; iCycle < numCycles; iCycle++) {
// initalize locals
uint32_t w[80] = {};
uint32_t a = h0, b = h1, c = h2, d = h3, e = h4;
for (uint8_t i = 0; i < 80; i++) {
// convert words to big-endian and copy to 80-elem array
if (i < 16)
w[i] = blocks[(iCycle*16)+i];
else
w[i] = rotL((w[i-3]^w[i-8]^w[i-14]^w[i-16]), 1);
// run defined formulas
uint32_t f, k, temp;
if (i < 20) {
f = (b & c) | ((~b) & d);
k = 0x5A827999;
}
else if (i < 40) {
f = b ^ c ^ d;
k = 0x6ED9EBA1;
}
else if (i < 60) {
f = (b & c) | (b & d) | (c & d);
k = 0x8F1BBCDC;
}
else {
f = b ^ c ^ d;
k = 0xCA62C1D6;
}
temp = rotL(a, 5) + f + e + k + w[i];
e = d; d = c; c = rotL(b, 30); b = a; a = temp;
}
// write back the results
h0 += a; h1 += b; h2 += c; h3 += d; h4 += e;
}
// append the 64-bit big endian ml at the end
if (ml < 0x80000000)
blocks[(numCycles*16)-1] = (uint32_t) ml;
else {
blocks[(numCycles*16)-2] = (uint32_t) ml;
blocks[(numCycles*16)-1] = (uint32_t) (ml >> 32);
}
This puts the most-significant 32-bit value first and the least-significant 32-bit value second. That's half the reason your code works.
The other half is that while the 32-bit values are in little-endian form, you are reading their values on a little-endian platform. That will always give you the correct value. You never try to access the individual bytes of the 32-bit values, so which bytes goes where makes no difference.
I have a bit-mask of N chars in size, which is statically known (i.e. can be calculated at compile time, but it's not a single constant, so I can't just write it down), with bits set to 1 denoting the "wanted" bits. And I have a value of the same size, which is only known at runtime. I want to collect the "wanted" bits from that value, in order, into the beginning of a new value. For simplicity's sake let's assume the number of wanted bits is <= 32.
Completely unoptimized reference code which hopefully has the correct behaviour:
template<int N, const char mask[N]>
unsigned gather_bits(const char* val)
{
unsigned result = 0;
char* result_p = (char*)&result;
int pos = 0;
for (int i = 0; i < N * CHAR_BIT; i++)
{
if (mask[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (val[i/CHAR_BIT] & (1 << (i % CHAR_BIT)))
{
if (pos < sizeof(unsigned) * CHAR_BIT)
{
result_p[pos/CHAR_BIT] |= 1 << (pos % CHAR_BIT);
}
else
{
abort();
}
}
pos += 1;
}
}
return result;
}
Although I'm not sure whether that formulation actually allows access to the contents of the mask at compile time. But in any case, it's available for use, maybe a constexpr function or something would be a better idea. I'm not looking here for the necessary C++ wizardry (I'll figure that out), just the algorithm.
An example of input/output, with 16-bit values and imaginary binary notation for clarity:
mask = 0b0011011100100110
val = 0b0101000101110011
--
wanted = 0b__01_001__1__01_ // retain only those bits which are set in the mask
result = 0b0000000001001101 // bring them to the front
^ gathered bits begin here
My questions are:
What's the most performant way to do this? (Are there any hardware instructions that can help?)
What if both the mask and the value are restricted to be unsigned, so a single word, instead of an unbounded char array? Can it then be done with a fixed, short sequence of instructions?
There will pext (parallel bit extract) that does exactly what you want in Intel Haswell. I don't know what the performance of that instruction will be, probably better than the alternatives though. This operation is also known as "compress-right" or simply "compress", the implementation from Hacker's Delight is this:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
So I've got a custom randomizer class that uses a Mersenne Twister (the code I use is adapted from this site). All seemed to be working well, until I started testing different seeds (I normally use 42 as a seed, to ensure that each time I run my program, the results are the same, so I can see how code changes influence things).
It turns out that, no matter what seed I choose, the code produces the exact same series of numbers each time. Clearly I'm doing something wrong, but I don't know what. Here is my seed function:
void Randomizer::Seed(unsigned long int Seed)
{
int ii;
x[0] = Seed & 0xffffffffUL;
for (ii = 0; ii < N; ii++)
{
x[ii] = (1812433253UL * (x[ii - 1] ^ (x[ii - 1] >> 30)) + ii);
x[ii] &= 0xffffffffUL;
}
}
And this is my Rand() function
unsigned long int Randomizer::Rand()
{
unsigned long int Result;
unsigned long int a;
int ii;
// Refill x if exhausted
if (Next == N)
{
Next = 0;
for (ii = 0; ii < N - 1; ii++)
{
Result = (x[ii] & U) | x[ii + 1] & L;
a = (Result & 0x1UL) ? A : 0x0UL;
x[ii] = x[( ii + M) % N] ^ (Result >> 1) ^ a;
}
Result = (x[N - 1] & U) | x[0] & L;
a = (Result & 0x1UL) ? A : 0x0UL;
x[N - 1] = x[M - 1] ^ (Result >> 1) ^ a;
}
Result = x[Next++];
//Improves distribution
Result ^= (Result >> 11);
Result ^= (Result << 7) & 0x9d2c5680UL;
Result ^= (Result << 15) & 0xefc60000UL;
Result ^= (Result >> 18);
return Result;
}
The various values are:
#define A 0x9908b0dfUL
#define U 0x80000000UL
#define L 0x7fffffffUL
int Randomizer::N = 624;
int Randomizer::M = 397;
int Randomizer::Next = 0;
unsigned long Randomizer::x[624];
Can anyone help me figure out why different seeds don't result in different sequences of numbers?
Your Seed() function assigns to x[0], then starts looping at ii=0, which overwrites x[0] with an undefined value (it references x[-1]). Start your loop at 1, and you'll probably be all set.
Writing your own randomizer is dangerous. Why? It's hard to get right (see above), it's hard to know if you've done it right, and if it's wrong, things that rely on correctly distributed random numbers will not work quite right. Hopefully that thing is not cryptography or statistical modeling where the tails matter.... Think about using std::random, or if you're not on C++11 yet, boost::random.
I want to extract the n most significant bits from an integer in C++ and convert those n bits to an integer.
For example
int a=1200;
// its binary representation within 32 bit word-size is
// 00000000000000000000010010110000
Now I want to extract the 4 most significant digits from that representation, i.e. 1111
00000000000000000000010010110000
^^^^
and convert them again to an integer (1001 in decimal = 9).
How is possible with a simple c++ function without loops?
Some processors have an instruction to count the leading binary zeros of an integer, and some compilers have instrinsics to allow you to use that instruction. For example, using GCC:
uint32_t significant_bits(uint32_t value, unsigned bits) {
unsigned leading_zeros = __builtin_clz(value);
unsigned highest_bit = 32 - leading_zeros;
unsigned lowest_bit = highest_bit - bits;
return value >> lowest_bit;
}
For simplicity, I left out checks that the requested number of bits are available. For Microsoft's compiler, the intrinsic is called __lzcnt.
If your compiler doesn't provide that intrinsic, and you processor doesn't have a suitable instruction, then one way to count the zeros quickly is with a binary search:
unsigned leading_zeros(int32_t value) {
unsigned count = 0;
if ((value & 0xffff0000u) == 0) {
count += 16;
value <<= 16;
}
if ((value & 0xff000000u) == 0) {
count += 8;
value <<= 8;
}
if ((value & 0xf0000000u) == 0) {
count += 4;
value <<= 4;
}
if ((value & 0xc0000000u) == 0) {
count += 2;
value <<= 2;
}
if ((value & 0x80000000u) == 0) {
count += 1;
}
return count;
}
It's not fast, but (int)(log(x)/log(2) + .5) + 1 will tell you the position of the most significant non-zero bit. Finishing the algorithm from there is fairly straight-forward.
This seems to work (done in C# with UInt32 then ported so apologies to Bjarne):
unsigned int input = 1200;
unsigned int most_significant_bits_to_get = 4;
// shift + or the msb over all the lower bits
unsigned int m1 = input | input >> 8 | input >> 16 | input >> 24;
unsigned int m2 = m1 | m1 >> 2 | m1 >> 4 | m1 >> 6;
unsigned int m3 = m2 | m2 >> 1;
unsigned int nbitsmask = m3 ^ m3 >> most_significant_bits_to_get;
unsigned int v = nbitsmask;
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -((int)v);
if (v>0) c--;
if ((v & 0x0000FFFF) >0) c -= 16;
if ((v & 0x00FF00FF) >0) c -= 8;
if ((v & 0x0F0F0F0F) >0 ) c -= 4;
if ((v & 0x33333333) >0) c -= 2;
if ((v & 0x55555555) >0) c -= 1;
unsigned int result = (input & nbitsmask) >> c;
I assumed you meant using only integer math.
I used some code from #OliCharlesworth's link, you could remove the conditionals too by using the LUT for trailing zeroes code there.
Lets say that I have an array of 4 32-bit integers which I use to store the 128-bit number
How can I perform left and right shift on this 128-bit number?
Thanks!
Working with uint128? If you can, use the x86 SSE instructions, which were designed for exactly that. (Then, when you've bitshifted your value, you're ready to do other 128-bit operations...)
SSE2 bit shifts take ~4 instructions on average, with one branch (a case statement). No issues with shifting more than 32 bits, either. The full code for doing this is, using gcc intrinsics rather than raw assembler, is in sseutil.c (github: "Unusual uses of SSE2") -- and it's a bit bigger than makes sense to paste here.
The hurdle for many people in using SSE2 is that shift ops take immediate (constant) shift counts. You can solve that with a bit of C preprocessor twiddling (wordpress: C preprocessor tricks). After that, you have op sequences like:
LeftShift(uint128 x, int n) = _mm_slli_epi64(_mm_slli_si128(x, n/8), n%8)
for n = 65..71, 73..79, … 121..127
... doing the whole shift in two instructions.
void shiftl128 (
unsigned int& a,
unsigned int& b,
unsigned int& c,
unsigned int& d,
size_t k)
{
assert (k <= 128);
if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
{
a=b;
b=c;
c=d;
d=0;
shiftl128(a,b,c,d,k-32);
}
else
{
a = (a << k) | (b >> (32-k));
b = (b << k) | (c >> (32-k));
c = (c << k) | (d >> (32-k));
d = (d << k);
}
}
void shiftr128 (
unsigned int& a,
unsigned int& b,
unsigned int& c,
unsigned int& d,
size_t k)
{
assert (k <= 128);
if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
{
d=c;
c=b;
b=a;
a=0;
shiftr128(a,b,c,d,k-32);
}
else
{
d = (c << (32-k)) | (d >> k); \
c = (b << (32-k)) | (c >> k); \
b = (a << (32-k)) | (b >> k); \
a = (a >> k);
}
}
Instead of using a 128 bit number why not use a bitset? Using a bitset, you can adjust how big you want it to be. Plus you can perform quite a few operations on it.
You can find more information on these here:
http://www.cppreference.com/wiki/utility/bitset/start?do=backlink
First, if you're shifting by n bits and n is greater than or equal to 32, divide by 32 and shift whole integers. This should be trivial. Now you're left with a remaining shift count from 0 to 31. If it's zero, return early, you're done.
For each integer you'll need to shift by the remaining n, then shift the adjacent integer by the same amount and combine the valid bits from each.
Since you mentioned you're storing your 128-bit value in an array of 4 integers, you could do the following:
void left_shift(unsigned int* array)
{
for (int i=3; i >= 0; i--)
{
array[i] = array[i] << 1;
if (i > 0)
{
unsigned int top_bit = (array[i-1] >> 31) & 0x1;
array[i] = array[i] | top_bit;
}
}
}
void right_shift(unsigned int* array)
{
for (int i=0; i < 4; i++)
{
array[i] = array[i] >> 1;
if (i < 3)
{
unsigned int bottom_bit = (array[i+1] & 0x1) << 31;
array[i] = array[i] | bottom_bit;
}
}
}