I have need of a good pseudo random number generator (PRNG), and it seems like the current state of the art is the xorshift128+ algoritm. Unfortunately, I have discovered 2 different versions. The one on wikipedia: Xorshift shows as:
uint64_t s[2];
uint64_t xorshift128plus(void) {
uint64_t x = s[0];
uint64_t const y = s[1];
s[0] = y;
x ^= x << 23; // a
s[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
return s[1] + y;
}
Which seems straight forward enough. What's more, the edit logs appear to show that this code snippet was added by a user named "Vigna", which is presumably "Sebastiano Vigna" who is the author of the paper on xorshift128+: Further scramblings of Marsaglia’s xorshift generators. Unfortunately, the implementation in that paper is slightly different:
uint64_t next(void) {
uint64_t s1 = s[0];
const uint64_t s0 = s[1];
s[0] = s0;
s1 ^= s1 << 23; // a
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
return s[1] + s0;
}
Apart from some different names, these two snippets are identical except for the final two shifts. In the Wikipedia version those shifts are by 17 and 26, while the shifts in the paper are by 18 and 5.
Does anyone know which is the "right" algorithm? Does it make a difference? This is apparently a fairly widely used algorithm - but which version is used?
Thanks to #Blastfurnace, it appears that the answer is that the most recent set of constants according to the author of the algorithm are: 23, 18, and 5. Apparently it doesn't matter too much, but those are theoretically better than the initial set of numbers he used. Sebastiano Vigna made these comments in response to the news that the V8 Javascript engine is shifting to using this algorithm.
The implementation that I am using is:
uint64_t a = s[0];
uint64_t b = s[1];
s[0] = b;
a ^= a << 23;
a ^= a >> 18;
a ^= b;
a ^= b >> 5;
s[1] = a;
return a + b;
Related
I am in the quite fortunate position to say that my code for a simple SHA1 Hash generator seems to work well. Unfortunately I know that this Arduino Program runs with Little Endianness and the description on the method to generate a hash requires the original message length to be appended as Big Endian integer.
This means for the message char m[] = "Applecake" I would have 9*8 bits, expressed as a 64-bit unsigned integer that is 0x0000 0000 0000 0048. That means, stored with Little Endian, the memory would look like this: 0x0048 0000 0000 0000.
As described in Section 4 of RFC 3174 Step c) I have to
Obtain the 2-word representation of l, the number of bits in the original message. If l < 2^32 then the first word is all zeroes. Append these two words to the padded message.
So with my memory as described above, I would have to convert it to Big Endian first and then append the lower 32 bits to the padded message.
The problem is, that if I do convert the Endianness of the length, which I know is Little Endian, I get the wrong padding and therefore the wrong hash.
Why is my code working without conversion of the Endianness?
Which limitations does my code have concerning the compatibility across different Arduinos, microcontrollers and compilers?
// initialize variables
h0 = 0x67452301;
h1 = 0xEFCDAB89;
h2 = 0x98BADCFE;
h3 = 0x10325476;
h4 = 0xC3D2E1F0;
// calculate the number of required cycles and create a blocks array
uint32_t numCycles = ((ml+65)/512)+1;
uint32_t blocks[numCycles*16] = {};
// copy message
uint32_t messageBytes = ml/8 + (ml%8!=0 ? 1 : 0);
for (uint32_t i = 0; i < messageBytes; i++) {
blocks[i/4] |= ((uint32_t) message[i]) << (8*(3-(i%4)));
}
// append the 1 bit
blocks[ml/32] |= ((uint32_t) 0b1) << (31-(ml%32));
// append the 64-bit big endian ml at the end
if (ml < 0x80000000)
blocks[(numCycles*16)-1] = (uint32_t) ml;
else {
blocks[(numCycles*16)-2] = (uint32_t) ml;
blocks[(numCycles*16)-1] = (uint32_t) (ml >> 32);
}
for (uint32_t iCycle = 0; iCycle < numCycles; iCycle++) {
// initalize locals
uint32_t w[80] = {};
uint32_t a = h0, b = h1, c = h2, d = h3, e = h4;
for (uint8_t i = 0; i < 80; i++) {
// convert words to big-endian and copy to 80-elem array
if (i < 16)
w[i] = blocks[(iCycle*16)+i];
else
w[i] = rotL((w[i-3]^w[i-8]^w[i-14]^w[i-16]), 1);
// run defined formulas
uint32_t f, k, temp;
if (i < 20) {
f = (b & c) | ((~b) & d);
k = 0x5A827999;
}
else if (i < 40) {
f = b ^ c ^ d;
k = 0x6ED9EBA1;
}
else if (i < 60) {
f = (b & c) | (b & d) | (c & d);
k = 0x8F1BBCDC;
}
else {
f = b ^ c ^ d;
k = 0xCA62C1D6;
}
temp = rotL(a, 5) + f + e + k + w[i];
e = d; d = c; c = rotL(b, 30); b = a; a = temp;
}
// write back the results
h0 += a; h1 += b; h2 += c; h3 += d; h4 += e;
}
// append the 64-bit big endian ml at the end
if (ml < 0x80000000)
blocks[(numCycles*16)-1] = (uint32_t) ml;
else {
blocks[(numCycles*16)-2] = (uint32_t) ml;
blocks[(numCycles*16)-1] = (uint32_t) (ml >> 32);
}
This puts the most-significant 32-bit value first and the least-significant 32-bit value second. That's half the reason your code works.
The other half is that while the 32-bit values are in little-endian form, you are reading their values on a little-endian platform. That will always give you the correct value. You never try to access the individual bytes of the 32-bit values, so which bytes goes where makes no difference.
My task is to design a function that fulfils those requirements:
Function shall sum members of given one-dimensional array. However, it should sum only members whose number of ones in the binary representation is higher than defined threshold (e.g. if the threshold is 4, number 255 will be counted and 15 will not)
The array length is arbitrary
The function shall utilize as little memory as possible and shall be written in an efficient way
The production function code (‘sum_filtered(){..}’) shall not use any standard C library functions (or any other libraries)
The function shall return 0 on success and error code on error
The array elements are of a type 16-bit signed integer and an overflow during calculation shall be regarded as a failure
Use data types that ensure portability between different CPUs (so the calculations will be the same on 8/16/32-bit MCU)
The function code should contain a reasonable amount of comments in doxygen annotation
Here is my solution:
#include <iostream>
using namespace std;
int sum_filtered(short array[], int treshold)
{
// return 1 if invalid input parameters
if((treshold < 0) || (treshold > 16)){return(1);}
int sum = 0;
int bitcnt = 0;
for(int i=0; i < sizeof(array); i++)
{
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
// Add integer to sum if bitcnt>treshold
if(bitcnt>treshold){sum += array[i];}
}
return(0);
}
int main()
{
short array[5] = {15, 2652, 14, 1562, -115324};
int result = sum_filtered(array, 14);
cout << result << endl;
short array2[5] = {15, 2652, 14, 1562, 15324};
result = sum_filtered(array2, -2);
cout << result << endl;
}
However I'm not sure whether this code is portable between different CPUs.
And I don't how can an overflow occur during calculation and what can be other errors during processing of arrays with this function.
Can somebody more experienced give me his opinion?
Well, I can foresee one problem:
for(int i=0; i < sizeof(array); i++)
array in this context is a pointer, so will likely be 4 on 32bit systems, or 8 on 64bit systems. You really do want to be passing a count variable (in this case 5) into the sum_filtered function (and then you can pass the count as sizeof(array) / sizeof(short)).
Anyhow, this code:
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
Effectively you are doing a popcount here (which can be done using __builtin_popcount on gcc/clang, or __popcnt on MSVC. They are compiler specific, but usually boil down to a single popcount CPU instruction on most CPUs).
If you do want to do this the slow way, then an efficient approach is to treat the computation as a form of bitwise SIMD operation:
#include <cstdint> // or stdint.h if you have a rubbish compiler :)
uint16_t popcount(uint16_t s)
{
// perform 8x 1bit adds
uint16_t a0 = s & 0x5555;
uint16_t b0 = (s >> 1) & 0x5555;
uint16_t s0 = a0 + b0;
// perform 4x 2bit adds
uint16_t a1 = s0 & 0x3333;
uint16_t b1 = (s0 >> 2) & 0x3333;
uint16_t s1 = a1 + b1;
// perform 2x 4bit adds
uint16_t a2 = s1 & 0x0F0F;
uint16_t b2 = (s1 >> 4) & 0x0F0F;
uint16_t s2 = a2 + b2;
// perform 1x 8bit adds
uint16_t a3 = s2 & 0x00FF;
uint16_t b3 = (s2 >> 8) & 0x00FF;
return a3 + b3;
}
I know it says you can't use stdlib functions (your 4th point), but that shouldn't apply to the standardised integer types surely? (e.g. uint16_t) If it does, well then there is no way to guarantee portability across platforms. You're out of luck.
Personally I'd just use a 64bit integer for the sum. That should reduce the risk of any overflows *(i.e. if the threshold is zero, and all the values are -128, then you'd overflow if the array size exceeded 0x1FFFFFFFFFFFF elements (562,949,953,421,311 in decimal).
#include <cstdint>
int64_t sum_filtered(int16_t array[], uint16_t threshold, size_t array_length)
{
// changing the type on threshold to be unsigned means we don't need to test
// for negative numbers.
if(threshold > 16) { return 1; }
int64_t sum = 0;
for(size_t i=0; i < array_length; i++)
{
if (popcount(array[i]) > threshold)
{
sum += array[i];
}
}
return sum;
}
void Tools::Swap(uint32_t number){
int temp1 = (number >> 31) & 1;
int temp2 = number & 1;
int ans = number & 7ffffffe;
int mask = (temp2 << 31) | temp1;
ans = ans | mask;
cout << ans << endl;
}
I've worked it out on paper and it does seem to swap the first and last bits but I want to be sure it's the best way I can be doing this.
No. (temp2 << 31) causes undefined behaviour if temp2 is 1, and int is 32-bit or narrower.
However, if you replace all of the int by uint32_t and slap an 0x on the front of 7ffffffe, then it seems correct.
It might be better to let the compiler create temps as needed. With shift counts of 31, there's no need to use &.
uint32_t number;
// ...
number = (number<<31) | (number & 0x7ffffffe) | (number>>31);
So I've got a custom randomizer class that uses a Mersenne Twister (the code I use is adapted from this site). All seemed to be working well, until I started testing different seeds (I normally use 42 as a seed, to ensure that each time I run my program, the results are the same, so I can see how code changes influence things).
It turns out that, no matter what seed I choose, the code produces the exact same series of numbers each time. Clearly I'm doing something wrong, but I don't know what. Here is my seed function:
void Randomizer::Seed(unsigned long int Seed)
{
int ii;
x[0] = Seed & 0xffffffffUL;
for (ii = 0; ii < N; ii++)
{
x[ii] = (1812433253UL * (x[ii - 1] ^ (x[ii - 1] >> 30)) + ii);
x[ii] &= 0xffffffffUL;
}
}
And this is my Rand() function
unsigned long int Randomizer::Rand()
{
unsigned long int Result;
unsigned long int a;
int ii;
// Refill x if exhausted
if (Next == N)
{
Next = 0;
for (ii = 0; ii < N - 1; ii++)
{
Result = (x[ii] & U) | x[ii + 1] & L;
a = (Result & 0x1UL) ? A : 0x0UL;
x[ii] = x[( ii + M) % N] ^ (Result >> 1) ^ a;
}
Result = (x[N - 1] & U) | x[0] & L;
a = (Result & 0x1UL) ? A : 0x0UL;
x[N - 1] = x[M - 1] ^ (Result >> 1) ^ a;
}
Result = x[Next++];
//Improves distribution
Result ^= (Result >> 11);
Result ^= (Result << 7) & 0x9d2c5680UL;
Result ^= (Result << 15) & 0xefc60000UL;
Result ^= (Result >> 18);
return Result;
}
The various values are:
#define A 0x9908b0dfUL
#define U 0x80000000UL
#define L 0x7fffffffUL
int Randomizer::N = 624;
int Randomizer::M = 397;
int Randomizer::Next = 0;
unsigned long Randomizer::x[624];
Can anyone help me figure out why different seeds don't result in different sequences of numbers?
Your Seed() function assigns to x[0], then starts looping at ii=0, which overwrites x[0] with an undefined value (it references x[-1]). Start your loop at 1, and you'll probably be all set.
Writing your own randomizer is dangerous. Why? It's hard to get right (see above), it's hard to know if you've done it right, and if it's wrong, things that rely on correctly distributed random numbers will not work quite right. Hopefully that thing is not cryptography or statistical modeling where the tails matter.... Think about using std::random, or if you're not on C++11 yet, boost::random.
I am trying to optimize a small piece of code with SSE intrinsics (I am a complete beginner on the topic), but I am a little stuck on the use of conditionals.
My original code is:
unsigned long c;
unsigned long constant = 0x12345678;
unsigned long table[256];
int n, k;
for( n = 0; n < 256; n++ )
{
c = n;
for( k = 0; k < 8; k++ )
{
if( c & 1 ) c = constant ^ (c >> 1);
else c >>= 1;
}
table[n] = c;
}
The goal of this code is to compute a crc table (the constant can be any polynomial, it doesn't play a role here),
I suppose my optimized code would be something like:
__m128 x;
__m128 y;
__m128 *table;
x = _mm_set_ps(3, 2, 1, 0);
y = _mm_set_ps(3, 2, 1, 0);
//offset for incrementation
offset = _mm_set1_ps(4);
for( n = 0; n < 64; n++ )
{
y = x;
for( k = 0; k < 8; k++ )
{
//if do something with y
//else do something with y
}
table[n] = y;
x = _mm_add_epi32 (x, offset);
}
I have no idea how to go through the if-else statement, but I suspect there is a clever trick. Has anybody an idea on how to do that?
(Aside from this, my optimization is probably quite poor - any advice or correction on it would be treated with the greatest sympathy)
You can get rid of the if/else entirely. Back in the days when I produced MMX assembly code, that was a common programming activity. Let me start with a series of transformations on the "false" statement:
c >>= 1;
c = c >> 1;
c = 0 ^ (c >> 1);
Why did I introduce the exclusive-or? Because exclusive-or is also found in the "true" statement:
c = constant ^ (c >> 1);
Note the similarity? In the "true" part, we xor with a constant, and in the false part, we xor with zero.
Now I'm going to show you a series of transformations on the entire if/else statement:
if (c & 1)
c = constant ^ (c >> 1); // same as before
else
c = 0 ^ (c >> 1); // just different layout
if (c & 1)
c = constant ^ (c >> 1);
else
c = (constant & 0) ^ (c >> 1); // 0 == x & 0
if (c & 1)
c = (constant & -1) ^ (c >> 1); // x == x & -1
else
c = (constant & 0) ^ (c >> 1);
Now the two branches only differ in the second argument to the binary-and, which can be calculated trivially from the condition itself, thus enabling us to get rid of the if/else:
c = (constant & -(c & 1)) ^ (c >> 1);
Disclaimer: This solution only works on a two's complement architecture where -1 means "all bits set".
The idea in SSE is to build both results and then blend the results together.
E.g. :
__m128i mask = ...; // some way to build mask[n] = 0x1
__m128i constant = ...;
__m128i tmp_c = _mm_xor_si128( _mm_srli_epis32( c, 1 ), constant );
__m128i tmp_c2 = _mm_srli_epis32( c, 1 );
__m128i v = _mm_cmpeq_epi32( c, mask );
tmp_c = _mm_and_epi32( tmp_c, mask );
tmp_c2 = _mm_andnot_si128( mask, tmp_c2 );
c = _mm_or_si128( tmp_c, tmp_c2 );
// or in sse4_1
c = _mm_blendv_epi8( tmp_c, tmp_c2, mask );
Note beside, this is not complete code, only to demonstrate the principle.
The first step in efficiently computing CRC is using a wider basic unit than the bit. See here for an example of how to do this byte per byte.