Is there a clever (ie: branchless) way to "compact" a hex number. Basically move all the 0s all to one side?
eg:
0x10302040 -> 0x13240000
or
0x10302040 -> 0x00001324
I looked on Bit Twiddling Hacks but didn't see anything.
It's for a SSE numerical pivoting algorithm. I need to remove any pivots that become 0. I can use _mm_cmpgt_ps to find good pivots, _mm_movemask_ps to convert that in to a mask, and then bit hacks to get something like the above. The hex value gets munged in to a mask for a _mm_shuffle_ps instruction to perform a permutation on the SSE 128 bit register.
To compute mask for _pext:
mask = arg;
mask |= (mask << 1) & 0xAAAAAAAA | (mask >> 1) & 0x55555555;
mask |= (mask << 2) & 0xCCCCCCCC | (mask >> 2) & 0x33333333;
First do bit-or on pairs of bits, then on quads. Masks prevent shifted values from overflowing to other digits.
After computing mask this way or harold's way (which is probably faster) you don't need the full power of _pext, so if targeted hardware doesn't support it you can replace it with this:
for(int i = 0; i < 7; i++) {
stay_mask = mask & (~mask - 1);
arg = arg & stay_mask | (arg >> 4) & ~stay_mask;
mask = stay_mask | (mask >> 4);
}
Each iteration moves all nibbles one digit to the right if there is some space. stay_mask marks bits that are in their final positions. This uses somewhat less operations than Hacker's Delight solution, but might still benefit from branching.
Supposing we can use _pext_u32, the issue then is computing a mask that has an F for every nibble that isn't zero. I'm not sure what the best approach is, but you can compute the OR of the 4 bits of the nibble and then "spread" it back out to F's like this:
// calculate horizontal OR of every nibble
x |= x >> 1;
x |= x >> 2;
// clean up junk
x &= 0x11111111;
// spread
x *= 0xF;
Then use that as the mask of _pext_u32.
_pext_u32 can be emulated by this (taken from Hacker's Delight, figure 7.6)
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
But that's a bit of a disaster. It's probably better to just resort to branching code then.
uint32_t fun(uint32_t val) {
uint32_t retVal(0x00);
uint32_t sa(28);
for (int sb(28); sb >= 0; sb -= 4) {
if (val & (0x0F << sb)) {
retVal |= (0x0F << sb) << (sa - sb)
sa -= 4;
}
}
return retVal;
}
I think this (or something similar) is what you're looking for. Eliminating the 0 nibbles within a number. I've not debugged it, and it would only works on one side atm.
If your processor supports conditional instruction execution, you may get a benefit from this algorithm:
uint32_t compact(uint32_t orig_value)
{
uint32_t mask = 0xF0000000u; // Mask for isolating a hex digit.
uint32_t new_value = 0u;
for (unsigned int i = 0; i < 8; ++i) // 8 hex digits
{
if (orig_value & mask == 0u)
{
orig_value = orig_value << 4; // Shift the original value by 1 digit
}
new_value |= orig_value & mask;
mask = mask >> 4; // next digit
}
return new_value;
}
This looks like a good candidate for loop unrolling.
The algorithm assumes that when the original value is shifted left, zeros are shifted in, filling in the "empty" bits.
Edit 1:
On a processor that supports conditional execution of instructions, the shifting of the original value would be conditionally executed depending on the result of the ANDing of the original value and the mask. Thus no branching, only ignored instructions.
I came up with the following solution. Please take a look, maybe it will help you.
#include <iostream>
#include <sstream>
#include <algorithm>
using namespace std;
class IsZero
{
public:
bool operator ()(char c)
{
return '0' == c;
}
};
int main()
{
int a = 0x01020334; //IMPUT
ostringstream my_sstream;
my_sstream << hex << a;
string str = my_sstream.str();
int base_str_length = str.size();
cout << "Input hex: " << str << endl;
str.insert(remove_if(begin(str), end(str), IsZero()), count_if(begin(str), end(str), IsZero()), '0');
str.replace(begin(str) + base_str_length, end(str), "");
cout << "Processed hex: " << str << endl;
return 0;
}
Output:
Input hex: 1020334
Processed hex: 1233400
Lets say that I have an array of 4 32-bit integers which I use to store the 128-bit number
How can I perform left and right shift on this 128-bit number?
Thanks!
Working with uint128? If you can, use the x86 SSE instructions, which were designed for exactly that. (Then, when you've bitshifted your value, you're ready to do other 128-bit operations...)
SSE2 bit shifts take ~4 instructions on average, with one branch (a case statement). No issues with shifting more than 32 bits, either. The full code for doing this is, using gcc intrinsics rather than raw assembler, is in sseutil.c (github: "Unusual uses of SSE2") -- and it's a bit bigger than makes sense to paste here.
The hurdle for many people in using SSE2 is that shift ops take immediate (constant) shift counts. You can solve that with a bit of C preprocessor twiddling (wordpress: C preprocessor tricks). After that, you have op sequences like:
LeftShift(uint128 x, int n) = _mm_slli_epi64(_mm_slli_si128(x, n/8), n%8)
for n = 65..71, 73..79, … 121..127
... doing the whole shift in two instructions.
void shiftl128 (
unsigned int& a,
unsigned int& b,
unsigned int& c,
unsigned int& d,
size_t k)
{
assert (k <= 128);
if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
{
a=b;
b=c;
c=d;
d=0;
shiftl128(a,b,c,d,k-32);
}
else
{
a = (a << k) | (b >> (32-k));
b = (b << k) | (c >> (32-k));
c = (c << k) | (d >> (32-k));
d = (d << k);
}
}
void shiftr128 (
unsigned int& a,
unsigned int& b,
unsigned int& c,
unsigned int& d,
size_t k)
{
assert (k <= 128);
if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
{
d=c;
c=b;
b=a;
a=0;
shiftr128(a,b,c,d,k-32);
}
else
{
d = (c << (32-k)) | (d >> k); \
c = (b << (32-k)) | (c >> k); \
b = (a << (32-k)) | (b >> k); \
a = (a >> k);
}
}
Instead of using a 128 bit number why not use a bitset? Using a bitset, you can adjust how big you want it to be. Plus you can perform quite a few operations on it.
You can find more information on these here:
http://www.cppreference.com/wiki/utility/bitset/start?do=backlink
First, if you're shifting by n bits and n is greater than or equal to 32, divide by 32 and shift whole integers. This should be trivial. Now you're left with a remaining shift count from 0 to 31. If it's zero, return early, you're done.
For each integer you'll need to shift by the remaining n, then shift the adjacent integer by the same amount and combine the valid bits from each.
Since you mentioned you're storing your 128-bit value in an array of 4 integers, you could do the following:
void left_shift(unsigned int* array)
{
for (int i=3; i >= 0; i--)
{
array[i] = array[i] << 1;
if (i > 0)
{
unsigned int top_bit = (array[i-1] >> 31) & 0x1;
array[i] = array[i] | top_bit;
}
}
}
void right_shift(unsigned int* array)
{
for (int i=0; i < 4; i++)
{
array[i] = array[i] >> 1;
if (i < 3)
{
unsigned int bottom_bit = (array[i+1] & 0x1) << 31;
array[i] = array[i] | bottom_bit;
}
}
}
Whats the proper way about going about this? Lets say I have ABCD and abcd and the output bits should be something like AaBbCcDd.
unsigned int JoinBits(unsigned short a, unsigned short b) { }
#include <stdint.h>
uint32_t JoinBits(uint16_t a, uint16_t b) {
uint32_t result = 0;
for(int8_t ii = 15; ii >= 0; ii--){
result |= (a >> ii) & 1;
result <<= 1;
result |= (b >> ii) & 1;
if(ii != 0){
result <<= 1;
}
}
return result;
}
also tested on ideone here: http://ideone.com/lXTqB.
First, spread your bits:
unsigned int Spread(unsigned short x)
{
unsigned int result=0;
for (unsigned int i=0; i<15; ++i)
result |= ((x>>i)&1)<<(i*2);
return result;
}
Then merge the two with an offset in your function like this:
Spread(a) | (Spread(b)<<1);
If you want true bitwise interleaving, the simplest and elegant way might be this:
unsigned int JoinBits(unsigned short a, unsigned short b)
{
unsigned int r = 0;
for (int i = 0; i < 16; i++)
r |= ((a & (1 << i)) << i) | ((b & (1 << i)) << (i + 1));
return r;
}
Without any math trick to exploit, my first naive solution would be to use a BitSet like data structure to compute the output number bit by bit. This would take looping over lg(a) + lg(b) bits which would give you the complexity.
Quite possible with some bit manipulation, but the exact code depends on the byte order of the platform. Assuming little-endian (which is the most common), you could do:
unsigned int JoinBits(unsigned short x, unsigned short y) {
// x := AB-CD
// y := ab-cd
char bytes[4];
/* Dd */ bytes[0] = ((x & 0x000F) << 4) | (y & 0x000F);
/* Cc */ bytes[1] = (x & 0x00F0) | ((y & 0x00F0) >> 4);
/* Bb */ bytes[2] = ((x & 0x0F00) >> 4) | ((y & 0x0F00) >> 8);
/* Aa */ bytes[3] = ((x & 0xF000) >> 8) | ((y & 0xF000) >> 12);
return *reinterpret_cast<unsigned int *>(bytes);
}
From Sean Anderson's website :
static const unsigned short MortonTable256[256] =
{
0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,
0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,
0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,
0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,
0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,
0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,
0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,
0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,
0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,
0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,
0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,
0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,
0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,
0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,
0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,
0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,
0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,
0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,
0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,
0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,
0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,
0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,
0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,
0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,
0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,
0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,
0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,
0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,
0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
};
unsigned short x; // Interleave bits of x and y, so that all of the
unsigned short y; // bits of x are in the even positions and y in the odd;
unsigned int z; // z gets the resulting 32-bit Morton Number.
z = MortonTable256[y >> 8] << 17 |
MortonTable256[x >> 8] << 16 |
MortonTable256[y & 0xFF] << 1 |
MortonTable256[x & 0xFF];