In shader intBitsToFloat and floatBitsToInt - opengl

So this issue I'm running into is lack of knowledge, I need to be in the glsl 120 (OpenGL 2.1) which locks the software down heavily, Essentially I need many of the features from 440 shaders in 120, using shader5 on supported hardware which covers it just fine, however on shader4 for mac and other things IE mesa I need to do a lot of functions myself. But the ones that keep nagging me are my bit related functions, I'm very bad at tracking bits so any help would be lovely. The largest function I'm having issues on is intBitsToFloat and floatBitsToInt, I've tried a few things but to no success.
int floatToIntBitst(float a){
//Nan
if(a != a) return 0x7fc00000;
//-0
if (a == 0.0) return (1.0 / a == -1.0/0.0 ) ? 0x80000000 : 0;
bool neg = false;
if (a < 0.0) {
neg = true;
a = -a;
}
if (isinf(a)) {
return neg ? 0xff800000 : 0x7f800000;
}
int exp = ((a >> 52) & 0x7ff) - 1023;
int mantissa = (a & 0xffffffff) >> 29;
if (exp <= -127) {
mantissa = (0x800000 | mantissa) >> (-127 - exp + 1);
exp = -127;
}
int bits = negative?2147483648: 0;
bits |= (exp + 127) << 23;
bits |= mantissa;
return bits;
}
some of my other functions, any feedback would be appreciated
bitfieldReverse
int bitfieldReverse(int x) {
x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
return x;
}
I have all variations, if something needs to change for uints let me know.
LSB and MSB
int findLSB(int x ) { return x&-x; }
int findMSB(int x) {
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return (x & ~(x >> 1));
}
Same goes for these
bitCount
int bitCount(int a) {
a = (a & 0x55555555) + ((a >> 1) & 0x55555555);
a = (a & 0x33333333) + ((a >> 2) & 0x33333333);
a = (a + (a >> 4)) & 0x0f0f0f0f;
a = (a + (a >> 8));
a = (a + (a >> 16));
return a & 0xff;
}

Hardware that doesn't offer GL 3.0+ is almost always hardware that also doesn't have integers as a type distinct from float. As such, all of your ints are really floats in disguise. Which means that they must live with the ranges and restrictions of a float.
Because of this, you cannot effectively do the kind of bit-manipulation you're trying on such hardware.

Related

How to unit test bit manipulation logic

I have a method which converts RGBA to BGRA. Below is the method
unsigned int ConvertRGBAToBGRA(unsigned int v) {
unsigned char r = (v)& 0xFF;
unsigned char g = (v >> 8) & 0xFF;
unsigned char b = (v >> 16) & 0xFF;
unsigned char a = (v >> 24) & 0xFF;
return (a << 24) | (r << 16) | (g << 8) | b;
};
How can I unit test this nicely? Is there a way I can read back the bits and unit test this method somehow?
I am using googletests
Inspired by #Yves Daoust's comment, why can't you just write a series of checks like below? You can use the nice formatting of C++14:
unsigned int ConvertRGBAToBGRA(unsigned int v) {
unsigned char r = (v)&0xFF;
unsigned char g = (v >> 8) & 0xFF;
unsigned char b = (v >> 16) & 0xFF;
unsigned char a = (v >> 24) & 0xFF;
return (a << 24) | (r << 16) | (g << 8) | b;
};
TEST(ConvertRGBAToBGRATest, Test1) {
EXPECT_EQ(ConvertRGBAToBGRA(0x12'34'56'78), 0x12'78'56'34);
EXPECT_EQ(ConvertRGBAToBGRA(0x12'78'56'34), 0x12'34'56'78);
EXPECT_EQ(ConvertRGBAToBGRA(0x11'11'11'11), 0x11'11'11'11);
EXPECT_EQ(ConvertRGBAToBGRA(0x00'00'00'00), 0x00'00'00'00);
EXPECT_EQ(ConvertRGBAToBGRA(0xAa'Bb'Cc'Dd), 0xAa'Dd'Cc'Bb);
EXPECT_EQ(ConvertRGBAToBGRA(ConvertRGBAToBGRA(0x12'34'56'78)), 0x12'34'56'78);
EXPECT_EQ(ConvertRGBAToBGRA(ConvertRGBAToBGRA(0x12'78'56'34)), 0x12'78'56'34);
EXPECT_EQ(ConvertRGBAToBGRA(ConvertRGBAToBGRA(0x11'11'11'11)), 0x11'11'11'11);
EXPECT_EQ(ConvertRGBAToBGRA(ConvertRGBAToBGRA(0x00'00'00'00)), 0x00'00'00'00);
EXPECT_EQ(ConvertRGBAToBGRA(ConvertRGBAToBGRA(0xAa'Bb'Cc'Dd)), 0xAa'Bb'Cc'Dd);
}
Live example: https://godbolt.org/z/eEajYYYsf
You could also define a custom matcher and use EXPECT_THAT macro:
// A custom matcher for comparing BGRA and RGBA.
MATCHER_P(IsBgraOf, n, "") {
return ((n & 0xFF000000) == (arg & 0xFF000000)) &&
((n & 0x00FF0000) == ((arg << 16) & 0x00FF0000)) &&
((n & 0x0000FF00) == (arg & 0x0000FF00));
}
TEST(ConvertRGBAToBGRATest, WithExpectThat) {
EXPECT_THAT(ConvertRGBAToBGRA(0x12'34'56'78), IsBgraOf(0x12'34'56'78));
EXPECT_THAT(ConvertRGBAToBGRA(0x12'78'56'34), IsBgraOf(0x12'78'56'34));
EXPECT_THAT(ConvertRGBAToBGRA(0xAa'Bb'Cc'Dd), IsBgraOf(0xAa'Bb'Cc'Dd));
EXPECT_THAT(ConvertRGBAToBGRA(0x00'00'00'00), IsBgraOf(0x00'00'00'00));
EXPECT_THAT(ConvertRGBAToBGRA(0x11'11'11'11), IsBgraOf(0x11'11'11'11));
}
Live example: https://godbolt.org/z/P4EcW19s9
You can split the value in four bytes by mapping to an array via a pointer. Then swap the bytes.
uint8_t* pV= reinterpret_cast<uint8_t*>(&V);
uint8_t Swap= pV[1]; pV[1]= pV[3]; pV[3]= Swap;

Efficiently deinterleave even and odd bits from an integer in arm/neon [duplicate]

How to encode/decode morton codes(z-order) given [x, y] as 32bit unsigned integers producing 64bit morton code, and vice verse ?
I do have xy2d and d2xy but only for coordinates that are 16bits wide producing 32bit morton number. Searched a lot in net, but couldn't find. Please help.
If it is possible for you to use architecture specific instructions you'll likely be able to accelerate the operation beyond what is possible using bit-twiddeling hacks:
For example if you write code for the Intel Haswell and later CPUs you can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
Here is a complete example (tested with GCC):
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton(uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
void morton_to_xy(uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
If you have to support earlier CPUs or the ARM platform not all is lost. You may still get at least get help for the xy_to_morton function from instructions specific for cryptography.
A lot of CPUs have support for carry-less multiplication these days. On ARM that'll be vmul_p8 from the NEON instruction set. On X86 you'll find it as PCLMULQDQ from the CLMUL instruction set (available since 2010).
The trick here is, that a carry-less multiplication of a number with itself will return a bit-pattern that contains the original bits of the argument with zero-bits interleaved. So it is identical to the _pdep_u32(x,0x55555555) shown above. E.g. it turns the following byte:
+----+----+----+----+----+----+----+----+
| b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
+----+----+----+----+----+----+----+----+
Into:
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 0 | b7 | 0 | b6 | 0 | b5 | 0 | b4 | 0 | b3 | 0 | b2 | 0 | b1 | 0 | b0 |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Now you can build the xy_to_morton function as (here shown for CLMUL instruction set):
#include <wmmintrin.h>
#include <stdint.h>
// on GCC, compile with option -mpclmul
uint64_t carryless_square (uint32_t x)
{
uint64_t val[2] = {x, 0};
__m128i *a = (__m128i * )val;
*a = _mm_clmulepi64_si128 (*a,*a,0);
return val[0];
}
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return carryless_square(x)|(carryless_square(y) <<1);
}
_mm_clmulepi64_si128 generates a 128 bit result of which we only use the lower 64 bits. So you can even improve upon the version above and use a single _mm_clmulepi64_si128 do do the job.
That is as good as you can get on mainstream platforms (e.g. modern ARM with NEON and x86). Unfortunately I don't know of any trick to speed up the morton_to_xy function using the cryptography instructions and I tried really hard for several month.
void xy2d_morton(uint64_t x, uint64_t y, uint64_t *d)
{
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
*d = x | (y << 1);
}
// morton_1 - extract even bits
uint32_t morton_1(uint64_t x)
{
x = x & 0x5555555555555555;
x = (x | (x >> 1)) & 0x3333333333333333;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFF;
x = (x | (x >> 16)) & 0x00000000FFFFFFFF;
return (uint32_t)x;
}
void d2xy_morton(uint64_t d, uint64_t &x, uint64_t &y)
{
x = morton_1(d);
y = morton_1(d >> 1);
}
The naïve code would be the same irregardless of the bit count. If you don't need super fast bit twiddling version, this will do
uint32_t x;
uint32_t y;
uint64_t z = 0;
for (int i = 0; i < sizeof(x) * 8; i++)
{
z |= (x & (uint64_t)1 << i) << i | (y & (uint64_t)1 << i) << (i + 1);
}
If you need faster bit twiddling, then this one should work. Note that x and y have to be 64bit variables.
uint64_t x;
uint64_t y;
uint64_t z = 0;
x = (x | (x << 16)) & 0x0000FFFF0000FFFF;
x = (x | (x << 8)) & 0x00FF00FF00FF00FF;
x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F;
x = (x | (x << 2)) & 0x3333333333333333;
x = (x | (x << 1)) & 0x5555555555555555;
y = (y | (y << 16)) & 0x0000FFFF0000FFFF;
y = (y | (y << 8)) & 0x00FF00FF00FF00FF;
y = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y | (y << 2)) & 0x3333333333333333;
y = (y | (y << 1)) & 0x5555555555555555;
z = x | (y << 1);

Optimizing Bitshift into Array

I have a piece of code that runs at ~1.2 million runs per second after performing some tasks, the bulkiest is setting a uint8_t array with bitshifted data from two uint32_t pieces of data. The excerpt code is as follows:
static inline uint32_t RotateRight(uint32_t val, int n)
{
return (val >> n) + (val << (32 - n));
}
static inline uint32_t CSUInt32BE(const uint8_t *b)
{
return ((uint32_t)b[0] << 24) | ((uint32_t)b[1] << 16) | ((uint32_t)b[2] << 8) | (uint32_t)b[3];
}
static uint32_t ReverseBits(uint32_t val) // Usually just static, tried inline/static inline
{
// uint32_t res = 0;
// for (int i = 0; i<32; i++)
// {
// res <<= 1;
// res |= val & 1;
// val >>= 1;
// }
// Original code above, benched ~220k l/s
//val = ((val & 0x55555555) << 1) | ((val >> 1) & 0x55555555);
//val = ((val & 0x33333333) << 2) | ((val >> 2) & 0x33333333);
//val = ((val & 0x0F0F0F0F) << 4) | ((val >> 4) & 0x0F0F0F0F);
//val = ((val & 0x00FF00FF) << 8) | ((val >> 8) & 0x00FF00FF);
//val = (val << 16) | (val >> 16);
// Option 0, benched ~770k on MBP
uint32_t c = 0;
c = (BitReverseTable256[val & 0xff] << 24) |
(BitReverseTable256[(val >> 8) & 0xff] << 16) |
(BitReverseTable256[(val >> 16) & 0xff] << 8) |
(BitReverseTable256[val >> 24]); // was (val >> 24) & 0xff
// Option 1, benched ~970k l/s on MBP, Current, minor tweak to 24
//unsigned char * p = (unsigned char *)&val;
//unsigned char * q = (unsigned char *)&c;
//q[3] = BitReverseTable256[p[0]];
//q[2] = BitReverseTable256[p[1]];
//q[1] = BitReverseTable256[p[2]];
//q[0] = BitReverseTable256[p[3]];
// Option 2 at ~970k l/s on MBP from http://stackoverflow.com/questions/746171/best-algorithm-for-bit-reversal-from-msb-lsb-to-lsb-msb-in-c
return c; // Current
// return val; // option 0
// return res; // original
//uint32_t m;
//val = (val >> 16) | (val << 16); // swap halfwords
//m = 0x00ff00ff; val = ((val >> 8) & m) | ((val << 8) & ~m); // swap bytes
//m = m^(m << 4); val = ((val >> 4) & m) | ((val << 4) & ~m); // swap nibbles
//m = m^(m << 2); val = ((val >> 2) & m) | ((val << 2) & ~m);
//m = m^(m << 1); val = ((val >> 1) & m) | ((val << 1) & ~m);
//return val;
// Benches at 850k l/s on MBP
//uint32_t t;
//val = (val << 15) | (val >> 17);
//t = (val ^ (val >> 10)) & 0x003f801f;
//val = (t + (t << 10)) ^ val;
//t = (val ^ (val >> 4)) & 0x0e038421;
//val = (t + (t << 4)) ^ val;
//t = (val ^ (val >> 2)) & 0x22488842;
//val = (t + (t << 2)) ^ val;
//return val;
// Benches at 820k l/s on MBP
}
static void StuffItDESCrypt(uint8_t data[8], StuffItDESKeySchedule *ks, BOOL enc)
{
uint32_t left = ReverseBits(CSUInt32BE(&data[0]));
uint32_t right = ReverseBits(CSUInt32BE(&data[4]));
right = RotateRight(right, 29);
left = RotateRight(left, 29);
//Encryption function runs here
left = RotateRight(left, 3);
right = RotateRight(right, 3);
uint32_t left1 = ReverseBits(left);
uint32_t right1 = ReverseBits(right);
data[0] = right1 >> 24;
data[1] = (right1 >> 16) & 0xff;
data[2] = (right1 >> 8) & 0xff;
data[3] = right1 & 0xff;
data[4] = left1 >> 24;
data[5] = (left1 >> 16) & 0xff;
data[6] = (left1 >> 8) & 0xff;
data[7] = left1 & 0xff;
Is this the most optimal way to accomplish this? I have a uint64_t version as well:
uint64_t both = ((uint64_t)ReverseBits(left) << 32) | (uint64_t)ReverseBits(right);
data[0] = (both >> 24 & 0xff);
data[1] = (both >> 16) & 0xff;
data[2] = (both >> 8) & 0xff;
data[3] = both & 0xff;
data[4] = (both >> 56);
data[5] = (both >> 48) & 0xff;
data[6] = (both >> 40) & 0xff;
data[7] = (both >> 32) & 0xff;
I tested what would happen if I completely skipped this assignment (the ReverseBits function is still done), and the code runs at ~6.5 million runs per second. In addition, this speed hit happens if I only do just one as well, leveling out at 1.2 million even without touching the other 7 assignments.
I'd hate to think that this operation takes a massive 80% speed hit due to this work and can't be made any faster.
This is on Windows Visual Studio 2015 (though I try to keep the source as portable to macOS and Linux as possible).
Edit: The full base code is at Github. I am not the original author of the code, however I have forked it and maintain a password recovery solution using a modified for speed version. You can see my speed up successes in ReverseBits with various solutions and benched speeds.
These files are 20+ years old, and has successfully recovered files albeit at a low speed for years. See blog post.
You're certainly doing more work than you need to do. Note how function ReverseBits() goes to some effort to put the bytes of the reversed word in the correct order, and how the next thing that happens -- the part to which you are attributing the slowdown -- is to reorder those same bytes.
You could write and use a modified version of ReverseBits() that puts the bytes of the reversed representation directly into the correct places in the array, instead of packing them into integers just to unpack them again. That ought to be at least a bit faster, as you would be strictly removing operations.
My immediate thought was to "view" the int32_t as if they were an array of int8_t like
uint8_t data2[8];
*((uint32_t*)&data2[0]) = right1;
*((uint32_t*)&data2[4]) = left1;
However, you store the most significant bits of right1 in data[0], whereas this approach lets the least significant bits go to data[0]. Anyway, as I do not know what ReverseBits does and whether you could also adapt your code according to a different order, maybe it helps...

reverseBytes using Bitwise operators

reverseBytes - reverse bytes
Example: reverseBytes(0x0123456789abcdef) = 0xefcdab8967452301
Legal ops: ! ~ & ^ | + << >>
I'm required to solve the above problem. There is no limit on no. of operators. I already have a different solution to this. But I would like to know what's the problem with the following solution I came up with? Thank you.
long reverseBytes(long x) {
int a = x; //Get first bytes, first 8 bits
int b = (a >> 8); //Get 2nd byte
int c = (b >> 8); //3rd byte
int d = (c >> 8); //4th
int e = (d >> 8); //5th
int f = (e >> 8); //6th
int g = (f >> 8); //7th
int h = (g >> 8); //8th
a = a & 0xFF; //Remove the rest except LSB byte
b = b & 0xFF; // same
c = c & 0xFF;
d = d & 0xFF;
e = e & 0xFF;
f = f & 0xFF;
g = g & 0xFF;
h = h & 0xFF;
return ( (a << 56) + (b << 48) + (c << 40) + (d << 32) + (e << 24) + (f << 16) + (g << 8) + (h) );
}
For practical application, you can use library function bswap64()
For student practice, you needed write a simple loop like:
unsigned long reverseBytes(unsigned long x) {
unsigned long rc;
for(int i = 0; i < 8; i++, x >>= 8)
rc = (rc << 8) | (unsigned char)x;
return rc;
}

Best c++ way to choose randomly position of set bit in bitset

I have std::bitset<32> word and I want to choose randomly and index (0-31) of some bit which is 1. How can I do that without loops and counters. Is there any std::algorithm suitable for that?
If it's easier I can convert the bitset to string or int and make it on the string or int.
Here's a first stab at it:
std::bitset<32> bitset{...};
std::mt19937 prng(std::time(nullptr));
std::uniform_int_distribution<std::size_t> dist{1, bitset.count()};
std::size_t p = 0;
for(std::size_t c = dist(prng); c; ++p)
c -= bitset[p];
// (p - 1) is now the index of the chosen bit.
It works by counting the set bits, doing the random pick c in that interval, then looking for the cth set bit.
If you have 32-bit (or even 64-bit) bitset, more efficient solution would be to convert to integer and then use bitwise operations on that integer to get random set bit.
Here is how you can convert your bitset to unsigned long:
std::bitset<32> word(0x1028);
unsigned long ulWord = word.to_ulong(); // ulWord == 0x1028
Then you can use “Select the bit position“ function from the Bit Twiddling Hacks page to select random set bit efficiently:
unsigned int bitcnt = word.count();
unsigned int randomSetBitIndex = 63-selectBit(ulWord, random() % bitcnt + 1);
unsigned long randomSetBit = 1 << randomSetBitIndex;
Here is the full code:
// Select random set bit from a bitset
#include <iostream>
#include <bitset>
#include <random>
using namespace std;
unsigned int selectBit(unsigned long long v, unsigned int r) {
// Source: https://graphics.stanford.edu/~seander/bithacks.html
// v - Input: value to find position with rank r.
// r - Input: bit's desired rank [1-64].
unsigned int s; // Output: Resulting position of bit with rank r [1-64]
uint64_t a, b, c, d; // Intermediate temporaries for bit count.
unsigned int t; // Bit count temporary.
// Do a normal parallel bit count for a 64-bit integer,
// but store all intermediate steps.
a = v - ((v >> 1) & ~0UL/3);
b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
c = (b + (b >> 4)) & ~0UL/0x11;
d = (c + (c >> 8)) & ~0UL/0x101;
t = (d >> 32) + (d >> 48);
// Now do branchless select!
s = 64;
s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
t = (d >> (s - 16)) & 0xff;
s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
t = (c >> (s - 8)) & 0xf;
s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
t = (b >> (s - 4)) & 0x7;
s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
t = (a >> (s - 2)) & 0x3;
s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
t = (v >> (s - 1)) & 0x1;
s -= ((t - r) & 256) >> 8;
return 64-s;
}
int main() {
// Input
std::bitset<32> word(0x1028);
// Initialize random number generator
std::random_device randDevice;
std::mt19937 random(randDevice());
// Select random bit
unsigned long ulWord = word.to_ulong();
unsigned int bitcnt = word.count();
unsigned int randomSetBitIndex = 63-selectBit(ulWord, random() % bitcnt + 1);
unsigned long randomSetBit = 1 << randomSetBitIndex;
// Output
cout << "0x" << std::hex << randomSetBit << endl; // either 0x8, 0x20 or 0x1000
return 0;
}
Run it on Ideone.