Getting the number of trailing 1 bits - c++

Are there any efficient bitwise operations I can do to get the number of set bits that an integer ends with? For example 1110 = 10112 would be two trailing 1 bits. 810 = 10002 would be 0 trailing 1 bits.
Is there a better algorithm for this than a linear search? I'm implementing a randomized skip list and using random numbers to determine the maximum level of an element when inserting it. I am dealing with 32 bit integers in C++.
Edit: assembler is out of the question, I'm interested in a pure C++ solution.

Calculate ~i & (i + 1) and use the result as a lookup in a table with 32 entries. 1 means zero 1s, 2 means one 1, 4 means two 1s, and so on, except that 0 means 32 1s.

Taking the answer from Ignacio Vazquez-Abrams and completing it with the count rather than a table:
b = ~i & (i+1); // this gives a 1 to the left of the trailing 1's
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x55555555) + ((b>>1) & 0x55555555); // 2 bit sums of 1 bit numbers
b = (b & 0x33333333) + ((b>>2) & 0x33333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f) + ((b>>4) & 0x0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff) + ((b>>8) & 0x00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff) + ((b>>16) & 0x0000ffff); // sum of 16 bit numbers
at the end b will contain the count of 1's (the masks, adding and shifting count the 1's).
Unless I goofed of course. Test before use.

The Bit Twiddling Hacks page has a number of algorithms for counting trailing zeros. Any of them can be adapted by simply inverting your number first, and there are probably clever ways to alter the algorithms in place without doing that as well. On a modern CPU with cheap floating point operations the best is probably thus:
unsigned int v=~input; // find the number of trailing ones in input
int r; // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
r = (*(uint32_t *)&f >> 23) - 0x7f;
if(r==-127) r=32;

GCC has __builtin_ctz and other compilers have their own intrinsics. Just protect it with an #ifdef:
#ifdef __GNUC__
int trailingones( uint32_t in ) {
return ~ in == 0? 32 : __builtin_ctz( ~ in );
}
#else
// portable implementation
#endif
On x86, this builtin will compile to one very fast instruction. Other platforms might be somewhat slower, but most have some kind of bit-counting functionality that will beat what you can do with pure C operators.

There may be better answers available, particularly if assembler isn't out of the question, but one viable solution would be to use a lookup table. It would have 256 entries, each returning the number of contiguous trailing 1 bits. Apply it to the lowest byte. If it's 8, apply to the next and keep count.

Implementing Steven Sudit's idea...
uint32_t n; // input value
uint8_t o; // number of trailing one bits in n
uint8_t trailing_ones[256] = {
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 8};
uint8_t t;
do {
t=trailing_ones[n&255];
o+=t;
} while(t==8 && (n>>=8))
1 (best) to 4 (worst) (average 1.004) times (1 lookup + 1 comparison + 3 arithmetic operations) minus one arithmetic operation.

This code counts the number of trailing zero bits, taken from here (there's also a version that depends on the IEEE 32 bit floating point representation, but I wouldn't trust it, and the modulus/division approaches look really slick - also worth a try):
int CountTrailingZeroBits(unsigned int v) // 32 bit
{
unsigned int c = 32; // c will be the number of zero bits on the right
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
static const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers
for (int i = 4; i >= 0; --i) // unroll for more speed
{
if (v & B[i])
{
v <<= S[i];
c -= S[i];
}
}
if (v)
{
c--;
}
return c;
}
and then to count trailing ones:
int CountTrailingOneBits(unsigned int v)
{
return CountTrailingZeroBits(~v);
}

http://graphics.stanford.edu/~seander/bithacks.html might give you some inspiration.

Implementation based on Ignacio Vazquez-Abrams's answer
uint8_t trailing_ones(uint32_t i) {
return log2(~i & (i + 1));
}
Implementation of log2() is left as an exercise for the reader (see here)

Taking #phkahler's answer you can define the following preprocessor statement:
#define trailing_ones(x) __builtin_ctz(~x & (x + 1))
As you get a one left to all the prior ones, you can simply count the trailing zeros.

Blazingly fast ways to find the number of trailing 0's are given in Hacker's Delight.
You could complement your integer (or more generally, word) to find the number of trailing 1's.

I have this sample for you :
#include <stdio.h>
int trailbits ( unsigned int bits, bool zero )
{
int bitsize = sizeof(int) * 8;
int len = 0;
int trail = 0;
unsigned int compbits = bits;
if ( zero ) compbits = ~bits;
for ( ; bitsize; bitsize-- )
{
if ( compbits & 0x01 ) trail++;
else
{
if ( trail > 1 ) len++;
trail = 0;
}
compbits = compbits >> 1;
}
if ( trail > 1 ) len++;
return len;
}
void PrintBits ( unsigned int bits )
{
unsigned int pbit = 0x80000000;
for ( int len=0 ; len<32; len++ )
{
printf ( "%c ", pbit & bits ? '1' : '0' );
pbit = pbit >> 1;
}
printf ( "\n" );
}
void main(void)
{
unsigned int forbyte = 0x0CC00990;
PrintBits ( forbyte );
printf ( "Trailing ones is %d\n", trailbits ( forbyte, false ));
printf ( "Trailing zeros is %d\n", trailbits ( forbyte, true ));
}

Related

Index of first byte having its MSB set

I have eight 8-bit values stored in a 64-bit integer. The MSB of each byte can either be 1 or 0, and the rest of their bits are all 0. Example:
MSB 10000000 00000000 10000000 ... 10000000 00000000 00000000 LSB
I now need to find the index of first byte that has its bit set. First meaning that we search from the least significant direction. In the above example the result would be 2.
Using de Bruijn we could scan for the first set bit and divide by 8 to get its byte index.
Here's my question: de Bruijn is generic, it works for any input. But in my use case we are limited to bytes having only their MSB set. Is it possible to optimize for this case?
The implementation is in C++. I can't use any intrinsics or inline assembly (_BitScanForward64(), __builtin_clzll etc).
(Edit)
Isolate the lowest set bit x &= (-x) then see How to find the position of the only-set-bit in a 64-bit value using bit manipulation efficiently? which is examining this exact problem (despite the title).
The answers below are slightly more general.
A couple cycles of latency could be saved over the de Bruijn bitscan by eliminating the table lookup.
uint64_t ByteIndexOfLowestSetBit(uint64_t val) {
assert(val != 0);
const uint64_t m = UINT64_C(0x0101010101010101);
return ((((val - 1) ^ val) & (m - 1)) * m) >> 56;
}
Use trailing bit manipulation to get a mask covering the lowest set bit and below.
Set each byte covered by the mask to 1. Count how many 1 bytes we have by prefix-summing them horizontally. We now have placed a 1-based byte index into the most significant byte of the u64 word. Shift the count to the bottom and subtract 1 to get a 0-based index. However, we don’t want the -1 on the critical path... so instead subtract 1 from m so we never count the least significant byte in the total.
The problem of finding the highest set MS1B is more complicated because we don't have any bit-manipulation tricks to isolate the bit wanted. In that case,
Extract Bits with a Single Multiplication, use them as an index into a table. If a input value of zero is not allowed then the value of the least significant byte either doesn't matter or is non-zero. This allows the use of a lookup table with 7-bit indices instead of 8-bits.
Adapt as needed.
uint64_t ReversedIndexOf_Highest_Byte_With_LSB_Set (uint64_t val) {
static const unsigned char ctz7_tab[128] = {
7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
};
assert(val != 0);
assert((val & 0xFEFEFEFEFEFEFEFEULL) == 0);
val = (val * UINT64_C(0x0080402010080402)) >> 57;
return ctz7_tab[val];
}
Here a simple way.
int LeastSignificantSetBitByteIndex(long value)
{
if((value & 0x80) != 0) return 0;
if((value & 0x8000) != 0) return 1;
if((value & 0x800000) != 0) return 2;
if((value & 0x80000000L) != 0) return 3;
if((value & 0x8000000000L) != 0) return 4;
if((value & 0x800000000000L) != 0) return 5;
if((value & 0x80000000000000L) != 0) return 6;
if((value & 0x8000000000000000L) != 0) return 7;
return -1;
}
int MostSignificantSetBitByteIndex(long value)
{
if((value & 0x8000000000000000L) != 0) return 0;
if((value & 0x80000000000000L) != 0) return 1;
if((value & 0x800000000000L) != 0) return 2;
if((value & 0x8000000000L) != 0) return 3;
if((value & 0x80000000L) != 0) return 4;
if((value & 0x800000) != 0) return 5;
if((value & 0x8000) != 0) return 6;
if((value & 0x80) != 0) return 7;
return -1;
}

How can I make something happen x percentage?

I have to write a piece of code in the form of c*b, where c and b are random numbers and the product is smaller than INT_MAx. But b or c has to be equal to 0 10% of the time and I don't know how to do that.
srand ( time(NULL) );
int product = b*c;
c = rand() % 10000;
b = rand() % INT_MAX/c;
b*c < INT_MAX;
cout<<""<<endl;
cout << "What is " << c << "x" << b << "?"<<endl;
cin >> guess;
You can use std::piecewise_constant_distribution
std::random_device rd;
std::mt19937 gen(rd());
double interval[] = {0, 0, 1, Max};
double weights[] = { .10, 0, .9};
std::piecewise_constant_distribution<> dist(std::begin(interval),
std::end(interval),
weights);
dist(gen);
An int is always less than or equal to INT_MAX therefore you can simply multiply a random boolean variable that is true with 90% probability with the product of two uniformly distributed integers:
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_int_distribution<int> uniform;
std::bernoulli_distribution bernoulli(0.9); // 90% 1 ; 10% 0
const int product = bernoulli(generator) * uniform(generator) * uniform(generator)
If you had a specific limit in mind, like say N for the individual numbers and M for the product of the two numbers you can do:
std::default_random_engine generator;
std::uniform_int_distribution<int> uniform(0,N);
std::bernoulli_distribution bernoulli(0.9); // 90% 1 ; 10% 0
int product;
do { product = bernoulli(generator) * uniform(generator) * uniform(generator) }
while(!(product<M));
edit: std::piecewise_constant_distribution is more elegant, didn't know about it until I read the other answer.
If you want a portable solution that does not depend on the standard C++ library and also which is faster, and maybe simpler to understand, you can use the following snippet. The variable random_sequence is a pre-generated array of random numbers where the 0 happens 10% of the time. The variable runs and len are used to index into this array as an endless sequence. This is however, a simple solution, since the pattern will repeat after 90 runs. But if you don't care about the pattern repeating then this method will work fine.
int runs = 0;
int len = 90; //The length of the random sequence.
int random_sequence[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
int coefficient = random_sequence[runs % len];
runs++;
Then, whatever variable you want to be 0 10% of the time you do it like this:
float b = coefficient * rand();
or
float c = coefficient * rand();
If you want both variables to be 0 10% of the times individidually then it's like this:
float b = coefficient * rand();
coefficient = random_sequence[runs % len];
float c = coefficient * rand();
And if you want them to be 0 10% of the times jointly then the random_sequence array must be like this:
int len = 80;
int random_sequence[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
And use
float b = coefficient * rand();
float c = coefficient * rand();
I gave it a shot here #[ http://codepad.org/DLbVfNVQ ]. Average value is somewhere in the neighborhood of 0.4. -CR

'std::wstring_convert' to convert as much as possible (from a UTF8 file-read chunk)

I am fetching text from a utf-8 text file, and doing it by chunks to increase performance.
std::ifstream.read(myChunkBuff_str, myChunkBuff_str.length())
Here is a more detailed example
I am getting around 16 thousand characters with each chunk.
My next step is to convert this std::string into something that can allow me to work on these "complex characters" individually, thus converting that std::string into std::wstring.
I am using the following function for converting, taken from here:
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
However, at its end of the chunk one of the Russian characters might be cut-off, and the conversion will fail, with an std::range_error exception.
For example, in UTF-8 "привет" takes 15 chars and "приве" takes 13 chars.
So, if my chunk was hypothetically 14, the 'т' would be partially missing, and the conversion would throw exception.
Question:
How to detect these partially-loaded character? ('т' in this case) This would allow me to convert without it, and perhaps shift the next chunk a bit earlier than planned, to include this problematic 'т' next time?
I don't want to try or catch around these functions, as try/catch might slow me down the program. It also doesn't tell me "how much of character was missing for the conversion to actually succeed".
I know about wstring_convert::converted() but it's not really useful if my program crashes before I get to it
You could do this using a couple of functions. UTF-8 has a way to detect the beginning of a multibyte character and (from the beginning) the size of the multibyte character.
So two functions:
// returns zero if this is the first byte of a UTF-8 char
// otherwise non-zero.
static unsigned is_continuation(char c)
{
return (c & 0b10000000) && !(c & 0b01000000);
}
// if c is the *first* byte of a UTF-8 multibyte character, returns
// the total number of bytes of the character.
static unsigned size(const unsigned char c)
{
constexpr static const char u8char_size[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0
};
return u8char_size[(unsigned char)c];
}
You could track back from the end of your buffer until is_continuation(c) is false. Then check if size(c) of the current UTF-8 char is longer than the end of the buffer.
Disclaimer - last time I looked these functions were working but have not used them in a while.
Edit: to add.
If you feel like doing th whole thing manually I may as well post the code to convert a UTF-8 multibyte character to a UTF-16 multibyte or a UTF-32 char.
UTF-32 Is easy:
// returns a UTF-32 char from a `UTF-8` multibyte
// character pointed to by cp
static char32_t char32(const char* cp)
{
auto sz = size(*cp); // function above
if(sz == 1)
return *cp;
char32_t c32 = (0b0111'1111 >> sz) & (*cp);
for(unsigned i = 1; i < sz; ++i)
c32 = (c32 << 6) | (cp[i] & 0b0011'1111);
return c32;
}
UTF-16 Is a little more tricky:
// UTF-16 characters can be 1 or 2 characters wide...
using char16_pair = std::array<char16_t, 2>;
// outputs a UTF-16 char in cp16 from a `UTF-8` multibyte
// character pointed to by cp
//
// returns the number of characters in this `UTF-16` character
// (1 or 2).
static unsigned char16(const char* cp, char16_pair& cp16)
{
char32_t c32 = char32(cp);
if(c32 < 0xD800 || (c32 > 0xDFFF && c32 < 0x10000))
{
cp16[0] = char16_t(c32);
cp16[1] = 0;
return 1;
}
c32 -= 0x010000;
cp16[0] = ((0b1111'1111'1100'0000'0000 & c32) >> 10) + 0xD800;
cp16[1] = ((0b0000'0000'0011'1111'1111 & c32) >> 00) + 0xDC00;
return 2;
}

Parallel algorithm that does a small insertion/shifting

Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

An integer [0,4095] 12bits to a tuble{A,B,C} the fastest way in c++

Intput: An integer [0,4095] 12bits.
Output: A tuble of {A,B,C} all [0,255]
The A,B,C are given as 0 to 255, where 255 maps to 15 in the 4 bits. Reason are that I want to construct a Color struct having RGB defined from 0 to 255.
I assume the solution to be something like bit shifting the input to extract the 3 sets of 4bits and then multiply by 17 as (255/15 | 15 = 1111(binary)).
How would you compute this fastest?
my own solution:
QColor mycolor(int value)
{
if(value > 0xFFF)
value = 0xFFF;
int a=0,b=0,c=0;
a = (value & 0xF) * 17;
b = ((value&(0xF<<4))>>4) *17;
c = ((value&(0xF<<8))>>8) *17;
return QColor(c,b,a);
}
cv::Mat cv_image(10,10,CV_16U,cv::Scalar::all(1));
QImage image(cv_image.data, 10,10,QImage::Format_RGB444);
QPainter p(&image);
p.setPen(mycolor(255));
p.drawLine(0,0,9,0);
p.setPen(mycolor(4095));
p.drawLine(0,1,9,1);
p.setPen(mycolor(0));
p.drawLine(0,2,9,2);
p.setPen(mycolor(10000));
p.drawLine(0,3,9,3);
********* Start testing of Test1 *********
Config: Using QTest library 4.7.4, Qt 4.7.4
PASS : Test1::initTestCase()
[255, 255, 255, 255, 255, 255, 255, 255, 255, 255;
4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095;
0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095, 4095;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1;
1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
PASS : Test1::test1()
First of all input 0...4096 is in fact 12 bits and this makes the question easier to understand. Here is one possible solution:
int val; // 0...4096
int red = ((val&(255<<8))>>8)*17;
int green = ((val&(255<<4))>>4)*17;
int blue = ((val&(255<<0))>>0)*17;
I have kept the bit shifting for blue as well so you can spot the similarity in the calculation. Hope this helps.
You can use unions to better parse your color coded 12 bit value.
union colorCoding
{
unsigned int val:12;
struct
{
unsigned int red:4;
unsigned int blue:4;
unsigned int green:4;
};
};
To get the first four bits from the input, you can AND it with 1111, then bitshift the input to the right by four bits and repeat the process. This gets you three integers in the range of 0 to 15.
If you then want to convert that to something in [0,255], then bitshift everything to the left by four bits and OR it with 1111 (for simplicity).
A = (input&15)<<4|15;
input >>= 4;
B = (input&15)<<4|15;
input >>= 4;
C = (input&15)<<4|15;
or (if you want 0 to map to 0)
A = input&15;
A = A<<4|A;
input >>= 4;
B = input&15;
B = B<<4|B;
input >>= 4;
C = input&15;
C = C<<4|C;