AVX512 - Left packing elements by index using mask - c++

In short, I am trying to compress(left pack) 64-bit integers by index. Neither scatter nor compress intrinsics solves this problem directly.
Suppose you have eight 64-bit integers in a and want to left pack those elements at addresses starting at base_addr by the index subject to mask k.
int64_t* dst; // memory to store the result
__m512i a = _mm512_loadu_si512 ( arr ); // load data from memory into a
__mmask8 k = _mm512_cmpgt_epi64_mask ( a, _mm512_set1_epi64(6) ); // compare for greater-than
__m512i index = _mm512_set_epi64 ( 14, 12, 10, 8, 6, 4, 2, 0 ); // index vector
_mm512_mask_compressstoreu_epi64_by_index ( dst, k, index, a ); // How can I implement this function efficiently?
So, _mm512_mask_compressstoreu_epi64_by_index function should compress 64-bit integers from a into memory dst using index. The writemask k stores the element, which is active in a, to memory.
The result of this function will looks like:
dst = [10, 0, 7, 0, 9, 0, 0, 0 ...].
The elements 10, 7 and 9 are stored after an index 0, 2 and 4 accordingly.
I've tried _mm512_mask_compressstoreu_epi64 and _mm512_mask_i64scatter_epi64 intrinsics, but these instructions save the elements differently. They will give you following results:
_mm512_mask_compressstoreu_epi64( dst, k, a ) produces: dst = [ 10, 7, 9 , ... ]
_mm512_mask_i64scatter_epi64 ( dst, k, index, a, 8 ) produces: dst = [ 0, 10, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 9, ...]
What I want is _mm512_mask_compressstoreu_epi64_by_index( dst, k, index, a ) which results dst = [10, 0, 7, 0, 9, 0, 0, 0 ...]
How can I solve this problem?

Related

Is there any efficient way to do "shuffling" of vector

I have a large size unsorted array, each element contains a unique integer number,
std::vector<size_t> Vec= {1, 5, 3, 7, 18...}
I need to shuffle the vector in such a way, given a specific number, look for it and then swap it with the number in a new desired position. This swapping needs to be done many times.
Currently I use anther vector PositionLookup to remember&update the positions after every swapping. And I'm wondering is there any more efficient way/data structure that can help do this?
Current solution,
//look for a specific number "key" and swap it with the number in desired position "position_new"
void shuffle(key, position_new)
{
size_t temp = Vec[position_new]; // main vector
size_t position_old = PositionLookup[key]; // auxiliary vector
Vec[position_old] = temp;
PositionLookup[temp] = position_old;
Vec[position_new] = key;
PositionLookup[key] = position_new;
}
A couple microoptimizations to start with: If the vector has a fixed size, you could use a std::array or a plain C array instead of a std::vector. You can also use the most compact integer type that can hold all the values in the vector (e.g. std::int8_t/signed char for values in the interval [-128,127], std::uint16_t/unsigned short for values in the interval [0,65535], etc.)
The bigger optimization opportunity: Since the values themselves never change, only their indexes, you only need to keep track of the indexes.
Suppose for simplicity's sake the values are 0 through 4. In that case we can have an array
std::array<std::int8_t, 5> indices{{2, 3, 1, 4, 0}};
Which represents the index of its indices in an imaginary array, here 4, 2, 0, 1, 3. Or in other words indices[0] is 2, which is the index of 0 in the imaginary array.
Then to swap the positions of 0 and 1 you only need to do
std::swap(indices[0], indices[1]);
Which makes the indices array 3, 2, 1, 4, 0 and the imaginary array 4, 2, 1, 0, 3.
Of course the imaginary array's values might not be the same as its indices.
If the (sorted) values are something like -2, -1, 0, 1, 2 you could obtain the value from the index by adding 2, or if they're 0, 3, 6, 9, 12 you could divide by 3, or if they're -5, -3, -1, 1, 3 you could add 5 then divide by 2, etc.
If the values don't follow a defined pattern, you can create a second array to look up the value that goes with an index.
std::array<std::int8_t, 5> indices{{2, 3, 1, 4, 0}};
constexpr std::array<std::int8_t, 5> Values{{1, 3, 5, 7, 18}};
// Imaginary array before: 18, 5, 1, 3, 7
std::swap(indices[0], indices[1]);
// Imaginary array after: 18, 5, 3, 1, 7
const auto index_to_value = [&](decltype(indices)::value_type idx) noexcept {
return Values[idx];
};
const auto value_to_index = [&](decltype(Values)::value_type val) noexcept {
return std::lower_bound(Values.begin(), Values.end(), val)
- Values.begin();
};
It's the same thing if the values aren't known until runtime, just obviously the values lookup table can't be const or constexpr.
std::array<std::int8_t, 5> indices{{2, 3, 1, 4, 0}};
std::array<std::int8_t, 5> values; // Not known yet at compile-time
// ... set `values` at runtime to e.g. -93, -77, -64, 8, 56
// Imaginary array before: 56, -64, -93, -77, 8
std::swap(indices[0], indices[1]);
// Imaginary array after: 56, -64, -77, -93, 8
const auto index_to_value = [&](decltype(indices)::value_type idx) noexcept {
return values[idx];
};
const auto value_to_index = [&](decltype(values)::value_type val) noexcept {
return std::lower_bound(values.cbegin(), values.cend(), val)
- values.cbegin();
};

Max subarray with start and end index

I'm trying to find the maximum contiguous subarray with start and end index. The method I've adopted is divide-and-conquer, with O(nlogn) time complexity.
I have tested with several test cases, and the start and end index always work correctly. However, I found that if the array contains an odd-numbered of elements, the maximum sum is sometimes correct, sometimes incorrect(seemingly random). But for even cases, it is always correct. Here is my code:
int maxSubSeq(int A[], int n, int &s, int &e)
{
// s and e stands for start and end index respectively,
// and both are passed by reference
if(n == 1){
return A[0];
}
int sum = 0;
int midIndex = n / 2;
int maxLeftIndex = midIndex - 1;
int maxRightIndex = midIndex;
int leftMaxSubSeq = A[maxLeftIndex];
int rightMaxSubSeq = A[maxRightIndex];
int left = maxSubSeq(A, midIndex, s, e);
int right = maxSubSeq(A + midIndex, n - midIndex, s, e);
for(int i = midIndex - 1; i >= 0; i--){
sum += A[i];
if(sum > leftMaxSubSeq){
leftMaxSubSeq = sum;
s = i;
}
}
sum = 0;
for(int i = midIndex; i < n; i++){
sum += A[i];
if(sum > rightMaxSubSeq){
rightMaxSubSeq = sum;
e = i;
}
}
return max(max(leftMaxSubSeq + rightMaxSubSeq, left),right);
}
Below is two of the test cases I was working with, one has odd-numbered elements, one has even-numbered elements.
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
Edit: The following are the 2 kinds of outputs:
// TEST 1
Test file : T2-Data-1.txt
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
maxSubSeq : A[3..7] = 32769 // Index is correct, but sum should be 20
Test file : T2-Data-2.txt
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
maxSubSeq : A[9..17] = 39 // correct
// TEST 2
Test file : T2-Data-1.txt
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
maxSubSeq : A[3..7] = 20
Test file : T2-Data-2.txt
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
maxSubSeq : A[9..17] = 39
Can anyone point out why this is occurring? Thanks in advance!
Assuming that n is the correct size of your array (we see it being passed in as a parameter and later used to initialize midIndexbut we do not see its actual invocation and so must assume you're doing it correctly), the issue lies here:
int midIndex = n / 2;
In the case that your array has an odd number of elements, which we can represented as
n = 2k + 1
we can find that your middle index will always equate to
(2k + 1) / 2 = k + (1/2)
which means that for every integer, k, you'll always have half of an integer number added to k.
C++ doesn't round integers that receive floating-point numbers; it truncates. So while you'd expect k + 0.5 to round to k+1, you actually get k after truncation.
This means that, for example, when your array size is 11, midIndex is defined to be 5. Therefore, you need to adjust your code accordingly.

Efficient C++ way to shift a cv::Mat with OpenCV

What is an efficient way to "shift" an OpenCV cv::Mat?
With shift I mean that if I have a row like this
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
and I shift it by 3 positions, I will get a row like this
3, 4, 5, 6, 7, 8, 9, 0, 1, 2
Now I am using this function:
void shift( const cv::Mat& in, cv::Mat& out, int shift )
{
if ( shift < 0 || shift > in.cols ) return;
if ( shift == 0 || shift==in.cols ) {
out = in.clone();
} else {
cv::hconcat(in(cv::Rect(shift,0,in.cols-shift,in.rows)),in(cv::Rect(0,0,shift,in.rows)),out);
}
}
but I am looking for a more efficient way.
If size of your array is not so big and rows=1 then you can repeat your array like this
0,1,2,3,4,5,6,7,8,9, 0,1,2,3,4,5,6,7,8,9
Then for shifting it shift array's datapointer

Parallel algorithm that does a small insertion/shifting

Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

Getting the number of trailing 1 bits

Are there any efficient bitwise operations I can do to get the number of set bits that an integer ends with? For example 1110 = 10112 would be two trailing 1 bits. 810 = 10002 would be 0 trailing 1 bits.
Is there a better algorithm for this than a linear search? I'm implementing a randomized skip list and using random numbers to determine the maximum level of an element when inserting it. I am dealing with 32 bit integers in C++.
Edit: assembler is out of the question, I'm interested in a pure C++ solution.
Calculate ~i & (i + 1) and use the result as a lookup in a table with 32 entries. 1 means zero 1s, 2 means one 1, 4 means two 1s, and so on, except that 0 means 32 1s.
Taking the answer from Ignacio Vazquez-Abrams and completing it with the count rather than a table:
b = ~i & (i+1); // this gives a 1 to the left of the trailing 1's
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x55555555) + ((b>>1) & 0x55555555); // 2 bit sums of 1 bit numbers
b = (b & 0x33333333) + ((b>>2) & 0x33333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f) + ((b>>4) & 0x0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff) + ((b>>8) & 0x00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff) + ((b>>16) & 0x0000ffff); // sum of 16 bit numbers
at the end b will contain the count of 1's (the masks, adding and shifting count the 1's).
Unless I goofed of course. Test before use.
The Bit Twiddling Hacks page has a number of algorithms for counting trailing zeros. Any of them can be adapted by simply inverting your number first, and there are probably clever ways to alter the algorithms in place without doing that as well. On a modern CPU with cheap floating point operations the best is probably thus:
unsigned int v=~input; // find the number of trailing ones in input
int r; // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
r = (*(uint32_t *)&f >> 23) - 0x7f;
if(r==-127) r=32;
GCC has __builtin_ctz and other compilers have their own intrinsics. Just protect it with an #ifdef:
#ifdef __GNUC__
int trailingones( uint32_t in ) {
return ~ in == 0? 32 : __builtin_ctz( ~ in );
}
#else
// portable implementation
#endif
On x86, this builtin will compile to one very fast instruction. Other platforms might be somewhat slower, but most have some kind of bit-counting functionality that will beat what you can do with pure C operators.
There may be better answers available, particularly if assembler isn't out of the question, but one viable solution would be to use a lookup table. It would have 256 entries, each returning the number of contiguous trailing 1 bits. Apply it to the lowest byte. If it's 8, apply to the next and keep count.
Implementing Steven Sudit's idea...
uint32_t n; // input value
uint8_t o; // number of trailing one bits in n
uint8_t trailing_ones[256] = {
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 8};
uint8_t t;
do {
t=trailing_ones[n&255];
o+=t;
} while(t==8 && (n>>=8))
1 (best) to 4 (worst) (average 1.004) times (1 lookup + 1 comparison + 3 arithmetic operations) minus one arithmetic operation.
This code counts the number of trailing zero bits, taken from here (there's also a version that depends on the IEEE 32 bit floating point representation, but I wouldn't trust it, and the modulus/division approaches look really slick - also worth a try):
int CountTrailingZeroBits(unsigned int v) // 32 bit
{
unsigned int c = 32; // c will be the number of zero bits on the right
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
static const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers
for (int i = 4; i >= 0; --i) // unroll for more speed
{
if (v & B[i])
{
v <<= S[i];
c -= S[i];
}
}
if (v)
{
c--;
}
return c;
}
and then to count trailing ones:
int CountTrailingOneBits(unsigned int v)
{
return CountTrailingZeroBits(~v);
}
http://graphics.stanford.edu/~seander/bithacks.html might give you some inspiration.
Implementation based on Ignacio Vazquez-Abrams's answer
uint8_t trailing_ones(uint32_t i) {
return log2(~i & (i + 1));
}
Implementation of log2() is left as an exercise for the reader (see here)
Taking #phkahler's answer you can define the following preprocessor statement:
#define trailing_ones(x) __builtin_ctz(~x & (x + 1))
As you get a one left to all the prior ones, you can simply count the trailing zeros.
Blazingly fast ways to find the number of trailing 0's are given in Hacker's Delight.
You could complement your integer (or more generally, word) to find the number of trailing 1's.
I have this sample for you :
#include <stdio.h>
int trailbits ( unsigned int bits, bool zero )
{
int bitsize = sizeof(int) * 8;
int len = 0;
int trail = 0;
unsigned int compbits = bits;
if ( zero ) compbits = ~bits;
for ( ; bitsize; bitsize-- )
{
if ( compbits & 0x01 ) trail++;
else
{
if ( trail > 1 ) len++;
trail = 0;
}
compbits = compbits >> 1;
}
if ( trail > 1 ) len++;
return len;
}
void PrintBits ( unsigned int bits )
{
unsigned int pbit = 0x80000000;
for ( int len=0 ; len<32; len++ )
{
printf ( "%c ", pbit & bits ? '1' : '0' );
pbit = pbit >> 1;
}
printf ( "\n" );
}
void main(void)
{
unsigned int forbyte = 0x0CC00990;
PrintBits ( forbyte );
printf ( "Trailing ones is %d\n", trailbits ( forbyte, false ));
printf ( "Trailing zeros is %d\n", trailbits ( forbyte, true ));
}