Eigen rowwise cross product between arrays - c++

I have two Eigen::ArrayX3d objects, that's N rows and 3 columns. To make this concrete, the first array consists of 3d velocities of N particles. The other one consists of magnetic field vectors at the position of each of the particles. I'm trying to compute the Lorentz force, v x B - this means I have to take each pair of rows and compute the cross product. In Python, this would mean simply doing numpy.cross(v, B).
I'm trying to figure out how to do this in Eigen and failing hard. It seems as though cross is defined for Matrix and Vectors only, but it doesn't really make sense to me to keep my data as a Matrix (though I'm of course open to suggestions).
Is there any reasonable way to perform this operation? I'd be very grateful for any pointers.
This setup is a good example::
ArrayX3d a(4,3);
ArrayX3d b(4,3);
a <<1,0,0,
0,1,0,
0,0,1,
1,0,0;
b <<0,1,0,
0,0,1,
1,0,0,
0,1,0;
A successful application of the a x b operation should just shift the 1's by 1 place to the right in each row.

I can get the result using a matrix or array:
MatrixX3d a(4, 3);
MatrixX3d b(4, 3);
a << 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0;
b << 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0;
for(int i = 0; i < a.rows(); i++){
cout << a.row(i).cross(b.row(i)) << endl;
}
With an array:
ArrayX3d a(4, 3);
ArrayX3d b(4, 3);
a << 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0;
b << 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0;
for(int i = 0; i < a.rows(); i++){
cout << a.row(i).matrix().cross(b.matrix().row(i)) << endl;
}
The output:
0 0 1
1 0 0
0 1 0
0 0 1
This result could be saved into a matrix or array for each row.

Related

How can I make something happen x percentage?

I have to write a piece of code in the form of c*b, where c and b are random numbers and the product is smaller than INT_MAx. But b or c has to be equal to 0 10% of the time and I don't know how to do that.
srand ( time(NULL) );
int product = b*c;
c = rand() % 10000;
b = rand() % INT_MAX/c;
b*c < INT_MAX;
cout<<""<<endl;
cout << "What is " << c << "x" << b << "?"<<endl;
cin >> guess;
You can use std::piecewise_constant_distribution
std::random_device rd;
std::mt19937 gen(rd());
double interval[] = {0, 0, 1, Max};
double weights[] = { .10, 0, .9};
std::piecewise_constant_distribution<> dist(std::begin(interval),
std::end(interval),
weights);
dist(gen);
An int is always less than or equal to INT_MAX therefore you can simply multiply a random boolean variable that is true with 90% probability with the product of two uniformly distributed integers:
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_int_distribution<int> uniform;
std::bernoulli_distribution bernoulli(0.9); // 90% 1 ; 10% 0
const int product = bernoulli(generator) * uniform(generator) * uniform(generator)
If you had a specific limit in mind, like say N for the individual numbers and M for the product of the two numbers you can do:
std::default_random_engine generator;
std::uniform_int_distribution<int> uniform(0,N);
std::bernoulli_distribution bernoulli(0.9); // 90% 1 ; 10% 0
int product;
do { product = bernoulli(generator) * uniform(generator) * uniform(generator) }
while(!(product<M));
edit: std::piecewise_constant_distribution is more elegant, didn't know about it until I read the other answer.
If you want a portable solution that does not depend on the standard C++ library and also which is faster, and maybe simpler to understand, you can use the following snippet. The variable random_sequence is a pre-generated array of random numbers where the 0 happens 10% of the time. The variable runs and len are used to index into this array as an endless sequence. This is however, a simple solution, since the pattern will repeat after 90 runs. But if you don't care about the pattern repeating then this method will work fine.
int runs = 0;
int len = 90; //The length of the random sequence.
int random_sequence[] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
int coefficient = random_sequence[runs % len];
runs++;
Then, whatever variable you want to be 0 10% of the time you do it like this:
float b = coefficient * rand();
or
float c = coefficient * rand();
If you want both variables to be 0 10% of the times individidually then it's like this:
float b = coefficient * rand();
coefficient = random_sequence[runs % len];
float c = coefficient * rand();
And if you want them to be 0 10% of the times jointly then the random_sequence array must be like this:
int len = 80;
int random_sequence[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
And use
float b = coefficient * rand();
float c = coefficient * rand();
I gave it a shot here #[ http://codepad.org/DLbVfNVQ ]. Average value is somewhere in the neighborhood of 0.4. -CR

C++: Why my recursion trims my array rather than perform the intended purpose that it should recursively fill an array with values?

While I was doing the C++ test, I had a problem asking me to recursively fill in an array with the values falling in the range in which the lowest value and highest value are randomly generated. Here is my code of the recursive function:
int * recursivelyFillTheArray(int &arrLength, int &minValue, int &maxValue, int *arrToFill){
if (arrLength == 0) {
return arrToFill;
} else {
arrToFill[arrLength - 1] = rand() % (abs(minValue) + maxValue) - abs(minValue);
arrLength -= 1;
return recursivelyFillTheArray(arrLength, minValue, maxValue, arrToFill);
}
}
However, the returned output only shows an array of length of 2 and the second value is always a 0. like [-1,0], [4,0].
I then added a printArray() function inside of this recursivelyFillTheArray(), and surprisingly find out the function actually trims my array in the loop rather than fill in the array with the value. Like:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0]
[0, 0]
[0, 0]
[-8, 0]
I cannot figure out why. And here is the fully compiled code with the debug output.
https://ideone.com/quszeM
Please if you could help me see the reason and how to fix it?
Thanks a lot!
The main problem is that you're modifying arrLength - don't pass by reference unnecessarily.
You're also filling the array "backwards", with decreasing indices, but printing it "forwards", so you stop printing immediately before the value you just generated.
(As a bonus, printing those array elements is undefined since you never initialized the array. If you initialize it with 999, you'll see a whole lot of 999s.)
Perhaps this gets clearer if you replace the function call in main with the equivalent loop;
while (arrLength > 0) {
arrToFill[arrLength - 1] = rand() % (abs(minValue) + maxValue) - abs(minValue);
arrLength -= 1;
printArray(theArray, arrLength);
}
There doesn't seem to be any point in returning the arrToFill parameter, so something like this perhaps:
void recursivelyFillTheArray(int arrLength, int minValue, int maxValue, int *arrToFill){
if (arrLength > 0) {
arrToFill[arrLength - 1] = rand() % (abs(minValue) + maxValue) - abs(minValue);
recursivelyFillTheArray(arrLength - 1, minValue, maxValue, arrToFill);
}
}
or if you want a "forward fill",
void recursivelyFillTheArray(int arrLength, int minValue, int maxValue, int *arrToFill){
if (arrLength > 0) {
arrToFill[0] = rand() % (abs(minValue) + maxValue) - abs(minValue);
recursivelyFillTheArray(arrLength - 1, minValue, maxValue, arrToFill + 1);
}
}

Defining 2D arrays in C++

int train [4] [3] = { 0, 0, 0,
0, 1, 0,
1, 0, 0,
1, 1, 1 };
Is that a valid initialization of a 2d array in C++
And the rows will be 0,0,0 (row 1), (0,1,0) (row2), (1,0,0) (row3) and (1,1,1) (row 4) ?
And is it equivalent to
int train [4] [3] = {{0, 0, 0},
{0, 1, 0},
{1, 0, 0},
{1, 1, 1}};
int train [4] [3] = { 0, 0, 0,
0, 1, 0,
1, 0, 0,
1, 1, 1 };
is a valid initialization of a 2D array in C++.
From the C++11 Standard:
8.5.1 Aggregates
10 When initializing a multi-dimensional array, the initializer-clauses initialize the elements with the last (right-most) index of the array varying the fastest (8.3.4). [ Example:
int x[2][2] = { 3, 1, 4, 2 };
initializes x[0][0] to 3, x[0][1] to 1, x[1][0] to 4, and x[1][1] to 2. On the other hand,
float y[4][3] = {
{ 1 }, { 2 }, { 3 }, { 4 }
};
initializes the first column of y (regarded as a two-dimensional array) and leaves the rest zero. — end example ]
Yes! It is a valid intialization in c++.

Parallel algorithm that does a small insertion/shifting

Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$

Getting the number of trailing 1 bits

Are there any efficient bitwise operations I can do to get the number of set bits that an integer ends with? For example 1110 = 10112 would be two trailing 1 bits. 810 = 10002 would be 0 trailing 1 bits.
Is there a better algorithm for this than a linear search? I'm implementing a randomized skip list and using random numbers to determine the maximum level of an element when inserting it. I am dealing with 32 bit integers in C++.
Edit: assembler is out of the question, I'm interested in a pure C++ solution.
Calculate ~i & (i + 1) and use the result as a lookup in a table with 32 entries. 1 means zero 1s, 2 means one 1, 4 means two 1s, and so on, except that 0 means 32 1s.
Taking the answer from Ignacio Vazquez-Abrams and completing it with the count rather than a table:
b = ~i & (i+1); // this gives a 1 to the left of the trailing 1's
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x55555555) + ((b>>1) & 0x55555555); // 2 bit sums of 1 bit numbers
b = (b & 0x33333333) + ((b>>2) & 0x33333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f) + ((b>>4) & 0x0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff) + ((b>>8) & 0x00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff) + ((b>>16) & 0x0000ffff); // sum of 16 bit numbers
at the end b will contain the count of 1's (the masks, adding and shifting count the 1's).
Unless I goofed of course. Test before use.
The Bit Twiddling Hacks page has a number of algorithms for counting trailing zeros. Any of them can be adapted by simply inverting your number first, and there are probably clever ways to alter the algorithms in place without doing that as well. On a modern CPU with cheap floating point operations the best is probably thus:
unsigned int v=~input; // find the number of trailing ones in input
int r; // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
r = (*(uint32_t *)&f >> 23) - 0x7f;
if(r==-127) r=32;
GCC has __builtin_ctz and other compilers have their own intrinsics. Just protect it with an #ifdef:
#ifdef __GNUC__
int trailingones( uint32_t in ) {
return ~ in == 0? 32 : __builtin_ctz( ~ in );
}
#else
// portable implementation
#endif
On x86, this builtin will compile to one very fast instruction. Other platforms might be somewhat slower, but most have some kind of bit-counting functionality that will beat what you can do with pure C operators.
There may be better answers available, particularly if assembler isn't out of the question, but one viable solution would be to use a lookup table. It would have 256 entries, each returning the number of contiguous trailing 1 bits. Apply it to the lowest byte. If it's 8, apply to the next and keep count.
Implementing Steven Sudit's idea...
uint32_t n; // input value
uint8_t o; // number of trailing one bits in n
uint8_t trailing_ones[256] = {
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 8};
uint8_t t;
do {
t=trailing_ones[n&255];
o+=t;
} while(t==8 && (n>>=8))
1 (best) to 4 (worst) (average 1.004) times (1 lookup + 1 comparison + 3 arithmetic operations) minus one arithmetic operation.
This code counts the number of trailing zero bits, taken from here (there's also a version that depends on the IEEE 32 bit floating point representation, but I wouldn't trust it, and the modulus/division approaches look really slick - also worth a try):
int CountTrailingZeroBits(unsigned int v) // 32 bit
{
unsigned int c = 32; // c will be the number of zero bits on the right
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
static const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers
for (int i = 4; i >= 0; --i) // unroll for more speed
{
if (v & B[i])
{
v <<= S[i];
c -= S[i];
}
}
if (v)
{
c--;
}
return c;
}
and then to count trailing ones:
int CountTrailingOneBits(unsigned int v)
{
return CountTrailingZeroBits(~v);
}
http://graphics.stanford.edu/~seander/bithacks.html might give you some inspiration.
Implementation based on Ignacio Vazquez-Abrams's answer
uint8_t trailing_ones(uint32_t i) {
return log2(~i & (i + 1));
}
Implementation of log2() is left as an exercise for the reader (see here)
Taking #phkahler's answer you can define the following preprocessor statement:
#define trailing_ones(x) __builtin_ctz(~x & (x + 1))
As you get a one left to all the prior ones, you can simply count the trailing zeros.
Blazingly fast ways to find the number of trailing 0's are given in Hacker's Delight.
You could complement your integer (or more generally, word) to find the number of trailing 1's.
I have this sample for you :
#include <stdio.h>
int trailbits ( unsigned int bits, bool zero )
{
int bitsize = sizeof(int) * 8;
int len = 0;
int trail = 0;
unsigned int compbits = bits;
if ( zero ) compbits = ~bits;
for ( ; bitsize; bitsize-- )
{
if ( compbits & 0x01 ) trail++;
else
{
if ( trail > 1 ) len++;
trail = 0;
}
compbits = compbits >> 1;
}
if ( trail > 1 ) len++;
return len;
}
void PrintBits ( unsigned int bits )
{
unsigned int pbit = 0x80000000;
for ( int len=0 ; len<32; len++ )
{
printf ( "%c ", pbit & bits ? '1' : '0' );
pbit = pbit >> 1;
}
printf ( "\n" );
}
void main(void)
{
unsigned int forbyte = 0x0CC00990;
PrintBits ( forbyte );
printf ( "Trailing ones is %d\n", trailbits ( forbyte, false ));
printf ( "Trailing zeros is %d\n", trailbits ( forbyte, true ));
}