I have written a simulation software for highly parallelized execution, using MPI for internode and threads for intranode parallelization to reduce the memory footprint by using shared memory where possible. (The largest data structures are mostly read-only, so I can easily manage thread-safety.)
Although my program works fine (finally), I am having second thoughts about whether this approach is really best, mostly because managing two types of parallelizations does require some messy asynchronous code here and there.
I found a paper (pdf draft) introducing a shared memory extension to MPI, allowing the use of shared data structures within MPI parallelization on a single node.
I am not very experienced with MPI, so my question is: Is this possible with recent standard Open MPI implementations and where can I find an introduction / tutorial on how to do it?
Note that I am not talking about how message passing is accomplished with shared memory, I know that MPI does that. I would like to (read-)access the same object in memory from multiple MPI processors.
This can be done - here is a test code that sets up a small table on each shared memory node. Only one process (node rank 0) actually allocates and initialises the table, but all processes on a node can read it (apologies for the formatting - seems to be a space/tab issue)
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(void)
{
int i, flag;
int nodesize, noderank;
int size, rank, irank;
int tablesize, localtablesize;
int *table, *localtable;
int *model;
MPI_Comm allcomm, nodecomm;
char verstring[MPI_MAX_LIBRARY_VERSION_STRING];
char nodename[MPI_MAX_PROCESSOR_NAME];
MPI_Aint winsize;
int windisp;
int *winptr;
int version, subversion, verstringlen, nodestringlen;
allcomm = MPI_COMM_WORLD;
MPI_Win wintable;
tablesize = 5;
MPI_Init(NULL, NULL);
MPI_Comm_size(allcomm, &size);
MPI_Comm_rank(allcomm, &rank);
MPI_Get_processor_name(nodename, &nodestringlen);
MPI_Get_version(&version, &subversion);
MPI_Get_library_version(verstring, &verstringlen);
if (rank == 0)
{
printf("Version %d, subversion %d\n", version, subversion);
printf("Library <%s>\n", verstring);
}
// Create node-local communicator
MPI_Comm_split_type(allcomm, MPI_COMM_TYPE_SHARED, rank,
MPI_INFO_NULL, &nodecomm);
MPI_Comm_size(nodecomm, &nodesize);
MPI_Comm_rank(nodecomm, &noderank);
// Only rank 0 on a node actually allocates memory
localtablesize = 0;
if (noderank == 0) localtablesize = tablesize;
// debug info
printf("Rank %d of %d, rank %d of %d in node <%s>, localtablesize %d\n",
rank, size, noderank, nodesize, nodename, localtablesize);
MPI_Win_allocate_shared(localtablesize*sizeof(int), sizeof(int),
MPI_INFO_NULL, nodecomm, &localtable, &wintable);
MPI_Win_get_attr(wintable, MPI_WIN_MODEL, &model, &flag);
if (1 != flag)
{
printf("Attribute MPI_WIN_MODEL not defined\n");
}
else
{
if (MPI_WIN_UNIFIED == *model)
{
if (rank == 0) printf("Memory model is MPI_WIN_UNIFIED\n");
}
else
{
if (rank == 0) printf("Memory model is *not* MPI_WIN_UNIFIED\n");
MPI_Finalize();
return 1;
}
}
// need to get local pointer valid for table on rank 0
table = localtable;
if (noderank != 0)
{
MPI_Win_shared_query(wintable, 0, &winsize, &windisp, &table);
}
// All table pointers should now point to copy on noderank 0
// Initialise table on rank 0 with appropriate synchronisation
MPI_Win_fence(0, wintable);
if (noderank == 0)
{
for (i=0; i < tablesize; i++)
{
table[i] = rank*tablesize + i;
}
}
MPI_Win_fence(0, wintable);
// Check we did it right
for (i=0; i < tablesize; i++)
{
printf("rank %d, noderank %d, table[%d] = %d\n",
rank, noderank, i, table[i]);
}
MPI_Finalize();
}
Here is some sample output for 6 processes across two nodes:
Version 3, subversion 1
Library <SGI MPT 2.14 04/05/16 03:53:22>
Rank 3 of 6, rank 0 of 3 in node <r1i0n1>, localtablesize 5
Rank 4 of 6, rank 1 of 3 in node <r1i0n1>, localtablesize 0
Rank 5 of 6, rank 2 of 3 in node <r1i0n1>, localtablesize 0
Rank 0 of 6, rank 0 of 3 in node <r1i0n0>, localtablesize 5
Rank 1 of 6, rank 1 of 3 in node <r1i0n0>, localtablesize 0
Rank 2 of 6, rank 2 of 3 in node <r1i0n0>, localtablesize 0
Memory model is MPI_WIN_UNIFIED
rank 3, noderank 0, table[0] = 15
rank 3, noderank 0, table[1] = 16
rank 3, noderank 0, table[2] = 17
rank 3, noderank 0, table[3] = 18
rank 3, noderank 0, table[4] = 19
rank 4, noderank 1, table[0] = 15
rank 4, noderank 1, table[1] = 16
rank 4, noderank 1, table[2] = 17
rank 4, noderank 1, table[3] = 18
rank 4, noderank 1, table[4] = 19
rank 5, noderank 2, table[0] = 15
rank 5, noderank 2, table[1] = 16
rank 5, noderank 2, table[2] = 17
rank 5, noderank 2, table[3] = 18
rank 5, noderank 2, table[4] = 19
rank 0, noderank 0, table[0] = 0
rank 0, noderank 0, table[1] = 1
rank 0, noderank 0, table[2] = 2
rank 0, noderank 0, table[3] = 3
rank 0, noderank 0, table[4] = 4
rank 1, noderank 1, table[0] = 0
rank 1, noderank 1, table[1] = 1
rank 1, noderank 1, table[2] = 2
rank 1, noderank 1, table[3] = 3
rank 1, noderank 1, table[4] = 4
rank 2, noderank 2, table[0] = 0
rank 2, noderank 2, table[1] = 1
rank 2, noderank 2, table[2] = 2
rank 2, noderank 2, table[3] = 3
rank 2, noderank 2, table[4] = 4
Related
Let's say I have a vector v with random 1 and 0.
std::vector<int> v = {1,0,1,0,0,1,0,1};
I want to find out the max sequence with the property v[i] != v[i-1]. Basically the numbers need to be different. In this example the max sequence is 4 (1, 0, 1, 0) from position v[0] to v[3]. There is also (0,1,0,1) from position v[4] to v[7]. There are 2 max sequences so the final output should look like this:
4 2
Where 4 is the max sequence and 2 the numbers of max sequences.
Let's take another example:
std::vector<int> v2 = {1,0,1,1,1,0,1,0,1,0};
The output here should be:
6 1
The max sequence starts from v[4] to v[9]. There is only one max sequence so it will print 1 this time.
I tried to solve this using a for loop:
n - number of integers in the vector
k - number of different integers in vector
maxk - the max sequence
many - how many max sequence are
for(int i{1}; i < n; i++) {
if(v[i] != v[i-1]) {
k++;
if(k > maxk) {
maxk = k;
}
}
else {
if(k == maxk) {
many++;
}
else {
many = 1;
}
k = 1;
}
}
But if you give it a vector like {1, 0, 0} it will not work. Can someone give me a tip of how this problem can be solved? Sorry for my bad english
First, sequence isn't the right word. A sequence can jump past elements. You mean a subarray.
Second, you talk about arrays with 0 and 1 in them, then give an example with 2. Do you want to not count subarrays with 2? Or count them? In other words if the input is [1, 2, 2] are you expecting an answer of 1 1 or 2 1?'.
That said, just make an array of where the best current subarray begins. For your first example that array would look like this:
1, 0, 1, 0, 0, 1, 0, 1
0, 0, 0, 0, 4, 4, 4, 4
And then a linear scan finds that you have a group of 4 starting at index 0, and another group of 4 starting at index 4.
For your next example,
1, 0, 1, 1, 1, 0, 1, 0, 1, 0
0, 0, 0, 3, 4, 4, 4, 4, 4, 4
And you have a group of 3 starting at index 0, 1 starting at 3, and 6 starting at 4. So we've found the 1 group of 6.
For your last example, what you'd get would depend on the answer you want.
I'll leave coding this to you.
I have matrix of size 8x8. I want to send even rows (rows with the number of 0, 2, 4, 8) to another process using MPI_Type_vector. I came up with this code:
#include <mpi.h>
#define INITIATOR 0
#define SIZE 8
int main(int argc, char **argv) {
srand(time(NULL));
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Datatype MPI_EVEN_ROWS;
MPI_Type_vector(SIZE / 2, SIZE, SIZE * 2, MPI_INT, &MPI_EVEN_ROWS);
MPI_Type_commit(&MPI_EVEN_ROWS);
if (rank == INITIATOR) {
int a[SIZE][SIZE];
printf("Matrix: \n");
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
a[i][j] = rand() % 11;
printf("%d ", a[i][j]);
}
printf("\n");
}
MPI_Send(a, 1, MPI_EVEN_ROWS, 1, 0, MPI_COMM_WORLD);
} else {
int b[SIZE / 2][SIZE];
MPI_Recv(b, 1, MPI_EVEN_ROWS, INITIATOR, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Received matrix: \n");
for (int i = 0; i < SIZE / 2; i++) {
for (int j = 0; j < SIZE; j++) {
printf("%d ", b[i][j]);
}
printf("\n");
}
}
MPI_Finalize();
}
After executing the program I see that it is not working as expected. I recieve the same matrix, but instead of the rows with numbers 1, 3, 5, 7 there is random junk.
The example of output:
Matrix:
1 0 2 3 1 9 0 2
8 7 2 4 10 10 7 8
2 1 4 1 3 8 1 5
7 4 5 2 8 9 8 9
9 8 1 8 4 0 8 1
7 10 3 6 7 0 1 10
1 6 9 3 3 10 8 8
1 2 8 9 9 3 6 5
Received matrix:
1 0 2 3 1 9 0 2
1408046848 64 7 0 0 0 -81026544 21851
2 1 4 1 3 8 1 5
2 0 969750056 32581 972617984 32581 965895305 32581
I was pretty sure that I'll receive only the even rows. Am I doing something wrong or I misunderstood how MPI_Type_vector works?
Oh, and I need this task to be done by using MPI_Type_vector, I know the it can be done using MPI_Type_struct without those problems.
Thanks for your help in advance!
I know it's a little bit late but
I had the same problem and i got next advice - create second data type for recieve:
int size = 8; // Size of initial matrix 8x8
int sizeGet = 4; // for recieve even?odd rows matrix size 4x8
MPI_Datatype MPI_EVEN_ODD_ROWS;
MPI_Datatype MPI_EVEN_ODD_ROWS_RECIEVE;
MPI_Type_vector(sizeGet, size,size*2, MPI_DOUBLE, &MPI_EVEN_ODD_ROWS);
MPI_Type_vector(sizeGet, size, size , MPI_DOUBLE, &MPI_EVEN_ODD_ROWS_RECIEVE);
MPI_Type_commit(&MPI_EVEN_ODD_ROWS);
MPI_Type_commit(&MPI_EVEN_ODD_ROWS_RECIEVE);
from matrix a (8x8) we'll be sending even and odd rows to b and c matrixes (4x8) in process #0:
MPI_Send(a, 1, MPI_EVEN_ODD_ROWS,
1, 0, MPI_COMM_WORLD);
MPI_Send(a+8, 1, MPI_EVEN_ODD_ROWS, 1, 0, MPI_COMM_WORLD);
and recive data in other process like this:
MPI_Recv(b, 1, MPI_EVEN_ODD_ROWS_RECIEVE, 0,
0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(c, 1, MPI_EVEN_ODD_ROWS_RECIEVE, 0,
0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
This worked for me and I hope it will work for others
Cheers
I'm trying to find the maximum contiguous subarray with start and end index. The method I've adopted is divide-and-conquer, with O(nlogn) time complexity.
I have tested with several test cases, and the start and end index always work correctly. However, I found that if the array contains an odd-numbered of elements, the maximum sum is sometimes correct, sometimes incorrect(seemingly random). But for even cases, it is always correct. Here is my code:
int maxSubSeq(int A[], int n, int &s, int &e)
{
// s and e stands for start and end index respectively,
// and both are passed by reference
if(n == 1){
return A[0];
}
int sum = 0;
int midIndex = n / 2;
int maxLeftIndex = midIndex - 1;
int maxRightIndex = midIndex;
int leftMaxSubSeq = A[maxLeftIndex];
int rightMaxSubSeq = A[maxRightIndex];
int left = maxSubSeq(A, midIndex, s, e);
int right = maxSubSeq(A + midIndex, n - midIndex, s, e);
for(int i = midIndex - 1; i >= 0; i--){
sum += A[i];
if(sum > leftMaxSubSeq){
leftMaxSubSeq = sum;
s = i;
}
}
sum = 0;
for(int i = midIndex; i < n; i++){
sum += A[i];
if(sum > rightMaxSubSeq){
rightMaxSubSeq = sum;
e = i;
}
}
return max(max(leftMaxSubSeq + rightMaxSubSeq, left),right);
}
Below is two of the test cases I was working with, one has odd-numbered elements, one has even-numbered elements.
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
Edit: The following are the 2 kinds of outputs:
// TEST 1
Test file : T2-Data-1.txt
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
maxSubSeq : A[3..7] = 32769 // Index is correct, but sum should be 20
Test file : T2-Data-2.txt
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
maxSubSeq : A[9..17] = 39 // correct
// TEST 2
Test file : T2-Data-1.txt
Array with 11 elements:
1, 3, -7, 9, 6, 3, -2, 4, -1, -9,
2,
maxSubSeq : A[3..7] = 20
Test file : T2-Data-2.txt
Array with 20 elements:
1, 3, 2, -2, 4, 5, -9, -4, -8, 6,
5, 9, 7, -1, 5, -2, 6, 4, -3, -1,
maxSubSeq : A[9..17] = 39
Can anyone point out why this is occurring? Thanks in advance!
Assuming that n is the correct size of your array (we see it being passed in as a parameter and later used to initialize midIndexbut we do not see its actual invocation and so must assume you're doing it correctly), the issue lies here:
int midIndex = n / 2;
In the case that your array has an odd number of elements, which we can represented as
n = 2k + 1
we can find that your middle index will always equate to
(2k + 1) / 2 = k + (1/2)
which means that for every integer, k, you'll always have half of an integer number added to k.
C++ doesn't round integers that receive floating-point numbers; it truncates. So while you'd expect k + 0.5 to round to k+1, you actually get k after truncation.
This means that, for example, when your array size is 11, midIndex is defined to be 5. Therefore, you need to adjust your code accordingly.
Say I have a array A of 8 numbers, I have another array B of numbers to determine how many places should the number in A be shifted to right
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 0, 0, 0, 0
0 means valid, 1 means this number should be 1 place after, the output array is should insert 0 between after 3, the output array C should be :
C: 3,0,6,7,8,1,2,3
Whether to insert 0 or something else is not important, the point is that all numbers after 3 got shifted by one place. The outbound numbers will not be in the array anymore.
Another example:
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 2, 0, 0, 0
C 3, 0, 6, 7, 8, 0, 1, 2
.......................................
A 3, 6, 7, 8, 1, 2, 3, 5
B 0, 1, 0, 0, 1, 0, 0, 0
C 3, 0, 6, 7, 8, 1, 2, 3
I am thinking about using scan/prefix-sum or something similar to solve this problem. also this array is small that I should be able to fit the array in one warp (<32 numbers) and use shuffle instructions. Anyone has an idea?
One possible approach.
Due to the ambiguity of your shifting (0, 1, 0, 1, 0, 1, 1, 1 and 0, 1, 0 ,0 all produce the same data offset pattern, for example) it's not possible to just create a prefix sum of the shift pattern to produce the relative offset at each position. An observation we can make, however, is that a valid offset pattern will be created if each zero in the shift pattern gets replaced by the first non-zero shift value to its left:
0, 1, 0, 0 (shift pattern)
0, 1, 1, 1 (offset pattern)
or
0, 2, 0, 2 (shift pattern)
0, 2, 2, 2 (offset pattern)
So how to do this? Let's assume we have the second test case shift pattern:
0, 1, 0, 0, 2, 0, 0, 0
Our desired offset pattern would be:
0, 1, 1, 1, 2, 2, 2, 2
for a given shift pattern, create a binary value, where each bit is one if the value at the corresponding index into the shift pattern is zero, and zero otherwise. We can use a warp vote instruction, called __ballot() for this. Each lane will get the same value from the ballot:
1 0 1 1 0 1 1 1 (this is a single binary 8-bit value in this case)
Each warp lane will now take this value, and add a value to it which has a 1 bit at the warp lane position. Using lane 1 for the remainder of the example:
+ 0 0 0 0 0 0 1 0 (the only 1 bit in this value will be at the lane index)
= 1 0 1 1 1 0 0 1
We now take the result of step 2, and bitwise exclusive-OR with the result from step 1:
= 0 0 0 0 1 1 1 0
We now count the number of 1 bits in this value (there is a __popc() intrinsic for this), and subtract one from the result. So for the lane 1 example above, the result of this step would be 2, since there are 3 bits set. This gives use the distance to the first value to our left that is non-zero in the original shift pattern. So for the lane 1 example, the first non-zero value to the left of lane 1 is 2 lanes higher, i.e. lane 3.
For each lane, we use the result of step 4 to grab the appropriate offset value for that lane. We can process all lanes at once using a __shfl_down() warp shuffle instruction.
0, 1, 1, 1, 2, 2, 2, 2
Thus producing our desired "offset pattern".
Once we have the desired offset pattern, the process of having each warp lane use its offset value to appropriately shift its data item is straightforward.
Here is a fully worked example, using your 3 test cases. Steps 1-4 above are contained in the __device__ function mydelta. The remainder of the kernel is performing the step 5 shuffle, appropriately indexing into the data, and copying the data. Due to the usage of the warp shuffle instructions, we must compile this for a cc3.0 or higher GPU. (However, it would not be difficult to replace the warp shuffle instructions with other indexing code that would allow operation on cc2.0 or greater devices.) Also, due to the various intrinsics used, this function cannot work for more than 32 data items, but that was a prerequisite condition stated in your question.
$ cat t475.cu
#include <stdio.h>
#define DSIZE 8
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ int mydelta(const int shift){
unsigned nz = __ballot(shift == 0);
unsigned mylane = (threadIdx.x & 31);
unsigned lanebit = 1<<mylane;
unsigned temp = nz + lanebit;
temp = nz ^ temp;
unsigned delta = __popc(temp);
return delta-1;
}
__global__ void mykernel(const int *data, const unsigned *shift, int *result, const int limit){ // limit <= 32
if (threadIdx.x < limit){
unsigned lshift = shift[(limit - 1) - threadIdx.x];
unsigned delta = mydelta(lshift);
unsigned myshift = __shfl_down(lshift, delta);
myshift = __shfl(myshift, ((limit -1) - threadIdx.x)); // reverse offset pattern
result[threadIdx.x] = 0;
if ((myshift + threadIdx.x) < limit)
result[threadIdx.x + myshift] = data[threadIdx.x];
}
}
int main(){
int A[DSIZE] = {3, 6, 7, 8, 1, 2, 3, 5};
unsigned tc1B[DSIZE] = {0, 1, 0, 0, 0, 0, 0, 0};
unsigned tc2B[DSIZE] = {0, 1, 0, 0, 2, 0, 0, 0};
unsigned tc3B[DSIZE] = {0, 1, 0, 0, 1, 0, 0, 0};
int *d_data, *d_result, *h_result;
unsigned *d_shift;
h_result = (int *)malloc(DSIZE*sizeof(int));
if (h_result == NULL) { printf("malloc fail\n"); return 1;}
cudaMalloc(&d_data, DSIZE*sizeof(int));
cudaMalloc(&d_shift, DSIZE*sizeof(unsigned));
cudaMalloc(&d_result, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy(d_data, A, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_shift, tc1B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("index: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", i);
printf("\nA: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", A[i]);
printf("\ntc1 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc1B[i]);
printf("\ntc1 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc2B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc2 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc2B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
cudaMemcpy(d_shift, tc3B, DSIZE*sizeof(unsigned), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMempcyH2D fail");
mykernel<<<1,32>>>(d_data, d_shift, d_result, DSIZE);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
cudaMemcpy(h_result, d_result, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMempcyD2H fail");
printf("\ntc3 B: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", tc3B[i]);
printf("\ntc2 C: ");
for (int i = 0; i < DSIZE; i++)
printf("%d, ", h_result[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_35 -o t475 t475.cu
$ ./t475
index: 0, 1, 2, 3, 4, 5, 6, 7,
A: 3, 6, 7, 8, 1, 2, 3, 5,
tc1 B: 0, 1, 0, 0, 0, 0, 0, 0,
tc1 C: 3, 0, 6, 7, 8, 1, 2, 3,
tc2 B: 0, 1, 0, 0, 2, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 0, 1, 2,
tc3 B: 0, 1, 0, 0, 1, 0, 0, 0,
tc2 C: 3, 0, 6, 7, 8, 1, 2, 3,
$
Are there any efficient bitwise operations I can do to get the number of set bits that an integer ends with? For example 1110 = 10112 would be two trailing 1 bits. 810 = 10002 would be 0 trailing 1 bits.
Is there a better algorithm for this than a linear search? I'm implementing a randomized skip list and using random numbers to determine the maximum level of an element when inserting it. I am dealing with 32 bit integers in C++.
Edit: assembler is out of the question, I'm interested in a pure C++ solution.
Calculate ~i & (i + 1) and use the result as a lookup in a table with 32 entries. 1 means zero 1s, 2 means one 1, 4 means two 1s, and so on, except that 0 means 32 1s.
Taking the answer from Ignacio Vazquez-Abrams and completing it with the count rather than a table:
b = ~i & (i+1); // this gives a 1 to the left of the trailing 1's
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x55555555) + ((b>>1) & 0x55555555); // 2 bit sums of 1 bit numbers
b = (b & 0x33333333) + ((b>>2) & 0x33333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f) + ((b>>4) & 0x0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff) + ((b>>8) & 0x00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff) + ((b>>16) & 0x0000ffff); // sum of 16 bit numbers
at the end b will contain the count of 1's (the masks, adding and shifting count the 1's).
Unless I goofed of course. Test before use.
The Bit Twiddling Hacks page has a number of algorithms for counting trailing zeros. Any of them can be adapted by simply inverting your number first, and there are probably clever ways to alter the algorithms in place without doing that as well. On a modern CPU with cheap floating point operations the best is probably thus:
unsigned int v=~input; // find the number of trailing ones in input
int r; // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
r = (*(uint32_t *)&f >> 23) - 0x7f;
if(r==-127) r=32;
GCC has __builtin_ctz and other compilers have their own intrinsics. Just protect it with an #ifdef:
#ifdef __GNUC__
int trailingones( uint32_t in ) {
return ~ in == 0? 32 : __builtin_ctz( ~ in );
}
#else
// portable implementation
#endif
On x86, this builtin will compile to one very fast instruction. Other platforms might be somewhat slower, but most have some kind of bit-counting functionality that will beat what you can do with pure C operators.
There may be better answers available, particularly if assembler isn't out of the question, but one viable solution would be to use a lookup table. It would have 256 entries, each returning the number of contiguous trailing 1 bits. Apply it to the lowest byte. If it's 8, apply to the next and keep count.
Implementing Steven Sudit's idea...
uint32_t n; // input value
uint8_t o; // number of trailing one bits in n
uint8_t trailing_ones[256] = {
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4,
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 8};
uint8_t t;
do {
t=trailing_ones[n&255];
o+=t;
} while(t==8 && (n>>=8))
1 (best) to 4 (worst) (average 1.004) times (1 lookup + 1 comparison + 3 arithmetic operations) minus one arithmetic operation.
This code counts the number of trailing zero bits, taken from here (there's also a version that depends on the IEEE 32 bit floating point representation, but I wouldn't trust it, and the modulus/division approaches look really slick - also worth a try):
int CountTrailingZeroBits(unsigned int v) // 32 bit
{
unsigned int c = 32; // c will be the number of zero bits on the right
static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
static const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers
for (int i = 4; i >= 0; --i) // unroll for more speed
{
if (v & B[i])
{
v <<= S[i];
c -= S[i];
}
}
if (v)
{
c--;
}
return c;
}
and then to count trailing ones:
int CountTrailingOneBits(unsigned int v)
{
return CountTrailingZeroBits(~v);
}
http://graphics.stanford.edu/~seander/bithacks.html might give you some inspiration.
Implementation based on Ignacio Vazquez-Abrams's answer
uint8_t trailing_ones(uint32_t i) {
return log2(~i & (i + 1));
}
Implementation of log2() is left as an exercise for the reader (see here)
Taking #phkahler's answer you can define the following preprocessor statement:
#define trailing_ones(x) __builtin_ctz(~x & (x + 1))
As you get a one left to all the prior ones, you can simply count the trailing zeros.
Blazingly fast ways to find the number of trailing 0's are given in Hacker's Delight.
You could complement your integer (or more generally, word) to find the number of trailing 1's.
I have this sample for you :
#include <stdio.h>
int trailbits ( unsigned int bits, bool zero )
{
int bitsize = sizeof(int) * 8;
int len = 0;
int trail = 0;
unsigned int compbits = bits;
if ( zero ) compbits = ~bits;
for ( ; bitsize; bitsize-- )
{
if ( compbits & 0x01 ) trail++;
else
{
if ( trail > 1 ) len++;
trail = 0;
}
compbits = compbits >> 1;
}
if ( trail > 1 ) len++;
return len;
}
void PrintBits ( unsigned int bits )
{
unsigned int pbit = 0x80000000;
for ( int len=0 ; len<32; len++ )
{
printf ( "%c ", pbit & bits ? '1' : '0' );
pbit = pbit >> 1;
}
printf ( "\n" );
}
void main(void)
{
unsigned int forbyte = 0x0CC00990;
PrintBits ( forbyte );
printf ( "Trailing ones is %d\n", trailbits ( forbyte, false ));
printf ( "Trailing zeros is %d\n", trailbits ( forbyte, true ));
}