What's a speedy method to count the number of zero-valued bytes in a large, contiguous array? (Or conversely, the number of non-zero bytes.) By large, I mean 216 bytes or larger. The array's position and length can consist of whatever byte alignment.
Naive way:
int countZeroBytes(byte[] values, int length)
{
int zeroCount = 0;
for (int i = 0; i < length; ++i)
if (!values[i])
++zeroCount;
return zeroCount;
}
For my problem, I usually just maintain zeroCount and update it based on specific changes to values. However, I'd like to have a fast, general method of re-computing zeroCount after an arbitrary bulk change to values occurs. I'm sure there's a bit-twiddly method of accomplishing this more quickly, but alas, I'm but a novice twiddler.
EDIT: A few people have asked about the nature of the data being zero-checked, so I'll describe it. (It'd be nice if solutions were still general, though.)
Basically, envision a world composed of voxels (e.g. Minecraft), with procedurally generated terrain segregated into cubic chunks, or effectively pages of memory indexed as three-dimensional arrays. Each voxel is fly-weighted as a unique byte corresponding to a unique material (air, stone, water, etc). Many chunks contain only air or water, while others contain varying combinations of a 2-4 voxels in large quantities (dirt, sand, etc), with effectively 2-10% of voxels being random outliers. Voxels existing in large quantities tend to be highly clustered along every axis.
It seems as though a zero-byte-counting method would be useful in a number of unrelated scenarios, though. Hence, the desire for a general solution.
This is a special case of How to count character occurrences using SIMD with c=0, the char (byte) value to count matches for. See that Q&A for a well-optimized manually-vectorized AVX2 implementation of char_count (char const* vector, size_t size, char c); with a much tighter inner loop than this, avoiding reducing each vector of 0/-1 matches to scalar separately.
This will go as O(n) so the best you can do is decrease the constant. One quick fix is to remove the branch. This gives a result as fast as my SSE version below if the zeros are randomly distrbuted. This is likely due to the fact the GCC vectorizes this loop. However, for long runs of zeros or for a random density of zeros less than 1% the SSE version below is still faster.
int countZeroBytes_fix(char* values, int length) {
int zeroCount = 0;
for(int i=0; i<length; i++) {
zeroCount += values[i] == 0;
}
return zeroCount;
}
I originally thought that the density of zeros would matter. That turns out not to be the case, at least with SSE. Using SSE is a lot faster independent of the density.
Edit: actually, it does depend on the density it just the density of zeros has to be smaller than I expected. 1/64 zeros (1.5% zeros) is one zero in 1/4 SSE registers so the branch prediction does not work very well. However, 1/1024 zeros (0.1% zeros) is faster (see the table of times).
SIMD is even faster if the data has long runs of zeros.
You can pack 16 bytes into a SSE register. Then you can compare all 16 bytes at once with zero using _mm_cmpeq_epi8. Then to handle runs of zero you can use _mm_movemask_epi8 on the result and most of the time it will be zero. You could get a speed up of up to 16 in this case (for first half 1 and second half zero I got over a 12X speedup).
Here is a table of times in seconds for 2^16 bytes (with a repeat of 10000).
1.5% zeros 50% zeros 0.1% zeros 1st half 1, 2nd half 0
countZeroBytes 0.8s 0.8s 0.8s 0.95s
countZeroBytes_fix 0.16s 0.16s 0.16s 0.16s
countZeroBytes_SSE 0.2s 0.15s 0.10s 0.07s
You can see the results for last 1/2 zeros at http://coliru.stacked-crooked.com/a/67a169ddb03d907a
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h> // SSE2
#include <omp.h>
int countZeroBytes(char* values, int length) {
int zeroCount = 0;
for(int i=0; i<length; i++) {
if (!values[i])
++zeroCount;
}
return zeroCount;
}
int countZeroBytes_SSE(char* values, int length) {
int zeroCount = 0;
__m128i zero16 = _mm_set1_epi8(0);
__m128i and16 = _mm_set1_epi8(1);
for(int i=0; i<length; i+=16) {
__m128i values16 = _mm_loadu_si128((__m128i*)&values[i]);
__m128i cmp = _mm_cmpeq_epi8(values16, zero16);
int mask = _mm_movemask_epi8(cmp);
if(mask) {
if(mask == 0xffff) zeroCount += 16;
else {
cmp = _mm_and_si128(and16, cmp); //change -1 values to 1
//hortiontal sum of 16 bytes
__m128i sum1 = _mm_sad_epu8(cmp,zero16);
__m128i sum2 = _mm_shuffle_epi32(sum1,2);
__m128i sum3 = _mm_add_epi16(sum1,sum2);
zeroCount += _mm_cvtsi128_si32(sum3);
}
}
}
return zeroCount;
}
int main() {
const int n = 1<<16;
const int repeat = 10000;
char *values = (char*)_mm_malloc(n, 16);
for(int i=0; i<n; i++) values[i] = rand()%64; //1.5% zeros
//for(int i=0; i<n/2; i++) values[i] = 1;
//for(int i=n/2; i<n; i++) values[i] = 0;
int zeroCount = 0;
double dtime;
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) zeroCount = countZeroBytes(values,n);
dtime = omp_get_wtime() - dtime;
printf("zeroCount %d, time %f\n", zeroCount, dtime);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) zeroCount = countZeroBytes_SSE(values,n);
dtime = omp_get_wtime() - dtime;
printf("zeroCount %d, time %f\n", zeroCount, dtime);
}
I've come with this OpenMP implementation, which may take advantage of the array being in the local cache of each processor to actually read it in parallel.
nzeros_total = 0;
#pragma omp parallel for reduction(+:nzeros_total)
for (i=0;i<NDATA;i++)
{
if (v[i]==0)
nzeros_total++;
}
A quick benchmark, consisting on running 1000 times a for loop with a naive implementation (the same the OP has written in the question) versus the OpenMP implementation, running 1000 times too, taking the best time for both methods, with an array of 65536 ints with a zero value element probability of 50%, using Windows 7 on a QuadCore CPU, and compiled with VStudio 2012 Ultimate, yields these numbers:
DEBUG RELEASE
Naive method: 580 microseconds. 341 microseconds.
OpenMP method: 159 microseconds. 99 microseconds.
NOTE: I've tried the #pragma loop (hint_parallel(4)) but aparently, this didn't cause the naive version to perform any better so my guess is that the compiler was already applying this optimization, or it couldn't be applied at all. Also, #pragma loop (no_vector) didn't cause the naive version to perform worse.
You can also use POPCNT instruction which returns number of bits set. This allows to further simplify code and speed it up by eliminating unnecessary branches. Here is example with AVX2 and POPCNT:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include "immintrin.h"
int countZeroes(uint8_t* bytes, int length)
{
const __m256i vZero = _mm256_setzero_si256();
int count = 0;
for (int n = 0; n < length; n += 32)
{
__m256i v = _mm256_load_si256((const __m256i*)&bytes[n]);
v = _mm256_cmpeq_epi8(v, vZero);
int k = _mm256_movemask_epi8(v);
count += _mm_popcnt_u32(k);
}
return count;
}
#define SIZE 1024
int main()
{
uint8_t bytes[SIZE] __attribute__((aligned(32)));
for (int z = 0; z < SIZE; ++z)
bytes[z] = z % 2;
int n = countZeroes(bytes, SIZE);
printf("%d\n", n);
return 0;
}
For situations where 0s are common it would be faster to check 64 bytes at a time, and only check the bytes if the span is non-zero. If zero's are rare this will be more expensive. This code assumes that the large block is divisible by 64. This also assumes that memcmp is as efficient as you can get.
int countZeroBytes(byte[] values, int length)
{
static const byte zeros[64]={};
int zeroCount = 0;
for (int i = 0; i < length; i+=64)
{
if (::memcmp(values+i, zeros, 64) == 0)
{
zeroCount += 64;
}
else
{
for (int j=i; j < i+64; ++j)
{
if (!values[j])
{
++zeroCount;
}
}
}
}
return zeroCount;
}
Brute force to count zero bytes: Use a vector compare instruction which sets each byte of a vector to 1 if that byte was 0, and to 0 if that byte was not zero.
Do this 255 times to process up to 255 x 64 bytes (if you have 512 bit instruction available, or 255 x 32 or 255 x 16 bytes if you only have 128 bit vectors). And then you just add up the 255 result vectors. Since each byte after the compare had a value of 0 or 1, each sum is at most 255, so you now have one vector of 64 / 32 / 16 bytes, down from about 16,000 / 8,000 / 4,000 bytes.
It may be faster to avoid the condition and trade it for a look-up and an add:
char isCharZeroLUT[256] = { 1 }; /* 1 0 0 ... */
int zeroCount = 0;
for (int i = 0; i < length; ++i) {
zeroCount += isCharZeroLUT[values[i]];
}
I haven't measured the differences, though. It is also worth noting that some compiler happily vectorize sufficiently simple loops.
Related
Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE:
Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i++){
for (int b = 0; b < 8; ++b) {
sum[i*8 + b] += (buf[i] >> b) & 1;
}
}
}
An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
for (size_t int i = 0; i < k; i++) {
sum[i] += ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i++) {
for (size_t bit = 0; bit < 8; bit++) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also #Eric Postpischil.
As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i++) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
sums[i] += (buf[buf_index][i] >> shift) & 1;
}
}
}
Try it on godbolt.
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.
I profiled my code and the most expensive part of the code is the loop included in the post. I want to improve the performance of this loop using AVX. I have tried manually unrolling the loop and, while that does improve performance, the improvements are not satisfactory.
int N = 100000000;
int8_t* data = new int8_t[N];
for(int i = 0; i< N; i++) { data[i] = 1 ;}
std::array<float, 10> f = {1,2,3,4,5,6,7,8,9,10};
std::vector<float> output(N, 0);
int k = 0;
for (int i = k; i < N; i = i + 2) {
for (int j = 0; j < 10; j++, k = j + 1) {
output[i] += f[j] * data[i - k];
output[i + 1] += f[j] * data[i - k + 1];
}
}
Could I have some guidance on how to approach this.
I would assume that data is a large input array of signed bytes, and f is a small array of floats of length 10, and output is the large output array of floats. Your code goes out of bounds for the first 10 iterations by i, so I will start i from 10 instead. Here is a clean version of the original code:
int s = 10;
for (int i = s; i < N; i += 2) {
for (int j = 0; j < 10; j++) {
output[i] += f[j] * data[i-j-1];
output[i+1] += f[j] * data[i-j];
}
}
As it turns out, processing two iterations by i does not change anything, so we simplify it further to:
for (int i = s; i < N; i++)
for (int j = 0; j < 10; j++)
output[i] += f[j] * data[i-j-1];
This version of code (along with declarations of input/output data) should have been present in the question itself, without others having to clean/simplify the mess.
Now it is obvious that this code applies one-dimensional convolution filter, which is a very common thing in signal processing. For instance, it can by computed in Python using numpy.convolve function. The kernel has very small length 10, so Fast Fourier Transform won't provide any benefits compared to bruteforce approach. Given that the problem is well-known, you can read a lot of articles on vectorizing small-kernel convolution. I will follow the article by hgomersall.
First, let's get rid of reverse indexing. Obviously, we can reverse the kernel before running the main algorithm. After that, we have to compute the so-called cross-correlation instead of convolution. In simple words, we move the kernel array along the input array, and compute the dot product between them for every possible offset.
std::reverse(f.data(), f.data() + 10);
for (int i = s; i < N; i++) {
int b = i-10;
float res = 0.0;
for (int j = 0; j < 10; j++)
res += f[j] * data[b+j];
output[i] = res;
}
In order to vectorize it, let's compute 8 consecutive dot products at once. Recall that we can pack eight 32-bit float numbers into one 256-bit AVX register. We will vectorize the outer loop by i, which means that:
The loop by i will be advanced by 8 every iteration.
Every value inside the outer loop turns into a 8-element pack, such that k-th element of the pack holds this value for (i+k)-th iteration of the outer loop from the scalar version.
Here is the resulting code:
//reverse the kernel
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]); //every component will have same value
//note: you have to compute the last 16 values separately!
for (size_t i = s; i + 16 <= N; i += 8) {
int b = i-10;
__m256 res = _mm256_setzero_ps();
for (size_t j = 0; j < 10; j++) {
//load: data[b+j], data[b+j+1], data[b+j+2], ..., data[b+j+15]
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]);
//convert first 8 bytes of loaded 16-byte pack into 8 floats
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes));
//compute res = res + floats * revKernel[j] elementwise
res = _mm256_fmadd_ps(revKernel[j], floats, res);
}
//store 8 values packed in res into: output[i], output[i+1], ..., output[i+7]
_mm256_storeu_ps(&output[i], res);
}
For 100 millions of elements, this code takes about 120 ms on my machine, while the original scalar implementation took 850 ms. Beware: I have Ryzen 1600 CPU, so results on Intel CPUs may be somewhat different.
Now if you really want to unroll something, the inner loop by 10 kernel elements is the perfect place. Here is how it is done:
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]);
for (size_t i = s; i + 16 <= N; i += 8) {
size_t b = i-10;
__m256 res = _mm256_setzero_ps();
#define DOIT(j) {\
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]); \
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes)); \
res = _mm256_fmadd_ps(revKernel[j], floats, res); \
}
DOIT(0);
DOIT(1);
DOIT(2);
DOIT(3);
DOIT(4);
DOIT(5);
DOIT(6);
DOIT(7);
DOIT(8);
DOIT(9);
_mm256_storeu_ps(&output[i], res);
}
It takes 110 ms on my machine (slightly better that the first vectorized version).
The simple copy of all elements (with conversion from bytes to floats) takes 40 ms for me, which means that this code is not memory-bound yet, and there is still some room for improvement left.
My code does the following:
Do some long-running intense computation (called useless below)
Do a small latency-critical task
I find that the time it takes to execute the latency-critical task is higher with the long-running computation than without it.
Here is some stand-alone C++ code to reproduce this effect:
#include <stdio.h>
#include <stdint.h>
#define LEN 128
#define USELESS 1000000000
//#define USELESS 0
// Read timestamp counter
static inline long long get_cycles()
{
unsigned low, high;
unsigned long long val;
asm volatile ("rdtsc" : "=a" (low), "=d" (high));
val = high;
val = (val << 32) | low;
return val;
}
// Compute a simple hash
static inline uint32_t hash(uint32_t *arr, int n)
{
uint32_t ret = 0;
for(int i = 0; i < n; i++) {
ret = (ret + (324723947 + arr[i])) ^ 93485734985;
}
return ret;
}
int main()
{
uint32_t sum = 0; // For adding dependencies
uint32_t arr[LEN]; // We'll compute the hash of this array
for(int iter = 0; iter < 3; iter++) {
// Create a new array to hash for this iteration
for(int i = 0; i < LEN; i++) {
arr[i] = (iter + i);
}
// Do intense computation
for(int useless = 0; useless < USELESS; useless++) {
sum += (sum + useless) * (sum + useless);
}
// Do the latency-critical task
long long start_cycles = get_cycles() + (sum & 1);
sum += hash(arr, LEN);
long long end_cycles = get_cycles() + (sum & 1);
printf("Iteration %d cycles: %lld\n", iter, end_cycles - start_cycles);
}
}
When compiled with -O3 with USELESS set to 1 billion, the three iterations took 588, 4184, and 536 cycles, respectively. When compiled with USELESS set to 0, the iterations took 394, 358, and 362 cycles, respectively.
Why could this (particularly the 4184 cycles) be happening? I suspected cache misses or branch mis-predictions induced by the intense computation. However, without the intense computation, the zeroth iteration of the latency critical task is pretty fast so I don't think that cold cache/branch predictor is the cause.
Moving my speculative comment to an answer:
It is possible that while your busy loop is running, other tasks on the server are pushing the cached arr data out of the L1 cache, so that the first memory access in hash needs to reload from a lower level cache. Without the compute loop this wouldn't happen. You could try moving the arr initialization to after the computation loop, just to see what the effect is.
Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.
Output is an array of the indices of set bits of the bitarray.
Example:
bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}
Getting either set A or set B is fine.
The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.
How to do this about 5 times faster on a single cpu core?
current code:
#include <iostream>
#include <vector>
#include <time.h>
using namespace std;
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
for(i = 0; i < size; i++){
find_set_bit(v[i], ptr_set_new, base);
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
int k = base;
while(n){
if (n & 1){
*(ptr_set) = k;
ptr_set++;
}
n = n >> 1;
k++;
}
}
template <typename T>
void rand_vector(T& v){
srand(time(NULL));
int i;
int size = v.capacity();
for (i=0;i<size;i++){
v[i] = rand();
}
}
template <typename T>
void print_vector(T& v, int size_in = 0){
int i;
int size;
if (size_in == 0){
size = v.capacity();
} else {
size = size_in;
}
for (i=0;i<size;i++){
cout << v[i] << ' ';
}
cout << endl;
}
int main(void){
const int test_size = 6000;
vector<uint32_t> vec(test_size);
vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
rand_vector(vec);
//for (int i; i < 64; i++) vec[i] = -1;
//cout << "input" << endl;
print_vector(vec);
//cout << "calculate result" << endl;
int i;
int rep = 10000;
uint32_t res_size;
struct timespec tp_start, tp_end;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
for (i=0;i<rep;i++){
res_size = bitarray2set(vec, set.data());
}
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);
cout << "timing per cycle: " << timing << endl;
cout << "print result" << endl;
//print_vector(set, res_size);
}
result (compiled with icc -O3 code.cpp -lrt)
...
timing per cycle: 0.000739613 (7.4E-4).
print result
0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.
The cpu has popcnt instruction and sse4.2 instruction set.
Thanks.
Update
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
uint32_t * ptr_v;
uint32_t * ptr_v_end = &(v[size]);
for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
while(*ptr_v) {
*ptr_set_new++ = base + __builtin_ctz(*ptr_v);
(*ptr_v) &= (*ptr_v) - 1; // zeros the lowest 1-bit in n
}
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).
Warning:
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)
I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.
Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
while(n) {
*ptr_set++ = base + __builtin_ctz(n);
n &= n - 1; // zeros the lowest 1-bit in n
}
}
On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.
There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.
As suggested by #davidbak, you could use a table lookup to process 4 elements of the bitmap at once.
Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.
#rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.
I'm thinking something like
// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0, ... };
// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
int *set_start = set;
__m128i offset = _mm_setzero_si128();
for (nibble in input[]) { // pseudocode for the actual shifting / masking
__m128i v = _mm_load_si128(&LUT[nibble]);
__m128i vpos = _mm_add_epi32(v, offset);
_mm_store((__m128i*)set, vpos);
set += _mm_popcount_u32(nibble); // variable-length store
offset = _mm_add_epi32(offset, _mm_set1_epi32(4)); // increment the offset by 4
}
return set - set_start; // set size
}
When a nibble isn't 1111, the next store will overlap, but that's fine.
Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.
I have an array of integers, lets assume they are of type int64_t. Now, I know that only every first n bits of every integer are meaningful (that is, I know that they are limited by some bounds).
What is the most efficient way to convert the array in the way that all unnecessary space is removed (i.e. I have the first integer at a[0], the second one at a[0] + n bits and so on) ?
I would like it to be general as much as possible, because n would vary from time to time, though I guess there might be smart optimizations for specific n like powers of 2 or sth.
Of course I know that I can just iterate value over value, I just want to ask you StackOverflowers if you can think of some more clever way.
Edit:
This question is not about compressing the array to take as least space as possible. I just need to "cut" n bits from every integer and given the array I know the exact n of bits I can safely cut.
Today I released: PackedArray: Packing Unsigned Integers Tightly (github project).
It implements a random access container where items are packed at the bit-level. In other words, it acts as if you were able to manipulate a e.g. uint9_t or uint17_t array:
PackedArray principle:
. compact storage of <= 32 bits items
. items are tightly packed into a buffer of uint32_t integers
PackedArray requirements:
. you must know in advance how many bits are needed to hold a single item
. you must know in advance how many items you want to store
. when packing, behavior is undefined if items have more than bitsPerItem bits
PackedArray general in memory representation:
|-------------------------------------------------- - - -
| b0 | b1 | b2 |
|-------------------------------------------------- - - -
| i0 | i1 | i2 | i3 | i4 | i5 | i6 | i7 | i8 | i9 |
|-------------------------------------------------- - - -
. items are tightly packed together
. several items end up inside the same buffer cell, e.g. i0, i1, i2
. some items span two buffer cells, e.g. i3, i6
I agree with keraba that you need to use something like Huffman coding or perhaps the Lempel-Ziv-Welch algorithm. The problem with bit-packing the way you are talking about is that you have two options:
Pick a constant n such that the largest integer can be represented.
Allow n to vary from value to value.
The first option is relatively easy to implement, but is really going to waste a lot of space unless all integers are rather small.
The second option has the major disadvantage that you have to convey changes in n somehow in the output bitstream. For instance, each value will have to have a length associated with it. This means you are storing two integers (albeit smaller integers) for every input value. There's a good chance you'll increase the file size with this method.
The advantage of Huffman or LZW is that they create codebooks in such a way that the length of the codes can be derived from the output bitstream without actually storing the lengths. These techniques allow you to get very close to the Shannon limit.
I decided to give your original idea (constant n, remove unused bits and pack) a try for fun and here is the naive implementation I came up with:
#include <sys/types.h>
#include <stdio.h>
int pack(int64_t* input, int nin, void* output, int n)
{
int64_t inmask = 0;
unsigned char* pout = (unsigned char*)output;
int obit = 0;
int nout = 0;
*pout = 0;
for(int i=0; i<nin; i++)
{
inmask = (int64_t)1 << (n-1);
for(int k=0; k<n; k++)
{
if(obit>7)
{
obit = 0;
pout++;
*pout = 0;
}
*pout |= (((input[i] & inmask) >> (n-k-1)) << (7-obit));
inmask >>= 1;
obit++;
nout++;
}
}
return nout;
}
int unpack(void* input, int nbitsin, int64_t* output, int n)
{
unsigned char* pin = (unsigned char*)input;
int64_t* pout = output;
int nbits = nbitsin;
unsigned char inmask = 0x80;
int inbit = 0;
int nout = 0;
while(nbits > 0)
{
*pout = 0;
for(int i=0; i<n; i++)
{
if(inbit > 7)
{
pin++;
inbit = 0;
}
*pout |= ((int64_t)((*pin & (inmask >> inbit)) >> (7-inbit))) << (n-i-1);
inbit++;
}
pout++;
nbits -= n;
nout++;
}
return nout;
}
int main()
{
int64_t input[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};
int64_t output[21];
unsigned char compressed[21*8];
int n = 5;
int nbits = pack(input, 21, compressed, n);
int nout = unpack(compressed, nbits, output, n);
for(int i=0; i<=20; i++)
printf("input: %lld output: %lld\n", input[i], output[i]);
}
This is very inefficient because is steps one bit at a time, but that was the easiest way to implement it without dealing with issues of endianess. I have not tested this either with a wide range of values, just the ones in the test. Also, there is no bounds checking and it is assumed the output buffers are long enough. So what I am saying is that this code is probably only good for educational purposes to get you started.
Most any compression algorithm will get close to the minimum entropy needed to encode the integers, for example, Huffman coding, but accessing it like an array will be non-trivial.
Starting from Jason B's implementation, I eventually wrote my own version which processes bit-blocks instead of single bits. One difference is that it is lsb: It starts from lowest output bits going to highest. This only makes it harder to read with a binary dump, like Linux xxd -b. As a detail, int* can be trivially changed to int64_t*, and it should even better be unsigned. I have already tested this version with a few million arrays and it seems solid, so I share will the rest:
int pack2(int *input, int nin, unsigned char* output, int n)
{
int obit = 0;
int ibit = 0;
int ibite = 0;
int nout = 0;
if(nin>0) output[0] = 0;
for(int i=0; i<nin; i++)
{
ibit = 0;
while(ibit < n) {
ibite = std::min(n, ibit + 8 - obit);
output[nout] |= (input[i] & (((1 << ibite)-1) ^ ((1 << ibit)-1))) >> ibit << obit;
obit += ibite - ibit;
nout += obit >> 3;
if(obit & 8) output[nout] = 0;
obit &= 7;
ibit = ibite;
}
}
return nout;
}
int unpack2(int *oinput, int nin, unsigned char* ioutput, int n)
{
int obit = 0;
int ibit = 0;
int ibite = 0;
int nout = 0;
for(int i=0; i<nin; i++)
{
oinput[i] = 0;
ibit = 0;
while(ibit < n) {
ibite = std::min(n, ibit + 8 - obit);
oinput[i] |= (ioutput[nout] & (((1 << (ibite-ibit+obit))-1) ^ ((1 << obit)-1))) >> obit << ibit;
obit += ibite - ibit;
nout += obit >> 3;
obit &= 7;
ibit = ibite;
}
}
return nout;
}
I know this might seem like the obvious thing to say as I'm sure there's actually a solution, but why not use a smaller type, like uint8_t (max 255)? or uint16_t (max 65535)?. I'm sure you could bit-manipulate on an int64_t using defined values and or operations and the like, but, aside from an academic exercise, why?
And on the note of academic exercises, Bit Twiddling Hacks is a good read.
If you have fixed sizes, e.g. you know your number is 38bit rather than 64, you can build structures using bit specifications. Amusing you also have smaller elements to fit in the remaining space.
struct example {
/* 64bit number cut into 3 different sized sections */
uint64_t big_num:38;
uint64_t small_num:16;
uint64_t itty_num:10;
/* 8 bit number cut in two */
uint8_t nibble_A:4;
uint8_t nibble_B:4;
};
This isn't big/little endian safe without some hoop-jumping, so can only be used within a program rather than in a exported data format. It's quite often used to store boolean values in single bits without defining shifts and masks.
I don't think you can avoid iterating across the elements.
AFAIK Huffman encoding requires the frequencies of the "symbols", which unless you know the statistics of the "process" generating the integers, you will have to compute (by iterating across every element).