Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.
Output is an array of the indices of set bits of the bitarray.
Example:
bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}
Getting either set A or set B is fine.
The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.
How to do this about 5 times faster on a single cpu core?
current code:
#include <iostream>
#include <vector>
#include <time.h>
using namespace std;
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
for(i = 0; i < size; i++){
find_set_bit(v[i], ptr_set_new, base);
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
int k = base;
while(n){
if (n & 1){
*(ptr_set) = k;
ptr_set++;
}
n = n >> 1;
k++;
}
}
template <typename T>
void rand_vector(T& v){
srand(time(NULL));
int i;
int size = v.capacity();
for (i=0;i<size;i++){
v[i] = rand();
}
}
template <typename T>
void print_vector(T& v, int size_in = 0){
int i;
int size;
if (size_in == 0){
size = v.capacity();
} else {
size = size_in;
}
for (i=0;i<size;i++){
cout << v[i] << ' ';
}
cout << endl;
}
int main(void){
const int test_size = 6000;
vector<uint32_t> vec(test_size);
vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
rand_vector(vec);
//for (int i; i < 64; i++) vec[i] = -1;
//cout << "input" << endl;
print_vector(vec);
//cout << "calculate result" << endl;
int i;
int rep = 10000;
uint32_t res_size;
struct timespec tp_start, tp_end;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
for (i=0;i<rep;i++){
res_size = bitarray2set(vec, set.data());
}
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);
cout << "timing per cycle: " << timing << endl;
cout << "print result" << endl;
//print_vector(set, res_size);
}
result (compiled with icc -O3 code.cpp -lrt)
...
timing per cycle: 0.000739613 (7.4E-4).
print result
0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.
The cpu has popcnt instruction and sse4.2 instruction set.
Thanks.
Update
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
uint32_t * ptr_v;
uint32_t * ptr_v_end = &(v[size]);
for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
while(*ptr_v) {
*ptr_set_new++ = base + __builtin_ctz(*ptr_v);
(*ptr_v) &= (*ptr_v) - 1; // zeros the lowest 1-bit in n
}
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).
Warning:
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)
I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.
Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
while(n) {
*ptr_set++ = base + __builtin_ctz(n);
n &= n - 1; // zeros the lowest 1-bit in n
}
}
On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.
There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.
As suggested by #davidbak, you could use a table lookup to process 4 elements of the bitmap at once.
Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.
#rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.
I'm thinking something like
// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0, ... };
// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
int *set_start = set;
__m128i offset = _mm_setzero_si128();
for (nibble in input[]) { // pseudocode for the actual shifting / masking
__m128i v = _mm_load_si128(&LUT[nibble]);
__m128i vpos = _mm_add_epi32(v, offset);
_mm_store((__m128i*)set, vpos);
set += _mm_popcount_u32(nibble); // variable-length store
offset = _mm_add_epi32(offset, _mm_set1_epi32(4)); // increment the offset by 4
}
return set - set_start; // set size
}
When a nibble isn't 1111, the next store will overlap, but that's fine.
Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.
Related
My task is to design a function that fulfils those requirements:
Function shall sum members of given one-dimensional array. However, it should sum only members whose number of ones in the binary representation is higher than defined threshold (e.g. if the threshold is 4, number 255 will be counted and 15 will not)
The array length is arbitrary
The function shall utilize as little memory as possible and shall be written in an efficient way
The production function code (‘sum_filtered(){..}’) shall not use any standard C library functions (or any other libraries)
The function shall return 0 on success and error code on error
The array elements are of a type 16-bit signed integer and an overflow during calculation shall be regarded as a failure
Use data types that ensure portability between different CPUs (so the calculations will be the same on 8/16/32-bit MCU)
The function code should contain a reasonable amount of comments in doxygen annotation
Here is my solution:
#include <iostream>
using namespace std;
int sum_filtered(short array[], int treshold)
{
// return 1 if invalid input parameters
if((treshold < 0) || (treshold > 16)){return(1);}
int sum = 0;
int bitcnt = 0;
for(int i=0; i < sizeof(array); i++)
{
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
// Add integer to sum if bitcnt>treshold
if(bitcnt>treshold){sum += array[i];}
}
return(0);
}
int main()
{
short array[5] = {15, 2652, 14, 1562, -115324};
int result = sum_filtered(array, 14);
cout << result << endl;
short array2[5] = {15, 2652, 14, 1562, 15324};
result = sum_filtered(array2, -2);
cout << result << endl;
}
However I'm not sure whether this code is portable between different CPUs.
And I don't how can an overflow occur during calculation and what can be other errors during processing of arrays with this function.
Can somebody more experienced give me his opinion?
Well, I can foresee one problem:
for(int i=0; i < sizeof(array); i++)
array in this context is a pointer, so will likely be 4 on 32bit systems, or 8 on 64bit systems. You really do want to be passing a count variable (in this case 5) into the sum_filtered function (and then you can pass the count as sizeof(array) / sizeof(short)).
Anyhow, this code:
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
Effectively you are doing a popcount here (which can be done using __builtin_popcount on gcc/clang, or __popcnt on MSVC. They are compiler specific, but usually boil down to a single popcount CPU instruction on most CPUs).
If you do want to do this the slow way, then an efficient approach is to treat the computation as a form of bitwise SIMD operation:
#include <cstdint> // or stdint.h if you have a rubbish compiler :)
uint16_t popcount(uint16_t s)
{
// perform 8x 1bit adds
uint16_t a0 = s & 0x5555;
uint16_t b0 = (s >> 1) & 0x5555;
uint16_t s0 = a0 + b0;
// perform 4x 2bit adds
uint16_t a1 = s0 & 0x3333;
uint16_t b1 = (s0 >> 2) & 0x3333;
uint16_t s1 = a1 + b1;
// perform 2x 4bit adds
uint16_t a2 = s1 & 0x0F0F;
uint16_t b2 = (s1 >> 4) & 0x0F0F;
uint16_t s2 = a2 + b2;
// perform 1x 8bit adds
uint16_t a3 = s2 & 0x00FF;
uint16_t b3 = (s2 >> 8) & 0x00FF;
return a3 + b3;
}
I know it says you can't use stdlib functions (your 4th point), but that shouldn't apply to the standardised integer types surely? (e.g. uint16_t) If it does, well then there is no way to guarantee portability across platforms. You're out of luck.
Personally I'd just use a 64bit integer for the sum. That should reduce the risk of any overflows *(i.e. if the threshold is zero, and all the values are -128, then you'd overflow if the array size exceeded 0x1FFFFFFFFFFFF elements (562,949,953,421,311 in decimal).
#include <cstdint>
int64_t sum_filtered(int16_t array[], uint16_t threshold, size_t array_length)
{
// changing the type on threshold to be unsigned means we don't need to test
// for negative numbers.
if(threshold > 16) { return 1; }
int64_t sum = 0;
for(size_t i=0; i < array_length; i++)
{
if (popcount(array[i]) > threshold)
{
sum += array[i];
}
}
return sum;
}
TL; DR How to safely perfom a single bit update A[n/8] |= (1<<n%8); for A being a huge array of chars (i.e., setting n's bit of A true) when computing in parallel using C++11's <thread> library?
I'm performing a computation that's easy to parallelize. I'm computing elements of a certain subset of the natural numbers, and I wanna find elements that are not in the subset. For this I create a huge array (like A = new char[20l*1024l*1024l*1024l], i.e., 20GiB). A n's bit of this array is true if n lies in my set.
When doing it in parallel and setting the bits true using A[n/8] |= (1<<n%8);, I seem to get a small loss of information, supposedly due to concurring work on the same byte of A (each thread has to first read the byte, update the single bit and write the byte back). How can I get around this? Is there a way how to do this update as an atomic operation?
The code follows. GCC version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609. The machine is an 8-core Intel(R) Xeon(R) CPU E5620 # 2.40GHz, 37GB RAM. Compiler options: g++ -std=c++11 -pthread -O3
#include <iostream>
#include <thread>
typedef long long myint; // long long to be sure
const myint max_A = 20ll*1024ll*1024ll; // 20 MiB for testing
//const myint max_A = 20ll*1024ll*1024ll*1024ll; // 20 GiB in the real code
const myint n_threads = 1; // Number of threads
const myint prime = 1543; // Tested prime
char *A;
const myint max_n = 8*max_A;
inline char getA(myint n) { return A[n/8] & (1<<(n%8)); }
inline void setAtrue(myint n) { A[n/8] |= (1<<n%8); }
void run_thread(myint startpoint) {
// Calculate all values of x^2 + 2y^2 + prime*z^2 up to max_n
// We loop through x == startpoint (mod n_threads)
for(myint x = startpoint; 1*x*x < max_n; x+=n_threads)
for(myint y = 0; 1*x*x + 2*y*y < max_n; y++)
for(myint z = 0; 1*x*x + 2*y*y + prime*z*z < max_n; z++)
setAtrue(1*x*x + 2*y*y + prime*z*z);
}
int main() {
myint n;
// Only n_threads-1 threads, as we will use the master thread as well
std::thread T[n_threads-1];
// Initialize the array
A = new char[max_A]();
// Start the threads
for(n = 0; n < n_threads-1; n++) T[n] = std::thread(run_thread, n);
// We use also the master thread
run_thread(n_threads-1);
// Synchronize
for(n = 0; n < n_threads-1; n++) T[n].join();
// Print and count all elements not in the set and n != 0 (mod prime)
myint cnt = 0;
for(n=0; n<max_n; n++) if(( !getA(n) )&&( n%1543 != 0 )) {
std::cout << n << std::endl;
cnt++;
}
std::cout << "cnt = " << cnt << std::endl;
return 0;
}
When n_threads = 1, I get the correct value cnt = 29289. When n_threads = 7, I got cnt = 29314 and cnt = 29321 on two different calls, suggesting some of the bitwise operations on a single byte were concurring.
std::atomic provides all the facilities that you need here:
std::array<std::atomic<char>, max_A> A;
static_assert(sizeof(A[0]) == 1, "Shall not have memory overhead");
static_assert(std::atomic<char>::is_always_lock_free,
"No software-level locking needed on common platforms");
inline char getA(myint n) { return A[n / 8] & (1 << (n % 8)); }
inline void setAtrue(myint n) { A[n / 8].fetch_or(1 << n % 8); }
The load in getA is atomic (equivalent to load()), and std::atomic even has built-in support for oring the stored value with another one (fetch_or), atomically of course.
When initializing A, the naive way of for (auto& a : A) a = 0; would require synchronization after every store, which you can avoid by waiving some thread-safety. std::memory_order_release only requires that what we write is visible to other threads (but not that other thread's writes are visible to us). And indeed, if you do
// Initialize the array
for (auto& a : A)
a.store(0, std::memory_order_release);
you get the safety you need without any assembly-level synchronization on x86. You could do the reverse for the loads after the threads finish, but that has no added benefit on x86 (it's just a mov either way).
Demo on the full code: https://godbolt.org/z/nLPlv1
I'm looking to convert a byte array of LSB,MSB to an array of int
Currently, I'm using a for loop and converting each set of values individually,
void ConvertToInt(int OutArray[], byte InArray[], int InSize)
{
for(int i=0; InSize/2>=i; i++)
{
int value = InArray[2*i] + (InArray[2*i+1] << 8);
OutArray[i]=value;
}
}
However, given that:
OutArray[] is created in the parent function for this specific purpose.
I don't need InArray[] after this operation
Is there a more efficient way to directly convert my byte array to an Int array?
Operate on the array as bytes, then assemble and place into integers:
void ConvertToInt(int OutArray[], byte InArray[], int InSize)
{
int * p_output = &OutArray[0];
for (size_t i = 0; i < inSize; ++i)
{
byte lsb = InArray[i++];
byte msb = InArray[i];
int value = (msb << 8) | lsb;
*p_output++ = value;
}
}
You may need to convert to larger integers, depending on the warning level:
for (size_t i = 0U; i < InSize; i += 2U)
{
const byte lsb = InArray[i + 0U];
const byte msb = InArray[i + 1U];
const int lsb_as_int(static_cast<int>(lsb));
const int msb_as_int(static_cast<int>(msb));
*p_output++ = (msb_as_int * 256) + lsb_as_int;
}
In the above code, the promotion of byte to int is explicit. The variables are temporary and the compiler should simplify this (so don't worry about the temporary variables). Also, the temporary variables allow you to see the intermediate values when using a debugger.
Print out the assembly language generated by the compiler, in both debug and release (optimized) versions, before optimizing or panicking. A good compiler should optimize the loop contents to a few instructions.
If your machine is little endian, then this can be done in O(1) time and additional-memory complexity.
int16_t *ToInt(byte inArray[])
{
return reinterpret_cast<int16_t*>(inArray);
}
If you either want it in a new array, or your machine is big endian, you must walk over all elements. In that case the best you can get is O(n) time complexity.
The only way to get around that, is to wrap the original array in an accessor class that will convert between byte pairs to int. Such a wrapper will speed the time for the first several accesses, but if at some point all the array has to be read, then the lazy evaluation will cost more than converting the array from the beginning.
On the positive sude, the wrapper costs only O(1) additional memory. Also, if you want to save the array as bytes, you don't have to convert.
class as_int {
public:
class proxy
{
public:
proxy & operator=(int16_t value)
{
pair_[0] = value& 255;
pair_[1] = ((unsigned)value >> 8) & 255;
return *this;
}
operator int16_t() ......
private:
proxy(byte*pair): pair_(pair) {}
friend class as_int;
};
as_int(byte *arr, unsigned num_bytes)
: arr_(arr), size_(num_bytes/2)
{}
int16_t operator[] const (unsigned i)
{
assert(i < size);
byte *pair = arr_ + (i*2);
return pair[0] + (pair[1]<<8);
}
proxy operator[] (unsigned i)
{
assert(i < size);
return proxy(arr_ + (i*2));
}
....
And the use is quite trivial:
as_int arr(InByteArray, InSize);
std::cout << arr[3] << '\n';
arr[5] = 30000;
arr[3] = arr[6] = 500;
I have an array of integers, lets assume they are of type int64_t. Now, I know that only every first n bits of every integer are meaningful (that is, I know that they are limited by some bounds).
What is the most efficient way to convert the array in the way that all unnecessary space is removed (i.e. I have the first integer at a[0], the second one at a[0] + n bits and so on) ?
I would like it to be general as much as possible, because n would vary from time to time, though I guess there might be smart optimizations for specific n like powers of 2 or sth.
Of course I know that I can just iterate value over value, I just want to ask you StackOverflowers if you can think of some more clever way.
Edit:
This question is not about compressing the array to take as least space as possible. I just need to "cut" n bits from every integer and given the array I know the exact n of bits I can safely cut.
Today I released: PackedArray: Packing Unsigned Integers Tightly (github project).
It implements a random access container where items are packed at the bit-level. In other words, it acts as if you were able to manipulate a e.g. uint9_t or uint17_t array:
PackedArray principle:
. compact storage of <= 32 bits items
. items are tightly packed into a buffer of uint32_t integers
PackedArray requirements:
. you must know in advance how many bits are needed to hold a single item
. you must know in advance how many items you want to store
. when packing, behavior is undefined if items have more than bitsPerItem bits
PackedArray general in memory representation:
|-------------------------------------------------- - - -
| b0 | b1 | b2 |
|-------------------------------------------------- - - -
| i0 | i1 | i2 | i3 | i4 | i5 | i6 | i7 | i8 | i9 |
|-------------------------------------------------- - - -
. items are tightly packed together
. several items end up inside the same buffer cell, e.g. i0, i1, i2
. some items span two buffer cells, e.g. i3, i6
I agree with keraba that you need to use something like Huffman coding or perhaps the Lempel-Ziv-Welch algorithm. The problem with bit-packing the way you are talking about is that you have two options:
Pick a constant n such that the largest integer can be represented.
Allow n to vary from value to value.
The first option is relatively easy to implement, but is really going to waste a lot of space unless all integers are rather small.
The second option has the major disadvantage that you have to convey changes in n somehow in the output bitstream. For instance, each value will have to have a length associated with it. This means you are storing two integers (albeit smaller integers) for every input value. There's a good chance you'll increase the file size with this method.
The advantage of Huffman or LZW is that they create codebooks in such a way that the length of the codes can be derived from the output bitstream without actually storing the lengths. These techniques allow you to get very close to the Shannon limit.
I decided to give your original idea (constant n, remove unused bits and pack) a try for fun and here is the naive implementation I came up with:
#include <sys/types.h>
#include <stdio.h>
int pack(int64_t* input, int nin, void* output, int n)
{
int64_t inmask = 0;
unsigned char* pout = (unsigned char*)output;
int obit = 0;
int nout = 0;
*pout = 0;
for(int i=0; i<nin; i++)
{
inmask = (int64_t)1 << (n-1);
for(int k=0; k<n; k++)
{
if(obit>7)
{
obit = 0;
pout++;
*pout = 0;
}
*pout |= (((input[i] & inmask) >> (n-k-1)) << (7-obit));
inmask >>= 1;
obit++;
nout++;
}
}
return nout;
}
int unpack(void* input, int nbitsin, int64_t* output, int n)
{
unsigned char* pin = (unsigned char*)input;
int64_t* pout = output;
int nbits = nbitsin;
unsigned char inmask = 0x80;
int inbit = 0;
int nout = 0;
while(nbits > 0)
{
*pout = 0;
for(int i=0; i<n; i++)
{
if(inbit > 7)
{
pin++;
inbit = 0;
}
*pout |= ((int64_t)((*pin & (inmask >> inbit)) >> (7-inbit))) << (n-i-1);
inbit++;
}
pout++;
nbits -= n;
nout++;
}
return nout;
}
int main()
{
int64_t input[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20};
int64_t output[21];
unsigned char compressed[21*8];
int n = 5;
int nbits = pack(input, 21, compressed, n);
int nout = unpack(compressed, nbits, output, n);
for(int i=0; i<=20; i++)
printf("input: %lld output: %lld\n", input[i], output[i]);
}
This is very inefficient because is steps one bit at a time, but that was the easiest way to implement it without dealing with issues of endianess. I have not tested this either with a wide range of values, just the ones in the test. Also, there is no bounds checking and it is assumed the output buffers are long enough. So what I am saying is that this code is probably only good for educational purposes to get you started.
Most any compression algorithm will get close to the minimum entropy needed to encode the integers, for example, Huffman coding, but accessing it like an array will be non-trivial.
Starting from Jason B's implementation, I eventually wrote my own version which processes bit-blocks instead of single bits. One difference is that it is lsb: It starts from lowest output bits going to highest. This only makes it harder to read with a binary dump, like Linux xxd -b. As a detail, int* can be trivially changed to int64_t*, and it should even better be unsigned. I have already tested this version with a few million arrays and it seems solid, so I share will the rest:
int pack2(int *input, int nin, unsigned char* output, int n)
{
int obit = 0;
int ibit = 0;
int ibite = 0;
int nout = 0;
if(nin>0) output[0] = 0;
for(int i=0; i<nin; i++)
{
ibit = 0;
while(ibit < n) {
ibite = std::min(n, ibit + 8 - obit);
output[nout] |= (input[i] & (((1 << ibite)-1) ^ ((1 << ibit)-1))) >> ibit << obit;
obit += ibite - ibit;
nout += obit >> 3;
if(obit & 8) output[nout] = 0;
obit &= 7;
ibit = ibite;
}
}
return nout;
}
int unpack2(int *oinput, int nin, unsigned char* ioutput, int n)
{
int obit = 0;
int ibit = 0;
int ibite = 0;
int nout = 0;
for(int i=0; i<nin; i++)
{
oinput[i] = 0;
ibit = 0;
while(ibit < n) {
ibite = std::min(n, ibit + 8 - obit);
oinput[i] |= (ioutput[nout] & (((1 << (ibite-ibit+obit))-1) ^ ((1 << obit)-1))) >> obit << ibit;
obit += ibite - ibit;
nout += obit >> 3;
obit &= 7;
ibit = ibite;
}
}
return nout;
}
I know this might seem like the obvious thing to say as I'm sure there's actually a solution, but why not use a smaller type, like uint8_t (max 255)? or uint16_t (max 65535)?. I'm sure you could bit-manipulate on an int64_t using defined values and or operations and the like, but, aside from an academic exercise, why?
And on the note of academic exercises, Bit Twiddling Hacks is a good read.
If you have fixed sizes, e.g. you know your number is 38bit rather than 64, you can build structures using bit specifications. Amusing you also have smaller elements to fit in the remaining space.
struct example {
/* 64bit number cut into 3 different sized sections */
uint64_t big_num:38;
uint64_t small_num:16;
uint64_t itty_num:10;
/* 8 bit number cut in two */
uint8_t nibble_A:4;
uint8_t nibble_B:4;
};
This isn't big/little endian safe without some hoop-jumping, so can only be used within a program rather than in a exported data format. It's quite often used to store boolean values in single bits without defining shifts and masks.
I don't think you can avoid iterating across the elements.
AFAIK Huffman encoding requires the frequencies of the "symbols", which unless you know the statistics of the "process" generating the integers, you will have to compute (by iterating across every element).
I have a byte array generated by a random number generator. I want to put this into the STL bitset.
Unfortunately, it looks like Bitset only supports the following constructors:
A string of 1's and 0's like "10101011"
An unsigned long. (my byte array will be longer)
The only solution I can think of now is to read the byte array bit by bit and make a string of 1's and 0's. Does anyone have a more efficient solution?
Something like this?
#include <bitset>
#include <climits>
template<size_t numBytes>
std::bitset<numBytes * CHAR_BIT> bytesToBitset(uint8_t *data)
{
std::bitset<numBytes * CHAR_BIT> b;
for(int i = 0; i < numBytes; ++i)
{
uint8_t cur = data[i];
int offset = i * CHAR_BIT;
for(int bit = 0; bit < CHAR_BIT; ++bit)
{
b[offset] = cur & 1;
++offset; // Move to next bit in b
cur >>= 1; // Move to next bit in array
}
}
return b;
}
And an example usage:
int main()
{
std::array<uint8_t, 4> bytes = { 0xDE, 0xAD, 0xBE, 0xEF };
auto bits = bytesToBitset<bytes.size()>(bytes.data());
std::cout << bits << std::endl;
}
There's a 3rd constructor for bitset<> - it takes no parameters and sets all the bits to 0. I think you'll need to use that then walk through the array calling set() for each bit in the byte array that's a 1.
A bit brute-force, but it'll work. There will be a bit of complexity to convert the byte-index and bit offset within each byte to a bitset index, but it's nothing a little bit of thought (and maybe a run through under the debugger) won't solve. I think it's most likely simpler and more efficient than trying to run the array through a string conversion or a stream.
I have spent a lot of time by writing a reverse function (bitset -> byte/char array). There it is:
bitset<SIZE> data = ...
// bitset to char array
char current = 0;
int offset = 0;
for (int i = 0; i < SIZE; ++i) {
if (data[i]) { // if bit is true
current |= (char)(int)pow(2, i - offset * CHAR_BIT); // set that bit to true in current masked value
} // otherwise let it to be false
if ((i + 1) % CHAR_BIT == 0) { // every 8 bits
buf[offset++] = current; // save masked value to buffer & raise offset of buffer
current = 0; // clear masked value
}
}
// now we have the result in "buf" (final size of contents in buffer is "offset")
Here is my implementation using template meta-programming.
Loops are done in the compile-time.
I took #strager version, modified it in order to prepare for TMP:
changed order of iteration (so that I could make recursion from it);
reduced number of used variables.
Modified version with loops in a run-time:
template <size_t nOfBytes>
void bytesToBitsetRunTimeOptimized(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) {
for(int i = nOfBytes - 1; i >= 0; --i) {
for(int bit = 0; bit < CHAR_BIT; ++bit) {
result[i * CHAR_BIT + bit] = ((arr[i] >> bit) & 1);
}
}
}
TMP version based on it:
template<size_t nOfBytes, int I, int BIT> struct LoopOnBIT {
static inline void bytesToBitset(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) {
result[I * CHAR_BIT + BIT] = ((arr[I] >> BIT) & 1);
LoopOnBIT<nOfBytes, I, BIT+1>::bytesToBitset(arr, result);
}
};
// stop case for LoopOnBIT
template<size_t nOfBytes, int I> struct LoopOnBIT<nOfBytes, I, CHAR_BIT> {
static inline void bytesToBitset(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) { }
};
template<size_t nOfBytes, int I> struct LoopOnI {
static inline void bytesToBitset(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) {
LoopOnBIT<nOfBytes, I, 0>::bytesToBitset(arr, result);
LoopOnI<nOfBytes, I-1>::bytesToBitset(arr, result);
}
};
// stop case for LoopOnI
template<size_t nOfBytes> struct LoopOnI<nOfBytes, -1> {
static inline void bytesToBitset(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) { }
};
template <size_t nOfBytes>
void bytesToBitset(uint8_t* arr, std::bitset<nOfBytes * CHAR_BIT>& result) {
LoopOnI<nOfBytes, nOfBytes - 1>::bytesToBitset(arr, result);
}
client code:
uint8_t arr[]={0x6A};
std::bitset<8> b;
bytesToBitset<1>(arr,b);
Well, let's be honest, I was bored and started to think there had to be a slightly faster way than setting each bit.
template<int numBytes>
std::bitset<numBytes * CHARBIT bytesToBitset(byte *data)
{
std::bitset<numBytes * CHAR_BIT> b = *data;
for(int i = 1; i < numBytes; ++i)
{
b <<= CHAR_BIT; // Move to next bit in array
b |= data[i]; // Set the lowest CHAR_BIT bits
}
return b;
}
This is indeed slightly faster, at least as long as the byte array is smaller than 30 elements (depending on your optimization-flags passed to compiler). Larger array than that and the time used by shifting the bitset makes setting each bit faster.
you can initialize the bitset from a stream. I can't remember how to wrangle a byte[] into a stream, but...
from http://www.sgi.com/tech/stl/bitset.html
bitset<12> x;
cout << "Enter a 12-bit bitset in binary: " << flush;
if (cin >> x) {
cout << "x = " << x << endl;
cout << "As ulong: " << x.to_ulong() << endl;
cout << "And with mask: " << (x & mask) << endl;
cout << "Or with mask: " << (x | mask) << endl;
}