Reading bits performance

Reading bits performance - c++

I'm writing a helper class which I intend to use for reading bits in reverse from a data block.
I tried doing an optimization where I used "rol" instructions for masking the data. However, to my surprise this is actually slower than creating a new bitmask during each access.
class reverse_bit_reader
{
public:
static const size_t bits_per_block = sizeof(unsigned long)*8;
static const size_t high_bit = 1 << (bits_per_block-1);
reverse_bit_reader(const void* data, size_t size)
: data_(reinterpret_cast<const unsigned long*>(data))
, index_(size-1)
{
// Bits are stored in left to right order, potentially ignore the last bits
size_t last_bit_index = index_ % bits_per_block;
bit_mask_ = high_bit >> (last_bit_index+1);
if(bit_mask_ == 0)
bit_mask_ = high_bit;
}
bool next_bit1()
{
return get_bit(index_--);
}
bool next_bit2() // Why is next_bit1 faster?
{
__asm // Rotate bit_mask.
{
mov eax, [ecx+0];
rol eax, 1;
mov [ecx+0], eax;
}
return data_[index_-- / bits_per_block] & bit_mask_;
}
bool eof() const{return index_ < 0;}
private:
bool get_bit(size_t index) const
{
const size_t block_index = index / bits_per_block;
const size_t bit_index = index % bits_per_block;
const size_t bit_mask = high_bit >> bit_index;
return data_[block_index] & bit_mask;
}
unsigned long bit_mask_;
int index_;
const unsigned long* data_;
};
Can anyone explain why next_bit1 is faster than next_bit2?

If you're going to be reading bits sequentially out of longs, starting from the most significant bit, and you want it to be as fast as possible, could you do something along these lines?
#define GETBIT ((theBit = (theLong < 0)), (theLong <<= 1), theBit)

Related

Efficient multi-row vector

I need an efficient implementation of a vector with multiple rows, each having the same number of columns, which is not too ugly in C++. Currently I have the following:
class BaseVector {
protected: // variables
int64_t _capacity;
int64_t _nColumns;
protected:
template<typename taItem> void Allocate(taItem * &p, const int64_t nItems) {
p = static_cast<taItem*>(MemPool::Instance().Acquire(sizeof(taItem)*nItems));
if (p == nullptr) {
__debugbreak();
}
}
template<typename taItem> void Reallocate(taItem * &p, const int64_t newCap) {
taItem *np;
Allocate(np, newCap);
Utils::AlignedNocachingCopy(np, p, _nColumns * sizeof(taItem));
MemPool::Instance().Release(p, _capacity * sizeof(taItem));
p = np;
}
// Etc for Release() operation
public:
explicit BaseVector(const int64_t initCap) : _capacity(initCap), _nColumns(0) { }
void Clear() { _nColumns = 0; }
int64_t Size() const { return _nColumns; }
};
class DerivedVector : public BaseVector {
__m256d *_pRowA;
__m256i *_pRowB;
uint64_t *_pRowC;
uint8_t *_pRowD;
// Etc. for other rows
public:
DerivedVector(const int64_t nColumns) : BaseVector(nColumns) {
Allocate(_pRowA, nColumns);
Allocate(_pRowB, nColumns);
Allocate(_pRowC, nColumns);
Allocate(_pRowD, nColumns);
// Etc. for the other rows
}
void IncSize() {
if(_nColumns >= _capacity) {
const int64_t newCap = _capacity + (_capacity >> 1) + 1;
Reallocate(_pRowA, newCap);
Reallocate(_pRowB, newCap);
Reallocate(_pRowC, newCap);
Reallocate(_pRowD, newCap);
// Etc. for other rows
_capacity = newCap;
}
_nColumns++;
}
~DerivedVector() {
// Call here the Release() operation for all rows
}
};
The problem with this approach is that there can be 30 rows, so I have to type manually (and repeat myself) 30 times Allocate, 30 times Reallocate, 30 times Release, etc.
So is there a way in C++ to keep this code DRY and fast? I am ok with macros, but not heavy polymorphism in each access to a cell in the vector because this would kill performance.

undefined symbol: vtable for SomeClass

Just ran into these 2 Clang errors:
ld.lld: error: undefined symbol: vtable for HashFn
>>> referenced by hashFn.h:40 ......
HashFn::HashFn(int, bool)
And
ld.lld: error: undefined symbol: vtable for HashFn
>>> referenced by hashFn.h:36 ......
HashFn::HashFn(HashFn const&)
hashFn.h -->
#ifndef HASHFN_H_
#define HASHFN_H_
#include "./base.h"
typedef uint64_t uint64Array[30];
static int precomputedArraySize = sizeof(uint64Array) / sizeof(uint64_t);
inline uint64_t customPow(uint64Array *precomputedPowers, bool usePrecomputed,
uint64_t base, int exp) {
if (usePrecomputed && exp < precomputedArraySize) {
return (*precomputedPowers)[exp];
}
// TOOD: Optimization possible here when passed in toSize which is bigger
// than precomputedArraySize, we can start from the value of the last
// precomputed value.
uint64_t result = 1;
while (exp) {
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
// Functor for a hashing function
// Implements a Rabin fingerprint hash function
class HashFn {
public:
// Initialize a HashFn with the prime p which is used as the base of the Rabin
// fingerprint algorithm
explicit HashFn(int p, bool precompute = true) {
this->p = p;
this->precompute = precompute;
if (precompute) {
uint64_t result = 1;
for (int i = 0; i < precomputedArraySize; i++) {
precomputedPowers[i] = result;
result *= p;
}
}
}
//virtual ~HashFn(){}
~HashFn(){}
virtual uint64_t operator()(const char *input, int len,
unsigned char lastCharCode, uint64_t lastHash);
virtual uint64_t operator()(const char *input, int len);
private:
int p;
bool precompute;
uint64Array precomputedPowers;
};
#endif // HASHFN_H_
hashFn.cc -->
#include "hashFn.h"
uint64_t HashFn::operator()(const char *input, int len,
unsigned char lastCharCode, uint64_t lastHash) {
// See the abracadabra example:
// https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
return (lastHash - lastCharCode *
customPow(&precomputedPowers, precompute, p, len - 1)) *
p + input[len - 1];
}
uint64_t HashFn::operator()(const char *input, int len) {
uint64_t total = 0;
for (int i = 0; i < len; i++) {
total += input[i] *
customPow(&precomputedPowers, precompute, p, len - i - 1);
}
return total;
}
There is already a derived class from HashFn:
class HashFn2Byte : public HashFn {
public:
HashFn2Byte() : HashFn(0, false) {
}
uint64_t operator()(const char *input, int len,
unsigned char lastCharCode, uint64_t lastHash) override;
uint64_t operator()(const char *input, int len) override;
};
......
What went wrong? I can vaguely understand this has something to do with the virtual destructor but not sure why the vtable is undefined (if I define all declarations then vtable should be there automatically?).
Also, I'm now playing in Chromium so I don't know how would the fact all files are compiled into a "jumbo" object affect the results. The standalone version of this code (a Node native module) can compile and run normally.
Any input is appreciated! Thanks.

BloomFilter in C++ using the MurmurHash3 hash function

I am trying to code a C++ implementation of a Bloom filter using the MurmurHash3 hash function. My implementation is based on this site: http://blog.michaelschmatz.com/2016/04/11/how-to-write-a-bloom-filter-cpp/
Somehow, in my BloomFilter header file, the hash function throws an incomplete type error, also, when I use the hash function inside of the add function, I get a "hash is ambigious error".
What can I do to fix this? I am somewhat new to C++ so I'm not exactly sure if I am using the interface/implementation of a structure correctly.
I am also using a main function that will include this file and run some tests to analyze the false positive rate, number of bits, filter size etc . . .
#ifndef BLOOM_FILTER_H
#define BLOOM_FILTER_H
#include "MurmurHash3.h"
#include <vector>
//basic structure of a bloom filter object
struct BloomFilter {
BloomFilter(uint64_t size, uint8_t numHashes);
void add(const uint8_t *data, std::size_t len);
bool possiblyContains(const uint8_t *data, std::size_t len) const;
private:
uint8_t m_numHashes;
std::vector<bool> m_bits;
};
//Bloom filter constructor
BloomFilter::BloomFilter(uint64_t size, uint8_t numHashes)
: m_bits(size),
m_numHashes(numHashes) {}
//Hash array created using the MurmurHash3 code
std::array<uint64_t, 2> hash(const uint8_t *data, std::size_t len)
{
std::array<uint64_t, 2> hashValue;
MurmurHash3_x64_128(data, len, 0, hashValue.data());
return hashValue;
}
//Hash array created using the MurmurHash3 code
inline uint64_t nthHash(uint8_t n,
uint64_t hashA,
uint64_t hashB,
uint64_t filterSize) {
return (hashA + n * hashB) % filterSize;
}
//Adds an element to the array
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = hash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())] = true;
}
}
//Returns true or false based on a probabilistic assesment of the array using MurmurHash3
bool BloomFilter::possiblyContains(const uint8_t *data, std::size_t len) const {
auto hashValues = hash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
if (!m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())])
{
return false;
}
}
return true;
}
#endif

If your MurmurHash3_x64_128 returns two 64-bit numbers as a hash value, I'd treat that as 4 distinct uint32_t hashes as long as you don't need more than 4 billion bits in your bit string. Most likely you don't need more than 2-3 hashses, but that depends on your use case. To figure out how many hashes you need you can check "How many hash functions does my bloom filter need?".
Using MurmurHash3_x64_128 I'd do it this way (if I were to treat it as 4 x uint32_t hashses):
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = hash(data, len);
uint32_t* hx = reinterpret_cast<uint32_t*>(&hashValues[0]);
assert(m_numHashes <= 4);
for (int n = 0; n < m_numHashes; n++)
m_bits[hx[n] % m_bits.size()] = true;
}
Your code has some issues with types conversion that's why it didn't compile:
missing #include <array>
you have to use size_t for size (it might be 32-bit unsigned or 64-bit unsigned int)
it's better to name your hash to something else (e.g. myhash) and make it static.
Here's version of your code with these correction and this should work:
#ifndef BLOOM_FILTER_H
#define BLOOM_FILTER_H
#include "MurmurHash3.h"
#include <vector>
#include <array>
//basic structure of a bloom filter object
struct BloomFilter {
BloomFilter(size_t size, uint8_t numHashes);
void add(const uint8_t *data, std::size_t len);
bool possiblyContains(const uint8_t *data, std::size_t len) const;
private:
uint8_t m_numHashes;
std::vector<bool> m_bits;
};
//Bloom filter constructor
BloomFilter::BloomFilter(size_t size, uint8_t numHashes)
: m_bits(size),
m_numHashes(numHashes) {}
//Hash array created using the MurmurHash3 code
static std::array<uint64_t, 2> myhash(const uint8_t *data, std::size_t len)
{
std::array<uint64_t, 2> hashValue;
MurmurHash3_x64_128(data, len, 0, hashValue.data());
return hashValue;
}
//Hash array created using the MurmurHash3 code
inline size_t nthHash(int n,
uint64_t hashA,
uint64_t hashB,
size_t filterSize) {
return (hashA + n * hashB) % filterSize; // <- not sure if that is OK, perhaps it is.
}
//Adds an element to the array
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = myhash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())] = true;
}
}
//Returns true or false based on a probabilistic assesment of the array using MurmurHash3
bool BloomFilter::possiblyContains(const uint8_t *data, std::size_t len) const {
auto hashValues = myhash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
if (!m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())])
{
return false;
}
}
return true;
}
#endif
Run this code on ideone.
If you are just starting with c++, at first start with basic example, try to use std::hash maybe? Create working implementation, then extend it with optional hash function parameter. If you need your BloomFilter to be fast I'd probably stay away from vector<bool> and use array of unsigned ints instead.
Basic impl could something like this, provided that your have MurmurHash3 implemented:
uint32_t MurmurHash3(const char *str, size_t len);
class BloomFilter
{
public:
BloomFilter(int count_elements = 0, double bits_per_element = 10)
{
mem = NULL;
init(count_elements, bits_per_element);
}
~BloomFilter()
{
delete[] mem;
}
void init(int count_elements, double bits_per_element)
{
assert(!mem);
sz = (uint32_t)(count_elements*bits_per_element + 0.5);
mem = new uint8_t[sz / 8 + 8];
}
void add(const std::string &str)
{
add(str.data(), str.size());
}
void add(const char *str, size_t len)
{
if (len <= 0)
return;
add(MurmurHash3(str, len));
}
bool test(const std::string &str)
{
return test(str.data(), str.size());
}
bool test(const char *str, size_t len)
{
return test_hash(MurmurHash3(str, len));
}
bool test_hash(uint32_t h)
{
h %= sz;
if (0 != (mem[h / 8] & (1u << (h % 8))))
return true;
return false;
}
int mem_size() const
{
return (sz + 7) / 8;
}
private:
void add(uint32_t h)
{
h %= sz;
mem[h / 8] |= (1u << (h % 8));
}
public:
uint32_t sz;
uint8_t *mem;
};

C++: Strict aliasing vs union abuse

Apologies in advance for what may be a silly first post on well-trodden ground. While there is plenty of material on the subject, very little of it is definitive and/or intelligible to me.
I have an AlignedArray template class to dynamically allocate memory on the heap with arbitrary alignment (I need 32-byte alignment for AVX assembly routines). This requires some ugly pointer manipulation.
Agner Fog provides a sample class in cppexamples.zip that abuses a union to do so (http://www.agner.org/optimize/optimization_manuals.zip). However, I know that writing to one member of a union and then reading from another results in UB.
AFAICT it is safe to alias any pointer type to a char *, but only in one direction. This is where my understanding gets fuzzy. Here's an abridged version of my AlignedArray
class (essentially a rewrite of Agner's, to help my understanding):
template <typename T, size_t alignment = 32>
class AlignedArray
{
size_t m_size;
char * m_unaligned;
T * m_aligned;
public:
AlignedArray (size_t const size)
: m_size(0)
, m_unaligned(0)
, m_aligned(0)
{
this->size(size);
}
~AlignedArray ()
{
this->size(0);
}
T const & operator [] (size_t const i) const { return m_aligned[i]; }
T & operator [] (size_t const i) { return m_aligned[i]; }
size_t const size () { return m_size; }
void size (size_t const size)
{
if (size > 0)
{
if (size != m_size)
{
char * unaligned = 0;
unaligned = new char [size * sizeof(T) + alignment - 1];
if (unaligned)
{
// Agner:
/*
union {
char * c;
T * t;
size_t s;
} aligned;
aligned.c = unaligned + alignment - 1;
aligned.s &= ~(alignment - 1);
*/
// Me:
T * aligned = reinterpret_cast<T *>((reinterpret_cast<size_t>(unaligned) + alignment - 1) & ~(alignment - 1));
if (m_unaligned)
{
// Agner:
//memcpy(aligned.c, m_aligned, std::min(size, m_size));
// Me:
memcpy(aligned, m_aligned, std::min(size, m_size));
delete [] m_unaligned;
}
m_size = size;
m_unaligned = unaligned;
// Agner:
//m_aligned = aligned.t;
// Me:
m_aligned = aligned;
}
return;
}
return;
}
if (m_unaligned)
{
delete [] m_unaligned;
m_size = 0;
m_unaligned = 0;
m_aligned = 0;
}
}
};
So which method is safe(r)?

I have code that implements the (replacement) new and delete operators, suitable for SIMD (i.e., SSE / AVX). It uses the following functions that you might find useful:
static inline void *G0__SIMD_malloc (size_t size)
{
constexpr size_t align = G0_SIMD_ALIGN;
void *ptr, *uptr;
static_assert(G0_SIMD_ALIGN >= sizeof(void *),
"insufficient alignment for pointer storage");
static_assert((G0_SIMD_ALIGN & (G0_SIMD_ALIGN - 1)) == 0,
"G0_SIMD_ALIGN value must be a power of (2)");
size += align; // raw pointer storage with alignment padding.
if ((uptr = malloc(size)) == nullptr)
return nullptr;
// size_t addr = reinterpret_cast<size_t>(uptr);
uintptr_t addr = reinterpret_cast<uintptr_t>(uptr);
ptr = reinterpret_cast<void *>
((addr + align) & ~(align - 1));
*(reinterpret_cast<void **>(ptr) - 1) = uptr; // (raw ptr)
return ptr;
}
static inline void G0__SIMD_free (void *ptr)
{
if (ptr != nullptr)
free(*(reinterpret_cast<void **>(ptr) - 1)); // (raw ptr)
}
This should be easy to adapt. Obviously you would replace malloc and free, since you're using the global new and delete for raw (char) storage. It assumes that size_t is sufficiently wide for address arithmetic - true in practice, but uintptr_t from <cstdint> would be more correct.

To answer your question, both of those methods are just as safe. The only two operations that are really stinky there are the cast to size_t and new char[stuff]. You should at least be using uintptr_t from <cstdint> for the first. The second operation creates your only pointer aliasing issue as technically the char constructor is run on each char element and that constitutes accessing the data through the char pointer. You should use malloc instead.
The other supposed 'pointer aliasing' isn't an issue. And that's because other than the new operation you aren't accessing any data through the aliased pointers. You are only accessing data through the T * you get after alignment.
Of course, you have to remember to construct all of your array elements. This is true even in your version. Who knows what kind of T people will put there. And, of course, if you do that, you'll have to remember to call their destructors, and have to remember to handle exceptions when you copy them (memcpy doesn't cut it).
If you have a particular C++11 feature, you do not need to do this. C++11 has a function specifically for aligning pointers to arbitrary boundaries. The interface is a little funky, but it should do the job. The call is ::std::align defined in <memory>.Thanks to R. Martinho Fernandes for pointing it out.
Here is a version of your function with the suggested fixed:
#include <cstdint> // For uintptr_t
#include <cstdlib> // For malloc
#include <algorithm>
template <typename T, size_t alignment = 32>
class AlignedArray
{
size_t m_size;
void * m_unaligned;
T * m_aligned;
public:
AlignedArray (size_t const size)
: m_size(0)
, m_unaligned(0)
, m_aligned(0)
{
this->size(size);
}
~AlignedArray ()
{
this->size(0);
}
T const & operator [] (size_t const i) const { return m_aligned[i]; }
T & operator [] (size_t const i) { return m_aligned[i]; }
size_t size() const { return m_size; }
void size (size_t const size)
{
using ::std::uintptr_t;
using ::std::malloc;
if (size > 0)
{
if (size != m_size)
{
void * unaligned = 0;
unaligned = malloc(size * sizeof(T) + alignment - 1);
if (unaligned)
{
T * aligned = reinterpret_cast<T *>((reinterpret_cast<uintptr_t>(unaligned) + alignment - 1) & ~(alignment - 1));
if (m_unaligned)
{
::std::size_t constructed = 0;
const ::std::size_t num_to_copy = ::std::min(size, m_size);
try {
for (constructed = 0; constructed < num_to_copy; ++constructed) {
new(aligned + constructed) T(m_aligned[constructed]);
}
for (; constructed < size; ++constructed) {
new(aligned + constructed) T;
}
} catch (...) {
for (::std::size_t i = 0; i < constructed; ++i) {
aligned[i].T::~T();
}
::std::free(unaligned);
throw;
}
for (size_t i = 0; i < m_size; ++i) {
m_aligned[i].T::~T();
}
free(m_unaligned);
}
m_size = size;
m_unaligned = unaligned;
m_aligned = aligned;
}
}
} else if (m_unaligned) { // and size <= 0
for (::std::size_t i = 0; i < m_size; ++i) {
m_aligned[i].T::~T();
}
::std::free(m_unaligned);
m_size = 0;
m_unaligned = 0;
m_aligned = 0;
}
}
};

C++ class to access bytes/words of an unsigned integer

union LowLevelNumber
{
unsigned int n;
struct
{
unsigned int lowByte : 8;
unsigned int highByte : 8;
unsigned int upperLowByte : 8;
unsigned int upperHighByte : 8;
} bytes;
struct
{
unsigned int lowWord : 16;
unsigned int highWord : 16;
} words;
};
This union allows me to access the unsigned integer byte or word-wise.
However, the code looks rather ugly:
var.words.lowWord = 0x66;
Is there a way which would allow me to write code like this:
var.lowWord = 0x66;
Update:
This is really about writing short / beautiful code as in the example above. The union solution itself does work, I just don't want to write .words or .bytes everytime I access lowWord or lowByte.

union LowLevelNumber {
unsigned int n;
struct {
unsigned int lowByte : 8;
unsigned int highByte : 8;
unsigned int upperLowByte : 8;
unsigned int upperHighByte : 8;
};
struct {
unsigned int lowWord : 16;
unsigned int highWord : 16;
};
};
Note the removed bytes and words names.

C++
Would http://www.cplusplus.com/reference/stl/bitset/ serve for your needs?
Plain C version would look something like this:
int32 foo;
//...
//Set to 0x66 at the low byte
foo &= 0xffffff00;
foo |= 0x66;
This is probably going to be more maintainable down the road than writing a custom class/union, because it follows the typical C idiom.

You can make
short& loword() { return (short&)(*(void*)&m_source); }
and use it if you don't care parenthesis.
Or you can go fancy
public class lowordaccess
{
unsigned int m_source;
public:
void assign(unsigned int& source) { m_source = source; }
short& operator=(short& value) { ... set m_source }
operator short() { return m_source & 0xFF; }
}
and then
struct LowLevelNumber
{
LowLevelNumber() { loword.assign(number); }
unsigned int number;
lowordaccess loword;
}
var.loword = 1;
short n = var.loword;
The latter technique is a known property emulation in C++.

You could easily wrap that in a class and use get/set accessors.

Using a union for this is bad, because it is not portable w.r.t. endianness.
Use accessor functions and implement them with bit masks and shifts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading bits performance - c++

If you're going to be reading bits sequentially out of longs, starting from the most significant bit, and you want it to be as fast as possible, could you do something along these lines? #define GETBIT ((theBit = (theLong < 0)), (theLong <<= 1), theBit)

Related

Efficient multi-row vector

undefined symbol: vtable for SomeClass

BloomFilter in C++ using the MurmurHash3 hash function

C++: Strict aliasing vs union abuse

C++ class to access bytes/words of an unsigned integer

Categories

Resources