How do I allocate memory-aligned C++ object arrays? [duplicate]

How do I allocate memory-aligned C++ object arrays? [duplicate] - c++

This question already has an answer here:
Parameter "size" of member operator new[] increases if class has destructor/delete[]
(1 answer)
Closed 5 years ago.
I am seeing a problem with operator new[]:
#include <stdlib.h>
#include <stdio.h>
class V4 { public:
float v[ 4 ];
V4() {}
void *operator new( size_t sz ) { return aligned_alloc( 16, sz ); }
void *operator new[]( size_t sz ) { printf( "sz: %zu\n", sz ); return aligned_alloc( 16, sz ); }
void operator delete( void *p, size_t sz ) { free( p ); }
//void operator delete[]( void *p, size_t sz ) { free( p ); }
};
class W4 { public:
float w[ 4 ];
W4() {}
void *operator new( size_t sz ) { return aligned_alloc( 16, sz ); }
void *operator new[]( size_t sz ) { printf( "sz: %zu\n", sz ); return aligned_alloc( 16, sz ); }
void operator delete( void *p, size_t sz ) { free( p ); }
void operator delete[]( void *p, size_t sz ) { free( p ); }
};
int main( int argc, char **argv ) {
printf( "sizeof( V4 ): %zu\n", sizeof( V4 ));
V4 *p = new V4[ 1 ];
printf( "p: %p\n", p );
printf( "sizeof( W4 ): %zu\n", sizeof( W4 ));
W4 *q = new W4[ 1 ];
printf( "q: %p\n", q );
exit(0);
}
Produces:
$ g++ -Wall main.cpp && ./a.out
sizeof( V4 ): 16
sz: 16
p: 0x55be98a10030
sizeof( W4 ): 16
sz: 24
q: 0x55be98a10058
Why does the alloc size increase to 24 when I include the operator delete[]? This is screwing up my aligned malloc.
$ g++ --version
g++ (Debian 7.2.0-18) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
From looking at other questions, it seems as though the extra 8 bytes may be being used to store the array size. Even if this is expected behaviour, why is it triggered by operator delete[], and what is the correct procedure for allocating memory-aligned arrays?
EDIT Thanks, the linked questions appear to be relevant. I still think the question as asked needs an answer, however. It ought to be possible to change the example code to produce memory-aligned arrays without recourse to std::vector, in my opinion. My current thinking is that it will be necessary to allocate a yet-larger block of bytes which are 16-byte aligned, and return the pointer such that the initial 8 bytes bring the rest of the block to alignment on the 16-byte boundary. The delete[] operator would then have to perform the reverse operation before calling free(). This is pretty disgusting, but I think it is required to satisfy both the calling code (C runtime?) (which requires its 8 bytes for size storage) - and the use case which is to get 16-byte aligned Vector4s.
EDIT The linked answer is certainly relevant, but it does not address the problem of ensuring correct memory alignment.
EDIT It looks like this code will do what I want, but I don't like the magic number 8 in delete[]:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
class W16 { public:
float w[ 16 ];
W16() {}
void *operator new( size_t sz ) { return aligned_alloc( 16, sz ); }
void *operator new[]( size_t sz ) {
size_t r = sz % sizeof( W16 );
size_t ofs = sizeof( W16 ) - r;
size_t _sz = sz + ofs;
void *p1 = aligned_alloc( sizeof( W16 ), _sz );
void *p2 = ((uint8_t *) p1) + ofs;
printf( "sizeof( W16 ): %zx, sz: %zx, r: %zx, ofs: %zx, _sz: %zx\np1: %p\np2: %p\n\n", sizeof( W16 ), sz, r, ofs, _sz, p1, p2 );
return p2;
}
void operator delete( void *p, size_t sz ) { free( p ); }
void operator delete[]( void *p, size_t sz ) {
void *p1 = ((int8_t*) p) + 8 - sizeof( W16 );
printf( "\np2: %p\np1: %p", p, p1 );
free( p1 );
}
};
int main( int argc, char **argv ) {
printf( "sizeof( W16 ): %zx\n", sizeof( W16 ));
W16 *q = new W16[ 16 ];
printf( "&q[0]: %p\n", &q[0] );
delete[] q;
}
Output:
$ g++ -Wall main.cpp && ./a.out
sizeof( W16 ): 40
sizeof( W16 ): 40, sz: 408, r: 8, ofs: 38, _sz: 440
p1: 0x559876c68080
p2: 0x559876c680b8
&q[0]: 0x559876c680c0
p2: 0x559876c680b8
p1: 0x559876c68080
EDIT Title changed from feedback in comments. I don't think this is a 'duplicate' of the linked answer anymore, though I don't know if I can get it removed.

It looks as though this will do for me:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
inline void *array_alloc( size_t sz_obj, size_t sz_req ) {
size_t r = sz_req % sz_obj;
size_t ofs = sz_obj - r;
size_t sz = sz_req + ofs;
void *p1 = aligned_alloc( sz_obj, sz );
void *p2 = (void*) (((uintptr_t ) p1) + ofs);
//printf( "sz_obj: %zx, sz_req: %zx, r: %zx, ofs: %zx, sz: %zx\np1: %p\np2: %p\n\n", sz_obj, sz_req, r, ofs, sz, p1, p2 );
return p2;
}
inline void array_free( size_t sz_obj, void *p2 ) {
void *p1 = (void*) (((uint8_t*)p2) - (((uintptr_t)p2) % sz_obj));
//printf( "\np2: %p\np1: %p", p2, p1 );
free( p1 );
}
class W16 { public:
float w[ 16 ];
W16() {}
void *operator new( size_t sz ) { return aligned_alloc( 16, sz ); }
void *operator new[]( size_t sz ) { return array_alloc( sizeof( W16 ), sz ); }
void operator delete( void *p, size_t sz ) { free( p ); }
void operator delete[]( void *p, size_t sz ) { array_free( sizeof( W16 ), p ); }
};
int main( int argc, char **argv ) {
//printf( "sizeof( W16 ): %zx\n", sizeof( W16 ));
W16 *q = new W16[ 16 ];
printf( "&q[0]: %p\n", &q[0] );
delete[] q;
}
EDIT Thanks to n.m., this code works without a magic number.

Related

Custom allocator for STL fails to compile in release mode only

I have written a custom allocate which i'm using with std::vector. The code compiles and works when in debug mode, but it fails to compile in release mode with a strange error.
Here is my allocator :
template< class T >
class AllocPowOf2
{
public:
typedef size_t size_type;
typedef ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
typedef T value_type;
private:
size_type m_nMinNbBytes;
public:
template< class U >
struct rebind
{
typedef AllocPowOf2< U > other;
};
inline pointer address( reference value ) const
{
return & value;
}
inline const_pointer address( const_reference value ) const
{
return & value;
}
inline AllocPowOf2( size_type nMinNbBytes = 32 )
: m_nMinNbBytes( nMinNbBytes ) { }
inline AllocPowOf2( const AllocPowOf2 & oAlloc )
: m_nMinNbBytes( oAlloc.m_nMinNbBytes ) { }
template< class U >
inline AllocPowOf2( const AllocPowOf2< U > & oAlloc )
: m_nMinNbBytes( oAlloc.m_nMinNbBytes ) { }
inline ~AllocPowOf2() { }
inline bool operator != ( const AllocPowOf2< T > & oAlloc )
{
return m_nMinNbBytes != oAlloc.m_nMinNbBytes;
}
inline size_type max_size() const
{
return size_type( -1 ) / sizeof( value_type );
}
static size_type OptimizeNbBytes( size_type nNbBytes, size_type nMin )
{
if( nNbBytes < nMin )
{
nNbBytes = nMin;
}
else
{
size_type j = nNbBytes;
j |= (j >> 1);
j |= (j >> 2);
j |= (j >> 4);
j |= (j >> 8);
#if ENV_32BITS || ENV_64BITS
j |= (j >> 16);
#endif
#if ENV_64BITS
j |= (j >> 32);
#endif
++j; // Least power of two greater than nNbBytes and nMin
if( j > nNbBytes )
{
nNbBytes = j;
}
}
return nNbBytes;
}
pointer allocate( size_type nNum )
{
return new value_type[ OptimizeNbBytes( nNum * sizeof( value_type ), 32 ) ]; // ERROR HERE, line 97
}
void construct( pointer p, const value_type & value )
{
new ((void *) p) value_type( value );
}
void destroy( pointer p )
{
p->~T();
}
void deallocate( pointer p, size_type nNum )
{
(void) nNum;
delete[] p;
}
};
Here is the error :
Error 1 error C2512: 'std::_Aux_cont' : no appropriate default constructor available c:\XXX\AllocPowOf2.h 97
The code compiles correctly in debug mode in both Windows with VS2008 and Android with the Android NDK and eclipse.
Any idea ?

return new value_type[ OptimizeNbBytes( nNum * sizeof( value_type ), 32 ) ];
Ignoring OptimizeNbBytes for now, you are newing up nNum * sizeof(value_type) value_types, which also calls value_type's constructor that many times.
In other words, asked to allocate memory for 16 ints, you would allocate enough for 64 ints instead; not only that, but you were asked for raw memory, and instead ran constructors all over them, creating objects that will be overwritten by the container without being destroyed - and then the delete[] in deallocate will result in double destruction.
allocate should allocate raw memory:
return pointer(::operator new(OptimizeNbBytes( nNum * sizeof( value_type ), 32 )));
and deallocate should deallocate the memory without running any destructor:
::operator delete((void*)p);

CUDA 7.5 experimental host device lambdas

I played a bit with the experimental device lambdas that where introduced in CUDA 7.5 and promoted in this blog post by Mark Harris.
For the following example I removed a lot of stuff that is not needed to show my problem (my actual implementation looks a bit nicer...).
I tried to write a foreach function that operates either on vectors on device (1 thread per element) or host (serial) depending on a template parameter. With this foreach function I can easily implement BLAS functions. As an example I use assigning a scalar to each component of a vector (I attach the complete code in the end):
template<bool onDevice> void assignScalar( size_t size, double* vector, double a )
{
auto assign = [=] __host__ __device__ ( size_t index ) { vector[index] = a; };
if( onDevice )
{
foreachDevice( size, assign );
}
else
{
foreachHost( size, assign );
}
}
However, this code gives a compiler error because of the __host__ __device__ lambda:
The closure type for a lambda ("lambda ->void") cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function
I get the same error if I remove the __device__ from the lambda expression and I get no compile error if I remove __host__ (only __device__ lambda), but in this case the host part is not executed...
If I define the lambda as either __host__ or __device__ separately, the code compiles and works as expected.
template<bool onDevice> void assignScalar2( size_t size, double* vector, double a )
{
if( onDevice )
{
auto assign = [=] __device__ ( size_t index ) { vector[index] = a; };
foreachDevice( size, assign );
}
else
{
auto assign = [=] __host__ ( size_t index ) { vector[index] = a; };
foreachHost( size, assign );
}
}
However, this introduces code duplication and actually makes the whole idea of using lambdas useless for this example.
Is there a way to accomplish what I want to do or is this a bug in the experimental feature? Actually, defining a __host__ __device__ lambda is explicitly mentioned in the first example in the programming guide. Even for that simpler example (just return a constant value from the lambda) I couldn't find a way to use the lambda expression on both host and device.
Here is the full code, compile with options -std=c++11 --expt-extended-lambda:
#include <iostream>
using namespace std;
template<typename Operation> void foreachHost( size_t size, Operation o )
{
for( size_t i = 0; i < size; ++i )
{
o( i );
}
}
template<typename Operation> __global__ void kernel_foreach( Operation o )
{
size_t index = blockIdx.x * blockDim.x + threadIdx.x;
o( index );
}
template<typename Operation> void foreachDevice( size_t size, Operation o )
{
size_t blocksize = 32;
size_t gridsize = size/32;
kernel_foreach<<<gridsize,blocksize>>>( o );
}
__global__ void printFirstElementOnDevice( double* vector )
{
printf( "dVector[0] = %f\n", vector[0] );
}
void assignScalarHost( size_t size, double* vector, double a )
{
auto assign = [=] ( size_t index ) { vector[index] = a; };
foreachHost( size, assign );
}
void assignScalarDevice( size_t size, double* vector, double a )
{
auto assign = [=] __device__ ( size_t index ) { vector[index] = a; };
foreachDevice( size, assign );
}
// compile error:
template<bool onDevice> void assignScalar( size_t size, double* vector, double a )
{
auto assign = [=] __host__ __device__ ( size_t index ) { vector[index] = a; };
if( onDevice )
{
foreachDevice( size, assign );
}
else
{
foreachHost( size, assign );
}
}
// works:
template<bool onDevice> void assignScalar2( size_t size, double* vector, double a )
{
if( onDevice )
{
auto assign = [=] __device__ ( size_t index ) { vector[index] = a; };
foreachDevice( size, assign );
}
else
{
auto assign = [=] __host__ ( size_t index ) { vector[index] = a; };
foreachHost( size, assign );
}
}
int main()
{
size_t SIZE = 32;
double* hVector = new double[SIZE];
double* dVector;
cudaMalloc( &dVector, SIZE*sizeof(double) );
// clear memory
for( size_t i = 0; i < SIZE; ++i )
{
hVector[i] = 0;
}
cudaMemcpy( dVector, hVector, SIZE*sizeof(double), cudaMemcpyHostToDevice );
assignScalarHost( SIZE, hVector, 1.0 );
cout << "hVector[0] = " << hVector[0] << endl;
assignScalarDevice( SIZE, dVector, 2.0 );
printFirstElementOnDevice<<<1,1>>>( dVector );
cudaDeviceSynchronize();
assignScalar2<false>( SIZE, hVector, 3.0 );
cout << "hVector[0] = " << hVector[0] << endl;
assignScalar2<true>( SIZE, dVector, 4.0 );
printFirstElementOnDevice<<<1,1>>>( dVector );
cudaDeviceSynchronize();
// assignScalar<false>( SIZE, hVector, 5.0 );
// cout << "hVector[0] = " << hVector[0] << endl;
//
// assignScalar<true>( SIZE, dVector, 6.0 );
// printFirstElementOnDevice<<<1,1>>>( dVector );
// cudaDeviceSynchronize();
cudaError_t error = cudaGetLastError();
if(error!=cudaSuccess)
{
cout << "ERROR: " << cudaGetErrorString(error);
}
}
I used the production release of CUDA 7.5.
Update
I tried this third version for the assignScalar function:
template<bool onDevice> void assignScalar3( size_t size, double* vector, double a )
{
#ifdef __CUDA_ARCH__
#define LAMBDA_HOST_DEVICE __device__
#else
#define LAMBDA_HOST_DEVICE __host__
#endif
auto assign = [=] LAMBDA_HOST_DEVICE ( size_t index ) { vector[index] = a; };
if( onDevice )
{
foreachDevice( size, assign );
}
else
{
foreachHost( size, assign );
}
}
It compiles and runs without error, but the device version (assignScalar3<true>) is not executed. Actually, I thought that __CUDA_ARCH__ will always be undefined (since the function is not __device__) but I checked explicitly that there is a compile path where it is defined.

The task that I tried to accomplish with the examples provided in the question is not possible with CUDA 7.5, though it was not explicitly excluded from the allowed cases for the experimental lambda support.
NVIDIA announced that CUDA Toolkit 8.0 will support __host__ __device__ lambdas as an experimental feature, according to the blog post CUDA 8 Features Revealed.
I verified that my example works with the CUDA 8 Release Candidate (Cuda compilation tools, release 8.0, V8.0.26).
Here is the code that I finally used, compiled with nvcc -std=c++11 --expt-extended-lambda:
#include <iostream>
using namespace std;
template<typename Operation> __global__ void kernel_foreach( Operation o )
{
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
o( i );
}
template<bool onDevice, typename Operation> void foreach( size_t size, Operation o )
{
if( onDevice )
{
size_t blocksize = 32;
size_t gridsize = size/32;
kernel_foreach<<<gridsize,blocksize>>>( o );
}
else
{
for( size_t i = 0; i < size; ++i )
{
o( i );
}
}
}
__global__ void printFirstElementOnDevice( double* vector )
{
printf( "dVector[0] = %f\n", vector[0] );
}
template<bool onDevice> void assignScalar( size_t size, double* vector, double a )
{
auto assign = [=] __host__ __device__ ( size_t i ) { vector[i] = a; };
foreach<onDevice>( size, assign );
}
int main()
{
size_t SIZE = 32;
double* hVector = new double[SIZE];
double* dVector;
cudaMalloc( &dVector, SIZE*sizeof(double) );
// clear memory
for( size_t i = 0; i < SIZE; ++i )
{
hVector[i] = 0;
}
cudaMemcpy( dVector, hVector, SIZE*sizeof(double), cudaMemcpyHostToDevice );
assignScalar<false>( SIZE, hVector, 3.0 );
cout << "hVector[0] = " << hVector[0] << endl;
assignScalar<true>( SIZE, dVector, 4.0 );
printFirstElementOnDevice<<<1,1>>>( dVector );
cudaDeviceSynchronize();
cudaError_t error = cudaGetLastError();
if(error!=cudaSuccess)
{
cout << "ERROR: " << cudaGetErrorString(error);
}
}

How to optimize out the space a class reference member takes?

template<typename T>
struct UninitializedField
{
T& X;
inline UninitializedField( ) : X( *( T* )&DATA )
{
}
protected:
char DATA[ sizeof( T ) ];
};
int main( )
{
UninitializedField<List<int>> LetsTest;
printf( "%u, %u\n", sizeof( LetsTest ), sizeof( List<int> ) );
}
I am trying to program a class that wraps an object without being automatically initialize\constructed.
But when I execute my program the output is:
8, 4
Is there a way to optimize out the dereference to get into the object in X and the space it takes?

template<typename T>
struct UninitializedField {
__inline UninitializedField( const T &t ) {
*( ( T* )this ) = t;
}
__inline UninitializedField( bool Construct = false, bool Zero = true ) {
if ( Zero )
memset( this, 0, sizeof( *this ) );
if ( Construct )
*( ( T* )this ) = T( );
}
__inline T *operator->( ) {
return ( T* )this;
}
__inline T &operator*( ) {
return *( ( T* )this );
}
protected:
char DATA[ sizeof( T ) ];
};
There isn't any space taken, and with compiler-optimization on there's no call to function.

Why is std::vector<bool> faster?

As I was implementing the Sieve of Eratosthenes I ran into an issue with std::vector<bool> : there is no access to the raw data.
So I decided to use a custom minimalistic implementation where I would have access to the data pointer.
#ifndef LIB_BITS_T_H
#define LIB_BITS_T_H
#include <algorithm>
template <typename B>
class bits_t{
public:
typedef B block_t;
static const size_t block_size = sizeof(block_t) * 8;
block_t* data;
size_t size;
size_t blocks;
class bit_ref{
public:
block_t* const block;
const block_t mask;
bit_ref(block_t& block, const block_t mask) noexcept : block(&block), mask(mask){}
inline void operator=(bool v) const noexcept{
if(v) *block |= mask;
else *block &= ~mask;
}
inline operator bool() const noexcept{
return (bool)(*block & mask);
}
};
bits_t() noexcept : data(nullptr){}
void resize(const size_t n, const bool v) noexcept{
block_t fill = v ? ~block_t(0) : block_t(0);
size = n;
blocks = (n + block_size - 1) / block_size;
data = new block_t[blocks];
std::fill(data, data + blocks, fill);
}
inline block_t& block_at_index(const size_t i) const noexcept{
return data[i / block_size];
}
inline size_t index_in_block(const size_t i) const noexcept{
return i % block_size;
}
inline bit_ref operator[](const size_t i) noexcept{
return bit_ref(block_at_index(i), block_t(1) << index_in_block(i));
}
~bits_t(){
delete[] data;
}
};
#endif // LIB_BITS_T_H
The code is nearly the same than the one in /usr/include/c++/4.7/bits/stl_bvector.h but is slower.
I tried an optimization,
#ifndef LIB_BITS_T_H
#define LIB_BITS_T_H
#include <algorithm>
template <typename B>
class bits_t{
const B mask[64] = {
0b0000000000000000000000000000000000000000000000000000000000000001,
0b0000000000000000000000000000000000000000000000000000000000000010,
0b0000000000000000000000000000000000000000000000000000000000000100,
0b0000000000000000000000000000000000000000000000000000000000001000,
0b0000000000000000000000000000000000000000000000000000000000010000,
0b0000000000000000000000000000000000000000000000000000000000100000,
0b0000000000000000000000000000000000000000000000000000000001000000,
0b0000000000000000000000000000000000000000000000000000000010000000,
0b0000000000000000000000000000000000000000000000000000000100000000,
0b0000000000000000000000000000000000000000000000000000001000000000,
0b0000000000000000000000000000000000000000000000000000010000000000,
0b0000000000000000000000000000000000000000000000000000100000000000,
0b0000000000000000000000000000000000000000000000000001000000000000,
0b0000000000000000000000000000000000000000000000000010000000000000,
0b0000000000000000000000000000000000000000000000000100000000000000,
0b0000000000000000000000000000000000000000000000001000000000000000,
0b0000000000000000000000000000000000000000000000010000000000000000,
0b0000000000000000000000000000000000000000000000100000000000000000,
0b0000000000000000000000000000000000000000000001000000000000000000,
0b0000000000000000000000000000000000000000000010000000000000000000,
0b0000000000000000000000000000000000000000000100000000000000000000,
0b0000000000000000000000000000000000000000001000000000000000000000,
0b0000000000000000000000000000000000000000010000000000000000000000,
0b0000000000000000000000000000000000000000100000000000000000000000,
0b0000000000000000000000000000000000000001000000000000000000000000,
0b0000000000000000000000000000000000000010000000000000000000000000,
0b0000000000000000000000000000000000000100000000000000000000000000,
0b0000000000000000000000000000000000001000000000000000000000000000,
0b0000000000000000000000000000000000010000000000000000000000000000,
0b0000000000000000000000000000000000100000000000000000000000000000,
0b0000000000000000000000000000000001000000000000000000000000000000,
0b0000000000000000000000000000000010000000000000000000000000000000,
0b0000000000000000000000000000000100000000000000000000000000000000,
0b0000000000000000000000000000001000000000000000000000000000000000,
0b0000000000000000000000000000010000000000000000000000000000000000,
0b0000000000000000000000000000100000000000000000000000000000000000,
0b0000000000000000000000000001000000000000000000000000000000000000,
0b0000000000000000000000000010000000000000000000000000000000000000,
0b0000000000000000000000000100000000000000000000000000000000000000,
0b0000000000000000000000001000000000000000000000000000000000000000,
0b0000000000000000000000010000000000000000000000000000000000000000,
0b0000000000000000000000100000000000000000000000000000000000000000,
0b0000000000000000000001000000000000000000000000000000000000000000,
0b0000000000000000000010000000000000000000000000000000000000000000,
0b0000000000000000000100000000000000000000000000000000000000000000,
0b0000000000000000001000000000000000000000000000000000000000000000,
0b0000000000000000010000000000000000000000000000000000000000000000,
0b0000000000000000100000000000000000000000000000000000000000000000,
0b0000000000000001000000000000000000000000000000000000000000000000,
0b0000000000000010000000000000000000000000000000000000000000000000,
0b0000000000000100000000000000000000000000000000000000000000000000,
0b0000000000001000000000000000000000000000000000000000000000000000,
0b0000000000010000000000000000000000000000000000000000000000000000,
0b0000000000100000000000000000000000000000000000000000000000000000,
0b0000000001000000000000000000000000000000000000000000000000000000,
0b0000000010000000000000000000000000000000000000000000000000000000,
0b0000000100000000000000000000000000000000000000000000000000000000,
0b0000001000000000000000000000000000000000000000000000000000000000,
0b0000010000000000000000000000000000000000000000000000000000000000,
0b0000100000000000000000000000000000000000000000000000000000000000,
0b0001000000000000000000000000000000000000000000000000000000000000,
0b0010000000000000000000000000000000000000000000000000000000000000,
0b0100000000000000000000000000000000000000000000000000000000000000,
0b1000000000000000000000000000000000000000000000000000000000000000
};
public:
typedef B block_t;
static const size_t block_size = sizeof(block_t) * 8;
block_t* data;
size_t size;
size_t blocks;
class bit_ref{
public:
block_t* const block;
const block_t mask;
bit_ref(block_t& block, const block_t mask) noexcept : block(&block), mask(mask){}
inline void operator=(bool v) const noexcept{
if(v) *block |= mask;
else *block &= ~mask;
}
inline operator bool() const noexcept{
return (bool)(*block & mask);
}
};
bits_t() noexcept : data(nullptr){}
void resize(const size_t n, const bool v) noexcept{
block_t fill = v ? ~block_t(0) : block_t(0);
size = n;
blocks = (n + block_size - 1) / block_size;
data = new block_t[blocks];
std::fill(data, data + blocks, fill);
}
inline block_t& block_at_index(const size_t i) const noexcept{
return data[i / block_size];
}
inline size_t index_in_block(const size_t i) const noexcept{
return i % block_size;
}
inline bit_ref operator[](const size_t i) noexcept{
return bit_ref(block_at_index(i), mask[index_in_block(i)]);
}
~bits_t(){
delete[] data;
}
};
#endif // LIB_BITS_T_H
(Compiling with g++4.7 -O3)
Eratosthenes sieve algorithm (33.333.333 bits)
std::vector<bool> 19.1s
bits_t<size_t> 19.9s
bits_t<size_t> (with lookup table) 19.7s
ctor + resize(33.333.333 bits) + dtor
std::vector<bool> 120ms
bits_t<size_t> 150ms
QUESTION : Where does the slowdown come from?

Outside of all the problems as pointed out by some other users, your resize is allocating more memory each time the current block limit is reached to add ONE block. The std::vector will double the size of the buffer (so if you already had 16 blocks, now you have 32 blocks). In other words, they will do less new than you.
This being said, you do not do the necessary delete & copy and that could have a "positive" impact in your version... ("positive" impact speed wise, it is not positive that you do not delete the old data, nor copy it in your new buffer.)
Also, the std::vector will properly enlarge the buffer and thus copy data that is likely already in your CPU cache. With your version, that cache is lost since you just ignore the old buffer on each resize().
Also when a class handles a memory buffer it is customary to implement the copy and assignment operators, for some reasons... and you could look into using a shared_ptr<>() too. The delete is then hidden and the class is a template so it is very fast (it does not add any code that you would not already have in your own version.)
=== Update
There is one other thing. You're operator [] implementation:
inline bit_ref operator[](const size_t i) noexcept{
return bit_ref(block_at_index(i), mask[index_in_block(i)]);
}
(side note: the inline is not required since the fact that you write the code within the class already means you okayed the inline capability already.)
You only offer a non-const version which "is slow" because it creates a sub-class. You should try implementing a const version that returns bool and see whether that accounts for the ~3% difference you see.
bool operator[](const size_t i) const noexcept
{
return (block_at_index(i) & mask[index_in_block(i)]) != 0;
}
Also, using a mask[] array can also slow down things. (1LL << (index & 0x3F)) should be faster (2 CPU instructions with 0 memory access).

Apparently, the wrapping of i % block_size in a function was the culprit
inline size_t index_in_block ( const size_t i ) const noexcept {
return i % block_size;
}
inline bit_ref operator[] ( const size_t i ) noexcept {
return bit_ref( block_at_index( i ), block_t( 1 ) << index_in_block( i ) );
}
so replacing the above code with
inline bit_ref operator[] ( const size_t i ) noexcept {
return bit_ref( block_at_index( i ), block_t( 1 ) << ( i % block_size ) );
}
solves the issue. However, I still don't know why it is. My best guess is that I didn't get the signature of index_in_block right and that the optimizer is thus not able to inline this function in a similar way to the manual inlining way.
Here is the new code.
#ifndef LIB_BITS_2_T_H
#define LIB_BITS_2_T_H
#include <algorithm>
template <typename B>
class bits_2_t {
public:
typedef B block_t;
static const int block_size = sizeof( block_t ) * __CHAR_BIT__;
private:
block_t* _data;
size_t _size;
size_t _blocks;
public:
class bit_ref {
public:
block_t* const block;
const block_t mask;
bit_ref ( block_t& block, const block_t mask) noexcept
: block( &block ), mask( mask ) {}
inline bool operator= ( const bool v ) const noexcept {
if ( v ) *block |= mask;
else *block &= ~mask;
return v;
}
inline operator bool() const noexcept {
return (bool)( *block & mask );
}
};
bits_2_t () noexcept : _data( nullptr ), _size( 0 ), _blocks( 0 ) {}
bits_2_t ( const size_t n ) noexcept : _data( nullptr ), _size( n ) {
_blocks = number_of_blocks_needed( n );
_data = new block_t[_blocks];
const block_t fill( 0 );
std::fill( _data, _data + _blocks, fill );
}
bits_2_t ( const size_t n, const bool v ) noexcept : _data( nullptr ), _size( n ) {
_blocks = number_of_blocks_needed( n );
_data = new block_t[_blocks];
const block_t fill = v ? ~block_t( 0 ) : block_t( 0 );
std::fill( _data, _data + _blocks, fill );
}
void resize ( const size_t n ) noexcept {
resize( n, false );
}
void resize ( const size_t n, const bool v ) noexcept {
const size_t tmpblocks = number_of_blocks_needed( n );
const size_t copysize = std::min( _blocks, tmpblocks );
block_t* tmpdata = new block_t[tmpblocks];
std::copy( _data, _data + copysize, tmpdata );
const block_t fill = v ? ~block_t( 0 ) : block_t( 0 );
std::fill( tmpdata + copysize, tmpdata + tmpblocks, fill );
delete[] _data;
_data = tmpdata;
_blocks = tmpblocks;
_size = n;
}
inline size_t number_of_blocks_needed ( const size_t n ) const noexcept {
return ( n + block_size - 1 ) / block_size;
}
inline block_t& block_at_index ( const size_t i ) const noexcept {
return _data[i / block_size];
}
inline bit_ref operator[] ( const size_t i ) noexcept {
return bit_ref( block_at_index( i ), block_t( 1 ) << ( i % block_size ) );
}
inline bool operator[] ( const size_t i ) const noexcept {
return (bool)( block_at_index( i ) & ( block_t( 1 ) << ( i % block_size ) ) );
}
inline block_t* data () {
return _data;
}
inline const block_t* data () const {
return _data;
}
inline size_t size () const {
return _size;
}
void clear () noexcept {
delete[] _data;
_size = 0;
_blocks = 0;
_data = nullptr;
}
~bits_2_t () {
clear();
}
};
#endif // LIB_BITS_2_T_H
Here are the results for this new code on my amd64 machine for primes up to 1.000.000.000 (best of 3 runs, real time).
Sieve of Eratosthenes with 1 memory unit per number ( not skipping multiples of 2 ).
bits_t<uint8_t>
real 0m23.614s user 0m23.493s sys 0m0.092s
bits_t<uint16_t>
real 0m24.399s user 0m24.294s sys 0m0.084s
bits_t<uint32_t>
real 0m23.501s user 0m23.372s sys 0m0.108s <-- best
bits_t<uint64_t>
real 0m24.393s user 0m24.304s sys 0m0.068s
std::vector<bool>
real 0m24.362s user 0m24.276s sys 0m0.056s
std::vector<uint8_t>
real 0m38.303s user 0m37.570s sys 0m0.683s
Here is the code of the sieve (where (...) should be replaced by the bit array of your choice).
#include <iostream>
typedef (...) array_t;
int main ( int argc, char const *argv[] ) {
if ( argc != 2 ) {
std::cout << "#0 missing" << std::endl;
return 1;
}
const size_t count = std::stoull( argv[1] );
array_t prime( count, true );
prime[0] = prime[1] = false;
for ( size_t k = 2 ; k * k < count ; ++k ) {
if ( prime[k] ) {
for ( size_t i = k * k ; i < count ; i += k ) {
prime[i] = false;
}
}
}
return 0;
}

Using global C++object from C crashes application

This is my first post, I'm new to this site, but I've been lurking around for a while now. I've got a good knowledge of C and very limited knowledge of C++. I guess. I'm on Windows (XPx64), VS2008.
I'm trying to wrap a C++ library, kdtree2, so that I can use it from C. The main issues relate to accessing the kdtree2 and kdtree2_result_vector classes. As the authors ftp server does not respond I've uploaded a copy of the original distribution kdtree2 src
Just some quick info on the kd-tree (a form of a binary tree), "'the data' are coordinates in n-dimensional Cartesian space and an index. What it is used for are nearest neighbour searches, so after constructing the tree (which will not be modified), one can query the tree for various types of nn-searches. The results in this case are returned in a vector object of structs (c-like structs).
struct kdtree2_result {
//
// the search routines return a (wrapped) vector
// of these.
//
public:
float dis; // its square Euclidean distance
int idx; // which neighbor was found
};
My imagined solution is to have an array of kdtree2 objects (one per thread). For the kdtree2_result_vector class I haven't got a solution yet as I'm not getting past first base. It is not necessary to access the kdtree2 class directly.
I only need to fill it with data and then use it (as the second function below is an example of). For this I've defined:
kdtree2 *global_kdtree2;
extern "C" void new_kdtree2 ( float **data, const int n, const int dim, bool arrange ) {
multi_array_ref<float,2> kdtree2_data ( ( float * ) &data [ 0 ][ 0 ], extents [ n ][ dim ], c_storage_order ( ) );
global_kdtree2 = new kdtree2 ( kdtree2_data, arrange );
}
For then using that tree, I've defined:
extern "C" void n_nearest_around_point_kdtree2 ( int idxin, int correltime, int nn ) {
kdtree2_result_vector result;
global_kdtree2->n_nearest_around_point ( idxin, correltime, nn, result );
}
kdtree2_result_vector is derived from the vector class. This compiles without error, and the resulting library can be linked and it's C-functions accessed from C.
The problem is that the invocation of n_nearest_around_point_kdtree2 crashes the program. I suspect somehow between setting up the tree and using it in the second function call, the tree somehow gets freed/destroyed. The calling c-test-program is posted below:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include "kdtree2.h"
#define MALLOC_2D(type,x,y) ((type**)malloc_2D_kdtree2((x),(y),sizeof(type)))
void **malloc_2D_kdtree2 ( const int x, const int y, const int type_size ) {
const int y_type_size = y * type_size;
void** x_idx = ( void ** ) malloc ( x * ( sizeof ( void ** ) + y_type_size ) );
if ( x_idx == NULL )
return NULL;
char* y_idx = ( char * ) ( x_idx + x );
for ( int i = 0; i < x; i++ )
x_idx [ i ] = y_idx + i * y_type_size;
return x_idx;
}
int main ( void ) {
float **data = MALLOC_2D ( float, 100, 3 );
for ( int i = 0; i < 100; i++ )
for ( int j = 0; j < 3; j++ )
data [ i ][ j ] = ( float ) ( 3 * i + j );
// this works fine
tnrp ( data, 100, 3, false );
new_kdtree2 ( data, 100, 3, false );
// this crashes the program
n_nearest_around_point_kdtree2 ( 9, 3, 6 );
delete_kdtree2 ( );
free ( data );
return 0;
}
As far as I can see, searching the internet, it should work, but I'm obviously missing something vital in the brave (for me) new world of C++.
EDIT:
Resolution, thanks to larsmans. I've defined the following class (derived from what larsmans posted earlier):
class kdtree {
private:
float **data;
multi_array_ref<float,2> data_ref;
kdtree2 tree;
public:
kdtree2_result_vector result;
kdtree ( float **data, int n, int dim, bool arrange ) :
data_ref ( ( float * ) &data [ 0 ][ 0 ], extents [ n ][ dim ], c_storage_order ( ) ),
tree ( data_ref, arrange )
{
}
void n_nearest_brute_force ( std::vector<float>& qv ) {
tree.n_nearest_brute_force ( qv, result ); }
void n_nearest ( std::vector<float>& qv, int nn ) {
tree.n_nearest ( qv, nn, result ); }
void n_nearest_around_point ( int idxin, int correltime, int nn ) {
tree.n_nearest_around_point ( idxin, correltime, nn, result ); }
void r_nearest ( std::vector<float>& qv, float r2 ) {
tree.r_nearest ( qv, r2, result ); }
void r_nearest_around_point ( int idxin, int correltime, float r2 ) {
tree.r_nearest_around_point ( idxin, correltime, r2, result ); }
int r_count ( std::vector<float>& qv, float r2 ) {
return tree.r_count ( qv, r2 ); }
int r_count_around_point ( int idxin, int correltime, float r2 ) {
return tree.r_count_around_point ( idxin, correltime, r2 ); }
};
The code to call these functions from C:
kdtree* global_kdtree2 [ 8 ];
extern "C" void new_kdtree2 ( const int thread_id, float **data, const int n, const int dim, bool arrange ) {
global_kdtree2 [ thread_id ] = new kdtree ( data, n, dim, arrange );
}
extern "C" void delete_kdtree2 ( const int thread_id ) {
delete global_kdtree2 [ thread_id ];
}
extern "C" void n_nearest_around_point_kdtree2 ( const int thread_id, int idxin, int correltime, int nn, struct kdtree2_result **result ) {
global_kdtree2 [ thread_id ]->n_nearest_around_point ( idxin, correltime, nn );
*result = &( global_kdtree2 [ thread_id ]->result.front ( ) );
}
and eventually the C-program to start using it all:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include "kdtree2.h"
int main ( void ) {
float **data = MALLOC_2D ( float, 100, 3 );
for ( int i = 0; i < 100; i++ )
for ( int j = 0; j < 3; j++ )
data [ i ][ j ] = ( float ) ( 3 * i + j );
int thread_id = 0;
new_kdtree2 ( thread_id, data, 100, 3, false );
struct kdtree2_result *result;
n_nearest_around_point_kdtree2 ( thread_id, 28, 3, 9, &result );
for ( int i = 0; i < 9; i++ )
printf ( "result[%d]= (%d,%f)\n", i , result [ i ].idx, result [ i ].dis );
printf ( "\n" );
n_nearest_around_point_kdtree2 ( thread_id, 9, 3, 6, &result );
for ( int i = 0; i < 6; i++ )
printf ( "result[%d]= (%d,%f)\n", i , result [ i ].idx, result [ i ].dis );
delete_kdtree2 ( thread_id );
free ( data );
return 0;
}

The API docs in the referenced paper are rather flaky and the author's FTP server doesn't respond, so I can't tell with certainty, but my hunch is that
multi_array_ref<float,2> kdtree2_data((float *)&data[0][0], extents[n][dim],
c_storage_order( ));
global_kdtree2 = new kdtree2(kdtree2_data, arrange);
construct the kdtree2 by storing a reference to kdtree2_data in the global_kdtree2 object, rather than making a full copy. Since kdtree2_data is a local variable, it is destroyed when new_kdtree2 returns. You'll have to keep it alive until n_nearest_around_point_kdtree2 is done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I allocate memory-aligned C++ object arrays? [duplicate] - c++

Related

Custom allocator for STL fails to compile in release mode only

CUDA 7.5 experimental host device lambdas

How to optimize out the space a class reference member takes?

Why is std::vector<bool> faster?

Using global C++object from C crashes application

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I allocate memory-aligned C++ object arrays? [duplicate] - c++

Related

Custom allocator for STL fails to compile in release mode only

CUDA 7.5 experimental __host__ __device__ lambdas

How to optimize out the space a class reference member takes?

Why is std::vector<bool> faster?

Using global C++object from C crashes application

Categories

Resources

CUDA 7.5 experimental host device lambdas