I'm going to write a templatized implementation of a KDTree, which for now should only work as Quadtree or Octree for a BarnesHut implementation.
The crucial point here is the design, I would like to specify the number of dimension where the tree is defined as template parameter and then simply declare some common methods, which automatically behave the correct way (I think some template specialization is needed then).
I would like to specialize the template in order to have 2^2 (quadtree) or 2^3 (octree) nodes.
Do someone have some design ideas? I would like to avoid inheritance because it constrains me to do dynamic memory allocation rather than static allocations.
Here N can be 2 or 3
template<int N>
class NTree
{
public:
NTree<N>( const std::vector<Mass *> &);
~NTree<N>()
{
for (int i=0; i<pow(2,N); i++)
delete nodes[i];
}
private:
void insert<N>( Mass *m );
NTree *nodes[pow(2,N)]; // is it possible in a templatized way?
};
Another problem is that quadtree has 4 nodes but 2 dimension, octree has 8 nodes, but 3 dimension, i.e. number of nodes is 2^dimension. Can I specify this via template-metaprogramming? I would like to keep the number 4 and 8 so the loop unroller can be faster.
Thank you!
You can use 1 << N instead of pow(2, N). This works because 1 << N is a compile-time constant, whereas pow(2, N) is not a compile time constant (even though it will be evaluated at compile-time anyway).
If you are using a C++11 compiler that supports constexpr you could write yourself a constexpr-version of pow to do the calculation at runtime.
Related
I'm writing a constructor that will transfer elements from a 2D initializer list declared as initializer_list<initializer_list<T>> myInitializerList to a 1D normal array declared as T * arr (this will store the elements in row-major order). I have two nested for loops to iterate over the initializer lists but I'm trying to optimize the test expressions in each for loop (the second of the three statements in the loop declaration).
I have a couple ideas for how to do this but I'm not sure which will be faster. The first is to compare iterators and make the test expression for the outer loop rowItr != myInitializerList.end() and for the inner loop colItr != rowItr->end().
My second idea involves comparing indexes, not iterators, and it might be faster (although I'm not sure). Below is a complete loop demonstrating how I've implemented the second strategy:
template <typename T> //a numerical type
//intentionally left out constructor definition, variable initialization
for (unsigned currentPosition{ 0 }; currentPosition < numRows*numCols; rowItr++) {
typename initializer_list<T>::iterator colItr = rowItr->begin();
for (unsigned currentColumn{ 0 }; currentColumn++ < numCols; colItr++) {
arr[currentPosition++] = *colItr;
}
}
//end constructor definition
The arrays will store numerical types, most likely float and have dimensions of up to 1024x1024.
Based on these constraints, is one technique faster than the other? Alternatively, is there a third algorithm I didn't consider that will be faster than both of the ones I already have?
I need to check if structures were changed, for this i have static assert implemented through template technique. But for now i only have checked sizeof, but one can add something to a structure without changing sizeof.
How can i calculate sum of sizeofs of members?
struct A
{
int i;
bool b;
};
I need __sizeof(A) to be equal to 3 that is sizeof(A::i) + sizeof(A::b).
I need it to be done for big lot of structures in my project, and structures must be padded as it supposed to give the best performance.
I have to create a three-dimensional array using class A as element ,class A is defined like below, should I use vector<vector<vector<A> > > or boost::multi_array? Which one is better?
struct C
{
int C_1;
short C_2;
};
class B
{
public:
bool B_1;
vector<C> C_;
};
class A
{
public:
bool A_1;
B B_[6];
};
If you know the size of all three dimensions at the time, that you write your code, and if you don't need checking for array bounds, then just use traditional arrays:
const int N1 = ...
const int N2 = ...
const int N3 = ...
A a[N1][N2][N3]
If the array dimensions can onlybe determined at run time, but remain constant after program initialization, and if array usage is distributed uniformly, then boost::multi_array is your friend. However, if a lot of dynamic extension is going on at runtime, and/or if array sizes are not uniform (for example, you need A[0][0][0...99] but only A[2][3][0...3]), then the nested vector is likely the best solution. In the case of non-uniform sizes, put the dimension, whose size variies the most, as last dimension. Also, in the nested vector solution, it is generally a good idea to put small dimensions first.
The main concern that I would have about using vector<vector<vector<A> > > would be making sure that the second- and third-level vectors all have the same length like they would in a traditional 3D array, since there would be nothing in the data type to enforce that. I'm not terribly familiar with boost::multi_array, but it looks like this isn't an issue there - you can resize() the whole array, but unless I'm mistaken you can't accidentally remove an item from the third row and leave it a different size than all of the other rows (for example).
So assuming concerns like file size and compile time aren't much of an issue, I would think you'd want boost::multi_array. If those things are an issue, you might want to consider using a plain-old 3D array, since that should beat either of the other two options hands-down in those areas.
Problem
Suppose I have a large array of bytes (think up to 4GB) containing some data. These bytes correspond to distinct objects in such a way that every s bytes (think s up to 32) will constitute a single object. One important fact is that this size s is the same for all objects, not stored within the objects themselves, and not known at compile time.
At the moment, these objects are logical entities only, not objects in the programming language. I have a comparison on these objects which consists of a lexicographical comparison of most of the object data, with a bit of different functionality to break ties using the remaining data. Now I want to sort these objects efficiently (this is really going to be a bottleneck of the application).
Ideas so far
I've thought of several possible ways to achieve this, but each of them appears to have some rather unfortunate consequences. You don't necessarily have to read all of these. I tried to print the central question of each approach in bold. If you are going to suggest one of these approaches, then your answer should respond to the related questions as well.
1. C quicksort
Of course the C quicksort algorithm is available in C++ applications as well. Its signature matches my requirements almost perfectly. But the fact that using that function will prohibit inlining of the comparison function will mean that every comparison carries a function invocation overhead. I had hoped for a way to avoid that. Any experience about how C qsort_r compares to STL in terms of performance would be very welcome.
2. Indirection using Objects pointing at data
It would be easy to write a bunch of objects holding pointers to their respective data. Then one could sort those. There are two aspects to consider here. On the one hand, just moving around pointers instead of all the data would mean less memory operations. On the other hand, not moving the objects would probably break memory locality and thus cache performance. Chances that the deeper levels of quicksort recursion could actually access all their data from a few cache pages would vanish almost completely. Instead, each cached memory page would yield only very few usable data items before being replaced. If anyone could provide some experience about the tradeoff between copying and memory locality I'd be very glad.
3. Custom iterator, reference and value objects
I wrote a class which serves as an iterator over the memory range. Dereferencing this iterator yields not a reference but a newly constructed object to hold the pointer to the data and the size s which is given at construction of the iterator. So these objects can be compared, and I even have an implementation of std::swap for these. Unfortunately, it appears that std::swap isn't enough for std::sort. In some parts of the process, my gcc implementation uses insertion sort (as implemented in __insertion_sort in file stl_alog.h) which moves a value out of the sequence, moves a number items by one step, and then moves the first value back into the sequence at the appropriate position:
typename iterator_traits<_RandomAccessIterator>::value_type
__val = _GLIBCXX_MOVE(*__i);
_GLIBCXX_MOVE_BACKWARD3(__first, __i, __i + 1);
*__first = _GLIBCXX_MOVE(__val);
Do you know of a standard sorting implementation which doesn't require a value type but can operate with swaps alone?
So I'd not only need my class which serves as a reference, but I would also need a class to hold a temporary value. And as the size of my objects is dynamic, I'd have to allocate that on the heap, which means memory allocations at the very leafs of the recusrion tree. Perhaps one alternative would be a vaue type with a static size that should be large enough to hold objects of the sizes I currently intend to support. But that would mean that there would be even more hackery in the relation between the reference_type and the value_type of the iterator class. And it would mean I would have to update that size for my application to one day support larger objects. Ugly.
If you can think of a clean way to get the above code to manipulate my data without having to allocate memory dynamically, that would be a great solution. I'm using C++11 features already, so using move semantics or similar won't be a problem.
4. Custom sorting
I even considered reimplementing all of quicksort. Perhaps I could make use of the fact that my comparison is mostly a lexicographical compare, i.e. I could sort sequences by first byte and only switch to the next byte when the firt byte is the same for all elements. I haven't worked out the details on this yet, but if anyone can suggest a reference, an implementation or even a canonical name to be used as a keyword for such a byte-wise lexicographical sorting, I'd be very happy. I'm still not convinced that with reasonable effort on my part I could beat the performance of the STL template implementation.
5. Completely different algorithm
I know there are many many kinds of sorting algorithms out there. Some of them might be better suited to my problem. Radix sort comes to my mind first, but I haven't really thought this through yet. If you can suggest a sorting algorithm more suited to my problem, please do so. Preferrably with implementation, but even without.
Question
So basically my question is this:
“How would you efficiently sort objects of dynamic size in heap memory?”
Any answer to this question which is applicable to my situation is good, no matter whether it is related to my own ideas or not. Answers to the individual questions marked in bold, or any other insight which might help me decide between my alternatives, would be useful as well, particularly if no definite answer to a single approach turns up.
The most practical solution is to use the C style qsort that you mentioned.
template <unsigned S>
struct my_obj {
enum { SIZE = S; };
const void *p_;
my_obj (const void *p) : p_(p) {}
//...accessors to get data from pointer
static int c_style_compare (const void *a, const void *b) {
my_obj aa(a);
my_obj bb(b);
return (aa < bb) ? -1 : (bb < aa);
}
};
template <unsigned N, typename OBJ>
void my_sort (const char (&large_array)[N], const OBJ &) {
qsort(large_array, N/OBJ::SIZE, OBJ::SIZE, OBJ::c_style_compare);
}
(Or, you can call qsort_r if you prefer.) Since STL sort inlines the comparision calls, you may not get the fastest possible sorting. If all your system does is sorting, it may be worth it to add the code to get custom iterators to work. But, if most of the time your system is doing something other than sorting, the extra gain you get may just be noise to your overall system.
Since there are only 31 different object variations (1 to 32 bytes), you could easily create an object type for each and select a call to std::sort based on a switch statement. Each call will get inlined and highly optimized.
Some object sizes might require a custom iterator, as the compiler will insist on padding native objects to align to address boundaries. Pointers can be used as iterators in the other cases since a pointer has all the properties of an iterator.
I'd agree with std::sort using a custom iterator, reference and value type; it's best to use the standard machinery where possible.
You worry about memory allocations, but modern memory allocators are very efficient at handing out small chunks of memory, particularly when being repeatedly reused. You could also consider using your own (stateful) allocator, handing out length s chunks from a small pool.
If you can overlay an object onto your buffer, then you can use std::sort, as long as your overlay type is copyable. (In this example, 4 64bit integers). With 4GB of data, you're going to need a lot of memory though.
As discussed in the comments, you can have a selection of possible sizes based on some number of fixed size templates. You would have to have pick from these types at runtime (using a switch statement, for example). Here's an example of the template type with various sizes and example of sorting the 64bit size.
Here's a simple example:
#include <vector>
#include <algorithm>
#include <iostream>
#include <ctime>
template <int WIDTH>
struct variable_width
{
unsigned char w_[WIDTH];
};
typedef variable_width<8> vw8;
typedef variable_width<16> vw16;
typedef variable_width<32> vw32;
typedef variable_width<64> vw64;
typedef variable_width<128> vw128;
typedef variable_width<256> vw256;
typedef variable_width<512> vw512;
typedef variable_width<1024> vw1024;
bool operator<(const vw64& l, const vw64& r)
{
const __int64* l64 = reinterpret_cast<const __int64*>(l.w_);
const __int64* r64 = reinterpret_cast<const __int64*>(r.w_);
return *l64 < *r64;
}
std::ostream& operator<<(std::ostream& out, const vw64& w)
{
const __int64* w64 = reinterpret_cast<const __int64*>(w.w_);
std::cout << *w64;
return out;
}
int main()
{
srand(time(NULL));
std::vector<unsigned char> buffer(10 * sizeof(vw64));
vw64* w64_arr = reinterpret_cast<vw64*>(&buffer[0]);
for(int x = 0; x < 10; ++x)
{
(*(__int64*)w64_arr[x].w_) = rand();
}
std::sort(
w64_arr,
w64_arr + 10);
for(int x = 0; x < 10; ++x)
{
std::cout << w64_arr[x] << '\n';
}
std::cout << std::endl;
return 0;
}
Given the enormous size (4GB), I would seriously consider dynamic code generation. Compile a custom sort into a shared library, and dynamically load it. The only non-inlined call should be the call into the library.
With precompiled headers, the compilation times may actually be not that bad. The whole <algorithm> header doesn't change, nor does your wrapper logic. You just need to recompile a single predicate each time. And since it's a single function you get, linking is trivial.
#define OBJECT_SIZE 32
struct structObject
{
unsigned char* pObject;
bool operator < (const structObject &n) const
{
for(int i=0; i<OBJECT_SIZE; i++)
{
if(*(pObject + i) != *(n.pObject + i))
return (*(pObject + i) < *(n.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * OBJECT_SIZE); // 10 Objects
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*OBJECT_SIZE);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to check the sort
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end());
free(pObjects);
To skip the #define
struct structObject
{
unsigned char* pObject;
};
struct structObjectComparerAscending
{
int iSize;
structObjectComparerAscending(int _iSize)
{
iSize = _iSize;
}
bool operator ()(structObject &stLeft, structObject &stRight)
{
for(int i=0; i<iSize; i++)
{
if(*(stLeft.pObject + i) != *(stRight.pObject + i))
return (*(stLeft.pObject + i) < *(stRight.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int iObjectSize = 32; // Read it from somewhere
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * iObjectSize);
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*iObjectSize);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to work with something...
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end(), structObjectComparerAscending(iObjectSize));
free(pObjects);
I have a 1d array containing Nd data, I would like to effectively traverse on it with std::transform or std::for_each.
unigned int nelems;
unsigned int stride=3;// we are going to have 3D points
float *pP;// this will keep xyzxyzxyz...
Load(pP);
std::transform(pP, pP+nelems, strMover<float>(pP, stride));//How to define the strMover??
The answer is not to change strMover, but to change your iterator. Define a new iterator class which wraps a float * but moves forward 3 places when operator++ is called.
You can use boost's Permutation Iterator and use a nonstrict permutation which only includes the range you are interested in.
If you try to roll your own iterator, there are some gotchas: to remain strict to the standard, you need to think carefully about what the correct "end" iterator for such a stride iterator is, since the naive implementation will merrily stride over and beyond the allowed "one-past-the-end" to the murky area far past the end of the array which pointers should never enter, for fear of nasal demons.
But I have to ask: why are you storing an array of 3d points as an array of floats in the first place? Just define a Point3D datatype and create an array of that instead. Much simpler.
This is terrible, people told you to use stride iterators instead. Apart from not being able to use functional objects from standard library with this approach, you make it very, very complicated for compiler to produce multicore or sse optimization by using crutches like this.
Look for "stride iterator" for proper solution, for example in c++ cookbook.
And back to original question... use valarray and stride to simulate multidimensional arrays.
Well, I have decided to use for_each instead of transform any other decisions are welcome:
generator<unsigned int> gen(0, 1);
vector<unsigned int> idx(m_nelem);//make an index
std::generate(idx.begin(), idx.end(),gen);
std::for_each(idx.begin(), idx.end(), strMover<float>(&pPOS[0],&m_COM[0],stride));
where
template<class T> T op_sum (T i, T j) { return i+j; }
template<class T>
class strMover
{
T *pP_;
T *pMove_;
unsigned int stride_;
public:
strMover(T *pP,T *pMove, unsigned int stride):pP_(pP), pMove_(pMove),stride_(stride)
{}
void operator() ( const unsigned int ip )
{
std::transform(&pP_[ip*stride_], &pP_[ip*stride_]+stride_,
pMove_, &pP_[ip*stride_], op_sum<T>);
}
};
From first look this is a thread safe solution.
use boost adapters. you can get iterators out of them. the only disadvantage is compilation time.
vector<float> pp = vector_load(pP);
boost::for_each(pp|stride(3)|transformed(dosmtn()));