custom alignment options for an efficient quad edge implementation - d

I'm fairly new to the D2 programming language. I need to implement a quad edge data structure. This is an adjacency data structure for effective graph embeddings and their duals.
From previous experience with C, I started with the following implementation:
struct edge(T) {
edge* next;
T* data;
uint position;
}
alias edge[4] qedge;
edge* new_edge() {
qedge* q = new qedge; // does not compile, just analogous
// initialize the four edges (q[i].position = i)
return cast (edge*) q;
}
That works fine, but this position field is bugging me - it is just wasting space there.
Is there a way to align to a (8 * ${machine word size}) boundary when allocating the memory for the qedge array (then I will not need the additional position field, because the information at which position an edge is in the qedge array will be encoded into the edge address)?
I'm aiming at an efficient and cute implementation (that's why I've chosen D), so any other suggestions are welcome.
EDIT: Here is a piece of code to make the thing more clear http://dpaste.dzfl.pl/e26baaad:
module qedge;
class edge(T) {
edge next;
T* data;
}
alias edge!int iedge;
alias iedge[4] qedge;
import std.stdio;
import std.c.stdlib;
void main() {
writeln(iedge.sizeof); // 8
writeln(qedge.sizeof); // 32
// need something for the next line,
// which always returns an address, divisible by 32
qedge* q = cast (qedge*) malloc(qedge.sizeof);
writeln(q); // ex. 0x10429A0 = 17050016, which is, does not start at 32-byte boundary
}

To directly answer the question - use the align(N) to specify the alignment you want. However, keep this in mind (quote from dlang.org): Do not align references or pointers that were allocated using NewExpression on boundaries that are not a multiple of size_t
Reference manual at http://dlang.org has a section about alignment - http://dlang.org/attribute.html#align .
Say, the T is int - the edge!int is already 24 bytes big on 64-bit architecture.
D is not so different from C in alignment. Check the following (runnable/editable code at: http://dpaste.dzfl.pl/de0121e1):
module so_0001;
// http://stackoverflow.com/questions/11383240/custom-alignment-options-for-an-efficient-quad-edge-implementation
struct edge(T) {
edge* next;
T* data;
uint position;
}
alias edge!int iedge;
alias edge!int[4] qedge;
/* not a good idea for struct... it is a value type...
edge!int new_edge() {
// in the original example it used new... D is not C++ to mix value and reference types! :)
}
*/
import std.stdio;
int main() {
writeln(iedge.sizeof);
// 64bit
// ----------
writeln(iedge.next.offsetof); // 0
writeln(iedge.data.offsetof); // 8
writeln(iedge.position.offsetof); // 16
writeln(qedge.sizeof);
return 0;
}

Related

Best way to wrap a C array pointer as a Chapel array

When interoperating with C, I often find myself being handed a pointer to an array. Chapel currently lets me treat this pointer as a 1D 0-indexed array. However, there are cases where I'd like to treat this pointer as a Chapel array (with eg. a multidimensional domain). What would the most idiomatic way to achieve this in Chapel?
I might have tried to do this by wrapping the C pointer in a class (with a domain) and defining this, and these (serial and parallel) methods so that one could index into and iterate over the class. In order to implement this, it would be useful to have a function that maps an index in a domain to a 0-indexed location. Is there such a built in function?
Unfortunately there does not appear to be a function that works for every domain. DefaultRectangularArr uses a method called getDataIndex under the covers that performs this computation, and it looks like other array types rely on a similar method defined on an inner storage class. It looks like these are not available on the domains themselves. I suspect relying on any of these would be inadvisable as they may be changed as part of an implementation adjustment, anyways.
Our hope is that eventually pointers like what you are describing could be wrapped in Chapel arrays using something like the makeArrayFromPtr function defined for interoperability. Unfortunately this function only supports 1D 0-indexed arrays today, but work is currently being done to expand our support for array interoperability. I would expect that function to adjust its arguments or for another version to be defined for multi-dimensional arrays, and we are still figuring that out.
I was curious whether I could trick a Chapel array into referring to a buffer allocated in C without much effort. I was able to do it, but am not proud of the result. Specifically:
it uses features that aren't user-facing, so may change or break at any future point
it currently only works for rectangular Chapel arrays that start indexing at 0 in each dimension
With those caveats in mind, here's a simple C header that exposes a few simple routines to be called from Chapel. The first allocates a trivial 9-element array; the second frees its argument.
#include <stdlib.h>
double* getDataPtr() {
double* dataPtr = (double*)malloc(9*sizeof(double));
dataPtr[0] = 1.1;
dataPtr[1] = 1.2;
dataPtr[2] = 1.3;
dataPtr[3] = 2.1;
dataPtr[4] = 2.2;
dataPtr[5] = 2.3;
dataPtr[6] = 3.1;
dataPtr[7] = 3.2;
dataPtr[8] = 3.3;
return dataPtr;
}
void freeDataPtr(double* ptr) {
free(ptr);
}
and here's the Chapel code that calls into it, then forces the C pointer into an existing array of the appropriate size and 0-based indices:
//
// Declare a Chapel array. Note that this program will only work as
// written if it uses 0-based indexing.
//
var A: [0..2, 0..2] real;
//
// testit.h is the C code above. It defines a simple C stub that returns a pointer
// to floating point data.
//
require "testit.h";
//
// Here are the key routines that testit.h exposes back to Chapel to
// get and free a pointer to floating point data.
//
extern proc getDataPtr(): c_ptr(real);
extern proc freeDataPtr(ptr: c_ptr(real));
//
// Grab the pointer from C
//
const myCPtr = getDataPtr();
//
// Save two pointer values defined in A's descriptor. Note that these
// are not part of its public interface, so are not recommended for
// typical users and could change / break at any future point.
//
const saveData = A._value.data;
const saveShiftedData = A._value.shiftedData;
//
// Replace these pointers with the one we got from C.
//
A._value.data = (myCPtr: _ddata(real));
A._value.shiftedData = (myCPtr: _ddata(real));
//
// print out A, "proving" that we're referring to the data from C
//
writeln(A);
//
// restore the original pointers to avoid having Chapel try to free
// the C memory / leak the Chapel memory.
//
A._value.data = saveData;
A._value.shiftedData = saveShiftedData;
//
// Free the C data
//
freeDataPtr(myCPtr);
The output of Chapel's writeln(A) statement is:
1.1 1.2 1.3
2.1 2.2 2.3
3.1 3.2 3.3
I think it'd be completely reasonable to file a feature request on Chapel's GitHub issues page proposing a better user-facing interface for adopting a C pointer like this, albeit in a nicer and more official way.
A slightly different approach might be below. The downside here is that this isn't a Chapel array, but it does allow for indexing/iterating just like a Chapel array would. This doesn't handle strides or non-zero alignments and there is no bounds-checking done.
prototype module CPtrArray {
use SysCTypes;
record CPtrArray {
type eltType;
param rank : int;
const first : rank*int;
const blk : rank*int;
var data : c_ptr(eltType);
proc init(type t, D : domain, ptr : c_ptr(t)) {
this.eltType = t;
this.rank = D.rank;
this.first = D.first;
var blktmp : rank*int;
blktmp(rank) = 1;
if (rank > 1) {
for param idim in (rank-1)..1 by -1 {
blktmp(idim) = blktmp(idim+1)*D.shape(idim+1);
}
}
this.blk = blktmp;
this.complete();
data = ptr;
}
proc getDataOffset(ndx : rank*int) : int {
var offset = ndx(rank)-first(rank);
if (rank > 1) {
for param idim in 1..(rank-1) do
offset += (ndx(idim)-first(idim))*blk(idim);
}
return offset;
}
inline proc this(ndx : rank*int) ref {
return data[getDataOffset(ndx)];
}
inline proc this(i:int ...rank) ref {
return this(i);
}
// Should provide iterators as well.
}
}
A simple test program that uses it :
use CPtrArray;
var A : [0.. #10] real;
forall ii in 0.. #10 do A[ii] = ii+1.0;
writeln(A);
var ptr = c_ptrTo(A[0]);
// Code for testing the array class goes here
const D = {1.. #5, 3.. #2};
var A1 = new CPtrArray(real, D, ptr);
for ndx in D {
writeln(ndx," ",A1[ndx]);
}

Implementing concurrent_vector according to intel blog

I am trying to implement a thread-safe lockless container, analogous to std::vector, according to this https://software.intel.com/en-us/blogs/2008/07/24/tbbconcurrent_vector-secrets-of-memory-organization
From what I understood, to prevent re-allocations and invalidating all iterators on all threads, instead of a single contiguous array, they add new contiguous blocks.
Each block they add is with a size of increasing powers of 2, so they can use log(index) to find the proper segment where an item at [index] is supposed to be.
From what I gather, they have a static array of pointers to segments, so they can quickly access them, however they don't know how many segments the user wants, so they made a small initial one and if the amount of segments exceeds the current count, they allocate a huge one and switch to using that one.
The problem is, adding a new segment can't be done in a lockless thread safe manner or at least I haven't figured out how. I can atomically increment the current size, but only that.
And also switching from the small to the large array of segment pointers involves a big allocation and memory copies, so I can't understand how they are doing it.
They have some code posted online, but all the important functions are without available source code, they are in their Thread Building Blocks DLL. Here is some code that demonstrates the issue:
template<typename T>
class concurrent_vector
{
private:
int size = 0;
int lastSegmentIndex = 0;
union
{
T* segmentsSmall[3];
T** segmentsLarge;
};
void switch_to_large()
{
//Bunch of allocations, creates a T* segmentsLarge[32] basically and reassigns all old entries into it
}
public:
concurrent_vector()
{
//The initial array is contiguous just for the sake of cache optimization
T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
segmentsSmall[0] = initialContiguousBlock;
segmentsSmall[1] = initialContiguousBlock + 2;
segmentsSmall[2] = initialContiguousBlock + 2 + 4;
}
void push_back(T& item)
{
if(size > 2 + 4 + 8)
{
switch_to_large(); //This is the problem part, there is no possible way to make this thread-safe without a mutex lock. I don't understand how Intel does it. It includes a bunch of allocations and memory copies.
}
InterlockedIncrement(&size); //Ok, so size is atomically increased
//afterwards adds the item to the appropriate slot in the appropriate segment
}
};
I would not try to make the segmentsLarge and segmentsSmall a union. Yes this wastes one more pointer. Then the pointer, lets call it just segments can initially point to segmentsSmall.
On the other hand the other methods can always use the same pointer which makes them simpler.
And switching from small to large can be accomplished by one compare exchange of a pointer.
I am not sure how this could be accomplished safely with a union.
The idea would look something like this (note that I used C++11, which the Intel library predates, so they likely did it with their atomic intrinsics).
This probably misses quite a few details which I am sure the Intel people have thought more about, so you will likely have to check this against the implementations of all other methods.
#include <atomic>
#include <array>
#include <cstddef>
#include <climits>
template<typename T>
class concurrent_vector
{
private:
std::atomic<size_t> size;
std::atomic<T**> segments;
std::array<T*, 3> segmentsSmall;
unsigned lastSegmentIndex = 0;
void switch_to_large()
{
T** segmentsOld = segments;
if( segmentsOld == segmentsSmall.data()) {
// not yet switched
T** segmentsLarge = new T*[sizeof(size_t) * CHAR_BIT];
// note that we leave the original segment allocations alone and just copy the pointers
std::copy(segmentsSmall.begin(), segmentsSmall.end(), segmentsLarge);
for(unsigned i = segmentsSmall.size(); i < numSegments; ++i) {
segmentsLarge[i] = nullptr;
}
// now both the old and the new segments array are valid
if( segments.compare_exchange_strong(segmentsOld, segmentsLarge)) {
// success!
return;
} else {
// already switched, just clean up
delete[] segmentsLarge;
}
}
}
public:
concurrent_vector() : size(0), segments(segmentsSmall.data())
{
//The initial array is contiguous just for the sake of cache optimization
T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
segmentsSmall[0] = initialContiguousBlock;
segmentsSmall[1] = initialContiguousBlock + 2;
segmentsSmall[2] = initialContiguousBlock + 2 + 4;
}
void push_back(T& item)
{
if(size > 2 + 4 + 8) {
switch_to_large();
}
// here we may have to allocate more segments atomically
++size;
//afterwards adds the item to the appropriate slot in the appropriate segment
}
};

Size of class object in C++

On what basis is size of class object shown as 12?
class testvector
{
public : vector<int> test;
};
int main()
{
testvector otestvector;
cout<<"size :"<<sizeof(otestvector)<<"\n";
cout<<"size of int :"<<sizeof(int);
}
Output:
size :12
size of int :4
Think of it like this. Let's imagine the standard C++ library didn't have a vector class. And you decided it would be a good idea to have one.
You just might, at a very minimum, come up with something like this. (Disclaimer: the actual vector class for C++ is far more complicated)
template <class T>
class vector
{
T* items; // array of items
size_t length; // number of items inserted
size_t capacity; // how many items we've allocated
public:
void push_back(const T& item) {
if (length >= capacity) {
grow(length * 2); // double capacity
}
items[length] = item;
length++;
}
...
};
Let's break down an instance of my simple vector class down on a 32-bit system:
sizeof(items) == 4 // pointers are 4 bytes on 32-bit systems
sizeof(length) == 4; // since size_t is typically a long, it's 32-bits as well
sizeof(capacity) == 4; // same as above
So there's 12 bytes of member variables just to start out. Hence sizeof(vector<T>) == 12 for my simple example. And it doesn't matter what type T actually is. The sizeof() operator just accounts for the member variables, not any heap allocations associated with each.
The above is just a crude example. The actual vector class has a more complex structure, support for custom allocators, and other optimizations for efficient iterations, insertion, and removal. Hence, likely more member variables inside the class.
So at the very least, my minimum example is already 12 bytes long. Probably will be 24 bytes on a 64-bit compiler since sizeof(pointer) and sizeof(size_t) typically double on 64-bit.

Memory allocation for struct (low performance)

I do have a question related to slow performance of allocating memory for several structs.
I have a struct which is looking like: see below
typedef struct _node
{
// Pointer to leaves & neigbours
struct _node *children[nrChild], *neighb[nrNeigh];
// Pointer to parent Node
struct _node *parentNode;
struct _edgeCorner *edgePointID[nrOfEdge];
int indexID; // Value
double f[latDir]; // Lattice Velos
double rho; // Density
double Umag; // Mag. velocity
int depth; // Depth of octree element
} node
At the beginning of my code I do have to create a lot of them (100.000 – 1.000.000 ) by
Using :
tree = new node();
and initiating the elements after it.
Unfortunately, this is pretty slow, therefore do anyone of you have an idea to improve the performance?
Firstly, you'll want to fix it so that it's actually written in C++.
struct node
{
// Pointer to leaves & neigbours
std::array<std::unique_ptr<node>, nrChild> children;
std::array<node*, nrNeigh> neighb;
// Pointer to parent Node
node* parentNode;
std::array<_edgeCorner*, nrOfEdge> edgePointID;
int indexID; // Value
std::array<double, latDir> f; // Lattice Velos
double rho; // Density
double Umag; // Mag. velocity
int depth; // Depth of octree element
};
Secondly, in order to improve your performance, you will require a custom allocator. Boost.Pool would be a fine choice- it's a pre-existing solution that is explicitly designed for repeated allocations of the same size, in this case, sizeof(node). There are other schemes like memory arena that can be even faster, depending on your deallocation needs.
If you know how many nodes you will have, you could allocate them all in one go:
node* Nodes = new node[1000000];
You will need to set the values afterwards, just like you would do if you did it one by one. If it's a lot faster this way, you could try an architecture where you find out how many nodes you will need before allocating them, even if you don't have that number right now.

How do I allocate variably-sized structures contiguously in memory?

I'm using C++, and I have the following structures:
struct ArrayOfThese {
int a;
int b;
};
struct DataPoint {
int a;
int b;
int c;
};
In memory, I want to have 1 or more ArrayOfThese elements at the end of each DataPoint. There are not always the same number of ArrayOfThese elements per DataPoint.
Because I have a ridiculous number of DataPoints to assemble and then stream across a network, I want all my DataPoints and their ArrayOfThese elements to be contiguous. Wasting space for a fixed number of the ArrayOfThese elements is unacceptable.
In C, I would have made an element at the end of DataPoint that was declared as ArrayOfThese d[0];, allocated a DataPoint plus enough extra bytes for however many ArrayOfThese elements I had, and used the dummy array to index into them. (Of course, the number of ArrayOfThese elements would have to be in a field of DataPoint.)
In C++, is using placement new and the same 0-length array hack the correct approach? If so, does placement new guarantee that subsequent calls to new from the same memory pool will allocate contiguously?
Since you are dealing with plain structures that have no constructors, you could revert to C memory management:
void *ptr = malloc(sizeof(DataPoint) + n * sizeof(ArrayOfThese));
DataPoint *dp = reinterpret_cast<DataPoint *>(ptr));
ArrayOfThese *aotp = reinterpet_cast<ArrayOfThese *>(reintepret_cast<char *>(ptr) + sizeof(DataPoint));
Since your structs are PODs you might as well do it just as you would in C. The only thing you'll need is a cast. Assuming n is the number of things to allocate:
DataPoint *p=static_cast<DataPoint *>(malloc(sizeof(DataPoint)+n*sizeof(ArrayOfThese)));
Placement new does come into this sort of thing, if your objects have a a non-trivial constructor. It guarantees nothing about any allocations though, for it does no allocating itself and requires the memory to have been already allocated somehow. Instead, it treats the block of memory passed in as space for the as-yet-unconstructed object, then calls the right constructor to construct it. If you were to use it, the code might go like this. Assume DataPoint has the ArrayOfThese arr[0] member you suggest:
void *p=malloc(sizeof(DataPoint)+n*sizeof(ArrayOfThese));
DataPoint *dp=new(p) DataPoint;
for(size_t i=0;i<n;++i)
new(&dp->arr[i]) ArrayOfThese;
What gets constructed must get destructed so if you do this you should sort out the call of the destructor too.
(Personally I recommend using PODs in this sort of situation, because it removes any need to call constructors and destructors, but this sort of thing can be done reasonably safely if you are careful.)
As Adrian said in his answer, what you do in memory doesn't have to be the same as what you stream over the network. In fact, it might even be good to clearly divide this, because having a communication protocol relying on your data being designed in a specific way makes huge problem if you later need to refactor your data.
The C++ way to store an arbitrary number of elements contiguously is of course to std::vector. Since you didn't even consider this, I assume that there's something that makes this undesirable. (Do you only have small numbers of ArrayOfThese and fear the space overhead associated with std::vector?)
While the trick with over-allocating a zero-length array probably isn't guaranteed to work and might, technically, invoke the dreaded undefined behavior, it's a widely spread one. What platform are you on? On Windows, this is done in the Windows API, so it's hard to imagine a vendor shipping a C++ compiler which wouldn't support this.
If there's a limited number of possible ArrayOfThese element counts, you could also use fnieto's trick to specify those few numbers and then new one of the resulting template instances, depending on the run-time number:
struct DataPoint {
int a;
int b;
int c;
};
template <std::size_t sz>
struct DataPointWithArray : DataPoint {
ArrayOfThese array[sz];
};
DataPoint* create(std::size_t n)
{
switch(n) {
case 1: return new DataPointWithArray[1];
case 2: return new DataPointWithArray[2];
case 5: return new DataPointWithArray[5];
case 7: return new DataPointWithArray[7];
case 27: return new DataPointWithArray[27];
default: assert(false);
}
return NULL;
}
Prior to C++0X, the language had no memory model to speak of. And with the new standard, I don't recall any talk of guarantees of contiguity.
Regarding this particular question, it sounds as if what you want is a pool allocator, many examples of which exist. Consider, for instance, Modern C++ Design, by Alexandrescu. The small object allocator discussion is what you should look at.
I think boost::variant might accomplish this. I haven't had an opportunity to use it, but I believe it's a wrapper around unions, and so a std::vector of them should be contiguous, but of course each item will take up the larger of the two sizes, you can't have a vector with differently-sized elements.
Take a look at the comparison of boost::variant and boost::any.
If you want the offset of each element to be dependent on the composition of the previous elements, you will have to write your own allocator and accessors.
Seems like it would be simpler to allocate an array of pointers and work with that rather than using placement new. That way you could just reallocate the whole array to the new size with little runtime cost. Also if you use placement new, you have to explicitly call destructors, which means mixing non-placement and placement in a single array is dangerous. Read http://www.parashift.com/c++-faq-lite/dtors.html before you do anything.
don't confuse data organisation inside your program and data organisation for serialization: they do not have the same goal.
for streaming across a network, you have to consider both side of the channel, the sending and the receiving side: how does the receiving side differentiate between a DataPoint and an ArrayOfThese ? how does the receiving side know how many ArrayOfThese are appended after a DataPoint ? (also to consider: what is the byte ordering of each side ? does data types have the same size in memory ?)
personally, i think you need a different structure for streaming your data, in which you add the number of DataPoint you are sending as well as the number of ArrayOfThese after each DataPoint. i would also not care about the way data is already organized in my program and reorganize/reformat to suit my protocol and not my program. after that writing a function for sending and another for receiving is not a big deal.
Why not have DataPoint contain a variable-length array of ArrayOfThese items? This will work in C or C++. There are some concerns if either struct contains non-primitive types
But use free() rather than delete on the result:
struct ArrayOfThese {
int a;
int b;
};
struct DataPoint {
int a;
int b;
int c;
int length;
ArrayOfThese those[0];
};
DataPoint* allocDP(int a, int b, int c, size_t length)
{
// There might be alignment issues, but not for most compilers:
size_t sz = sizeof(DataPoint) + length * sizeof(ArrayOfThese);
DataPoint dp = (DataPoint*)calloc( sz );
// (Check for out of memory)
dp->a = a; dp->b = b; tp->c = c; dp->length = length;
}
Then you can use it "normally" in a loop where the DataPoint knows its length:
DataPoint *dp = allocDP( 5, 8, 3, 20 );
for(int i=0; i < dp->length; ++i)
{
// Initialize or access: dp->those[i]
}
Could you make those into classes with the same superclass and then use your favourite stl container of choice, using the superclass as the template?
Two questions: Is the similarity between ArrayOfThese and DataPoint real, or a simplification for posting? I.e. is the real difference just one int (or some arbitrary number of the same type of items)?
Is the number of ArrayOfThese associated with a particular DataPoint known at compile time?
If the first is true, I'd think hard about simply allocating an array of as many items as necessary for one DataPoint+N ArrayOfThese. I'd then build a quick bit of code to overload operator[] for that to return item N+3, and overload a(), b() and c() to return the first three items.
If the second is true, I was going to suggest essentially what I see fnieto has just posted, so I won't go into more detail.
As far as placement new goes, it doesn't really guarantee anything about allocation -- in fact, the whole idea about placement new is that it's completely unrelated to memory allocation. Rather, it allows you to create an object at an arbitrary address (subject to alignment restrictions) in a block of memory that's already allocated.
Here's the code I ended up writing:
#include <iostream>
#include <cstdlib>
#include <cassert>
using namespace std;
struct ArrayOfThese {
int e;
int f;
};
struct DataPoint {
int a;
int b;
int c;
int numDPars;
ArrayOfThese d[0];
DataPoint(int numDPars) : numDPars(numDPars) {}
DataPoint* next() {
return reinterpret_cast<DataPoint*>(reinterpret_cast<char*>(this) + sizeof(DataPoint) + numDPars * sizeof(ArrayOfThese));
}
const DataPoint* next() const {
return reinterpret_cast<const DataPoint*>(reinterpret_cast<const char*>(this) + sizeof(DataPoint) + numDPars * sizeof(ArrayOfThese));
}
};
int main() {
const size_t BUF_SIZE = 1024*1024*200;
char* const buffer = new char[BUF_SIZE];
char* bufPtr = buffer;
const int numDataPoints = 1024*1024*2;
for (int i = 0; i < numDataPoints; ++i) {
// This wouldn't really be random.
const int numArrayOfTheses = random() % 10 + 1;
DataPoint* dp = new(bufPtr) DataPoint(numArrayOfTheses);
// Here, do some stuff to fill in the fields.
dp->a = i;
bufPtr += sizeof(DataPoint) + numArrayOfTheses * sizeof(ArrayOfThese);
}
DataPoint* dp = reinterpret_cast<DataPoint*>(buffer);
for (int i = 0; i < numDataPoints; ++i) {
assert(dp->a == i);
dp = dp->next();
}
// Here, send it out.
delete[] buffer;
return 0;
}