Storing unused class data members on disk - c++

I have a GUI application which works with point cloud data and a quadtree data structure behind it to handle the data. As the point format I'm working with has changed recently I had to modify my point class to hold the new attributes, which cause Point objects to grow in size significantly and in effect reducing the performance of my quadtree. Some of this attributes are not needed for displaying and processing the data, but they still need to be preserved in the output. This is roughly how my point class looks at the moment:
class Point {
public:
/* ... */
private:
/* Used data members */
double x;
double y;
double z;
double time;
int attr1;
int attr2;
/* Unused data members */
int atr3;
double atr4;
float atr5;
float atr6;
float atr7;
}
When the data is loaded from the file the Points are stored in a Point* array and then handled by the quadtree. Similarly when they are savedan array of points is passed from the quadtree and saved to a file. Note that Point objects I'm using in my quadtree are different fomr those stored in the file, but I'm using a library that provides reader and writer objects which I use to create my points. Here's an example:
int PointLoader::load(int n, Point* points) {
Point temp;
int pointCounter = 0;
/* reader object is provided by the library and declared elsewhere */
while (pointCounter < n && reader->read_point()) {
temp = Point(reader->get_x(), reader->get_y(), reader->get_z(), /* ... */ )
points[pointCounter] = temp;
++pointCounter;
}
return pointCounter;
}
Now, my idea is to reduce size of the Point class, and store unused attributes in another class (or struct) called PointData on the hard drive. This is necessary because the data usually doesn't fit in memory and there's a caching system in place, which again would benefit from smaller point objects. So given the example it would look something like this:
int PointLoader::load(int n, Point* points) {
Point temp;
PointData tempData;
int pointCounter = 0;
while (pointCounter < n && reader->read_point()) {
temp = Point(reader->get_x(), reader->get_y(), reader->get_z(), /* ... */ )
pointData = (reader->get_attr3(), reader->get_attr4(), /* ... */)
temp.dataHandle = /* some kind of handle to the data object */
points[pointCounter] = temp;
/* Save pointData to file to retrieve when saving points */
++pointCounter;
}
return pointCounter;
}
Then when I save my modified points I'd simply used the dataHandle (file offset? an index in memory mapped array?) to retrieve the pointData of each point and write it back to the file.
Does that sound like a good idea? What would be the most sensible approach of achieving this?

I would suggest you use mapped files for storing the additional data. This will automatically cause them to be flushed to disk and removed from RAM if there is memory pressure, but they will stay resident in RAM most of the time if there is enough memory.
In your Point class, storing offsets in the file is better than storing direct pointers into the mapped memory region as offsets will still be correct if you have to remap the file in order to grow it (you have to grow the file using e.g. lseek() yourself, as you can only map as much as the size of the file).
This mechanism is very convenient to code to, but you must have enough address space to map the whole file - no issue in a 64-bit app, but possibly a problem if you're 32-bit and need more than a few hundred MB of data in the file. You can of course map and unmap multiple files, but it requires more coding work and is less performant (there is some cost to mapping and unmapping files).

Related

C++ can you design a data structure that keeps pointers in contiguous memory and doesn't invalidate them?

I am trying to implement the half edge data structure for mesh representation. My current hurdle is,
For this DS you need multiple pointers, an example of a possible implementation is this:
class HalfEdge {
Edge * next;
Edge * prev;
Edge * pair;
Vertex * vertex;
Face * face;
}
There's nothing wrong with that. However if you use pointers (let's say we use smart pointers instead of raw to avoid known issues with pointers), your data now lives sparsely in memory and you loose performance due to caching. Worse, if you want to send that mesh to the gpu, now you must serialize the vertices, and imagine you have to do that every frame!
So an alternative is to have arrays and then do this:
// Somewhere else we have vectors of edges, faces and vertices
// we store their indices in the array rather than pointers.
class HalfEdge {
uint next_index;
uint prev_index;
uint pair_index;
uint vertex_index;
uint face_index;
}
This, combined with the arrays, will allow you to keep stuff contingently in memory and now you can send the mesh to the gpu without the overhead of making the linearized buffer. However this has a syntax problem.
Before you could do face->field now you have to do face_array[face_index].field Which is super clunky.
Naively one could think that combining vertices for allocation and pointers for access would work, something like Face* face = &face_array[index], however, anyone with enough experience in C++ knows that pointer is going to become invalid the second the array is resized.
Not knowing the potential size of the mesh, the array can't be pre allocated, so it's not possible to do that.
Based on all of the above, can you do better than face_array[index].field if you want the memory to be contingent?
Assuming you store your members in a list, you could do something like:
class HalfEdge {
public:
Edge& next() { return (*edge_list)[next_index]; }
Edge& prev() { return (*edge_list)[prev_index]; }
// ...
// probably also const-qualified versions is a good idea
private:
// or global if needed, or whatever
std::vector<Edge>* edge_list;
// ...
std::size_t next_index;
std::size_t prev_index;
// ...
};
This will allow you to access values in your local scope like
HalfEdge he = /* ... */;
auto& edge = he.next(); // which is otherwise equivalent to
auto& edge = edge_list[he.next_index];
In essence it's similar to your idea of storing a reference, but rather than storing a reference to a data member that can get invalidated we instead store a reference to the actual array, and recalulate the offset as needed.

On memory allocation in armadillo sparse matrices

I want to know whether I need to free up the memory occupied by locations and values objects after sparse matrix has been created. Here is the code:
void load_data(umat &locations, vec& values){
// fill up locations and values
}
int main(int argc, char**argv){
umat loc;
vec val;
load_data(loc,val);
sp_mat X(loc,val);
return 0;
}
In the above code, load_data() fills up the locations and values objects and then sparse matrix is created in main(). My Q: Do I need to free up memory used by locations and values after construction of X? The reason is that X could be large and I am on low RAM. I know when main returns, OS will free up locations and values as well as X. But, the real question is whether the memory occupied by X is the same as by locationsand values OR X is allocated memory separately and I need to free locations and values in the case.
The constructor (SpMat_meat.hpp:231) you are using
template<typename T1, typename T2>
inline SpMat(const Base<uword,T1>& locations, const Base<eT,T2>& values, const bool sort_locations = true);
fills the sparse matrix with copies of the values in values.
I understand you are worried that you will run out of memory and if you keep loc, val and X separately you basically have two copies of the same data, taking up twice as much memory as actually needed (this is indeed what happens in your code snippet), so I will try to focus on addressing this problem and give you a few options:
1) If you are fine with keeping two copies of the data for a short while, the easiest solution is to dynamically allocate loc and val and delete them right after initialization of X
int main(int argc, char**argv){
umat* ploc;
vec* pval;
load_data(*ploc,*pval);
// at this point we have one copy of the data
sp_mat X(*ploc,*pval);
// at this point we have two copies of the data
delete ploc;
delete pval;
// at this point we have one copy of the data
return 0;
}
Of course you can use safe pointers instead of C style ones, but you get the idea.
2) If you absolutely don't want to have two copies of the data at any time, I would suggest that you modify your load_data routine to sequentially load values one by one and directly insert them into X
void load_data(umat &locations, vec& values, sp_mat& X){
// fill up locations and values one by one into X
}
Other options would be to i) use move semantics to directly move the values in val into X or ii) directly use the memory allocated for val as the memory for X, similar to the advanced constructor for matrices
Mat(eT* aux_mem, const uword aux_n_rows, const uword aux_n_cols, const bool copy_aux_mem = true, const bool strict = false)
Both options would however require modifications on the level of the armadillo library, as such a functionality is not supplied yet for sparse matrices (There is only a plain move constructor so far). It would be a good idea to request these features from the developers though!

How can I allocate memory for a data structure that contains a vector?

If I have a struct instanceData:
struct InstanceData
{
unsigned usedInstances;
unsigned allocatedInstances;
void* buffer;
Entity* entity;
std::vector<float> *vertices;
};
And I allocate enough memory for an Entity and std::vector:
newData.buffer = size * (sizeof(Entity) + sizeof(std::vector<float>)); // Pseudo code
newData.entity = (Entity *)(newData.buffer);
newData.vertices = (std::vector<float> *)(newData.entity + size);
And then attempt to copy a vector of any size to it:
SetVertices(unsigned i, std::vector<float> vertices)
{
instanceData.vertices[i] = vertices;
}
I get an Access Violation Reading location error.
I've chopped up my code to make it concise, but it's based on Bitsquid's ECS. so just assume it works if I'm not dealing with vectors (it does). With this in mind, I'm assuming it's having issues because it doesn't know what size the vector is going to scale to. However, I thought the vectors might increase along another dimension, like this?:
Am I wrong? Either way, how can I allocate memory for a vector in a buffer like this?
And yes, I know vectors manage their own memory. That's besides the point. I'm trying to do something different.
It looks like you want InstanceData.buffer to have the actual memory space which is allocated/deallocated/accessed by other things. The entity and vertices pointers then point into this space. But by trying to use std::vector, you are mixing up two completely incompatible approaches.
1) You can do this with the language and the standard library, which means no raw pointers, no "new", no "sizeof".
struct Point {float x; float y;} // usually this is int, not float
struct InstanceData {
Entity entity;
std::vector<Point> vertices;
}
This is the way I would recommend. If you need to output to a specific binary format for serialization, just handle that in the save method.
2) You can manage the memory internal to the class, using oldschool C, which means using N*sizeof(float) for the vertices. Since this will be extremely error prone for a new programmer (and still rough for vets), you must make all of this private to class InstanceData, and do not allow any code outside InstanceData to manage them. Use unit tests. Provide public getter functions. I've done stuff like this for data structures that go across the network, or when reading/writing files with a specified format (Tiff, pgp, z39.50). But just to store in memory using difficult data structures -- no way.
Some other questions you asked:
How do I allocate memory for std::vector?
You don't. The vector allocates its own memory, and manages it. You can tell it to resize() or reserve() space, or push_back, but it will handle it. Look at http://en.cppreference.com/w/cpp/container/vector
How do I allocate memory for a vector [sic] in a buffer like this?
You seem to be thinking of an array. You're way off with your pseudo code so far, so you really need to work your way up through a tutorial. You have to allocate with "new". I could post some starter code for this, if you really need, which I would edit into the answer here.
Also, you said something about vector increasing along another dimension. Vectors are one dimensional. You can make a vector of vectors, but let's not get into that.
edit addendum:
The basic idea with a megabuffer is that you allocate all the required space in the buffer, then you initialize the values, then you use it through the getters.
The data layout is "Header, Entity1, Entity2, ..., EntityN"
// I did not check this code in a compiler, sorry, need to get to work soon
MegaBuffer::MegaBuffer() {AllocateBuffer(0);}
MegaBuffer::~MegaBuffer() {ReleaseBuffer();}
MegaBuffer::AllocateBuffer(size_t size /*, whatever is needed for the header*/){
if (nullptr!=buffer)
ReleaseBuffer();
size_t total_bytes = sizeof(Header) + count * sizeof(Entity)
buffer = new unsigned char [total_bytes];
header = buffer;
// need to set up the header
header->count = 0;
header->allocated = size;
// set up internal pointer
entity = buffer + sizeof(Header);
}
MegaBuffer::ReleaseBuffer(){
delete [] buffer;
}
Entity* MegaBuffer::operator[](int n) {return entity[n];}
The header is always a fixed size, and appears exactly once, and tells you how many entities you have. In your case there's no header because you are using member variables "usedInstances" and "allocatednstances" instead. So you do sort of have a header but it is not part of the allocated buffer. But you don't want to allocate 0 bytes, so just set usedInstances=0; allocatedInstances=0; buffer=nullptr;
I did not code for changing the size of the buffer, because the bitsquid ECS example covers that, but he doesn't show the first time initialization. Make sure you initialize n and allocated, and assign meaningful values for each entity before you use them.
You are not doing the bitsquid ECS the same as the link you posted. In that, he has several different objects of fixed size in parallel arrays. There is an entity, its mass, its position, etc. So entity[4] is an entity which has mass equal to "mass[4]" and its acceleration is "acceleration[4]". This uses pointer arithmetic to access array elements. (built in array, NOT std::Array, NOT std::vector)
The data layout is "Entity1, Entity2, ..., EntityN, mass1, mass2, ..., massN, position1, position2, ..., positionN, velocity1 ... " you get the idea.
If you read the article, you'll notice he says basically the same thing everyone else said about the standard library. You can use an std container to store each of these arrays, OR you can allocate one megabuffer and use pointers and "built in array" math to get to the exact memory location within that buffer for each item. In the classic faux-pas, he even says "This avoids any hidden overheads that might exist in the Array class and we only have a single allocation to keep track of." But you don't know if this is faster or slower than std::Array, and you're introducing a lot of bugs and extra development time dealing with raw pointers.
I think I see what you are trying to do.
There are numerous issues. First. You are making a buffer of random data, telling C++ that a Vector sized piece of it is a Vector. But, at no time do you actually call the constructor to Vector which will initialize the pointers and constructs inside to viable values.
This has already been answered here: Call a constructor on a already allocated memory
The second issue is the line
instanceData.vertices[i] = vertices;
instanceData.vertices is a pointer to a Vector, so you actually need to write
(*(instanceData.vertices))[i]
The third issue is that the contents of *(instanceData.vertices) are floats, and not Vector, so you should not be able to do the assignment there.

Prealloc memory list

We try to develop a realtime application. In this program 4 cameras send 100 times a second there image array to a method.
In this method I have to make copy of each array. (Used for ImageProcessing in other thread).
I would like to store the last 100 images of each camera in a list.
The problem is: How to prealloc such memory in list (new them in constructor?).
I would like to use something like a ringbuffer with fixed size, allocated memories array and fifo principal.
Any idea how?
Edit1:
Example pseudo code:
// called from writer thread
void receiveImage(const char *data, int length)
{
Image *image = images.nextStorage();
std::copy(data, data + length, image->data);
}
// prealloc
void preallocImages()
{
for (int i = 0; i < 100; i++)
images.preAlloc(new Image(400, 400));
}
// consumer thread
void imageProcessing()
{
Image image = image.WaitAndGetImage();
// ... todo
}
Say you create an Image class to hold the data for an image, having a ring buffer amounts to something like:
std::vector<Image> images(100);
int next = 0;
...
while (whatever)
{
images[next++] = get_image();
next %= images.size();
}
You talk about preallocating memory: each Image constructor can own the task of preallocating memory for its own image. If could do that with new, or if you have fixed-size images that aren't particularly huge you could try a corrspondingly sized array in the Image class... that way the all image data will be kept contiguously in memory - it might be a little faster to iterate images "in order". Note that simply having allocated virtual addresses doesn't mean there's physical backing memory yet, and that stuff may still be swapped out into virtual memory. If you have memory access speed issues, you might want to think about scanning over the memory for an image you expect to use shortly before using it or using OS functions to advise the OS of your intended memory use patterns. Might as well get something working and profile it first ;-).
For FIFO handling - just have another variable also starting at 0, if it's != next then you can "process" the image at that index in the vector, then increment the variable until it catches up with next.

Initializing and maintaining structs of structs

I’m writing C++ code to deal with a bunch of histograms that are populated from laboratory measurements. I’m running into problems when I try to organize things better, and I think my problems come from mishandling pointers and/or structs.
My original design looked something like this:
// the following are member variables
Histogram *MassHistograms[3];
Histogram *MomentumHistograms[3];
Histogram *PositionHistograms[3];
where element 0 of each array corresponded to one laboratory measurement, element 1 of each corresponded to another, etc. I could access the individual histograms via MassHistograms[0] or similar, and that worked okay. However, the organization didn't seem right to me—if I were to perform a new measurement, I’d have to add an element to each of the histogram arrays. Instead, I came up with
struct Measurement {
Histogram *MassHistogram;
Histogram *MomentumHistogram;
Histogram *PositionHistogram;
};
As an added layer of complexity, I further wanted to bundle these measurements according to the processing that has been done on their data, so I made
struct MeasurementSet {
Measurement SignalMeasurement;
Measurement BackgroundMeasurement;
};
I think this arrangement is much more logical and extensible—but it doesn’t work ;-) If I have code like
MeasurementSet ms;
Measurement m = ms.SignalMeasurement;
Histogram *h = m.MassHistogram;
and then try to do stuff with h, I get a segmentation fault. Since the analogous code was working fine before, I assume that I’m not properly handling the structs in my code. Specifically, do structs need to be initialized explicitly in any way? (The Histograms are provided by someone else’s library, and just declaring Histogram *SomeHistograms[4] sufficed to initialize them before.)
I appreciate the feedback. I’m decently familar with Python and Clojure, but my limited knowledge of C++ doesn’t extend to [what seems like] the arcana of the care and feeding of structs :-)
What I ended up doing
I turned Measurement into a full-blown class:
class Measurement {
Measurement() {
MassHistogram = new Histogram();
MomentumHistogram = new Histogram();
PositionHistogram = new Histogram();
};
~Measurement() {
delete MassHistogram;
delete MomentumHistogram;
delete PositionHistogram;
};
Histogram *MassHistogram;
Histogram *MomentumHistogram;
Histogram *PositionHistogram;
}
(The generic Histogram() constructor I call works fine.) The other problem I was having was solved by always passing Measurements by reference; otherwise, the destructor would be called at the end of any function that received a Measurement and the next attempt to do something with one of the histograms would segfault.
Thank you all for your answers!
Are you sure that Histogram *SomeHistograms[4] initialized the data? How do you populate the Histogram structs?
The problem here is not the structs so much as the pointers that are tripping you up. When you do this: MeasurementSet ms; it declares an 'automatic variable' of type MeasurementSet. What it means is that all the memory for MeasurementSet is 'allocated' and ready to go. MeasurementSet, in turn, has two variables of type Measurement that are also 'allocated' and 'ready to go'. Measurement, in turn, has 3 variables of type Histogram * that are also 'allocated' and 'ready to go'... but wait! The type 'Histogram *' is a 'pointer'. That means it's an address - a 32 or 64 bit (or whatever bit) value that describes an actual memory location. And that's it. It's up to you to make it point to something - to put something at that location. Before it points to anything, it will have literally random data in it (or 0'd out data, or some special debug data, or something like that) - the point is that if you try to do something with it, you'll get a segmentation fault, because you will likely be attempting to read a part of data your program isn't supposed to be reading.
In c++, a struct is almost exactly the same thing as a class (which has a similar concept in python), and you typically allocate one like so:
m.MassHistogram = new Histogram();
...after that, the histogram is ready-to-go. However, YMMV: can you allocate one yourself? Or can you only get one from some library, maybe from a device reading, etc? Furthermore, although you can do what I wrote, it's not necessarily 'pretty'. A c++-ic solution would be to put the allocation in a constructor (like init in python) and delete in a destructor.
When your struct contains a pointer, you have to initialize that variable yourself.
Example
struct foo
{
int *value;
};
foo bar;
// bar.value so far is not initialized and points to a random piece of data
bar.value = new int(0);
// bar.value now points to a int with the value 0
// remember, that you have to delete everything that you new'd, once your done with it:
delete bar.value;
First, always remember that structs and classes are almost exactly the same things. The only difference is that struct members are public by default, and a class member is private by default.
But all the rest is exactly the same.
Second, carefully differentiate between pointers and objects.
If I write
Histogram h;
space for histogram's data will be allocated, and it's constructor will be called. ( A construct is a method with exactly the same name as the class, here Historgram() )
If I write
Histogram* h;
I'm declaring a variable of 32/64 bits that will be used as a pointer to memory. It's initialzed with a random value. Dangerous!
If I write
Histogram* h = new Histogram();
memory will be allocated for one Histogram's data members, and it's constructor will be called. The address in memory will be stored in "h".
If I write
Histogram* copy = h;
I'm again declaring a 32/64 bit variable that points to exactly the same address in memory as h
If I write
Histogram* h = new Historgram;
Histogram* copy = h;
delete h;
the following happens
memory is allocated for a Histogram object
The constructor of Histogram will be called (even if you didn't write it, your compiler will generate one).
h will contain the memory address of this object
the delete operator will call the destructor of Histogram (even if you didn't write it, your compiler will generate one).
the memory allocated for the Histogram object will be deallocated
copy will still contain the memory address where the object used to be allocated. But you're not allowed to use it. It's called a "dangling pointer"
h's contents will be undefined
In short: the "n.MassHistogram" in your code is referring to a random area in memory. Don't use it. Either allocated it first using operator "new", or declare it as "Histogram" (object instead of pointer)
Welcome to CPP :D
You are aware that your definition of Measurement does not allocate memory for actual Histograms? In your code, m.MassHistogram is a dangling (uninitialized) pointer, it's not pointing to any measured Histogram, nor to any memory capable of storing a Histogram. As #Nari Rennlos posted just now, you need to point it to an existing (or newly allocated) Histogram.
What does your 3rd party library's interface look like? If it's at all possible, you should have a Measurement containing 3 Histograms (as opposed to 3 pointers to Histograms). That way when you create a Measurement or a MeasurementSet the corresponding Histograms will be created for you, and the same goes for destruction. If you still need a pointer, you can use the & operator:
struct Measurement2 {
Histogram MassHistogram;
Histogram MomentumHistogram;
Histogram PositionHistogram;
};
MeasurementSet2 ms;
Histogram *h = &ms.SignalMeasurement.MassHistogram; //h valid as long as ms lives
Also note that as long as you're not working with pointers (or references), objects will be copied and assigned by value:
MeasurementSet ms; //6 uninitialized pointers to Histograms
Measurement m = ms.SignalMeasurement; //3 more pointers, values taken from first 3 above
Histogram *h = m.MassHistogram; //one more pointer, same uninitialized value
Though if the pointers had been initialized, all 10 of them would be pointing to an actual Histogram at this point.
It gets worse if you have actual members instead of pointers:
MeasurementSet2 ms; //6 Histograms
Measurement2 m = ms.SignalMeasurement; //3 more Histograms, copies of first 3 above
Histogram h = m.MassHistogram; //one more Histogram
h.firstPoint = 42;
m.MassHistogram.firstPoint = 43;
ms.SignalMeasurement.MassHistogram.firstPoint = 44;
...now you have 3 slightly different mass signal histograms, 2 pairs of identical momentum and position signal histograms, and a triplet of background histograms.