C++ Transferring objects from memory to a file - c++

Sometimes it is needed to store a collection of linked objects in a file for future use. In this regard, there are two most obvious approaches, neither of which seems quite satisfactory. The first approach is to create a mapping of pointers to file offsets, like this:
struct A
{
int data;
std::list<B*> links;
};
struct B
{
char data;
std::list<C*> links;
};
typedef unsigned Offset;
std::map<void*,Offset> ptr2ofs;
The problem with this approach is that it requires additional mapping, which may be hashed for faster access, but overall will introduce time and space overhead per each saved link.
The second approach is to include the offset field directly in the data structures:
struct A
{
int data;
Offset offset;
std::list<B*> links;
};
This makes writing operations much faster, but the offset fields become redundant after saving, and will produce memory overhead after loading. So, in this case two sets of structures will be needed, one for saving data, and another one for loading it:
struct A_write
{
int data;
Offset offset;
std::list<B*> links;
};
struct A_read
{
int data;
std::list<B*> links;
};
Thus, both of the approaches obviously have significant drawbacks and neither can be considered as the reference approach. But is there a way to improve them?

Ever heard of offset pointers? They store the relative address rather than the absolute address, so if you just want to plunk down a block of memory into a file and read it back into memory another day, all the address references will still be valid.
If you want to use a serialization library, I'd look at one of the following. Yeah, you could design it yourself, but someone's already been through this trouble before....
Google Protocol Buffers
Cap'n Proto
Google FlatBuffers

Related

Map a layout onto memory address

In C++ is there a way to "map" my desired layout onto a memory data, without memcopying it?
I.e. there is a void* buffer, and I know its layout:
byte1: uint8_t
byte2-3: uint16_t
byte4: uint8_t
I know I can create a struct, and memcpy the data to the struct, and then I can have the values as fields of struct.
But is there a way achieving this without copying? The data is already there, I just need to get some fields, and I'm looking a way for something can help with the layout.
(I can have some static ints for the memory offsets, but I'm hoping for some more generic).
I.e: I would have more "layouts", and based on type of the raw data I'd map the appropriate layout and access its fields which still points to the original data.
I know I can point structs to data, it is easy:
struct message {
uint8_t type;
};
struct request:message {
uint8_t rid;
uint8_t other;
};
struct response:message {
uint8_t result;
};
vector<uint8_t> data;
data.push_back(1); //type
data.push_back(10);
data.push_back(11);
data.push_back(12);
data.push_back(13);
struct request* ptrRequest;
ptrRequest = (struct request*)&data[1];
cout << (int)ptrRequest->rid; //10
cout << (int)ptrRequest->other; //11
But what I'd like to achieve is to have a map with the layouts, i.e:
map<int, struct message*> messagetypes;
But I have no clue on how can I proceed as emplacing would need a new object, and casting is also challenging if the maps stores the base pointers only.
If your layout structure is POD you can do placement new-expression with no initialization, that serves as an object creation marker. E.g.:
#include <new> // Placement new.
// ...
uint8_t* data = ...; // Read from disk, network, or elsewhere.
static_assert(std::is_pod<request>::value, "struct request must be POD.");
request* ptrRequest = new (static_cast<void*>(data)) request;
That only works with PODs. This is a long-standing issue documented in P0593R6
Implicit creation of objects for low-level object manipulation.
If your target architecture requires data to be aligned, add data pointer alignment check.
As another answer states, memcpy may be eliminated by the compiler, examine the assembly output.
In C++ is there a way to "map" my desired layout onto a memory data, without memcopying it?
No, not in standard C++.
If the layout matches that of the class1, then what you might be able to do is to write the memory data onto the class instance initially, so that it doesn't need for copying afterwards.
If the above is not possible, then what you might do is copy (yes, this is memcopy, but hold that thought) the data onto an automatic instance of the class, then placement-new a copy of the automatic instance onto the source array. A good optimiser can see that these copies back and forth do not change the value, and can optimise them away. Matching layout is also necessary here. Example:
struct data {
std::uint8_t byte;
std::uint8_t another;
std::uint16_t properly_aligned;
};
void* buffer = get_some_buffer();
if (!std::align(alignof(data), sizeof(data), buffer, space))
throw std::invalid_argument("bad alignment");
data local{};
std::memcpy(&local, buffer, sizeof local);
data* dataptr = new(buffer) data{local};
std::uint16_t value_from_offset = dataptr->properly_aligned;
https://godbolt.org/z/uvrXS2 Notice how there is no call to std::memcpy in the generated assembly.
One thing to consider here is that the multi-byte integers must have the same byte order as the CPU uses natively. Therefore the data is not portable across systems (of different byte endienness). More advanced de-serialisation is required for portability.
1 It however seems unlikely that the data could possibly match the layout of the class, because the second element which is uint16_t is not aligned to two a 16 bit boundary from start of the layout.

How to let the compiler do the offset computations for an odd polymorphism structure, with as little code as possible?

I am not sure if this is possible at all in standard C++, so whether it even is possible to do, could be a secondary way to put my question.
I have this binary data which I want to read and re-create using structs. This data is originally created as a stream with the content appended to a buffer, field by field at a time; nothing special about that. I could simply read it as a stream, the same way it was written. Instead, I merely wanted to see if letting the compiler do the math for me, was possible, and instead implementing the binary data as a data structure instead.
The fields of the binary data have a predictable order which allows it to be represented as a data type, the issue I am having is with the depth and variable length of repeating fields. I am hoping the example code below makes it clearer.
Simple Example
struct Common {
int length;
};
struct Boo {
long member0;
char member1;
};
struct FooSimple : Common {
int count;
Boo boo_list[];
};
char buffer[1024];
int index = 15;
((FooSimple *)buffer)->boo_list[index].member0;
Advanced Example
struct Common {
int length;
};
struct Boo {
long member0;
char member1;
};
struct Goo {
int count;
Boo boo_list[];
};
struct FooAdvanced : Common {
int count;
Goo goo_list[];
};
char buffer[1024];
int index0 = 5, index1 = 15;
((FooAdvanced *)buffer)->goo_list[index0].boo_list[index1].member0;
The examples are not supposed to relate. I re-used some code due to lack of creativity for unique names.
For the simple example, there is nothing unusual about it. The Boo struct is of fixed size, therefore the compiler can do the calculations just fine, to reach the member0 field.
For the advanced example, as far as I can tell at least, it isn't as trivial of a case. The problem that I see, is that if I use the array selector operator to select a Goo object from the inline array of Goo-elements (goo_list), the compiler will not be able to do the offset calculations properly unless it makes some assumptions; possibly assuming that all preceding Goo-elements in the array have zero Boo-elements in the inline array (boo_list), or some other constant value. Naturally, that won't be the case.
Question(s):
What ways are there to achieve the offset computations to be done by the compiler, despite the inline arrays having variable lengths? Unless I am missing something, I believe templates can't help at all, due to their compile-time nature.
Is this even possible to achieve in C++?
How do you handle the case with instantiating a FoodAdvanced object, by feeding a variable number of Goo and Boo element counts to the goo_list and boo_list members, respectively?
If it is impossible, would I have to write some sort of wrapper code to handle the calculations instead?

How to fill buffers with mixed types conveniently in standard conformant way?

There are problems, where we need to fill buffers with mixed types. Two examples:
programming OpenGL/DirectX, we need to fill vertex buffers, which can have mixed types (which is basically an array of struct, but the struct maybe described by a run-time data)
creating a memory allocator: putting header/trailer information to the buffer (size, flags, next/prev pointer, sentinels, etc.)
The problem can be described like this:
there is an allocation function, which gives back some memory (new, malloc, OS dependent allocation function, like mmap or VirtualAlloc)
there is a need to put mixed types into an allocated buffer, at various offsets
A solution can be this, for example writing an int to an offset:
void *buffer = <allocate>;
int offset = <some_offset>;
char *ptr = static_cast<char*>(buffer);
*reinterpret_cast<int*>(ptr+offset) = int_value;
However, this is inconvenient, and has UB at least two places:
ptr+offset is UB, as there is no char array at ptr
writing to the result of reinterpret_cast is UB, as there is no int there
To solve the inconvenience problem, this solution is often used:
union Pointer {
void *asVoid;
bool *asBool;
byte *asByte;
char *asChar;
short *asShort;
int *asInt;
Pointer(void *p) : asVoid(p) { }
};
So, with this union, we can do this:
Pointer p = <allocate>;
p.asChar += offset;
*p.asInt++ = int_value; // write an int to offset
*p.asShort++ = short_value; // then a short afterwards
// other writes here
This solution is convenient for filling buffers, but has further UB, as the solution uses non-active union members.
So, my question is: how can one solve this problem in a strictly standard conformant, and most convenient way? I mean, I'd like to have the functionality which the union solution gives me, but in a standard conformant way.
(Note: suppose, that we have no alignment issues here, alignment is taken care of by using proper offsets)
A simple (and conformant) way to handle these things is leveraging std::memcpy to move whatever values you need into the correct offsets in your storage area, e.g.
std::int32_t value;
char *ptr;
int offset;
// ...
std::memcpy(ptr+offset, &value, sizeof(value));
Do not worry about performance, since your compiler will not actually perform std::memcpy calls in many cases (e.g. small values). Of course, check the assembly output (and profile!), but it should be fine in general.

An array of structures within a structure - what's the pointer type?

I have the following declaration in a file that gets generated by a perl script ( during compilation ):
struct _gamedata
{
short res_count;
struct
{
void * resptr;
short id;
short type;
} res_table[3];
}
_gamecoderes =
{
3,
{
{ &char_resource_ID_RES_welcome_object_ID,1002, 1001 },
{ &blah_resource_ID_RES_another_object_ID,1004, 1003 },
{ &char_resource_ID_RES_someting_object_ID,8019, 1001 },
}
};
My problem is that struct _gamedata is generated during compile time and the number of items in res_table will vary. So I can't provide a type declaring the size of res_table in advance.
I need to parse an instance of this structure, originally I was doing this via a pointer to a char ( and not defining struct _gamedata as a type. But I am defining res_table.
e.g.
char * pb = (char *)_gamecoderes;
// i.e. pb points to the instance of `struct _gamedata`.
short res_count = (short *)pb;
pb+=2;
res_table * entry = (res_table *)pb;
for( int i = 0; i < res_count; i++ )
{
do_something_with_entry(*entry);
}
I'm getting wierd results with this. I'm not sure how to declare a type _struct gamedata as I need to be able to handle a variable length for res_table at compile time.
Since the struct is anonymous, there's no way to refer to the type of this struct. (res_table is just the member name, not the type's name). You should provide a name for the struct:
struct GameResult {
short type;
short id;
void* resptr;
};
struct _gamedata {
short res_count;
GameResult res_table[3];
};
Also, you shouldn't cast the data to a char*. The res_count and entry's can be extracted using the -> operator. This way the member offsets can be computed correctly.
_gamedata* data = ...;
short res_count = data->res_count;
GameResult* entry = data->res_table;
or simply:
_gamedata* data;
for (int i = 0; i < data->res_count; ++ i)
do_something_with_entry(data->res_table[i]);
Your problem is alignment. There will be at least two bytes of padding in between res_count and res_table, so you cannot simply add two to pb. The correct way to get a pointer to res_table is:
res_table *table = &data->res_table;
If you insist on casting to char* and back, you must use offsetof:
#include <stddef.h>
...
res_table *table = (res_table *) (pb + offsetof(_gamedata, res_table));
Note: in C++ you may not use offsetof with "non-POD" data types (approximately "types you could not have declared in plain C"). The correct idiom -- without casting to char* and back -- works either way.
Ideally use memcpy(3), at least use type _gamedata, or define a protocol
We can consider two use cases. In what I might call the programmer-API type, serialization is an internal convenience and the record format is determined by the compiler and library. In the more formally defined and bulletproof implementation, a protocol is defined and a special-purpose library is written to portably read and write a stream.
The best practice will differ depending on whether it makes sense to create a versioned protocol and develop stream I/O operations.
API
The best and most completely portable implementation when reading from compiler-oject serialized streams would be to declare or dynamically allocate an exact or max-sized _gamedata and then use memcpy(3) to pull the data out of the serial stream or device memory or whatever it is. This lets the compiler allocate the object that is accessed by compiler code and it lets the developer allocate the object that is accessed by developer (i.e., char *) logic.
But at a minimum, set a pointer to _gamedata and the compiler will do everything for you. Note also that res_table[n] will always be at the "right" address regardless of the size of the res_table[] array. It's not like making it bigger changes the location of the first element.
General serialization best practice
If the _gamedata object itself is in a buffer and potentially misaligned, i,e., if it is anything other than an object allocated for a _gamedata type by the compiler or dynamically by a real allocator, then you still have potential alignment issues and the only correct solution is to memcpy(3) each discrete type out of the buffer.
A typical error is to use the misaligned pointer anyway, because it works (slowly) on x86. But it may not work on mobile devices, or future architectures, or on some architectures when in kernel mode, or with advanced optimizations enabled. It's best to stick with real C99.
It's a protocol
Finally, when serializing binary data in any fashion you are really defining a protocol. So, for maximum robustness, don't let the compiler define your protocol. Since you are in C, you can generally handle each fundamental object discretely with no loss in speed. If both the writer and reader do it, then only the developers have to agree on the protocol, not the developers and the compilers and the build team, and the C99 authors, and Dennis M. Ritchie, and probably some others.
As #Zack points out, there is padding between elements of your structure.
I'm assuming you have a char* because you've serialized the structure (in a cache, on disk, or over the network). Just because you are starting with a char * doesn't mean you have to access the entire struct the hard way. Cast it to a typed pointer, and let the compiler do the work for you:
_gamedata * data = (_gamedata *) my_char_pointer;
for( int i = 0; i < data->res_count; i++ )
{
do_something_with_entry(*data->res_table[i]);
}

c++: at what point should I start using "new char[N]" vs a static buffer "char[Nmax]"

My question is with regard to C++
Suppose I write a function to return a list of items to the caller. Each item has 2 logical fields: 1) an int ID, and 2) some data whose size may vary, let's say from 4 bytes up to 16Kbytes. So my question is whether to use a data structure like:
struct item {
int field1;
char field2[MAX_LEN];
OR, rather, to allocate field2 from the heap, and require the caller to destroy when he's done:
struct item{
int field1;
char *field2; // new char[N] -- destroy[] when done!
Since the max size of field #2 is large, is makes sense that this would be allocated from the heap, right? So once I know the size N, I call field2 = new char[N], and populate it.
Now, is this horribly inefficient?
Is it worse in cases where N is always small, i.e. suppose I have 10000 items that have N=4?
You should instead use one of the standard library containers, like std::string or std::vector<char>; then you don't have to worry about managing the memory yourself.
The thing that's horribly in efficient is all the time you will waste tracking down memory leaks. Use classes that take care of this for you.
But if you don't want to do that:
suppose I have 10000 items that have N=4?
So you waste 40k of memory - your PC has at least a gigabyte, probably two, don't worry about it. A consistent interface, even if you're doing new/delete, is better than something fancy that will be harder to debug.
The only time when can safely use fixed-size buffers in production code is sizes are compile-time system constants, such as MAX_PATH.
You could do both:
struct item {
...
char *field2; // Points to buf if < 8 chars (assuming null-terminator).
char buf[8];
};
This does require some clever copy semantics, so you'll need a custom copy-constructor and assignment operator.
Alternatively, if item is always heap-allocated, you could ensure that item and its data are always allocated together:
struct item {
...
char field2[1];
}
item* new_item(int size) {
int offset = &((item*)0)->field2[0] - (char*)0;
return new(malloc(offset + size)) item;
}
Actually it depends. As I see it:
statically sized buffer
Good
No need to manage memory
Very efficient in terms of execution speed
Bad
Might waste some memory
dynamically sized buffer
Good
Does not have to waste any memory, as the exact amount needed is known
Bad
Memory must be managed.
Might be slow(er)
With that in mind, and based on the situation (Is it likely sizes will vary much? Is execution speed extra important? ... ), pick one.