However convert a memory to a byte array? - c++

Now I have a database, which one field type is an array of byte.
Now I have a piece of memory, or an object. How to convert this piece of memory or even an object to a byte array and so that I can store the byte array to the database.
Suppose the object is
Foo foo
The memory is
buf (actually, don't know how to declare it yet)
The database field is
byte data[256]
Only hex value like x'1' can be insert into the field.
Thanks so much!

There are two methods.
One is simple but has serious limitations. You can write the memory image of the Foo object. The drawback is that if you ever change the compiler or the structure of Foo then all your data may no longer loadable (because the image no longer matches the object). To do this simply use
&Foo
as the byte array.
The other method is called 'serialization'. It can be used if the object changes
but adds a lot of space to encode the information. If you only have 256 bytes then you
need to be watchful serialization doesn't create a string too large to save.

Boost has a serialization library you may want to look at, though you'll need to careful about the size of the objects created. If you're only doing this with a small set of classes, you may want to write the marshalling and unmarshalling functions yourself.
From the documentation:
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. "

Related

Access field value raw bytes from a constructed Flatbuffer

Say I have following Flatbuffer definition compiled to C++.
table ParentObj { // this is the root table
timestamp:uint64;
child:ChildObj
}
table ChildObj {
...some fields...
}
I build a ParentObj (which includes the ChildObj) with Flatbuffer builder and send the final bytes over the network to another party.
Is there a way the receiver can access the raw bytes making up the ChildObj inside the received buffer? I can access individual fields in the child obj via the Flatbuffer generated C++ code interface. But can I get the buffer offset and length of the bytes making up the entire ChildObj object? I need this to generate a cryptographic signature of the ChildObj bytes.
No, because ChildObj is not necessarily contiguous in the buffer. It refers to a vtable (which may or may not be shared), and any sub-string/vector/table is an offset that may point to a non-adjacent part of the buffer.
Typically, you use nested_flatbuffer to store children that need to be treated as their own isolated buffer further down the line: child:[ubyte] (nested_flatbuffer: ChildObj).
However, getting a cryptographic signature from this is still a bad idea, since depending on how it was serialized (which implementation) the bytes may differ subtly, due to difference in alignment and ordering of fields/objects. To reliably get a hash out of this, you'd need to only has the actual data bytes, and in a fixed field order.

What does reinterpret_cast do binary-wise?

I'm writing a logger in C++, and I've come to the part where I'd like to take a log record and write in to a file.
I have created a LogRecord struct, and would like to serialize it and write it to a file in binary mode.
I have read some posts about serialization in C++, and one of the answers included this following snippet:
reinterpret_cast<char*>(&logRec)
I've tried reading about reinterpret_cast and what it does, but I couldn't fully understand what's really happening in the background.
From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true? How can that work?
A memory address is just a memory address. Memory isn't inherently special - it's just a huge array of bytes, for all we care. What gives memory its meaning is what we do with it, and the lenses through which we view it.
A pointer to a struct is just an integer that specifies some offset into memory - surely you can treat one integer in any way you want, in your case, as a pointer to some arbitrary number of bytes (chars).
reinterpret_cast() doesn't do anything special except allow you to convert one view of a memory address into another view of a memory address. It's still up to you to treat that memory address correctly.
For instance, char* is the conventional way to refer to a string of characters in C++ - but the type char* literally means "a pointer to a single char". How does it come to mean a pointer to a null-terminated string of characters? By convention, that's how. We treat the type differently depending on the context, but it's up to us to make sure we do so correctly.
For instance, how do you know how many bytes to read through your char* pointer to your struct? The type itself gives you zero information - it's up to you to know that you've really got a byte-oriented pointer to a struct of fixed length.
Remember, under the hood, the machine has no types. A piece of paper doesn't care if you write an essay on each line, or if you scribble all over the thing. It's how we treat it - and how the tools we use (C++) treat it.
Binary-wise, it does nothing at all. This casting is a higher-level concept that has no bearing in any actual machine instructions.
At a low level, a pointer is just a numeric value that holds a memory address. There is nothing to be done in telling the compiler "although you thought the destination memory contained a struct, now please think that it contains a char". The actual address itself doesn't change in any way.
From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true?
Yes.
How can that work?
A string is just a sequence of bytes, and your object is just a sequence of bytes, so that's how it works.
But it won't if your object is logically more than just a sequence of bytes. Any indirection, and you're hosed. Furthermore, any implementation-defined padding or representation/endianness and your data is non-portable. This might be acceptable; it really depends on your requirements.
Casting a struct into an array of bytes (chars) is a classic low impact method of binary serialization. This is based on the assumption that the content of the struct exists contiguously in memory. The casting allows us write this data to a file or socket using the normal APIs.
This only works though if the data is contiguous. This is true for C style structs or PODs in C++ terminology. It will not work with complex C++ objects or any struct with pointers to storage outside the struct. For text data you will need to use fixed size character arrays.
struct {
int num;
char name[50];
};
will serialize correctly.
struct {
int num;
char* name;
};
will not serialize correctly since the data for the string is stored outside the struct;
If you are sending data across a nework you will also need to ensure that the struct is packed or at least of known alignment and that integers are converted to a consistent endianness (network byte order is normally big endian)

Binary file read error

Run no problem for the first time,but when I comment on a part of the code,Program termination:
The problem is that you are trying to read objects you can't really save to files, or load from files.
Lets take that std::string member name. A std::string objects is basically just a pointer to a dynamically allocated array of characters (i.e. a C-style zero-terminated string), plus the length of the contained string. The problem is two-fold: First is that when attempting to save the name object it doesn't save the string, but the pointer; And the second problem is that pointers to dynamically allocated data are unique per process.
What happens when you load the object is that you read and set the pointer, but only the pointer. This pointer was valid in the process that wrote the object, but not in the current process, it doesn't point to any valid memory allocated by your process. Using this pointer, which is done when you use the string object, will then lead to undefined behavior, and UB us one of the most common reasons for crashes.
What you need to do is to serialize the string. If you want to write the code yourself and not use a library for it (there are many great serialization libraries you can use) then you need to write the string length as a fixed-sized integer, then write the actual string data. When you deserialize you first must know that the next piece of data to read is a string, and then read the length, followed by the string data, and then construct your string object from that.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.
In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html
D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").
I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.
You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

Why Serialization when a class object in memory is already binary (C/C++)?

My guess is that data is scattered in physical memory (even the data of a class object is sequential in virtual memory), so in order to send the data correctly it needs to be reassembled, and to be able to send over the network, one additional step is the transformation of host byte order to network byte order. Is it correct?
Proper serialization can be used to send data to arbitrary systems, that might not work under the same architecture as the source host.
Even an object that only consist of native types can be troublesome sharing between two systems because of the extra padding that might exists in between and after members, among other things. Sharing raw memory dumps of objects between programs compiled for the same architecture but with different compiler versions can also turn into a big hassle. There is no guarantee how variable type T actually is stored in memory.
If you are not working with pointers (references included), and the data is meant to be read by the same binary as it's dumped from, it's usually safe just to dump a raw struct to disk, but when sending data to another host.. drum roll serialization is the way to go.
I've heard developers talking about ntohl / htonl / ntohl / ntohs as methods of serializing/deserializing integers, and when you think about it saying that isn't that far from the truth.
The word "serialization" is often used to describe this "complicated method of storing data in a generic way", but then again; your first programming assignment where you were asked to save information about Dogs to file (hopefully*) made use of serialization, in some way or another.
* "hopefully" meaning that you didn't dump the raw memory representation of your Dog object to disk
Pointers!
If you've allocated memory on the heap you'll just end up with a serialised pointer pointing to an arbitrary area of memory. If you just have a few ints and chars then yes you can just write it out directly to a file, but that then becomes platform dependent because of the byte ordering that you mentioned.
Pointer and data pack(data align)
If you memcpy your object's memory, there is dangerous to copy a wild pointer value instead of it's data. There is another risk, if the sender and receiver have different data pack(data align) method, you will get rubbish after decoding.
Binary representations may be different between different architectures, compilers and even different versions of the same compiler. There's no guarantee that what system A sees as a signed integer will be seen as the same on system B. Byte ordering, word langths, struct padding etc will become hard to debug problems if you don't properly define the protocol or file format for exchanging the data.
Class (when we speak of C++) also includes virtual method pointers - and they must be reconstructed on receiving end.