What does reinterpret_cast do binary-wise? - c++

I'm writing a logger in C++, and I've come to the part where I'd like to take a log record and write in to a file.
I have created a LogRecord struct, and would like to serialize it and write it to a file in binary mode.
I have read some posts about serialization in C++, and one of the answers included this following snippet:
reinterpret_cast<char*>(&logRec)
I've tried reading about reinterpret_cast and what it does, but I couldn't fully understand what's really happening in the background.
From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true? How can that work?

A memory address is just a memory address. Memory isn't inherently special - it's just a huge array of bytes, for all we care. What gives memory its meaning is what we do with it, and the lenses through which we view it.
A pointer to a struct is just an integer that specifies some offset into memory - surely you can treat one integer in any way you want, in your case, as a pointer to some arbitrary number of bytes (chars).
reinterpret_cast() doesn't do anything special except allow you to convert one view of a memory address into another view of a memory address. It's still up to you to treat that memory address correctly.
For instance, char* is the conventional way to refer to a string of characters in C++ - but the type char* literally means "a pointer to a single char". How does it come to mean a pointer to a null-terminated string of characters? By convention, that's how. We treat the type differently depending on the context, but it's up to us to make sure we do so correctly.
For instance, how do you know how many bytes to read through your char* pointer to your struct? The type itself gives you zero information - it's up to you to know that you've really got a byte-oriented pointer to a struct of fixed length.
Remember, under the hood, the machine has no types. A piece of paper doesn't care if you write an essay on each line, or if you scribble all over the thing. It's how we treat it - and how the tools we use (C++) treat it.

Binary-wise, it does nothing at all. This casting is a higher-level concept that has no bearing in any actual machine instructions.
At a low level, a pointer is just a numeric value that holds a memory address. There is nothing to be done in telling the compiler "although you thought the destination memory contained a struct, now please think that it contains a char". The actual address itself doesn't change in any way.

From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true?
Yes.
How can that work?
A string is just a sequence of bytes, and your object is just a sequence of bytes, so that's how it works.
But it won't if your object is logically more than just a sequence of bytes. Any indirection, and you're hosed. Furthermore, any implementation-defined padding or representation/endianness and your data is non-portable. This might be acceptable; it really depends on your requirements.

Casting a struct into an array of bytes (chars) is a classic low impact method of binary serialization. This is based on the assumption that the content of the struct exists contiguously in memory. The casting allows us write this data to a file or socket using the normal APIs.
This only works though if the data is contiguous. This is true for C style structs or PODs in C++ terminology. It will not work with complex C++ objects or any struct with pointers to storage outside the struct. For text data you will need to use fixed size character arrays.
struct {
int num;
char name[50];
};
will serialize correctly.
struct {
int num;
char* name;
};
will not serialize correctly since the data for the string is stored outside the struct;
If you are sending data across a nework you will also need to ensure that the struct is packed or at least of known alignment and that integers are converted to a consistent endianness (network byte order is normally big endian)

Related

storing flag inside pointer

I have heard quite a lot about storing external data in pointer.
For example in (short string optimization).
For example:
when we want to overload << for our SSO class, dependant of the length of the string we want to print either value of pointer or string.
Instead of creating bool flag we could encode this flag inside pointer itself. If i am not mistaken its thanks PC architecture that adds padding to prevent unalligned memory access.
But i have yet to see it in example. How could we detect such flag, when binary operation such as & to check if RSB or LSB is set to 1 ( as a flag ) are not allowed on pointers? Also wouldnt this mess up dereferencing pointers?
All answers are appreciated.
It is quite possible to do such things (unlike other's have said). Most modern architectures (x86-64, for example) enforce alignment requirements that allow you to use the fact that the least significant bits of a pointer may be assumed to be zero, and make use of that storage for other purposes.
Let me pause for a second and say that what I'm about to describe is considered 'undefined behavior' by the C & C++ standard. You are going off-the-rails in a non-portable way by doing what I describe, but there are more standards governing the rules of a computer than the C++ standard (such as the processors assembly reference and architecture docs). Caveat emptor.
With the assumption that we're working on x86_64, let us say that you have a class/structure that starts with a pointer member:
struct foo {
bar * ptr;
/* other stuff */
};
By the x86 architectural constraints, that pointer in foo must be aligned on an 8-byte boundary. In this trivial example, you can assume that every pointer to a struct foo is therefore an address divisible by 8, meaning the lowest 3 bits of a foo * will be zero.
In order to take advantage of such a constraint, you must play some casting games to allow the pointer to be treated as a different type. There's a bunch of different ways of performing the casting, ranging from the old C method (not recommended) of casting it to and from a uintptr_t to cleaner methods of wrapping the pointer in a union. In order to access either the pointer or ancillary data, you need to logically 'and' the datum with a bitmask that zeros out the part of the datum you don't wish.
As an example of this explanation, I wrote an AVL tree a few years ago that sinks the balance book-keeping data into a pointer, and you can take a look at that example here: https://github.com/jschmerge/structures/blob/master/tree/avl_tree.h#L31 (everything you need to see is contained in the struct avl_tree_node at the line I referenced).
Swinging back to a topic you mentioned in your initial question... Short string optimization isn't implemented quite the same way. The implementations of it in Clang and GCC's standard libraries differ somewhat, but both boil down to using a union to overload a block of storage with either a pointer or an array of bytes, and play some clever tricks with the string's internal length field for differentiating whether the data is a pointer or local array. For more of the details, this blog post is rather good at explaining: https://shaharmike.com/cpp/std-string/
"encode this flag inside pointer itself"
No, you are not allowed to do this in either C or C++.
The behaviour on setting (let alone dereferencing) a pointer to memory you don't own is undefined in either language.
Sadly what you want to achieve is to be done at the assembler level, where the distinction between a pointer and integer is sufficiently blurred.

(Why) does an empty string have an address?

I guessed no, but this output of something like this shows it does
string s="";
cout<<&s;
what is the point of having empty string with an address ?
Do you think that should not cost any memory at all ?
Yes, every variable that you keep in memory has an address. As for what the "point" is, there may be several:
Your (literal) string is not actually "empty", it contains a single '\0' character. The std::string object that is created to contain it may allocate its own character buffer for holding this data, so it is not necessarily empty either.
If you are using a language in which strings are mutable (as is the case in C++), then there is no guarantee that an empty string will remain empty.
In an object-oriented language, a string instance with no data associated with it can still be used to call various instance methods on the string class. This requires a valid object instance in memory.
There is a difference between an empty string and a null string. Sometimes the distinction can be important.
And yes, I very much agree with the implementation of the language that an "empty" variable should still exist in and consume memory. In an object-oriented language an instance of an object is more than just the data that it stores, and there's nothing wrong with having an instance of an object that is not currently storing any actual data.
Following your logic, int i; would also not allocate any memory space, since you are not assigning any value to it. But how is it possible then, that this subsequent operation i = 10; works after that?
When you declare a variable, you are actually allocating memory space of a certain size (depending on the variable's type) to store something. If you want to use this space right way or not is up to you, but the declaration of the variable is what triggers memory allocation for it.
Some coding practices say you shouldn't declare a variable until the moment you need to use it.
An 'empty' string object is still an object - there may be more to its internal implementation than just the memory required to store the literal string itself. Besides that, most C-style strings (like the ones used in C++) are null-terminated, meaning even that "empty" string still uses one byte for the terminator.
Every named object in C++ has an address. There is even a specific requirement that the size of every type be at least 1 so that T[N] and T[N+1] are different, or so that in T a, b; both variables have distinct addresses.
In your case, s is a named object of type std::string, so it has an address. The fact that you constructed s from a particular value is immaterial. What matters is that s has been constructed, so it is an object, so it has an address.
s is a string object so it has an address. It has some internal data structures keeping track of the string. For example, current length of the string, current storage reserved for string, etc.
More generally, the C++ standard requires all objects to have a nonzero size. This helps ensure that every object has a unique address.
9 Classes
Complete objects and member subobjects of class type shall have nonzero size.
In C++, all classes are a specific, unchanging size. (varying by compiler and library, but specific at compile-time.) The std::string usually consists of a pointer, a length of allocation, and a length used. That's ~12 bytes, no matter how long the string is, and you have allocated std::string s on the call stack. When you display the address of the std::string, cout displays the location of the std::string in memory.
If the string doesn't point at anything, it won't allocate any space from the heap, which is like what you're thinking. But, all c-strings end in a trailing NULL, so the c-string "" is one character long, not zero. This means when you assign the c-string "" to the std::string, the std::string allocates 1 (or more) bytes, and assigns it the value of the trailing NULL character (usually zero '\0').
If there truly was no point to the empty string, then the programmer would not write the instruction at all. The language is loyal and trusting! And will never assume memory you allocate to be "wasted". Even if you are lost and heading over a cliff, it will hold your hand to the bitter end.
I think it'd be interesting to know, just as a curiosity though, that if you create a variable that isn't 'used' later, such as your empty string, the compiler may very well optimize it away so it incurs no cost to begin with. I guess compilers aren't as trusting...

Memory padding issue

I am working on a sample application in this application I am serializing some of the data. In client application I am reading the serialized data back. While doing this I observed some strange behavior.
In sample application size of object is different from size of data in client. I think this is because of memory padding. My problem is I am trying to write “BRUSHOBJ” to file. This structure is defined by Microsoft. I can change the declaration of this structure. Please let me know how to solve this problem.
Please let me know how to apply memory padding on slandered data type.
It sounds like you're trying to just cast the address of a struct to
char*, and use ostream::write on it. This simply doesn't work.
There's padding, but there's also the size of different types (which
varies from one platform to the next), byte order, and on some more
exotic platforms (including most mainframes) data representation itself.
Generally, you need a specification of what the output data should look
like, byte by byte, and you have to then write each byte with the
required value.
And this is just for simple types. A quick glance at BRUSHOBJ shows
that it contains a pointer, which you'll probably have to
follow—you'll certainly have to do something with it, since the
receiving end won't be able to do anything with a pointer into your
data. (I suspect, given the description, that you'll have to convert it
into some sort of identifier, and also transmit a dictionary mapping
such identifiers to objects. But I don't know enough about how this
structure is used to be sure.)
you have 2 options
serializing data
modify memory padding via #pragma pack
Serializing data has no relation with memory padding, you are just defining a way to write/ read back memory to/from a memory location (the memory stream).
I see the that _BRUSHOBJ struct has the following definition,
typedef struct _BRUSHOBJ {
ULONG iSolidColor;
PVOID pvRbrush;
FLONG flColorType;
} BRUSHOBJ;
please note that sending a pointer across process is nonsens. serializing a pointer should be done by writing the size of memory and the the memory itself. Anyway if you want to pass this BRUSHOBJ to a windows function you can get undefined behavior. It's not a supported/documented way of passing a BRUSHOBJ across process.
memory padding can by applied like this
#pragma pack(push)
#pragma pack(4)
struct myStruct
{
char Char1
int Int1;
};
#pragma pack(pop)
If you what to modify padding you should doit for a structure that is written by you.

However convert a memory to a byte array?

Now I have a database, which one field type is an array of byte.
Now I have a piece of memory, or an object. How to convert this piece of memory or even an object to a byte array and so that I can store the byte array to the database.
Suppose the object is
Foo foo
The memory is
buf (actually, don't know how to declare it yet)
The database field is
byte data[256]
Only hex value like x'1' can be insert into the field.
Thanks so much!
There are two methods.
One is simple but has serious limitations. You can write the memory image of the Foo object. The drawback is that if you ever change the compiler or the structure of Foo then all your data may no longer loadable (because the image no longer matches the object). To do this simply use
&Foo
as the byte array.
The other method is called 'serialization'. It can be used if the object changes
but adds a lot of space to encode the information. If you only have 256 bytes then you
need to be watchful serialization doesn't create a string too large to save.
Boost has a serialization library you may want to look at, though you'll need to careful about the size of the objects created. If you're only doing this with a small set of classes, you may want to write the marshalling and unmarshalling functions yourself.
From the documentation:
"Here, we use the term "serialization" to mean the reversible deconstruction of an arbitrary set of C++ data structures to a sequence of bytes. "

Access Violation Using memcpy or Assignment to an Array in a Struct

Update 2:
Well I’ve refactored the work-around that I have into a separate function. This way, while it’s still not ideal (especially since I have to free outside the function the memory that is allocated inside the function), it does afford the ability to use it a little more generally. I’m still hoping for a more optimal and elegant solution…
Update:
Okay, so the reason for the problem has been established, but I’m still at a loss for a solution.
I am trying to figure out an (easy/effective) way to modify a few bytes of an array in a struct. My current work-around of dynamically allocating a buffer of equal size, copying the array, making the changes to the buffer, using the buffer in place of the array, then releasing the buffer seems excessive and less-than optimal. If I have to do it this way, I may as well just put two arrays in the struct and initialize them both to the same data, making the changes in the second. My goal is to reduce both the memory footprint (store just the differences between the original and modified arrays), and the amount of manual work (automatically patch the array).
Original post:
I wrote a program last night that worked just fine but when I refactored it today to make it more extensible, I ended up with a problem.
The original version had a hard-coded array of bytes. After some processing, some bytes were written into the array and then some more processing was done.
To avoid hard-coding the pattern, I put the array in a structure so that I could add some related data and create an array of them. However now, I cannot write to the array in the structure. Here’s a pseudo-code example:
main() {
char pattern[]="\x32\x33\x12\x13\xba\xbb";
PrintData(pattern);
pattern[2]='\x65';
PrintData(pattern);
}
That one works but this one does not:
struct ENTRY {
char* pattern;
int somenum;
};
main() {
ENTRY Entries[] = {
{"\x32\x33\x12\x13\xba\xbb\x9a\xbc", 44}
, {"\x12\x34\x56\x78", 555}
};
PrintData(Entries[0].pattern);
Entries[0].pattern[2]='\x65'; //0xC0000005 exception!!! :(
PrintData(Entries[0].pattern);
}
The second version causes an access violation exception on the assignment. I’m sure it’s because the second version allocates memory differently, but I’m starting to get a headache trying to figure out what’s what or how to get fix this. (I’m currently working around it by dynamically allocating a buffer of the same size as the pattern array, copying the pattern to the new buffer, making the changes to the buffer, using the buffer in the place of the pattern array, and then trying to remember to free the—temporary—buffer.)
(Specifically, the original version cast the pattern array—+offset—to a DWORD* and assigned a DWORD constant to it to overwrite the four target bytes. The new version cannot do that since the length of the source is unknown—may not be four bytes—so it uses memcpy instead. I’ve checked and re-checked and have made sure that the pointers to memcpy are correct, but I still get an access violation. I use memcpy instead of str(n)cpy because I am using plain chars (as an array of bytes), not Unicode chars and ignoring the null-terminator. Using an assignment as above causes the same problem.)
Any ideas?
It is illegal to attempt to modify string literals. Your
Entries[0].pattern[2]='\x65';
line attempts exactly that. In your second example you are not allocating any memory for the strings. Instead, you are making your pointers (in the struct objects) to point directly at string literals. And string literals are not modifiable.
This question gets asked several times every day. Read Why is this string reversal C code causing a segmentation fault? for more details.
The problem boils down to the fact that a char[] is not a char*, even if the char[] acts a lot like a char* in expressions.
Other answers have addressed the reason for the error: you're modifying a string literal which is not allowed.
This question is tagged C++ so the easy way to solve your problem is to use std::string.
struct ENTRY {
std::string pattern;
int somenum;
};
Based on your updates, your real problem is this: You want to know how to initialize the strings in your array of structs in such a way that they're editable. (The problem has nothing to do with what happens after the array of structs is created -- as you show with your example code, editing the strings is easy enough if they're initialized correctly.)
The following code sample shows how to do this:
// Allocate the memory for the strings, on the stack so they'll be editable, and
// initialize them:
char ptn1[] = "\x32\x33\x12\x13\xba\xbb\x9a\xbc";
char ptn2[] = "\x12\x34\x56\x78";
// Now, initialize the structs with their char* pointers pointing at the editable
// strings:
ENTRY Entries[] = {
{ptn1, 44}
, {ptn2, 555}
};
That should work fine. However, note that the memory for the strings is on the stack, and thus will go away if you leave the current scope. That's not a problem if Entries is on the stack too (as it is in this example), of course, since it will go away at the same time.
Some Q/A on this:
Q: Why can't we initialize the strings in the array-of-structs initialization? A: Because the strings themselves are not in the structs, and initializing the array only allocates the memory for the array itself, not for things it points to.
Q: Can we include the strings in the structs, then? A: No; the structs have to have a constant size, and the strings don't have constant size.
Q: This does save memory over having a string literal and then malloc'ing storage and copying the string literal into it, thus resulting in two copies of the string, right? A: Probably not. When you write
char pattern[] = "\x12\x34\x56\x78";
what happens is that that literal value gets embedded in your compiled code (just like a string literal, basically), and then when that line is executed, the memory is allocated on the stack and the value from the code is copied into that memory. So you end up with two copies regardless -- the non-editable version in the source code (which has to be there because it's where the initial value comes from), and the editable version elsewhere in memory. This is really mostly about what's simple in the source code, and a little bit about helping the compiler optimize the instructions it uses to do the copying.