Is it evil to serialize struct objects using memcpy?
In one of my projects I am doing the following: I memcpy a struct object, base64 encode it, and write it to file. I do the inverse when parsing the data. It seems to work OK, but in certain situations (for example when using the WINDOWPLACEMENT for the HWND of Windows Media Player) it turns out that the decoded data does not match sizeof(WINDOWPLACEMENT).
Here are some code fragments:
// Using WINDOWPLACEMENT from Windows API headers:
typedef struct tagWINDOWPLACEMENT {
UINT length;
UINT flags;
UINT showCmd;
POINT ptMinPosition;
POINT ptMaxPosition;
RECT rcNormalPosition;
#ifdef _MAC
RECT rcDevice;
#endif
} WINDOWPLACEMENT;
static std::string EncodeWindowPlacement(const WINDOWPLACEMENT & inWindowPlacement)
{
std::stringstream ss;
{
Poco::Base64Encoder encoder(ss); // From the Poco C++ libraries
const char * offset = reinterpret_cast<const char*>(&inWindowPlacement);
std::vector<char> buffer(offset, offset + sizeof(inWindowPlacement));
for (size_t idx = 0; idx != buffer.size(); ++idx)
{
encoder << buffer[idx];
}
encoder.close();
}
return ss.str();
}
static WINDOWPLACEMENT DecodeWindowPlacement(const std::string & inEncoded)
{
std::string decodedString;
{
std::istringstream istr(inEncoded);
Poco::Base64Decoder decoder(istr); // From the Poco C++ libraries
decoder >> decodedString;
assert(decoder.eof());
if (decoder.fail())
{
throw std::runtime_error("Failed to parse Window placement data from the configuration file.");
}
}
if (decodedString.size() != sizeof(WINDOWPLACEMENT))
{
// !! Occurs frequently !!
throw std::runtime_error("Errors occured during parsing of the Window placement.");
}
WINDOWPLACEMENT windowPlacement;
memcpy(&windowPlacement, &decodedString[0], decodedString.size());
return windowPlacement;
}
I'm aware that copying classes in C++ using memcpy is likely to cause trouble because the copy constructors are not properly executed. I'm not sure if this also applies to C-style structs. Or is serialization by memory dumping simply not done?
Update:
A bug in Poco's Base64Encoder/Decoder is not impossible, but unlikely. Its test cases seem pretty thorough: Base64Test.cpp.
You will run into problems if you need to transfer these files between machines that do not all share the same endianness and word size, or if you add/remove slots from the structs in future versions and need to retain binary compatibility.
I'm not sure how operator>>() is implemented in Poco::Base64Decoder. If it is same as istream's operator>>(), then after decoder >> decodedString; decodedString may not contain all characters from the input. For example, if there is any whitespace character in encoded string then decoder >> decodedString; will read upto that whitespace.
Doing a memcpy of classes/structs is okay if they're just Plain Old Data (POD), but if that's the case, then you could rely on C++ doing the copying for you via copy constructors (which exist for both struct and class types in C++).
Certainly you can do it the way you have been doing it - one of the products I've worked on serializes data using memcpy, sends the data over the wire, and client applications decode the bytestream to get the data back.
But if you have a choice, you might want something higher level like boost.serialization, which offers more flexibility and deep-pointer copying. The aforementioned Google ProtoBuffers would work nicely too.
Here are some threads discussing serialization methods in C++:
boost serialization vs google protocol buffers?
C++ Serialization Performance
I wouldn't go as far as to say that it's evil, but I think it is asking for trouble and weird problems in many cases.
I know it has been done and it can work (I've seen people serialize structs like that to send over a network connection), but it has a number of drawbacks that have been pointed out already (inflexibility, endianness problems, structs containing pointers, packing, etc).
I'd recommend a more robust way of serializing and deserializing your data. I've heard lots of good things about Google protocol buffers, something like that will be a lot more flexible and will probably save you headaches in the end.
Serializing data in the manner you've done it is not particularly evil, if you know you're staying on a machine with the same byte size, word size, endian-ness, etc. Since you're serializing the window placement information, you probably don't care about portability between two different machines, and only want to save this information between sessions on the same machine. I'd hazard a guess that you're storing this into the Registry. If you want portability for other data that is actually useful when it's ported to other architectures, then you can look at many of the other suggestions posted here already, such as Google protocol buffers, etc. Whitespace is a red-herring, as all WS is irrelevant in a base64 encoded data stream and all decoders should ignore it (PoCo does).I am curious to know what are the sizes of the string and the structure when it fails.Knowing this might give you some insight into the problem.
Related
Let there be a structure
struct MyDataStructure
{
int a;
int b;
string c;
};
Let there be a function in the interface exposed by a dll.
class IDllInterface
{
public:
void getData(MyDataStructure&) = 0;
};
From a client exe which loads the dll, would the following code be safe?
...
IDllInterface* dll = DllFactory::getInterface(); // Imagine this exists
MyDataStructure data;
dll->getData(data);
...
Assume, of course, that MyDataStructure is known to both the client and the dll. Also according to what I understand, as the code is compiled separately for the dll and exe, the MyDataStructure could be different for difference compilers/compiler versions. Is my understanding correct.
If so, how can you pass data between the dll boundaries safely when working with different compilers/compiler versions.
You could use a "protocol" approach. For this, you could use a memory buffer to transfer the data and both sides just have to agree on the buffer layout.
The protocol agreement could be something like:
We don't use a struct, we just use a memory buffer - (pass me a pointer or whatever means toolkit allows sharing a memory buffer.
We clear the buffer to 0s before setting any data in it.
All ints use 4 bytes in the buffer. This means each side uses whatever int type under their compiler is 4 bytes e.g. int/long.
For the particular case of two ints, the first 8 bytes has the ints and after that it's the string data.
#define MAX_STRING_SIZE_I_NEED 128
// 8 bytes for ints.
#define DATA_SIZE (MAX_STRING_SIZE_I_NEED + 8)
char xferBuf[DATA_SIZE];
So Dll sets int etc. e.g.
void GetData(void* p);
// "int" is whatever type is known to use 4 bytes
(int*) p = intA_ValueImSending;
(int*) (p + 4) = intB_ValueImSending;
strcpy((char*) (p + 8), stringBuf_ImSending);
On the receving end it's easy enough to place the buffered values in the struct:
char buf[DATA_SIZE];
void* p =(void*) buf;
theDll.GetData(p);
theStrcuctInstance.intA = *(int*) p;
theStrcuctInstance.intB = *(int*) (p + 4);
...
If you want you could even agree on the endianness of the bytes per integer and set each of the 4 bytes of each integer in the buffer - but you probably wouldn't need to go to that extent.
For more general purpose both sides could agree on "markers" in the buffer. The buffer would look like this:
<marker>
<data>
<marker>
<data>
<marker>
<data>
...
Marker: 1st byte indicates the data type, the 2nd byte indicates the length (very much like a network protocol).
If you want to pass a string in COM, you normally want to use a COM BSTR object. You can create one with SysAllocString. This is defined to be neutral between compilers, versions, languages, etc. Contrary to popular belief, COM does directly support the int type--but from its perspective, int is always a 32-bit type. If you want a 64-bit integer, that's a Hyper, in COM-speak.
Of course you could use some other format that both sides of your connection know/understand/agree upon. Unless you have an extremely good reason to do this, it's almost certain to be a poor idea. One of the major strengths of COM is exactly the sort of interoperation you seem to want--but inventing your own string formation would limit that substantially.
Using JSON for communication.
I think I have found an easier way to do it hence answering my own question. As suggested in the answer by #Greg, one has to make sure that the data representation follows a protocol e.g. network protocol. This makes sure that the object representation between different binary components (exe and dll here) becomes irrelevant. If we think about it again this is the same problem that JSON solves by defining a simple object representation protocol.
So a simple yet powerful solution according to me would be to construct a JSON object from your object in the exe, serialise it, pass it across the dll boundary as bytes and deserialise it in the dll. The only agreement between the dll and exe would be that both use the same string encoding (e.g. UTF-8).
https://en.wikibooks.org/wiki/JsonCpp
One can use the above Jsoncpp library. The strings are encoded by UTF-8 by default in Jsoncpp library so that it convenient as well :-)
I am working on project in C++ that adopts many ideas from a golang project.
I don't properly understand how this binary.write works from the documentation and how I can replicate it in C++. I am stuck at this line in my project.
binary.Write(e.offsets, nativeEndian, e.offset)
The type of e.offsets is *bytes.Buffer and e.offset is uint64
In C++ standard libs, it is generally up to you to deal with endian concerns. So let's skip that for the time being. If you just want to write binary data to a stream such as a file, you can do something like this:
uint64_t value = 0xfeedfacedeadbeef;
std::ofstream file("output.bin", ios::binary);
file.write(reinterpret_cast<char*>(&value), sizeof(value));
The cast is necessary because the file stream deals with char*, but you can write whatever byte streams to it you like.
You can write entire structures this way as well so long as they are "Plain Old Data" (POD). For example:
struct T {
uint32_t a;
uint16_t b;
};
T value2 = { 123, 45 };
std::ofstream file("output.bin", ios::binary);
file.write(reinterpret_cast<char*>(&value2), sizeof(value2));
Reading these things back is similar using file.read, but as mentioned, if you REALLY do care about endian, then you need to take care of that yourself.
If you are dealing with non-POD types (such as std::string), then you will need to deal with a more involved data serialization system. There are numerous options to deal with this if needed.
I am doing in-memory image conversions between two frameworks (OpenSceneGraph and wxWidgets). Not wanting to care about the underlying classes (osg::Image and wxImage), I use the stream oriented I/O features both APIs provide like so:
1) Create an std::stringstream
2) Write to the stream using OSG's writers
3) Read from the stream using wxWigdets readers
This works fairly well. Until now I've been using direct access to the stream buffer, but my attention has been recently caught by the "non-contiguous underlying buffer" problem of the std::stringstream. I had been using a kludge to get a const char* ptr to the buffer - but it worked (tested on Windows, Linux and OSX, using MSVC 9 and GCC 4.x), so I never fixed it.
Now I understand that this code is a time bomb and I want to get rid of it. This problem has been brought up several times on SO (here for instance), but I could not find an answer that could really help me do the simplest thing that could possibly work.
I think the most reasonable thing to do is to create my own streambuf using a vector behind the scenes - this would guarantee that the buffer is contiguous. I am aware that this would not a generic solution, but given my constraints:
1) the required size is not infinite and actually quite predictable
2) my stream really needs to be an std::iostream (I can't use a raw char array) because of the APIs
anybody knows how I can either a custom stringbuf using a vector of chars ? Please do not answer "use std::stringstream::str()", since I know we can, but I'm precisely looking for something else (even though you'd say that copying 2-3 MB is so fast that I wouldn't even notice the difference, let's consider I am still interested in custom stringbufs just for the beauty of the exercise).
If you can use just an istream or an ostream (rather than
bidirectional), and don't need seeking, it's pretty simple (about 10
lines of code) to create your own streambuf using std::vector<char>.
But unless the strings are very, very large, why bother? The C++11 standard
guarantees that std::string is contiguous; that a char*
obtained by &myString[0] can be used to as a C style array.` And the
reason C++11 added this guarantee was in recognition of existing
practice; there simply weren't any implementations where this wasn't the
case (and now that it's required, there won't be any implementations in
the future where this isn't the case).
boost::iostreams have a few ready made sinks for this. There's array_sink if you have some sort of upper limit and can allocate the chunk upfront, such a sink won't grow dynamically but on the other hand that can be a positive as well. There's also back_inserter_device, which is more generic and works straight up with std::vector for example. An example using back_inserter_device:
#include <string>
#include <iostream>
#include "boost/iostreams/stream_buffer.hpp"
#include "boost/iostreams/device/back_inserter.hpp"
int main()
{
std::string destination;
destination.reserve( 1024 );
boost::iostreams::stream_buffer< boost::iostreams::back_insert_device< std::string > > outBuff( ( destination ) );
std::streambuf* cur = std::cout.rdbuf( &outBuff );
std::cout << "Hello!" << std::endl;
// If we used array_sink we'd need to use tellp here to retrieve how much we've actually written, and don't forgot to flush if you don't end with an endl!
std::cout.rdbuf( cur );
std::cout << destination;
}
Background:
I'm using Google's protobuf, and I would like to read/write several gigabytes of protobuf marshalled data to a file using C++. As it's recommended to keep the size of each protobuf object under 1MB, I figured a binary stream (illustrated below) written to a file would work. Each offset contains the number of bytes to the next offset until the end of the file is reached. This way, each protobuf can stay under 1MB, and I can glob them together to my heart's content.
[int32 offset]
[protobuf blob 1]
[int32 offset]
[protobuf blob 2]
...
[eof]
I have an implemntation that works on Github:
src/glob.hpp
src/glob.cpp
test/readglob.cpp
test/writeglob.cpp
But I feel I have written some poor code, and would appreciate some advice on how to improve it. Thus,
Questions:
I'm using reinterpret_cast<char*> to read/write the 32 bit integers to and from the binary fstream. Since I'm using protobuf, I'm making the assumption that all machines are little-endian. I also assert that an int is indeed 4 bytes. Is there a better way to read/write a 32 bit integer to a binary fstream given these two limiting assumptions?
In reading from fstream, I create a temporary fixed-length char buffer, so that I can then pass this fixed-length buffer to the protobuf library to decode using ParseFromArray, as ParseFromIstream will consume the entire stream. I'd really prefer just to tell the library to read at most the next N bytes from fstream, but there doesn't seem to be that functionality in protobuf. What would be the most idiomatic way to pass a function at most N bytes of an fstream? Or is my design sufficiently upside down that I should consider a different approach entirely?
Edit:
#codymanix: I'm casting to char since istream::read requires a char array if I'm not mistaken. I'm also not using the extraction operator >> since I read it was poor form to use with binary streams. Or is this last piece of advice bogus?
#Martin York: Removed new/delete in favor of std::vector<char>. glob.cpp is now updated. Thanks!
Don't use new []/delete[].
Instead us a std::vector as deallocation is guaranteed in the event of exceptions.
Don't assume that reading will return all the bytes you requested.
Check with gcount() to make sure that you got what you asked for.
Rather than have Glob implement the code for both input and output depending on a switch in the constructor. Rather implement two specialized classes like ifstream/ofstream. This will simplify both the interface and the usage.
If you have the following class as a network packet payload:
class Payload
{
char field0;
int field1;
char field2;
int field3;
};
Does using a class like Payload leave the recipient of the data susceptible to alignment issues when receiving the data over a socket? I would think that the class would either need to be reordered or add padding to ensure alignment.
Either reorder:
class Payload
{
int field1;
int field3;
char field0;
char field2;
};
or add padding:
class Payload
{
char field0;
char pad[3];
int field1;
char field2;
char pad[3];
int field3;
};
If reordering doesn't make sense for some reason, I would think adding the padding would be preferred since it would avoid alignment issues even though it would increase the size of the class.
What is your experience with such alignment issues in network data?
Correct, blindly ignoring alignment can cause problems. Even on the same operating system if 2 components were compiled with different compilers or different compiler versions.
It is better to...
1) Pass your data through some sort of serialization process.
2) Or pass each of your primitives individually, while still paying attention to byte ordering == Endianness
A good place to start would be Boost Serialization.
You should look into Google protocol buffers, or Boost::serialize like another poster said.
If you want to roll your own, please do it right.
If you use types from stdint.h (ie: uint32_t, int8_t, etc.), and make sure every variable has "native alignment" (meaning its address is divisible evenly by its size (int8_ts are anywhere, uint16_ts are on even addresses, uint32_ts are on addresses divisble by 4), you won't have to worry about alignment or packing.
At a previous job we had all structures sent over our databus (ethernet or CANbus or byteflight or serial ports) defined in XML. There was a parser that would validate alignment on the variables within the structures (alerting you if someone wrote bad XML), and then generate header files for various platforms and languages to send and receive the structures. This worked really well for us, we never had to worry about hand-writing code to do message parsing or packing, and it was guaranteed that all platforms wouldn't have stupid little coding errors. Some of our datalink layers were pretty bandwidth constrained, so we implemented things like bitfields, with the parser generating the proper code for each platform. We also had enumerations, which was very nice (you'd be surprised how easy it is for a human to screw up coding bitfields on enumerations by hand).
Unless you need to worry about it running on 8051s and HC11s with C, or over data link layers that are very bandwidth constrained, you are not going to come up with something better than protocol buffers, you'll just spend a lot of time trying to be on par with them.
We use packed structures that are overlaid directly over the binary packet in memory today and I am rueing the day that I decided to do that. The only way that we have gotten this to work is by:
carefully defining bit-width specific types based on the compilation environment (typedef unsigned int uint32_t)
inserting the appropriate compiler-specific pragmas in to specify tight packing of structure members
requiring that everything is in one byte order (use network or big-endian ordering)
carefully writing both the server and client code
If you are just starting out, I would advise you to skip the whole mess of trying to represent what's on the wire with structures. Just serialize each primitive element separately. If you choose not to use an existing library like Boost Serialize or a middleware like TibCo, then save yourself a lot of headache by writing an abstraction around a binary buffer that hides the details of your serialization method. Aim for an interface like:
class ByteBuffer {
public:
ByteBuffer(uint8_t *bytes, size_t numBytes) {
buffer_.assign(&bytes[0], &bytes[numBytes]);
}
void encode8Bits(uint8_t n);
void encode16Bits(uint16_t n);
//...
void overwrite8BitsAt(unsigned offset, uint8_t n);
void overwrite16BitsAt(unsigned offset, uint16_t n);
//...
void encodeString(std::string const& s);
void encodeString(std::wstring const& s);
uint8_t decode8BitsFrom(unsigned offset) const;
uint16_t decode16BitsFrom(unsigned offset) const;
//...
private:
std::vector<uint8_t> buffer_;
};
The each of your packet classes would have a method to serialize to a ByteBuffer or be deserialized from a ByteBuffer and offset. This is one of those things that I absolutely wish that I could go back in time and correct. I cannot count the number of times that I have spent time debugging an issue that was caused by forgetting to swap bytes or not packing a struct.
The other trap to avoid is using a union to represent bytes or memcpying to an unsigned char buffer to extract bytes. If you always use Big-Endian on the wire, then you can use simple code to write the bytes to the buffer and not worry about the htonl stuff:
void ByteBuffer::encode8Bits(uint8_t n) {
buffer_.push_back(n);
}
void ByteBuffer::encode16Bits(uint16_t n) {
encode8Bits(uint8_t((n & 0xff00) >> 8));
encode8Bits(uint8_t((n & 0x00ff) ));
}
void ByteBuffer::encode32Bits(uint32_t n) {
encode16Bits(uint16_t((n & 0xffff0000) >> 16));
encode16Bits(uint16_t((n & 0x0000ffff) ));
}
void ByteBuffer::encode64Bits(uint64_t n) {
encode32Bits(uint32_t((n & 0xffffffff00000000) >> 32));
encode32Bits(uint32_t((n & 0x00000000ffffffff) ));
}
This remains nicely platform agnostic since the numerical representation is always logically Big-Endian. This code also lends itself very nicely to using templates based on the size of the primitive type (think encode<sizeof(val)>((unsigned char const*)&val))... not so pretty, but very, very easy to write and maintain.
My experience is that the following approaches are to be preferred (in order of preference):
Use a high level framework like Tibco, CORBA, DCOM or whatever that will manage all these issues for you.
Write your own libraries on both sides of the connection that are are aware of packing, byte order and other issues.
Communicate only using string data.
Trying to send raw binary data without any mediation will almost certainly cause lots of problems.
You practically can't use a class or structure for this if you want any sort of portability. In your example, the ints may be 32-bit or 64-bit depending on your system. You're most likely using a little endian machine, but the older Apple macs are big endian. The compiler is free to pad as it likes too.
In general you'll need a method that writes each field to the buffer a byte at a time, after ensuring you get the byte order right with n2hll, n2hl or n2hs.
If you don't have natural alignment in the structures, compilers will usually insert padding so that alignment is proper. If, however, you use pragmas to "pack" the structures (remove the padding), there can be very harmful side affects. On PowerPCs, non-aligned floats generate an exception. If you're working on an embedded system that doesn't handle that exception, you'll get a reset. If there is a routine to handle that interrupt, it can DRASTICALLY slow down your code, because it'll use a software routine to work around the misalignment, which will silently cripple your performance.