I was wondering if it was a good idea to load/save an array of a certain type of structure using fstream. Note, I am talking about loading/saving to a binary file. Should I be loading/saving independent variables such as int, float, boolean rather then a struct? The reason I ask that is because I've heard that a structure might have some type of padding which might offset the save/load.
A structure may contain padding, which will be written to the file. This is no big deal if the file is going to be read back on the same platform, using code emitted by the same compiler that did the write. However, this is difficult to guarantee, and if you cannot guarantee it, you should normally write the data in some textual format, such as XML, json or whatever.
Without serialization, your binary data will not be portable across different platform (and compilers). So if you need portability, then you need to serialize the data before storing it in file as binary.
Have a look at these:
Boost Serialization Tutorial
Boost Serializable Concept
It's not deprecated (it's not part of any formal spec, where should it be deprecated?), but it's extremely not portable and probably the worst way to go about serialising stuff. Use Boost.Serialization, or a similar library.
As you pointed out in your answer this will happen with writing structs this way. If you want your files to be portable across platforms, e.g. file being written on Linux i686 to opened by Solaris on Sparc then even writing individual float's won't work.
Try writing your data to something like text or XML and then zip/tar the files to make one document of them.
As Neil said, prefer textual representation of data. The XML format may be overkill. Simpler versions are Comma Separated Value (CSV) and one value per text line.
Related
What i have is:
a hex file with the bytes of a c-struct in it, orderd in big-endian
the struct definition as *.h file
the struct information as dwarf2 debug info
My application has to be written in C / C++. Intermediate scripts using for example python would be ok.
What i have to do is read the bytes of the hex-file and cast it into the struct type on a system that is little-endian.
And during this process, i will have to reverse the bytes of each struct member.
The obvious solution would be to write a conversion function, that does byteswapping for each struct-member, but since the struct has multiple layers and ~1200 members that are changing faster than i can update my conversion function, writing that by hand is no solution.
So i could generate the conversion function automatically by:
Finding and parsing the types inside multiple *.h files
Iterating members of all struct-types and generate swaps for them -> without some sort of reflection api not that easy)
loading the struct via the conversion function.
Since this solution seems like quite a bit of work, i was wandering if there is easier way like telling the compiler to swap it or use debug-info somehow.
Does anybody know a trick that might help in this case?
Thanks and greetings!
Remark:
Changing any of the processes leading to this / changing the input-conditions or delegating responsibilities to other developers involved is not pssible.
Changing something about the hex-file as an input is not possible. This file comes out of some other system that will not change to fix this problem here.
Padding, Datatype-sizes etc. are identical. This is ensured by other measures, too. So endianess is defenetly the only problem. This is also why i see no reason against using dwarf2 info to identify the bytes of every struct member.
I agree that the layout of the struct is very bad. But It has some reasons why it is that way and to be short, i can/am not allowed not change that anyway because of process-reasons and backwards compatibility.
To give some more scope:
The Software that all of this is used in is deployed to multiple different embedded devices (multiple types). The hex-file containes the calibration information of the software and is thus stored in a specific system that can only output this hex-file.
I am now porting the software to a little-endian device and i have to use the hex-file given from the "main" branch of software, which is big-endian, as an input.
There is no way to tell C or C++ compiler to swap bytes from LE to BE or vice versa automatically. You really have to do it yourself. If your data structs are really huge, probably the best way is to implement automatic conversion code generation.
The problem, as far as I understand it, is tricky but tractable. As far as I understand, data extraction won't be running on an embedded device, so it won't be resource constrained. I say - embrace the runtime inefficiency that desktop hardware allows, and go for easy to debug instead.
Instead of thinking of the source file as "almost what I need modulo a couple of minor adjustments", think of it as "generic binary file with an open ended, evolving schema". The schema description is the DWARF data.
What I would do: start a Python project. Use the pyelftools PyPI module to parse the DWARF. Scroll for the compile units (CUs). In each CU, scroll through the top level entries (DIEs). Look for a DW_TAG_structure_type DIE with a specific value of DW_AT_name (I hope the struct name is known in advance). Then go through the DW_TAG_member sub-DIEs. DW_AT_data_member_location will give you the offset, letting you work around the padding. Look at DW_AT_type to detect the member type (you'd have to resolve the DIE reference for that). Recurse into struct- and array-type members as necessary.
From that, generate a format string for the struct.unpack method - it can read big-endian ints seamlessly. Then use struct.pack to format it into whatever format the C++ consumer expects.
This depends on you being able to track the data file to the DWARF info of the generating executable, exactly the same build. I hope the processes of the organization allow for that.
Recent versions of GCC allow the declaration of the desired endianness irrespective of the target platform for a source code section using the pragma scalar_storage_order or a specific type using an attribute with the same identifier. The main catch: g++ does not support this. Also, this won't work in all cases. For example, taking a pointer to a member with transparent endianness conversion leads to an error. Unless you're okay with sticking to C for struct access (it all depends on your current codebase), this is not an option.
The persistence layout is based on the original struct layout - so be it. However, a more explicit approach of serializing the structs should be preferred for exactly the reason you bring this up. Besides the endianness issue, struct packing also affects compatibility and should be explicitly specified. For persistence, a packing of 1 would be optimal. For in-memory data structures, that alignment is far from optimal in terms of performance and concurrency characteristics. Also, different platforms might have incompatible data types (e.g. sizeof(long) on 64-bit Linux/Windows - LP64 vs. LLP64). So, keeping the persistence layout separate from in-memory data structures tends to have a long list of advantages and therefore usually outweights the disadvantage of having to maintain the serialization code separately. Particularly, if portability is a major concern.
You could take advantage of C/C++-based reflection libraries or implement one yourself. In case of C, this will definitely require macros (e.g. Metaresc). In case of C++, you might actually get away your original struct definitions (e.g. Boost.Precise and Flat Reflection).
If reflection is not an option, you could generate the serialization code either by parsing the headers or debug symbols. Generally, parsing C/C++ is more complex. By moving the structs involved into dedicated headers, you might get away with a simple C/C++ parser. To make things easier, you could simplify parsing by processing the gdb output of ptype based on debug symbols. Or, you could parse debug symbols directly. With a scripting language like Python, both approaches should be feasible (pygccxml and pyelftools come to mind).
Rather than sticking to generating the serialization code as part of the build process, you could generate that code once and require updates whenever the structs change in the future. That's what I would do in a multi-platform scenario. Doing that would also spare you the pain of implementing a perfect parser that can deal with all kinds of C/C++ input, it would only have to be good enough for one-time generation.
Whenever I need to define a file structure, I'm using compiler-specific commands (like #pragma pack(1)) to ensure that I can safely read and write this file and don't need to worry about padding issues.
However, is there any other way to reach the same goal? I don't need to de-/serialize complex objects, just POD types.
It is impossible to define a cross-platform binary format that always nicely maps to the in-memory representation of types.
The two options for defining cross-platform file formats are:
Use text
Define a binary format in terms of what your favourite cross-platform serialisation library can provide and use that library to convert the file contents between their internal and external representation.
Boost Serialization Library might be an option, if you want it to be solved fast and without much ado.
I am trying to partially truncate (or shorten) an existing file, using fstream. I have tried writing an EOF character, but this seems to do nothing.
Any help would be appreciated...
I don't think you can. There are many functions for moving "up and down" the wrapper hierarchy for HANDLE<->int<->FILE *, at least on Windows, but there is no "proper" to extract the FILE * from an iostreams object (if indeed it is even implemented with one).
You may find this question to be of assistance.
Personally I would strongly recommend steering clear of iostreams, they're poorly designed, heavily C++, and nasty to look at. Take a look at Boost's iostreams, or wrap stdio.h if you need to use classes.
The relevant function for stdio is ftruncate().
The Boost.Interprocess library defines a portable truncate function. For some reason it is not documented, but you can find it this header file.
It'll depend on the OS. Most OSes support this, but in different ways. On Windows, there's a SetEndOfFile(). On Unix and similar systems, you lseek to where you want the file to end, and do an lwrite of zero bytes there. Other OSes undoubtedly use other methods.
I bit the bullet in the end and read the part of the file to be kept to an array then re-wrote it. It's not the best solution - but as the files will always be small I have decided to accept this method.
I have written a C++ library that saves my data (a collection of custom structs etc) into a binary file. I currently use (i.e. create and consume) the files locally, on my Windows (XP) machine. For simplicity, lets think of the library in two parts: a writer (Creates the files) and a reader or consumer (simply reads data from the files).
Recently though, I would like to also consume (i.e. read) the data files I have created on my XP machine, on my Linux machine. I must point out at this stage that both machines are PCs (so have the same endianess etc).
I can build a reader (and compile for Linux [Ubuntu 9.10 to be precise]), since I am the library creator. My question, before I embark down this road (of building the reader etc) is:
Assuming I have succesfully built the reader for Linux,
Can I simply copy accross, files that were created on the windows (XP) machine to the Linux (Ubuntu 9.10) machine and use the Linux reader to successfully read the copied over file?
For the files to be binary compatible:
endianness must match (as it does for you)
bitfield packing order must be the same
sizes and signedness of types must be the same
the compiler must make the same decisions about padding and alignment
It's certainly possible for all of these conditions to be fulfilled, or for you to not happen to be hitting any cases for which they are not. At the very least, though, I'd add some sanity checks and/or sentinel members to detect problems.
Binary files should be compatible across machines with the same endianess.
The issue you may have in your code is the size of ints, you can't necessarily assume that the compiler on different OS's has the same size int. So either copy blocks of bytes and cast them, or use int16, int32 etc.
Structs are not a file format, and you shouldn't try to use them as such.
When attempting to make structs work with fread and fwrite, there's a huge number of hacks to make it work. You byte-swap integers so that you can share files between little-endian and big-endian machines. You change your structs to use fixed-width integer types, so you can share between machines with different word sizes (such as between x86 and x64 machines). You add compiler-specific pragmas to control the padding of structs to share between compiler versions.
It works, but it's ugly. Not to mention, easy to get wrong.
Much like the recommendation in The byte order fallacy, a much better idea is to write code to read/write the fields individually. By writing your own code, you can ensure there's no padding, and you can choose integer sizes independently of the local size of integers, and you can support both endiannesses without byte-swapping (by reading/writing the bytes of an integer separately).
Unlike the hacky approach, this is hard to get wrong. Further, because you don't rely on any compiler or architecture specific behaviors, either your code will work on all compilers and architectures, or none. If you do it right, you shouldn't have any platform-specific bugs.
There is one downside; individually reading/writing the fields will be slower than just using fread/fwrite directly. You can set up a buffer (uint8_t buffer[]) and write the entirety of the data into it, and then write everything out at once, which might help, but it'll still be slower (because you'd still have to move the fields into the buffer one at a time), but for most purposes it'll still be fast enough (exceptions being embedded / real-time systems or extremely high performance computing).
If:
the machines have the same endianess (as you stated they have) and
you do open the streams in binary mode, as text mode might do funny things e.g. with line-ends and
you have programmed cleanly so you don't stumble over implementation-defined stuff like alignments, data type sizes, and struct packing,
then yes, your files should be portable.
The third bullet point is what makes a file format a "portable" one. Depending on what kind of data you have in your structs, it can be very easy or a bit tricky. Bitfields, or data being reinterpreted from a different type are especially tricky.
You might consider taking a look at the Boost Serialization Library.
A lot of thought has been put into it, and it will handle many of the potential cross-platform incompatibilities for you.
Of course, it's possible that it's overkill for your particular use case, especially if you've already got your writers & readers implemented.
I have a C++ class that looks a bit like this:
class BinaryStream : private std::iostream
{
public:
explicit BinaryStream(const std::string& file_name);
bool read();
bool write();
private:
Header m_hdr;
std::vector<Row> m_rows;
}
This class reads and writes data in a binary format, to disk. I am not using any platform specific coding - relying instead on the STL. I have succesfully compiled on XP. I am wondering if I can FTP the files written on the XP platform and read them on my Linux machine (once I recompile the binary stream library on Linux).
Summary:
Files created on Xp machine using a cross platform library coompiled for XP.
Compile the same library (used in 1 above) on a Linux machine
Question: Can files created in 1 above, be read on a Linux machine (2) ?
If no, please explain why not, and how I may get around this issue.
Derive from std::basic_streambuf. That's what they are there for. Note, most STL classes are not designed to be derived from. The one I mention is an exception.
This depends entirely on the specifics of the binary encoding. One thing that's different about Linux vs. XP is that you're much more likely to find yourself on a big-endian platform, and if your binary encoding is endian specific you'll end up with issues.
You may also end up with issues relating to the end-of-line character. There isn't enough information here about how you're using ::std::iostream to give you a good answer to this question.
I would strongly suggest looking at the protobuf library. It is an excellent library for creating fast cross-platform binary encodings.
If you want that your code is portable across machines with different endianess, you need to stick to using one endianess in your files. Whenever you read or write files, you do conversions between the host byte order, and the file byte order. It's common to use what you call network byte order when you want to write files that are portable across all machines. Network byte order is defined to be big endian, and there are pre-made functions made to deal with those conversions (although they are very easy to write yourself).
For example, before writing a long to a file, you should convert it to network byte order using htonl(), and when reading from a file you should convert it back to host byte order with ntohl(). On big-endian system htonl() and ntohl() simply return the same number as passed to the function, but on little-endian system it swaps each byte in the variable.
If you don't care about supporting big-endian systems, none of this is an issue though, although it's still good practice.
Another important thing to pay attention to is padding of your structs/classes that you write, if you write them directly to the file (eg. Header and Row). Different compilers on different platforms can use different padding, which means that variables are aligned differently in the memory. This can break things big-time, if the compilers you use on different platform use different padding. So for structs that you intend to write directly to files/other streams, you should always specify padding. You should tell the compiler to pack your structs like this:
#pragma pack(push, 1)
struct Header {
// This struct uses 1-byte padding
...
};
#pragma pack(pop)
Remember that doing this will make using the struct more inefficient when you use it in your application, because access to unaligned memory addresses means more work for the system. This is why it's generally a good idea to have separate types for the packed structs that you write to streams, and a type that you actually use in the application (you just copy the members from one to other).
EDIT. Another way to deal with the issue, of course, is to serialize those structs yourself, which won't require using #pragma (pragmas are compiler-dependent feature, although all major compilers to my knowledge supports the pragma pack).
Here is an article Endianness that is related to your question. Look for "Endianness in files and byte swap". Briefly if If your Linux machine has the same endianes than it's OK, if not - there migth be problems.
For example when integer 1 is written in file on XP it looks like this: 10 00
But when integer 1 is written in file on machine with the other endianess it will look like this: 00 01
But if you use only one byte characters there must be no problem.
As long as it's plain binary files it should work
Because you're using the STL for everything, there's no reason your program shouldn't be able to read the files on a different platform.
If you are writing a struct / class directly out to the disc, then don't.
This might not be compatible between different builds on the same compiler, and almost certainly will break when you move to a different platform or compiler. It will definitely break if you change to a different architecture.
It isn't clear from the above code what you're actually writing to the file.