I am trying to read chunks of data from a file directly into a struct but the padding is causing too much data to be read and the data to be misaligned.
Do I have to manually read each part into the struct or is there an easier way to do this?
My code:
The structs
typedef unsigned char byte;
struct Header
{
char ID[10];
int version;
};
struct Vertex //cannot rearrange the order of the members
{
byte flags;
float vertex[3];
char bone;
byte referenceCount;
};
How I am reading in the data:
std::ifstream in(path.c_str(), std::ifstream::in | std::ifstream::binary);
Header header;
in.read((char*)&header.ID, sizeof(header.ID));
header.ID[9] = '\0';
in.read((char*)&header.version, sizeof(header.version));
std::cout << header.ID << " " << header.version << "\n";
in.read((char*)&NumVertices, sizeof(NumVertices));
std::cout << NumVertices << "\n";
std::vector<Vertex> Vertices(NumVertices);
for(std::vector<Vertex>::iterator it = Vertices.begin(); it != Vertices.end(); ++it)
{
Vertex& v = (*it);
in.read((char*)&v.flags, sizeof(v.flags));
in.read((char*)&v.vertex, sizeof(v.vertex));
in.read((char*)&v.bone, sizeof(v.bone));
in.read((char*)&v.referenceCount, sizeof(v.referenceCount));
}
I tried doing: in.read((char*)&Vertices[0], sizeof(Vertices[0]) * NumVertices); but this produces incorrect results because of what I believe to be the padding.
Also: at the moment I am using C-style casts, what would be the correct C++ cast to use in this scenario or is a C-style cast okay?
If you're writing the entire structure out in binary, you don't need to read it as if you had stored each variable separately. You would just read in the size of the structure from file into the struct you have defined.
Header header;
in.read((char*)&header, sizeof(Header));
If you're always running on the same architecture or the same machine, you won't need to worry about endian issues as you'll be writing them out the same way your application needs to read them in. If you are creating the file on one architecture and expect it to be portable/usable on another, then you will need to swap bytes accordingly. The way I have done this in the past is to create a swap method of my own. (for example Swap.h)
Swap.h - This is the header you use within you're code
void swap(unsigned char *x, int size);
------------------
SwapIntel.cpp - This is what you would compile and link against when building for Intel
void swap(unsigned char *x, int size)
{
return; // Do nothing assuming this is the format the file was written for Intel (little-endian)
}
------------------
SwapSolaris.cpp - This is what you would compile and link against when building for Solaris
void swap(unsigned char *x, int size)
{
// Byte swapping code here to switch from little-endian to big-endian as the file was written on Intel
// and this file will be the implementation used within the Solaris build of your product
return;
}
No, you don't have to read each field separately. This is called alignment/packing. See http://en.wikipedia.org/wiki/Data_structure_alignment
C-style cast is equivalent to reinterpret_cast. In this case you use it correctly. You may use a C++-specific syntax, but it is a lot more typing.
You can change padding by explicitly asking your compiler to align structs on 1 byte instead of 4 or whatever its default is. Depending on environment, this can be done in many different ways, sometimes file by file ('compilation unit') or even struct by struct (with pragmas and such) or only on the whole project.
header.ID[10] = '\0';
header.ID[9] is the last element of the array.
If you are using a Microsoft compiler then explore the align pragma. There are also the alignment include files:
#include <pshpack1.h>
// your code here
#include <poppack.h>
GNU gcc has a different system that allows you to add alignment/padding to the structure definition.
If you are reading and writing this file yourself, try Google Protobuf library. It will handle all byteorder, alignment, padding and language interop issues.
http://code.google.com/p/protobuf/
Related
I have a binary file with some layout I know. For example let format be like this:
2 bytes (unsigned short) - length of a string
5 bytes (5 x chars) - the string - some id name
4 bytes (unsigned int) - a stride
24 bytes (6 x float - 2 strides of 3 floats each) - float data
The file should look like (I added spaces for readability):
5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5
Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.
Now I want to read this file. Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
P.S.: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]?
Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.
If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.
If you cannot use a third party API, you may look at QDataStream for inspiration
Documentation
Source code
The C way, which would work fine in C++, would be to declare a struct:
#pragma pack(1)
struct contents {
// data members;
};
Note that
You need to use a pragma to make the compiler align the data as-it-looks in the struct;
This technique only works with POD types
And then cast the read buffer directly into the struct type:
std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());
Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:
template<typename T>
const char *read_object(const char *buffer, T& target) {
target = *reinterpret_cast<const T*>(buffer);
return buffer + sizeof(T);
}
The main advantage is that such a reader can be specialized for more advanced c++ objects:
template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
size_t size = target.size();
CT const *buf_start = reinterpret_cast<const CT*>(buffer);
std::copy(buf_start, buf_start + size, target.begin());
return buffer + size * sizeof(CT);
}
And now in your main parser:
int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);
Note: As Tony D observed, even if you can get the alignment right via #pragma directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.
Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
That last risks a SIGBUS (if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1] or vice versa, using htons if needing to correct for endianness.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
No need... just read directly into the string:
std::string s(the_size, ' ');
if (input_fstream.read(&s[0], s.size()) &&
input_stream.gcount() == s.size())
...use s...
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.
Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.
This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
struct Data
{
uint32_t x;
float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
input_stream.gcount() == sizeof data)
...use x and y...
Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast data in a potentially unaligned char array (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonl if there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data* at it... it's legal to index past the declared array size of y as long as the memory content at the accessed addresses was part of the allocation and holds a valid float representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t first then new float[n] and do a further read into there....
Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....
I actually implemented a quick and dirty binary format parser to read .zip files (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.
On some specific platforms, a packed struct could work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).
A .zip archive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:
using Buffer = std::pair<unsigned char const*, size_t>;
template <typename OffsetReader>
class UInt16LEReader: private OffsetReader {
public:
UInt16LEReader() {}
explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}
uint16_t read(Buffer const& buffer) const {
OffsetReader const& or = *this;
size_t const offset = or.read(buffer);
assert(offset <= buffer.second && "Incorrect offset");
assert(offset + 2 <= buffer.second && "Too short buffer");
unsigned char const* begin = buffer.first + offset;
// http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
return (uint16_t(begin[0]) << 0)
+ (uint16_t(begin[1]) << 8);
}
}; // class UInt16LEReader
// Declined for UInt[8|16|32][LE|BE]...
Of course, the basic OffsetReader actually has a constant result:
template <size_t O>
class FixedOffsetReader {
public:
size_t read(Buffer const&) const { return O; }
}; // class FixedOffsetReader
and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptr which memoizes them).
What is interesting, though, is the end-result:
// http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
class LocalFileHeader {
public:
template <size_t O>
using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
template <size_t O>
using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;
UInt32< 0> signature;
UInt16< 4> versionNeededToExtract;
UInt16< 6> generalPurposeBitFlag;
UInt16< 8> compressionMethod;
UInt16<10> fileLastModificationTime;
UInt16<12> fileLastModificationDate;
UInt32<14> crc32;
UInt32<18> compressedSize;
UInt32<22> uncompressedSize;
using FileNameLength = UInt16<26>;
using ExtraFieldLength = UInt16<28>;
using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;
using ExtraField = StringReader<
CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
ExtraFieldLength
>;
FileName filename;
ExtraField extraField;
}; // class LocalFileHeader
This is rather simplistic, obviously, but incredibly flexible at the same time.
An obvious axis of improvement would be to improve chaining since here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.
I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:
BEGIN_STRUCT(foo)
UNSIGNED_SHORT(length)
STRING_FIELD(length, label)
UNSIGNED_INT(stride)
FLOAT_ARRAY(3 * stride)
END_STRUCT(foo)
Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.
NB I first saw this technique used in gcc for abstract syntax tree-related code generation.
If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).
It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.
You should better declare a structure (with 1-byte padding - how - depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::string etc. Use this structure only for file I/O, or other inter-process communication - use normal struct or class to hold it for further use in C++ program.
Since all of your data is variable, you can read the two blocks separately and still use casting:
struct id_contents
{
uint16_t len;
char id[];
} __attribute__((packed)); // assuming gcc, ymmv
struct data_contents
{
uint32_t stride;
float data[];
} __attribute__((packed)); // assuming gcc, ymmv
class my_row
{
const id_contents* id_;
const data_contents* data_;
size_t len;
public:
my_row(const char* buffer) {
id_= reinterpret_cast<const id_contents*>(buffer);
size_ = sizeof(*id_) + id_->len;
data_ = reinterpret_cast<const data_contents*>(buffer + size_);
size_ += sizeof(*data_) +
data_->stride * sizeof(float); // or however many, 3*float?
}
size_t size() const { return size_; }
};
That way you can use Mr. kbok's answer to parse correctly:
const char* buffer = getPointerToDataSomehow();
my_row data1(buffer);
buffer += data1.size();
my_row data2(buffer);
buffer += data2.size();
// etc.
I personally do it this way:
// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)
someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;
Very effective way for fixed-size structs at the start of the file.
Use a serialization library. Here are a few:
Boost serialization and Boost fusion
Cereal (my own library)
Another library called cereal (same name as mine but mine predates theirs)
Cap'n Proto
The Kaitai Struct library provides a very effective declarative approach, which has the added bonus of working across programming languages.
After installing the compiler, you will want to create a .ksy file that describes the layout of your binary file. For your case, it would look something like this:
# my_type.ksy
meta:
id: my_type
endian: be # for big-endian, or "le" for little-endian
seq: # describes the actual sequence of data one-by-one
- id: len
type: u2 # unsigned short in C++, two bytes
- id: my_string
type: str
size: 5
encoding: UTF-8
- id: stride
type: u4 # unsigned int in C++, four bytes
- id: float_data
type: f4 # a four-byte floating point number
repeat: expr
repeat-expr: 6 # repeat six times
You can then compile the .ksy file using the kaitai struct compiler ksc:
# wherever the compiler is installed
# -t specifies the target language, in this case C++
/usr/local/bin/kaitai-struct-compiler my_type.ksy -t cpp_stl
This will create a my_type.cpp file as well as a my_type.h file, which you can then include in your C++ code:
#include <fstream>
#include <kaitai/kaitaistream.h>
#include "my_type.h"
int main()
{
std::ifstream ifs("my_data.bin", std::ifstream::binary);
kaitai::kstream ks(&ifs);
my_type_t obj(&ks);
std::cout << obj.len() << '\n'; // you can now access properties of the object
return 0;
}
Hope this helped! You can find the full documentation for Kaitai Struct here. It has a load of other features and is a fantastic resource for binary parsing in general.
I use ragel tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.
ragel can also output go, Java,.. code for parsing, but I did not use these features.
The key feature of ragel is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.
I am exchanging a struct called struct update_packet with other servers (of identical or similar system) running the same program through UDP socket using sendto(..) and recvfrom().
update_packet needs to be in the general message format, which means its fields has predetermined fixed size and the size of the struct is the sum of the fields.
struct node {
uint32_t IP;
uint16_t port;
int16_t nil;
uint16_t server_id;
uint16_t cost;
};
struct update_packet {
uint16_t num_update_fields;
uint16_t port;
uint32_t IP;
struct node * nodes;
update_packet() :
num_update_fields(num_nodes), IP(myIP), port(myport)
{//fill in nodes array};
};
(update_packet contains a pointer array of struct node)
I used reinterpret_cast to send an instance of update packet via UDP, and the following compiles and sends to the correct destination.
int update_packet_size = sizeof(up);
sendto(s, reinterpret_cast<const char*>(&up), update_packet_size, 0,
(struct sockaddr *)&dest_addr, sizeof(dest_addr));
However, when I receive it and try to decode it by
struct update_packet update_msg =
reinterpret_cast<struct update_packet>(recved_msg);
I get an error
In function ‘int main(int, char**)’:
error: invalid cast from type ‘char*’ to type ‘update_packet’
struct update_packet update_msg =
reinterpret_cast<struct update_packet>(recved_msg);
Why does this error occur, and how can I fix this?
Also, is this a correct way to exchange data in an instance of struct through sockets? If not, what should I do? Do I need a pack()ing function like in http://beej.us/guide/bgnet/examples/pack2.c ?
Generalities
The cast issue has been properly answered in other questions.
However, you should never rely on pointer cast for sending/receiving a struct through the network, for many reasons including:
Packing : the compiler may align struct variables and insert padding bytes. This is compiler dependent, thus your code will not be portable. If the two communicating machines run your program compiled with different compilers, it will likely not work.
Endianness : for the same reason, the byte order when sending a multibyte number (such as int) may be different between the two machines.
This would be resulting in a code which might work for some times, but a few years later which would cause a lot of problems, if someone changes the compiler, the platform, etc... As this is for an educational project you should try doing it the proper way...
For this reason, converting data from a struct into a char array for sending through the network or writing to file should be done carefully, variable by variable, and if possible taking endianness into account. This process is called "serializing".
Serialization in details
Serialization means you convert a data structure into an array of bytes, that can be sent over the network.
The serialized format is not necessarily binary : text or xml are possible options. If the amount of data is small, text is maybe the best solution, and you can rely on the STL only with stringstreams (std::istringstream and std::ostringstream)
There are several good libraries for serializing to binary, for instance Boost::serialization or QDataStream in Qt.
You may also do it yourself, look SO for "C++ serializing"
Simple serializing to text using the STL
In your case, you might just serialize to a text string using something like:
std::ostringstream oss;
oss << up.port;
oss << up.IP;
oss << up.num_update_fields;
for(unsigned int i=0;i<up.num_update_fields;i++)
{
oss << up.nodes[i].IP;
oss << up.nodes[i].port;
oss << up.nodes[i].nil;
oss << up.nodes[i].server_id;
oss << up.nodes[i].cost;
}
std::string str = oss.str();
char * data_to_send = str.data();
unsigned int num_bytes_to_send = str.size();
And for deserializing received data:
std::string str(data_received, num_bytes_received);
std::istringstream(str);
update_packet up;
iss >> up.port;
iss >> up.IP;
iss >> up.num_update_fields;
//maximum number of nodes should be checked here before doing memory allocation!
up.nodes = (nodes*)malloc(sizeof(node)*up.num_update_fields);
for(unsigned int i=0;i<up.num_update_fields;i++)
{
iss >> up.nodes[i].IP;
iss >> up.nodes[i].port;
iss >> up.nodes[i].nil;
iss >> up.nodes[i].server_id;
iss >> up.nodes[i].cost;
}
This will be 100% portable and safe. You may verify data validity by checking the iss error flags.
Also you might, for safety:
Use a std::vector instead of the node pointer. This will prevent memory leaks and other issues
Check the number of nodes just after iss >> up.num_update_fields;, if it's too big just abort the decoding before allocating a huge buffer that will crash your program and maybe the system. Network attacks are based on "holes" like that : you may cause the a server to crash by making him allocate a buffer 100x larger than its RAM, if this kind of check is not made.
If your networking API has a std::iostream interface, you may use directly the << and >> operators from it, without using the intermediate string and stringstreams
You might think using space separated text is a waste of bandwidth. Think this only if your number of nodes is large, and makes the bandwidth use become non-negligible and critical. In that case, you need to serialize to binary. But don't do it if the text solution works perfectly (beware of premature optimization!)
Simple Binary serialization (not byte-order/endianness aware):
Replace:
oss.write << up.port;
By:
oss.write((const char *)&up.port, sizeof(up.port));
Endianness
But in your project, Big-Endian is required. If you are running on a PC (x86) you need to invert bytes in every field.
1)First option : by hand
const char * ptr = &up.port;
unsigned int s = sizeof(up.port);
for(unsigned int i=0; i<s; i++)
oss.put(ptr[s-1-i]);
Ultimate code : detect endianness (this is not difficult to do - look for it on SO) and adapt your serialization code.
2)Second option : use a library like boost or Qt
These libraries let you choose the endianness of the output data. Then they auto-detect the platform endianness and do the job automatically.
You can't cast a pointer to a struct, but you can cast a pointer to a pointer to a struct.
Change
struct update_packet update_msg =
reinterpret_cast<struct update_packet>(recved_msg);
to
update_packet * update_msg =
reinterpret_cast<update_packet *>(recved_msg);
And yes, you need, at least pack() because the compiler on the sending side might add padding differently. However it is not 100% safe. You also have take into account that the sending and receiving machine differ in endianess. I would suggest that you look into proper serialization mechanisms.
You may also use:
struct update_packet update_msg;
memcpy(&update_msg, recved_msg, size-of-message);
You must however ensure that size-of-message is exactly what you are looking for.
Speaking just of decoding (your computer - your rules), both endianness and packing can be taken into account on GCC and Clang with a combo like this (it's using the Boost.Endian library):
#include <boost/endian/arithmetic.hpp>
using boost::endian::big_uint16_t;
using boost::endian::big_uint32_t;
using boost::endian::big_uint64_t;
#pragma pack(push, 1)
enum class e_message_type: uint8_t {
hello = 'H',
goodbye = 'G'
};
struct message_header {
big_uint16_t size;
e_message_type message_type;
std::byte reserved;
};
static_assert(sizeof(header) == 4);
struct price_quote {
big_uint64_t price;
big_uint32_t size;
big_uint32_t timestamp;
};
static_assert(sizeof(header) == 16);
template<class T> struct envelope {
message_header header;
T payload;
};
static_assert(sizeof(envelope<price_quote>) == 20);
#pragma pack(pop)
// and then
auto& x = *static_cast<envelope const*>(buffer.data());
I am new in here and I have a question
I have a struct, let's say overall size is 8 bytes, here the struct:
struct Header
{
int ID; // 4 bytes
char Title [4]; // 4 bytes too
}; // so it 8 bytes right?
and I have a file with 8 bytes too...
I just want to ask, how to parse data on that file into the struct of that
I have tried this one:
Header* ParseHeader(char* filename)
{
char* buffer = new char[8];
fstream fs(filename);
if (fs.is_open() != true)
throw new exception("Couldn't Open file for Parsing Header.");
fs.read(buffer, 8);
if (!fs)
{
delete[] buffer;
throw new exception("Couldn't Read header OJN file.\nHeader data was corrupted");
}
Header* header = (Header*)((void*)buffer);
delete[] buffer;
fs.close();
return header;
}
but it fail, and return invalid data than what I was expect (I can make you sure, this is not file fault, the file structured correctly)
Can someone help me?
Thanks
Seems like you do everything fine until this point:
Header* header = (Header*)((void*)buffer);
delete[] buffer;
fs.close();
notice you delete the buffer after the casting, meaning that header points to a deleted location -> junk, you need to either not delete or copy the data if you like to still use it.
Also, to be quite honest, I don't understand how your code compiles, your function states it returns a Header, while you return a Header*..
You are deleting the data that is being returned. Therefore header is no longer accessible.
I think you meant the line to be:
Header header = *(Header*)((void*)buffer);
This will actually copy the header.
The fact that your 8 bytes file correctly maps to your struct Header is mere luck as far as C++ is involved. The structure could have internal padding that make it bigger than 8 bytes, and the data endianness could be different between your file and your CPU.
I realize your code works with your particular compiler version, on your operating system and on your CPU but you should not get into the habit of coding like that, otherwise you'll probably get into big trouble as soon as you change any of those parameters (or maybe even just some compiler flags). In other words, what you are doing is extremely bad practice. In C++ you don't even have the guarantee that an int is actually 4 bytes.
The Right Way™ to load such binary data from a file is to load each field individually and ensure proper endianness conversion depending on the CPU you're using (eg. through hton* / ntoh* or similar functions). Using a fixed-size type like int32_t also helps.
Just define your structure in 1-byte boundary as:
#pragma pack(1)
struct Header
{
int ID; // 4 bytes
char Title [4]; // 4 bytes too
};
#pragma pack()
First pack statement instructs the compiler to use one-byte padding for members in structure. This way size of Header will be 8 bytes. Second pack statement instructs to go back to default setting. You may need to use push and pop instructions (See enter link description here) - but this isn't required for you.
Secondly, and more importantly, you should not use hard-code values like 8. Always use sizeof to read or write a structure. Also, this statement is absolutely not needed:
char* buffer = new char[8];
...
Just declare Header variable itself, and read on it:
Header header;
...
fs.read(&header, sizeof(Header));
struct Vector
{
float x, y, z;
};
func(Vector *vectors) {...}
usage:
load float *coords = load(file);
func(coords);
I have a question about the alignment of structures in C++. I will pass a set of points to the function func(). Is is OK to do it in the way shown above, or is this relying on platform-dependent behavior? (it works at least with my current compiler) Can somebody recommend a good article on the topic?
Or, is it better to directly create a set of points while loading the data from the file?
Thanks
Structure alignment is implementation-dependent. However, most compilers give you a way of specifying that a structure should be "packed" (that is, arranged in memory with no padding bytes between fields). For example:
struct Vector {
float x;
float y;
float z;
} __attribute__((__packed__));
The above code will cause the gcc compiler to pack the structure in memory, making it easier to dump to a file and read back in later. The exact way to do this may be different for your compiler (details should be in your compiler's manual).
I always list members of packed structures on separate lines in order to be clear about the order in which they should appear. For most compilers this should be equivalent to float x, y, z; but I'm not certain if that is implementation-dependent behavior or not. To be safe, I would use one declaration per line.
If you are reading the data from a file, you need to validate the data before passing it to func. No amount of data alignment enforcement will make up for a lack of input validation.
Edit:
After further reading your code, I understand more what you are trying to do. You have a structure that contains three float values, and you are accessing it with a float* as if it were an array of floats. This is very bad practice. You don't know what kind of padding that your compiler might be using at the beginning or end of your structure. Even with a packed structure, it's not safe to treat the structure like an array. If an array is what you want, then use an array. The safest way is to read the data out of the file, store it into a new object of type struct Vector, and pass that to func. If func is defined to take a struct Vector* as an argument and your compiler is allowing you to pass a float* without griping, then this is indeed implementation-dependent behavior that you should not rely on.
Use an operator>> extraction overload.
std::istream& operator>>(std::istream& stream, Vector& vec) {
stream >> vec.x;
stream >> vec.y;
stream >> vec.z;
return stream;
}
Now you can do:
std::ifstream MyFile("My Filepath", std::ios::openmodes);
Vector vec;
MyFile >> vec;
func(&vec);
Prefer passing by reference than passing by pointer:
void func(Vector& vectors)
{ /*...*/ }
The difference here between a pointer and a reference is that a pointer can be NULL or point to some strange place in memory. A reference refers to an existing object.
As far as alignment goes, don't concern yourself. Compilers handle this automagically (at least alignment in memory).
If you are talking about alignment of binary data in a file, search for the term "serialization".
First of all, your example code is bad:
load float *coords = load(file);
func(coords);
You're passing func() a pointer to a float var instead of a pointer to a Vector object.
Secondly, Vector's total size if equal to (sizeof(float) * 3), or in other words to 12 bytes.
I'd consult my compiler's manual to see how to control the struct's aligment, and just to get a peace of mind I'd set it to, say 16 bytes.
That way I'll know that the file, if contains one vector, is only 16 bytes in size always and I need to read only 16 bytes.
Edit:
Check MSVC9's align capabilities .
Writing binary data is non portable between machines.
About the only portable thing is text (even then can not be relied as not all systems use the same text format (luckily most accept the 127 ASCII characters and hopefully soon we will standardize on something like Unicode (he says with a smile)).
If you want to write data to a file you must decide the exact format of the file. Then write code that will read the data from that format and convert it into your specific hardware's representation for that type. Now this format could be binary or it could be a serialized text format it does not matter much in performance (as the disk IO speed will probably be your limiting factor). In terms of compactness the binary format will probably be more efficient. In terms of ease of writing decoding functions on each platform the text format is definitely easier as a lot of it is already built into the streams.
So simple solution:
Read/Write to a serialized text format.
Also no alignment issues.
#include <algorithm>
#include <fstream>
#include <vector>
#include <iterator>
struct Vector
{
float x, y, z;
};
std::ostream& operator<<(std::ostream& stream, Vector const& data)
{
return stream << data.x << " " << data.y << " " << data.z << " ";
}
std::istream& operator>>(std::istream& stream, Vector& data)
{
return stream >> data.x >> data.y >> data.z;
}
int main()
{
// Copy an array to a file
Vector data[] = {{1.0,2.0,3.0}, {2.0,3.0,4.0}, { 3.0,4.0,5.0}};
std::ofstream file("plop");
std::copy(data, data+3, std::ostream_iterator<Vector>(file));
// Read data from a file.
std::vector<Vector> newData; // use a vector as we don't know how big the file is.
std::ifstream input("inputFile");
std::copy(std::istream_iterator<Vector>(input),
std::istream_iterator<Vector>(),
std::back_inserter(newData)
);
}
I have a binary file that was created on a unix machine. It's just a bunch of records written one after another. The record is defined something like this:
struct RECORD {
UINT32 foo;
UINT32 bar;
CHAR fooword[11];
CHAR barword[11];
UNIT16 baz;
}
I am trying to figure out how I would read and interpret this data on a Windows machine. I have something like this:
fstream f;
f.open("file.bin", ios::in | ios::binary);
RECORD r;
f.read((char*)&detail, sizeof(RECORD));
cout << "fooword = " << r.fooword << endl;
I get a bunch of data, but it's not the data I expect. I'm suspect that my problem has to do with the endian difference of the machines, so I've come to ask about that.
I understand that multiple bytes will be stored in little-endian on windows and big-endian in a unix environment, and I get that. For two bytes, 0x1234 on windows will be 0x3412 on a unix system.
Does endianness affect the byte order of the struct as a whole, or of each individual member of the struct? What approaches would I take to convert a struct created on a unix system to one that has the same data on a windows system? Any links that are more in depth than the byte order of a couple bytes would be great, too!
As well as the endian, you need to be aware of padding differences between the two platforms. Particularly if you have odd length char arrays and 16 bit values, you may well find different numbers of pad bytes between some elements.
Edit: if the structure was written out with no packing, then it should be fairly straightforward. Something like this (untested) code should do the job:
// Functions to swap the endian of 16 and 32 bit values
inline void SwapEndian(UINT16 &val)
{
val = (val<<8) | (val>>8);
}
inline void SwapEndian(UINT32 &val)
{
val = (val<<24) | ((val<<8) & 0x00ff0000) |
((val>>8) & 0x0000ff00) | (val>>24);
}
Then, once you've loaded the struct, just swap each element:
SwapEndian(r.foo);
SwapEndian(r.bar);
SwapEndian(r.baz);
Actually, endianness is a property of the underlying hardware, not the OS.
The best solution is to convert to a standard when writing the data -- Google for "network byte order" and you should find the methods to do this.
Edit: here's the link: http://www.gnu.org/software/hello/manual/libc/Byte-Order.html
Don't read directly into struct from a file! The packing might be different, you have to fiddle with pragma pack or similar compiler specific constructs. Too unreliable. A lot of programmers get away with this since their code isn't compiled in wide number of architectures and systems, but that doesn't mean it's OK thing to do!
A good alternative approach is to read the header, whatever, into a buffer and parse from three to avoid the I/O overhead in atomic operations like reading a unsigned 32 bit integer!
char buffer[32];
char* temp = buffer;
f.read(buffer, 32);
RECORD rec;
rec.foo = parse_uint32(temp); temp += 4;
rec.bar = parse_uint32(temp); temp += 4;
memcpy(&rec.fooword, temp, 11); temp += 11;
memcpy(%red.barword, temp, 11); temp += 11;
rec.baz = parse_uint16(temp); temp += 2;
The declaration of parse_uint32 would look like this:
uint32 parse_uint32(char* buffer)
{
uint32 x;
// ...
return x;
}
This is a very simple abstraction, it doesn't cost any extra in practise to update the pointer as well:
uint32 parse_uint32(char*& buffer)
{
uint32 x;
// ...
buffer += 4;
return x;
}
The later form allows cleaner code for parsing the buffer; the pointer is automatically updated when you parse from the input.
Likewise, memcpy could have a helper, something like:
void parse_copy(void* dest, char*& buffer, size_t size)
{
memcpy(dest, buffer, size);
buffer += size;
}
The beauty of this kind of arrangement is that you can have namespace "little_endian" and "big_endian", then you can do this in your code:
using little_endian;
// do your parsing for little_endian input stream here..
Easy to switch endianess for the same code, though, rarely needed feature.. file-formats usually have a fixed endianess anyway.
DO NOT abstract this into class with virtual methods; would just add overhead, but feel free to if so inclined:
little_endian_reader reader(data, size);
uint32 x = reader.read_uint32();
uint32 y = reader.read_uint32();
The reader object would obviously just be a thin wrapper around pointer. The size parameter would be for error checking, if any. Not really mandatory for the interface per-se.
Notice how the choise of endianess here was done at COMPILATION TIME (since we create little_endian_reader object), so we invoke the virtual method overhead for no particularly good reason, so I wouldn't go with this approach. ;-)
At this stage there is no real reason to keep the "fileformat struct" around as-is, you can organize the data to your liking and not necessarily read it into any specific struct at all; after all, it's just data. When you read files like images, you don't really need the header around.. you should have your image container which is same for all file types, so the code to read a specific format should just read the file, interpret and reformat the data & store the payload. =)
I mean, does this look complicated?
uint32 xsize = buffer.read<uint32>();
uint32 ysize = buffer.read<uint32>();
float aspect = buffer.read<float>();
The code can look that nice, and be a really low-overhead! If the endianess is same for file and architecture the code is compiled for, the innerloop can look like this:
uint32 value = *reinterpret_cast<uint32*>)(ptr); ptr += 4;
return value;
That might be illegal on some architectures, so that optimization might be a Bad Idea, and use slower, but more robust approach:
uint32 value = ptr[0] | (static_cast<uint32>(ptr[1]) << 8) | ...; ptr += 4;
return value;
On a x86 that can compile into bswap or mov, which is reasonably low-overhead if the method is inlined; the compiler would insert "move" node into the intermediate code, nothing else, which is fairly efficient. If alignment is a problem the full read-shift-or sequence might get generated, outch, but still not too shabby. Compare-branch could allow the optimization, if test the address LSB's and see if can use the fast or slow version of the parsing. But this would mean penalty for the test in every read. Might not be worth the effort.
Oh, right, we are reading HEADERS and stuff, I don't think that is a bottleneck in too many applications. If some codec is doing some really TIGHT innerloop, again, reading into a temporary buffer and decoding from there is well-adviced. Same principle.. no one reads byte-at-time from file when processing a large volume of data. Well, actually, I seen that kind of code very often and the usual reply to "why you do it" is that the file systems do block reads and that the bytes come from memory anyway, true, but they go through a deep call stack which is high-overhead for getting a few bytes!
Still, write the parser code once and use zillion times -> epic win.
Reading directly into struct from a file: DON'T DO IT FOLKS!
It affects each member independently, not the whole struct. Also, it does not affect things like arrays. For instance, it just makes bytes in an ints stored in reverse order.
PS. That said, there could be a machine with weird endianness. What I just said applies to most used machines (x86, ARM, PowerPC, SPARC).
You have to correct the endianess of each member of more than one byte, individually. Strings do not need to be converted (fooword and barword), as they can be seen as sequences of bytes.
However, you must take care of another problem: aligmenent of the members in your struct. Basically, you must check if sizeof(RECORD) is the same on both unix and windows code. Compilers usually provide pragmas to define the aligment you want (for example, #pragma pack).
You also have to consider alignment differences between the two compilers. Each compiler is allowed to insert padding between members in a structure the best suits the architecture. So you really need to know:
How the UNIX prog writes to the file
If it is a binary copy of the object the exact layout of the structure.
If it is a binary copy what the endian-ness of the source architecture.
This is why most programs (That I have seen (that need to be platform neutral)) serialize the data as a text stream that can be easily read by the standard iostreams.
I like to implement a SwapBytes method for each data type that needs swapping, like this:
inline u_int ByteSwap(u_int in)
{
u_int out;
char *indata = (char *)∈
char *outdata = (char *)&out;
outdata[0] = indata[3] ;
outdata[3] = indata[0] ;
outdata[1] = indata[2] ;
outdata[2] = indata[1] ;
return out;
}
inline u_short ByteSwap(u_short in)
{
u_short out;
char *indata = (char *)∈
char *outdata = (char *)&out;
outdata[0] = indata[1] ;
outdata[1] = indata[0] ;
return out;
}
Then I add a function to the structure that needs swapping, like this:
struct RECORD {
UINT32 foo;
UINT32 bar;
CHAR fooword[11];
CHAR barword[11];
UNIT16 baz;
void SwapBytes()
{
foo = ByteSwap(foo);
bar = ByteSwap(bar);
baz = ByteSwap(baz);
}
}
Then you can modify your code that reads (or writes) the structure like this:
fstream f;
f.open("file.bin", ios::in | ios::binary);
RECORD r;
f.read((char*)&detail, sizeof(RECORD));
r.SwapBytes();
cout << "fooword = " << r.fooword << endl;
To support different platforms you just need to have a platform specific implementation of each ByteSwap overload.
Something like this should work:
#include <algorithm>
struct RECORD {
UINT32 foo;
UINT32 bar;
CHAR fooword[11];
CHAR barword[11];
UINT16 baz;
}
void ReverseBytes( void *start, int size )
{
char *beg = start;
char *end = beg + size;
std::reverse( beg, end );
}
int main() {
fstream f;
f.open( "file.bin", ios::in | ios::binary );
// for each entry {
RECORD r;
f.read( (char *)&r, sizeof( RECORD ) );
ReverseBytes( r.foo, sizeof( UINT32 ) );
ReverseBytes( r.bar, sizeof( UINT32 ) );
ReverseBytes( r.baz, sizeof( UINT16 )
// }
return 0;
}