I started out just reading/writing 8-bit integers to files using chars. It was not very long before I realized that I needed to be able to work with more than just 256 possible values. I did some research on how to read/write 16-bit integers to files and became aware of the concept of big and little endian. I did even more research and found a few different ways to deal with endianness and I also learned some ways to write endianness-independent code. My overall conclusion was that I have to first check if the system I am using is using big or little endian, change the endianness depending on what type the system is using, and then work with the values.
The one thing I have not been able to find is the best/most common way to deal with endianness when reading/writing to files in C++ (no networking). So how should I go about doing this? To help clarify, I am asking for the best way to read/write 16/32-bit integers to files between big and little endian systems. Because I am concerned about the endianness between different systems, I would also like a cross-platform solution.
The most common way is simply to pass your in-memory values through htons() or htonl() before writing them to the file, and also pass the read data through ntohs() or ntohl() after reading it back from the file. (htons()/ntohs() handle 16-bit values, htonl()/ntohl() handle 32-bit values)
When compiled for a big-endian CPU, these functions are no-ops (they just return the value you passed in to them verbatim), so the values will get written to the file in big-endian format. When compiled for a little-endian CPU, these functions endian-swap the passed-in value and return the swapped version, so again the values will get written to the file in big-endian format.
That way the values in the file are always stored in big-endian format, and they always get converted to/from the appropriate (CPU-native) format when being transferred to/from memory. This is the simplest way to do it (since you don't have to write or debug any conditional logic), and the most common (these functions are implemented and available on just about all platforms)
In practice, a good habit is to avoid binary data (to exchange data between computers) and prefer text files and textual protocols to exchange data. You could use textual formats like JSON, YAML, XML, .... (or sometimes invent your own). There are many C++ libraries related to them, e.g. jsoncpp.
Textual data is indeed more verbose (takes more disk space) and slightly slower to parse (but the disk I/O is often the bottleneck, not the CPU time "wasted" in parsing or encoding formats like JSON) but is much easier to work on.
Read also about serialization. You'll find lots of libraries doing that (using some "common" well defined data format such as XDR or ASN1). Many file formats contain some header describing the concrete encoding. The elf(5) format is a good example of that.
Be aware that most of the time the data is more valuable (economically) than the software working on it. So it is very important to document very well how your data is organized in files.
Consider also using databases. Sometimes simply using sqlite with tables containing JSON is very effective.
PS. Without an actual real world case, your question is too broad, and has no meaningful universal answer. There is no single best way!
Basile, I agree that there is no universal answer.
In my world, embedded real time systems, using a text representation is blasphemy. Textual representations and JSON is at least 2 orders of magnitude slower than binary representations. It may be fine for the web. But that makes a difference when you have to process several kilo bytes of data per seconds (to handle voice for instance) across DSPs and GPPs.
For a more in depth discussion on this toppic, check out chapter 7 of the ZeroMQ book.
Related
I am reading code for a project written by others. The main task of the project is to read contents from a large structured text file (.txt) with 8 columns into a KnowledgeBase object, which have a number of methods and variables. The KnowledgeBase object is then output into a binary file. For example, the KnowledgeBase class has at least these two variables:
map<string, pair<string, string>> key_info
vector<ObjectInfo> objects
...
These variables are easy to understand when I track the code with gdb. Then, it seems it is converting such vectors and maps into binary forms. And the two variables above have their corresponding binary forms:
BinaryKeyInfo *bkeys
BinaryObjectInfo *bObjects
Later on when outputting to binary file, it has such code:
fwrite((char*)(&wcount),sizeof(int32_t),1,output);
fwrite((char*)bkeys,sizeof(KeyInfo_t),wcount,output);
The converting code from the original KnowledgeBase to binary is complicated. My question is, what's the main purpose of this conversion? Is it for faster loading of binary file into memory than plain text file? The plain text file is large. I learnt that object serialization is primarily for transmitting objects over the net, but I don't think the purpose here is for that. It is more like for speeding up data loading and memory saving. Could that be part of object serialization in C++?
Is the main purpose of object serialization in C++ for faster object loading?
No. The most important purpose of serialisation is to transform the state of the program into a format that can be stored on the filesystem, or that can be communicated across a network, and that can be de-serialised back. Often, the purpose of either is for another program to do the de-serialisation. Sometimes the de-serialiser is another instance of the same program.
The speed of de-serialisation is one metric that can be used to gauge whether one particular serialisation format is a good one. The ability to quickly undo what you have done is not the reason why you do it in the first place.
what's the benefit of converting them into binary vectors or maps?
As I mention above, the benefit of serialisation is the ability to store the serialised data on the filesystem, or to send it over a network.
what' the benefit between plain text files VS binary files?
Pros of text serialisation format:
Humans are able to read and write plain text. Humans generally are not able read nor write binary files.
It's generally easier to implement a plain text format de-/serialiser in a way that works across differing computers than it is to implement a binary format de-/serialiser that achieves the same.
Pros of binary serialisation format:
Typically faster and uses less storage and bandwidth.
Can be easier to implement if there is no need for communication between differing systems. This is typically only the case in very simple cases. (Furthermore, there usually is a need for cross-system compatibility, even if the need haven't been realised yet).
We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?
I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.
Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.
Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.
I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!
Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.
I want to do the folloing in C++:
create a command object
serialize it
(send it to another computer)
deserialize
execute
Two cases:
sender and receiver are both win 7
computers
sender is *nix and receiver is win
7
I found a tutorial for searialization: http://www.functionx.com/cpp/articles/serialization.htm. Is this the way to go? In python I could do:
def setAndPackCommand(self, object):
outFile = StringIO.StringIO()
pickledC = pickle.dump(object, outFile) # this packs object to outFile
stringToSend = outFile.getvalue() # decoding to string
def unpackAndExecute(self, stringToReceive):
inFile = StringIO.StringIO()
inFile.write(stringToReceive)
inFile.seek(0, 0)
receivedC = pickle.load(inFile)
receivedC.execute()
In this code the main point are pickle.dump and pickle.load. What are the C++ counterparts? Wikipedia says that c++ does not support serialization? What is the link above then?
What does binary serialization mean? memory is dumped to disk and deserialization needs exactly the same computer (no cross-platform transfers)?
br,
Juha
I would also recommend using a stable library like boost.serialization for serializing data.
If you are new to serialization, it means transforming objects into a data representation suitable for transmission or storage and rebuilding them from that data representation. The difficulty is not really big with so-called PODs (Plain Old Data objects). You can transmit the buffer as data and cast it back after the transfer by taking care of the data alignment and byte ordering (endianness). It becomes more complicated if objects reference other objects, and then it makes really sense to use a well designed library. Boost's serialization also supports versionning so you can update your format and keep up and backward compatible readers and writers (with some effort of course)
Here is a good introduction.
To briefly answer your questions, Wikipedia is right - C++ doesn't natively support serialisation. That doesn't mean that you can't roll your own solution though as demonstrated in the article you linked.
Binary serialisation refers to writing an object to a binary file format. Contrast to (for example), XML serialisation where the object is written to an XML-based format: in the former, you get a binary file where (for example) an int consists of 4 bytes of raw binary data. In the latter, you might get an int tag with a name attribute and its contents being the text value of the integer, such as <int name="myInt">12345</int>.
The big advantage of binary serialisation is (in most cases) that it is very compact, and very simple to convert to/from the object on the target platform. The downside is that, as you've suggested, it tends to be very machine-specific and so not very useful in your situation. Issues such as byte ordering and field alignment tend to vary from platform to platform, and so a binary format where those are not taken into account will most likely not be portable. That said, you can add code to take into account these differences, but it does increase the complexity of the solution.
The alternatives (text-based serialisation, XML serialisation, etc) have the advantage of generally being more cross platform and easier to edit by hand, but are generally less compact than the binary approach.
For the reasons outlined above, I'd avoid binary serialisation in your case (if possible), and go with a text based approach. An article I like is this one, which describes (among other things) an approach to serialisation that allows you to easily specify any kind of serialisation method you prefer to use.
Of course C++ program can do serialization, just not out of the box. Check out the Boost.Serialization library or Google's protocol buffers. The latter implements fast and portable cross-platform (binary) serialization, but it requires the use of a code generator.
(The tutorial you link to demonstrates a very simplistic, unportable approach to serialization. It also demonstrates very well how not to handle strings in C++.)
I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.