deserialize data on the fly from stream or file - c++

It is more or less quite easy to serialize and deserialze data structures with common libraries like boost::serialize.
But there is an also common case where I simply do something like ( pseudo code ):
// receiver
NetworkInputStreamSerialzer stream;
while (1) // read new data objects from stream
{
stream & data;
}
As I expect the data package must already be received complete from the network socket. If only a part of the object can be read the deserialization will fail. Especially with large data sets TCP will fragment the data.
Is there a generic way to deal with this problem? I have not found any hints to this problem in the docs from boost::serialize.
As this problem is generic to any kind of streamed data, not only for TCP based streaming but also for files where one prog sends and another receives the data, there must be a general solution but I could not find anything about.
My question is not specialized to boost. I use it only as an example.
EDIT:
Maybe some more explanation to my wording of "fragmentation":
Any kind of data, independent of the size it produces in serialized format, can be fragmented in several packages while transferred via TCP or by writing it to any kind of file. There is no kind of "atomic" write and read operation which is supported from the OS neither the serialization libraries I know.
So if reading an int from a human readable format like XML or JSON I can get the problem that I read a "11" instead of "112" if the "2" is not in the stream or file in the moment I read from it. So writing the length of the following content in a human readable format is also not a solution, because the size information itself can be corrupt while the read occurs in the moment the content string is not complete in this moment.

[Note: I get a sense from your Q, that you want a better alternative for boost::serialization for your specific case. If this doesn't answer your Q, then let me know, I shall delete it.]
Recommending to use Google Protocol Buffers from my own practical experience. Below are few advantages:
It can be used on wire (TCP etc.)
Simple grammar to write the .proto file for composing your own
messages
Cross platform & available with multiple languages
Very efficient compared to JSON & XML
Generates header & source files for handy getter, setter, serialize,
deserialize & debugging purpose
Easy to serialize & deserialize -- store to & retrieve from file
The last point is bit tricky. While storing in a file, you may have to insert the length of the message first and while retrieving, you may have to first read that length & then use read() method to read the exact number of bytes.
Above same trick you may want to use while passing on TCP. In first couple of bytes, the length can be passed. Once the length is determined, you can always collect the remaining fragmented message.

Related

Is there a way to populate an ellipses parameter programmatically?

I'm going to be getting in data from a file of my own making. This file will contain a printf format string and the parameters passed to it. I've already generated this code.
Now I want to do the reverse. Read format string and the parameters and pass it back to printf functions. Can I somehow generate the appropriate call stack or am I going to have to reparse the format string and send it to printf() piecemeal?
Edit
I know the risks with the printf functions. I understand that there are security vulnerabilities. These issues are non-issues as:
This is to be used in a debugging context. Not to be handled outside of that scope.
Executable that reads the file, is executed by the person who made the file.
The datafile created will be read by an executable that simply expands the file and is not accessible by a third party.
It has no access to writing anything to memory (%n is not valid).
Use Case
To compress a stream with minimal CPU overhead by tracking constantly repeating strings and replacing them with enumerations. All other data is saved as binary data, thus requiring only minimal processing instead of having to convert it to a large string every time.

How to unit test a generator/serialization method?

I want to write unit tests for a serialization method. By serialization method I mean a methd that outputs a set of data into a special format.
For example, a method that outputs data in XML format. (I write in C++ but it is the same in every language.)
class Generator
{
public:
std::string serialize();
};
// unit test (pseudo-code)
Generator gen;
// set some data in gen
std::string actual = gen.serialize();
std::string expected = "<xml>...</xml>";
ASSERT_EQUAL(expected, actual);
The problem with this is that the unit test is highly dependent on non-important things, like the formatting of the XML (line breaks) or the order of XML-attributes.
While with XML the previous method will work, it will not work with generators that output binary data.
So, what is a robust way to test serialization methods?
The ideas I have are the following, but all have serious drawbacks.
using external libraries to parse the data (for proprietary formats, there may not exist).
always write pairs of serialization/deserialization and test them in combination (bugs in both methods might remain undiscovered).
store the serialized data in external files and compare against them in the test (the unit test is difficult to read and maintain).
As you are asking about unit-testing, I assume that the intended behaviour of the serializer in all its details is known by you. That is, you know where you want line breaks and indentation etc. to be inserted.
The problem now is, that in every single test case only a subset of these details would be relevant. In other words, in some tests you want to test the proper indentation, and in some tests you just want to be sure that a number is inserted in the right way.
In addition to the options you have provided I recommend another approach: Use regular expression matching instead of string comparison. With the help of regular expressions you can reduce the serialized string to the essential parts which are of interest in the respective test. To check, for example, if the result string contains a certain number, say, 42, you could match it against ^[^0-9]*42[^0-9]*$. Then, the enclosing XML would be ignored in this particular test. This test would then be robust against a large number of changes in the serialization.
With this approach you avoid the dependency on external parsing libraries (well, you are depending on the regular expression library, but that is in many languages today even part of the standard library), you can also test for aspects which the serialize-deserialize can not test (indentation), your tests run fast and are not OS specific (no dependency on the file system).
This is more like a long comment with my first thoughts on the topic.
I think you have to look at two different scenarios. Your data <-> serialized data relation could be either 1:1 or 1:n.
XML would be a 1:n relation, where you XML code would have quite a little bit of freedom, but would still be unserialized to the same data again. In this case it seems to me, that developing and testing serialization/deserialization in combination is the way to go. If there are external libraries available as well, use them of course. If there are no external libraries available, then - as long as serialization / deserialization - yield the same result, you will probably not have "bugs", but "features"...
Testing the deserialization with stored external datafiles does also make sense, but this does not apply to the serialization, imho.
Looking at a 1:1 relation, like maybe putting the data into a certain binary format, you should go for the stored data in external files. Always use external libraries, if they exist, as well, of course.
I would suggest to do all three of those approaches together - where applicable, of course. You should not rely on a single one of them.

Is it inappropriate to use a std::string to store binary data?

I was surprised to see in this question that someone modified a working snippet just because, as the author of the second answer says:
it didn't seem appropriate to me that I should work with binary data
stored within std::string object
Is there a reason why I should not do so?
For binary data in my opinion the best option is std::vector<unsigned char>.
Using std::string while technically works sends to the user the wrong message that the data being handled is text.
On the other side being able to accept any byte in a string is important because sometimes you know the content is text, but in an unknown encoding. Forcing std::string to contain only valid and decoded text would be a big limitation for real world use.
This kind of limitation is one of the few things I don't like about QString: this limitation makes it impossible for example to use a file selection dialog to open a file if the filename has a "wrong" (unexpected) encoding or if the encoding is actually invalid (it contains mistakes).

Handling unget and putback with file streams

I have implemented std::basic_streambuf derived wrapper around std::basic_filebuf which converts between encodings. Within this wrapper I use a single buffer for both input and output. The buffering technique comes from this article.
Now, the problem I can't figure out is this. My internal buffer is filled on calls to underflow. According to the article, when switching from input to output, the buffer should be put in a state of limbo. To do this, I need to unget the unread data in the buffer. Reading the docs and the source codes, unget and putback are not guaranteed to succeed. This will leave me with an invalid tellg pointer with the next input operation.
I'm not asking for somebody to write this for me, but I am asking advice as to how to manage ungetting data from std::basic_filebuf in a way that will not fail.
I think, the only sure way is to calculate the bytes that would be written to file and adjust the offset accordingly. It's not as simple as it sounds though. The filebuf may have an associated locale, unknown at compile time. I tried getting the facet and passing the data through it's out member, but it doesn't work. The previously read data may have a non default mbstate_t value, and some codecvt objects also write a BOM.
Basically, it's almost impossible to calculate the 'on file' length of a section of file data after it has passed through a codecvt.
I have tagged this question with 'c' since 'c'-file streams also work with buffers and also use get and put pointers. std::basic_filebuf is just a wrapper around a 'c'-file stream. Answers in 'c' are also applicable to this problem.
Does anybody have any suggestions as to how to implement unlimited unget on file streams?

What is a stream in C++?

I have been hearing about streams, more specifically file streams.
So what are they?
Is it something that has a location in the memory?
Is it something that contains data?
Is it just a connection between a file and an object?
The term stream is an abstraction of a construct that allows you to send or receive an unknown number of bytes. The metaphor is a stream of water. You take the data as it comes, or send it as needed. Contrast this to an array, for example, which has a fixed, known length.
Examples where streams are used include reading and writing to files, receiving or sending data across an external connection. However the term stream is generic and says nothing about the specific implementation.
IOStreams are a front-end interface (std::istream, std::ostream) used to define input and output functions. The streams also store formatting options, e.g., the base to use for integer output and hold a std::locale object for all kind of customization. Their most important component is a pointer to a std::streambuf which defines how to access a sequence of characters, e.g., a file, a string, an area on the screen, etc. Specifically for files and strings special stream buffers are provided and classes derived from the stream base classes are provided for easier creation. Describing the entire facilities of the IOStreams library can pretty much fill an entire book: In C++ 2003 about half the library section was devoted to stream related functionality.
Stream is linear queue that connects a file to the program and maintain the flow of data in both direction. Here the source is any file, I/O device, Hard disk, CD/DVD etc.
Basically stream is if two type 1.Text Stream 2.Binary stream
Text Stream : It is a sequence of character arranges in line and each line terminated by new line (unix).
Binary Stream: It is data as it is coded internally in computer's main memory, without any modification.
File system is designed to work with a wide variety of devices, including terminals, disk drives, tape drives etc. Even though each device is different, file system transforms each into a logical device called stream. Streams are device independent so same function can be used to write a disk file and a tape file. In more technical term stream provides a abstraction between programmer and actual device being used.