I have been hearing about streams, more specifically file streams.
So what are they?
Is it something that has a location in the memory?
Is it something that contains data?
Is it just a connection between a file and an object?
The term stream is an abstraction of a construct that allows you to send or receive an unknown number of bytes. The metaphor is a stream of water. You take the data as it comes, or send it as needed. Contrast this to an array, for example, which has a fixed, known length.
Examples where streams are used include reading and writing to files, receiving or sending data across an external connection. However the term stream is generic and says nothing about the specific implementation.
IOStreams are a front-end interface (std::istream, std::ostream) used to define input and output functions. The streams also store formatting options, e.g., the base to use for integer output and hold a std::locale object for all kind of customization. Their most important component is a pointer to a std::streambuf which defines how to access a sequence of characters, e.g., a file, a string, an area on the screen, etc. Specifically for files and strings special stream buffers are provided and classes derived from the stream base classes are provided for easier creation. Describing the entire facilities of the IOStreams library can pretty much fill an entire book: In C++ 2003 about half the library section was devoted to stream related functionality.
Stream is linear queue that connects a file to the program and maintain the flow of data in both direction. Here the source is any file, I/O device, Hard disk, CD/DVD etc.
Basically stream is if two type 1.Text Stream 2.Binary stream
Text Stream : It is a sequence of character arranges in line and each line terminated by new line (unix).
Binary Stream: It is data as it is coded internally in computer's main memory, without any modification.
File system is designed to work with a wide variety of devices, including terminals, disk drives, tape drives etc. Even though each device is different, file system transforms each into a logical device called stream. Streams are device independent so same function can be used to write a disk file and a tape file. In more technical term stream provides a abstraction between programmer and actual device being used.
Related
It is more or less quite easy to serialize and deserialze data structures with common libraries like boost::serialize.
But there is an also common case where I simply do something like ( pseudo code ):
// receiver
NetworkInputStreamSerialzer stream;
while (1) // read new data objects from stream
{
stream & data;
}
As I expect the data package must already be received complete from the network socket. If only a part of the object can be read the deserialization will fail. Especially with large data sets TCP will fragment the data.
Is there a generic way to deal with this problem? I have not found any hints to this problem in the docs from boost::serialize.
As this problem is generic to any kind of streamed data, not only for TCP based streaming but also for files where one prog sends and another receives the data, there must be a general solution but I could not find anything about.
My question is not specialized to boost. I use it only as an example.
EDIT:
Maybe some more explanation to my wording of "fragmentation":
Any kind of data, independent of the size it produces in serialized format, can be fragmented in several packages while transferred via TCP or by writing it to any kind of file. There is no kind of "atomic" write and read operation which is supported from the OS neither the serialization libraries I know.
So if reading an int from a human readable format like XML or JSON I can get the problem that I read a "11" instead of "112" if the "2" is not in the stream or file in the moment I read from it. So writing the length of the following content in a human readable format is also not a solution, because the size information itself can be corrupt while the read occurs in the moment the content string is not complete in this moment.
[Note: I get a sense from your Q, that you want a better alternative for boost::serialization for your specific case. If this doesn't answer your Q, then let me know, I shall delete it.]
Recommending to use Google Protocol Buffers from my own practical experience. Below are few advantages:
It can be used on wire (TCP etc.)
Simple grammar to write the .proto file for composing your own
messages
Cross platform & available with multiple languages
Very efficient compared to JSON & XML
Generates header & source files for handy getter, setter, serialize,
deserialize & debugging purpose
Easy to serialize & deserialize -- store to & retrieve from file
The last point is bit tricky. While storing in a file, you may have to insert the length of the message first and while retrieving, you may have to first read that length & then use read() method to read the exact number of bytes.
Above same trick you may want to use while passing on TCP. In first couple of bytes, the length can be passed. Once the length is determined, you can always collect the remaining fragmented message.
When processing input, it can be useful to know if you are being sent data in a firehose-like manner; that is, you get to see the data once and it's forever lost once it goes through the stream.
There are lots of examples of firehoses, such as console input, signal capture devices, data streams from network sockets, or named pipes.
C++ streams and streambufs are supposed to encapsulate input and output behavior, but I'm not sure there's a standard non-destructive way to detect whether the associated sequence lurking behind a streambuf you've been handed is seekable or not.
What if you called streambuf::pubseekoff (0, cur, std::ios_base::in) ? Can you rely on that?
Does the standard (or at least design sense) require such a call to seekoff to return streampos(-1) if the associated sequence is not physically seekable? Would it set failbit? Is it undefined behavior?
These streams don't have a viable concept of "absolute position" and so a seekoff probably should return -1 even if the seek does nothing.
But if that behavior is not required, some library designer might have decided to return something else from seekoff in such a case. Perhaps the number of characters fed through the firehose so far. Or gptr()-eback() (useless). Or a random number.
The streambuf type I have the most uncertainty about is basic_filebuf. You should certainly be able to seek within a disk file, but the OS can also encapsulate firehose type streams within the 'file' concept.
I have implemented std::basic_streambuf derived wrapper around std::basic_filebuf which converts between encodings. Within this wrapper I use a single buffer for both input and output. The buffering technique comes from this article.
Now, the problem I can't figure out is this. My internal buffer is filled on calls to underflow. According to the article, when switching from input to output, the buffer should be put in a state of limbo. To do this, I need to unget the unread data in the buffer. Reading the docs and the source codes, unget and putback are not guaranteed to succeed. This will leave me with an invalid tellg pointer with the next input operation.
I'm not asking for somebody to write this for me, but I am asking advice as to how to manage ungetting data from std::basic_filebuf in a way that will not fail.
I think, the only sure way is to calculate the bytes that would be written to file and adjust the offset accordingly. It's not as simple as it sounds though. The filebuf may have an associated locale, unknown at compile time. I tried getting the facet and passing the data through it's out member, but it doesn't work. The previously read data may have a non default mbstate_t value, and some codecvt objects also write a BOM.
Basically, it's almost impossible to calculate the 'on file' length of a section of file data after it has passed through a codecvt.
I have tagged this question with 'c' since 'c'-file streams also work with buffers and also use get and put pointers. std::basic_filebuf is just a wrapper around a 'c'-file stream. Answers in 'c' are also applicable to this problem.
Does anybody have any suggestions as to how to implement unlimited unget on file streams?
I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.
The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.
Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.
Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted
I'm writing a set of unit tests that write calculated values out to files. Each test produces a square matrix that holds anywhere from 50,000 to 500,000 doubles, and I have a total of 128 combinations of test cases.
Is there any significant overhead involved in writing cout statements and then piping that output to files, or would I be better off writing directly to the file using an ofstream?
This is going to be dependent on your system and environment. This likely to be very little difference, but there is only one way to be sure: try both approaches and measure them.
Since the dimensions involved are so large I'm assuming that these files are not meant to be read by a human being? Just make sure you write them out as binary and not human-readable text because that will make so much more difference than the difference between using ofstream or piping cout.
Whether this means you have to use ofstream or not I don't know. I've never written binary to cout so I can't say whether that's possible...
As Charles Bailey said, it's implementation dependent; what follows is mostly for linux implementation with gnu toolchain, but I hardly imagine it being very different in other os.
In libstdc++ 4.4.2:
An fstream contain an underlying stdio_filebuf which is a basic_filebuf. This basic_filebuf contain it's own buffer by inheriting basic_streambuf, and actually contain a __basic_file, itself containing an underlying plain C stdio abstraction (FILE* or std::__c_file*), in which it flush the buffer.
cout, which is an ostream is initialized with a stdio_sync_filebuf itself initialized with the C file abstraction stdout. stdio_sync_filebuf call plain C stdio functions.
Considering only C++, it appear that an fstream may be more efficient thanks to two layers of buffer.
Considering C only, if the process is forked with the stdout file descriptor redirected in a file, there should be no difference between writing to a new opened file (what fstream does at the end) or to stdout since the fd point to a file anyway (what cout does at the end).
If I were you, I would use an fstream since it's your intent.