Read binary data from std::cin - c++

What is the easiest way to read binary (non-formated) data from std::cin into either a string or a stringstream?

std::cin is not opened with ios_binary. If you must use cin, then you need to reopen it, which isn't part of the standard.
Some ideas here: https://comp.unix.programmer.narkive.com/jeVj1j3I/how-can-i-reopen-std-cin-and-std-cout-in-binary-mode
Once it's binary, you can use cin.read() to read bytes. If you know that in your system, there is no difference between text and binary (and you don't need to be portable), then you can just use read without worrying.

For windows, you can use the _setmode function in conjunction with cin.read(), as already mentioned.
_setmode(_fileno(stdin), _O_BINARY);
cin.read(...);
See solution source here: http://talmai-oliveira.blogspot.com/2011/06/reading-binary-files-from-cin.html

cin.read would store a fixed number of bytes, without any logic searching for delimiters of the type that #Jason mentioned.
However, there may still be translations active on the stream, such as CRLF -> NL, so it still isn't ideal for binary data.

On a Unix/POSIX system, you can use the cin.get() method to read byte-by-byte and save the data into a container like a std::vector<unsigned int>, or you can use cin.read() in order to read a fixed amount of bytes into a buffer. You could also use cin.peek() to check for any end-of-data-stream indicators.
Keep in mind to avoid using the operator>> overload for this type of operation ... using operator>> will cause breaks to occur whenever a delimiter character is observed, and it will also remove the delimiting character from the stream itself. This would include any binary values that are equivalent to a space, tab, etc. Thus the binary data your end up storing from std::cin using that method will not match the input binary stream byte-for-byte.

All predefined iostream objects are obligated to be bound to corresponding C streams:
The object cin controls input from a stream buffer associated with the object stdin, declared in <cstdio>.
http://eel.is/c++draft/narrow.stream.objects
and thus the method of obtaining binary data is same as for C:
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary
mode. In the normal mode, reading from stdin on Windows will convert
\r\n (Windows newline) to the single character ASCII 10. Using the
"rb" mode disables this conversion so that you can properly read in
binary data.
https://stackoverflow.com/a/1599093/6049796

cplusplus.com:
Unformatted input
Most of the other member functions of the istream
class are used to perform unformatted input, i.e. no interpretation is
made on the characters got form the input. These member functions can
get a determined number of characters from the input character
sequence (get, getline, peek, read, readsome)...
As Lou Franco pointed out, std::cin isn't opened with std::ios_base::binary, but one of those functions might get you close to the behavior you're looking for.

With windows/mingw/msys/bash, if you need to pipe different commands with binary streams in between, you need to manipulate std::cin and std::cout as binary streams.
The _setmode solution from Mikhail works perfectly.
Using MinGW, the neaded headers are the following:
#include <io.h>
#include <fcntl.h>

Related

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

C++ file stream

Could someone please explain how c++ reads in files? I'm not asking the code to read in a file but after the ifstream >> variable, what are the rules to how c++ grabs the data ?
Is the file read like how a cin would read in the user input? meaning it stops after each whitespace? What happens after it reaches the end of a line? does it automatically proceed to the next line or do I have to write code for that? I know that it stops after eof, but I'm unsure of the process of extracting data and I can't write code if I don't understand the process. Thanks
Yes the input operator >> always "tokenizes" (stops at) whitespace. And reading from any input stream is working the same.
For very good information I suggest this reference. Especially the reference for the input operator is very detailed.
Basically when you call ifstream f;, you are creating a variable with access to the library. From there you must declare your intentions with that variable. Using f.open(fileName, ios::in); you can input from fileName using the >> operator, which actually operates like cin. It stops at white spaces like you'd expect. Once it reaches the end of a line, it continues as long as you have code that asks the operator to extract more. You dont have to do anything extra to tell it to move on to the next line.
More info can be found here.
The iostream formatted input and output operators are essentially defined in terms of the C library functions strtol/strtoul/strtod (cf. 22.4.2.1.2) and sprintf (cf. 22.4.2.2.2), respectively.
In C/C++ there is generally no difference in any stream input (besides details eg.: seeking).
Having C++ there are two distinguished ways of input:
Formatted Input
Unformatted Input
All formatted input operations involve an operator stream& operator >> (stream&, T). However, not all stream& operator >> (stream&, T) are performing formatted input (eg.: some are involving manipulators or a stream buffer)
Each formatted input starts with skipping white spaces and stops at the first character not being part of the input format (Note: It may be any character, it is not limited to white spaces).
Unformatted input reads all characters (does not ignore any white space) and stops if a requested amount of characters is retrieved or the stream reaches the end (EOF). Specialized functions (like std::getline) might stop early and ignore the delimiting condition character.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

Formatted and unformatted input and output and streams

I had been reading a few articles on some sites about Formatted and Unformatted I/O, however i have my mind more messed up now.
I know this is a very basic question, but i would request anyone can give a link [ to some site or previously answered question on Stackoverflow ] which explains, the idea of streams in C and C++.
Also, i would like to know about Formatted and Unformatted I/O.
The standard doesn't define what these terms mean, it just says which of the functions defined in the standard are formatted IO and which are not. It places some requirements on the implementation of these functions.
Formatted IO is simply the IO done using the << and >> operators. They are meant to be used with text representation of the data, they involve some parsing, analyzing and conversion of the data being read or written. Formatted input skips whitespace:
Each formatted input function begins execution by constructing an object of class sentry with the noskipws (second) argument false.
Unformatted IO reads and writes the data just as a sequence of 'characters' (with possibly applying the codecvt of the imbued locale). It's meant to read and write binary data, or function as a lower-level used by the formatted IO implementation. Unformatted input doesn't skip whitespace:
Each unformatted input function begins execution by constructing an object of class sentry with the default argument noskipws (second) argument true.
And allows you to retrieve the number of characters read by the last input operation using gcount():
Returns: The number of characters extracted by the last unformatted input member function called for the object.
Formatted IO means that your output is determined by a "format string", that means you provide a string with certain placeholders, and you additionally give arguments that should be used to fill these placeholders:
const char *daughter_name = "Lisa";
int daughter_age = 5;
printf("My daughter %s is %d years old\n", daughter_name, daughter_age);
The placeholders in the example are %s, indicating that this shall be substituted using a string, and %d, indicating that this is to be replaced by a signed integer number. There are a lot more options that give you control over how the final string will present itself. It's a convenience for you as the programmer, because it relieves you from the burden of converting the different data types into a string and it additionally relieves you from string appending operations via strcat or anything similar.
Unformatted IO on the other hand means you simply write character or byte sequences to a stream, not using any format string while you are doing so.
Which brings us to your question about streams. The general concept behind "streaming" is that you don't have to load a file or whatever input as a whole all the time. For small data this does work though, but imagine you need to process terabytes of data - no way this will fit into a single byte array without your machine running out of memory. That's why streaming allows you to process data in smaller-sized chunks, one at a time, one after the other, so that at any given time you just have to deal with a fix-sized amount of data. You read the data into a helper variable over and over again and process it, until your underlying stream tells you that you are done and there is no more data left.
The same works on the output side, you write your output step for step, chunk for chunk, rather than writing the whole thing at once.
This concept brings other nice features, too. Because you can nest streams within streams within streams, you can build a whole chain of transformations, where each stream may modify the data until you finally receive the end result, not knowing about the single transformations, because you treat your stream as if there were just one.
This can be very useful, for example C or C++ streams buffer the data that they read natively from e.g. a file to avoid unnecessary calls and to read the data in optimized chunks, so that the overall performance will be much better than if you would read directly from the file system.
Unformatted Input/Output is the most basic form of input/output. Unformatted input/output transfers the internal binary representation of the data directly between memory and the file. Formatted output converts the internal binary representation of the data to ASCII characters which are written to the output file. Formatted input reads characters from the input file and converts them to internal form. Formatted

Problem with getline and "strange characters"

I have a strange problem,
I use
wifstream a("a.txt");
wstring line;
while (a.good()) //!a.eof() not helping
{
getline (a,line);
//...
wcout<<line<<endl;
}
and it works nicely for txt file like this
http://www.speedyshare.com/files/29833132/a.txt
(sorry for the link, but it is just 80 bytes so it shouldn't be a problem to get it , if i c/p on SO newlines get lost)
BUT when I add for example 水 (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples )to any line that is the line where loading stops. I was under the wrong impression that getline that takes wstring as one input and wifstream as other can chew any txt input...
Is there any way to read every single line in the file even if it contains funky characters?
The not-very-satisfying answer is that you need to imbue the input stream with a locale which understands the particular character encoding in question. If you don't know which locale to choose, you can use the empty locale.
For example (untested):
std::wifstream a("a.txt");
std::locale loc("");
a.imbue(loc);
Unfortunately, there is no standard way to determine what locales are available for a given platform, let alone select one based on the character encoding.
The above code puts the locale selection in the hands of the user, and if they set it to something plausible (e.g. en_AU.UTF-8) it might all Just Work.
Failing this, you probably need to resort to third-party libraries such as iconv or ICU.
Also relevant this blog entry (apologies for the self-promotion).
The problem is with your call to the global function getline (a,line). This takes a std::string. Use the std::wistream::getline method instead of the getline function.
C++ fstreams delegeate I/O to their filebufs. filebufs always read "raw bytes" from disk and then use the stream locale's codecvt facet to convert between these raw bytes into their "internal encoding".
A wfstream is a basic_fstream<wchar_t> and thus has a basic_filebuf<wchar_t> which uses the locale's codecvt<wchar_t, char> to convert the bytes read from disk into wchar_ts. If you read a UCS-2 encoded file, the conversion must thus be performed with a codecvt who "knows" that the external encoding is UCS-2. You thus need a locale with such a codecvt (see, for example, this SO question)
By default, the stream's locale is the global locale at the stream construction. To use a specific locale, it should be imbue()-d on the stream.