std::ifstream in binary mode and locale in C++ - c++

A comment by James Kanze on How to copy a .txt file to a char array in c++ makes it sound like in order to be sure that a standard string would get the exact binary contents of a file when iterated through by a standard string constructor, one would have to both:
open the file in binary mode,
ensure that the file is imbued with the "C" locale.
In code, I'm guessing that means:
std::ifstream in(filename, ios_base::binary);
in.imbue(std::locale("C"));
Is that really necessary? More specifically, why would the locale have any impact when the file is opened in binary mode?
Note that what I am trying to do is more or less what the above mentioned question was about:
std::string contents(std::istreambuf_iterator<char>(in), std::istreambuf_iterator<char>());

Based on binary and text modes:
A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream always equals to the data that were earlier written out to that stream. Implementations are only allowed to append a number of null characters to the end of the stream.
I think
std::ifstream in(filename, ios_base::binary);
together with:
in.imbue(std::locale("C"));
does not make sense.
Either the stream is in binary mode, and the locale does not apply, or the programmer chooses to set the locale, but then he/she implicitly means that the stream is open in text mode (ios_base::binary should not be passed to the stream constructor). In that case, the read data may or may not equal to the data in the file, depending on the OS and the contents of the file.

Related

C++ ifstream, ofstream: What's the difference between raw read()/write() calls and opening file in binary mode?

This question concerns the behaviour of ifstream and ofstream when reading and writing data to files.
From reading around stackoverflow.com I have managed to find out that operator<< (stream insertion operator) converts objects such as doubles to text representation before output, and calls to read() and write() read and write raw data as it is stored in memory (binary format) respectively. EDIT: This much is obvious, nothing unexpected here.
I also found out that opening a file in binary mode prevents automatic translation of newline characters as required by different operating systems.
So my question is this: Does this automatic translation, eg; from \n to \r\n occur when calling functions read() and write()? Or is this behaviour just specific to the operator<<. (And also operator>>.)
Note there is a similar but slightly less specific question here. It does not give a definite answer. Difference in using read/write when stream is opened with/without ios::binary mode
The difference between binary and text mode its at a lower level.
If you open a file in text mode you will get translated data even when using read and write operations.
Please also note that you're allowed to seek to a position in a text file only if the position was obtained from a previous tell (or 0). To be able to do random positioning, the file must have been opened in binary mode.

Convert UCS-2 inside character array to UTF-8 std::string

Well this is a direct "followup" of this question; I decided to split the problem into two - Originally I posted the whole picture to prevent getting another close with "YZ problem". For now consider I know already the character encoding.
However I read a string using std::getline from a file. This file is encoded in a format I know -say UTF16 big endian-.
But not "all" files are UTF16 (actually most are UTF8), I prefer to have as little code-copying as possible.
Now my first response is to "just read the bytes" and "then do the conversion to UTF-8", and skip the conversion if the input is already UTF-8. So I read it first into a std::string (please ignore the "ugglyness" of OpenFilestreams()[file_index]);
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
After this I oblviously have a nonsense string - as the bytes are ordered as if the string was UCS2/UTF-16. So how can I convert this std::string to another std::string resulting in the UTF8-byte ordering. - Or should I do this at line reading level (or even at opening the file-stream level?)
I prefer to keep myself to the C++11 standard, maybe boost/ICU if it is really better (already have boots, but no ICU library at my pc).

How is std::fstream with both in and out supposed to work?

I've just started wondering - how is actually std::fstream opened with both std::ios::in and std::ios::out actually supposed to work? What should it do? Write something to (for example) an empty file, then read... what? Just written value? Where would the file "pointer"/"cursor" be? Maybe the answer's already out there but I just couldn't have found it.
What is std::fstream?
std::fstream is a bidirectional file stream class. That is, it provides an interface for both input and output for files. It is commonly used when a user needs to read from and write to the same external sequence.
When instantiating a bidirectional file stream (unlike std::ofstream or std::ifstream), the openmodes ios_base::in and ios_base::out are specified by default. This means that this:
std::fstream f("test.txt", std::ios_base::in | std::ios_base::out);
is the same as
std::fstream f("test.txt");
One would specify both options if they needed to also add some non-default openmodes such as trunc, ate, app, or binary. The ios_base::trunc openmode is needed if you intend to create a new file for bidirectional I/O, because the ios_base::in openmode disables the creation of a new file.
Bidirectional I/O
Bidirectional I/O is the utilization of a bidirectional stream for both input and output. In IOStreams, the standard streams maintain their character sequences in a buffer where it serves as a source or sink for data. For output streams, there is a "put" area (the buffer that holds characters for output). Likewise, for input streams, there is the "get" area.
In the case of std::fstream (a class for both input and output), it holds a joint file buffer representing both the get and put area respectively. The position indicator that marks the current position in the file is affected by both input and output operations. As such, in order to perform I/O correctly on a bidirectional stream, there are certain rules you must follow:
When you perform a read after a write or vice-versa, the stream should be repositioned back.
If an input operation hit the end-of-file, performing a write directly thereafter is fine.
This only refers to std::fstream. The above rules are not needed for std::stringstream.
I hope these answer your questions. If you have any more, you can just ask in the comments.

Read binary data from std::cin

What is the easiest way to read binary (non-formated) data from std::cin into either a string or a stringstream?
std::cin is not opened with ios_binary. If you must use cin, then you need to reopen it, which isn't part of the standard.
Some ideas here: https://comp.unix.programmer.narkive.com/jeVj1j3I/how-can-i-reopen-std-cin-and-std-cout-in-binary-mode
Once it's binary, you can use cin.read() to read bytes. If you know that in your system, there is no difference between text and binary (and you don't need to be portable), then you can just use read without worrying.
For windows, you can use the _setmode function in conjunction with cin.read(), as already mentioned.
_setmode(_fileno(stdin), _O_BINARY);
cin.read(...);
See solution source here: http://talmai-oliveira.blogspot.com/2011/06/reading-binary-files-from-cin.html
cin.read would store a fixed number of bytes, without any logic searching for delimiters of the type that #Jason mentioned.
However, there may still be translations active on the stream, such as CRLF -> NL, so it still isn't ideal for binary data.
On a Unix/POSIX system, you can use the cin.get() method to read byte-by-byte and save the data into a container like a std::vector<unsigned int>, or you can use cin.read() in order to read a fixed amount of bytes into a buffer. You could also use cin.peek() to check for any end-of-data-stream indicators.
Keep in mind to avoid using the operator>> overload for this type of operation ... using operator>> will cause breaks to occur whenever a delimiter character is observed, and it will also remove the delimiting character from the stream itself. This would include any binary values that are equivalent to a space, tab, etc. Thus the binary data your end up storing from std::cin using that method will not match the input binary stream byte-for-byte.
All predefined iostream objects are obligated to be bound to corresponding C streams:
The object cin controls input from a stream buffer associated with the object stdin, declared in <cstdio>.
http://eel.is/c++draft/narrow.stream.objects
and thus the method of obtaining binary data is same as for C:
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary
mode. In the normal mode, reading from stdin on Windows will convert
\r\n (Windows newline) to the single character ASCII 10. Using the
"rb" mode disables this conversion so that you can properly read in
binary data.
https://stackoverflow.com/a/1599093/6049796
cplusplus.com:
Unformatted input
Most of the other member functions of the istream
class are used to perform unformatted input, i.e. no interpretation is
made on the characters got form the input. These member functions can
get a determined number of characters from the input character
sequence (get, getline, peek, read, readsome)...
As Lou Franco pointed out, std::cin isn't opened with std::ios_base::binary, but one of those functions might get you close to the behavior you're looking for.
With windows/mingw/msys/bash, if you need to pipe different commands with binary streams in between, you need to manipulate std::cin and std::cout as binary streams.
The _setmode solution from Mikhail works perfectly.
Using MinGW, the neaded headers are the following:
#include <io.h>
#include <fcntl.h>

Problem with getline and "strange characters"

I have a strange problem,
I use
wifstream a("a.txt");
wstring line;
while (a.good()) //!a.eof() not helping
{
getline (a,line);
//...
wcout<<line<<endl;
}
and it works nicely for txt file like this
http://www.speedyshare.com/files/29833132/a.txt
(sorry for the link, but it is just 80 bytes so it shouldn't be a problem to get it , if i c/p on SO newlines get lost)
BUT when I add for example 水 (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples )to any line that is the line where loading stops. I was under the wrong impression that getline that takes wstring as one input and wifstream as other can chew any txt input...
Is there any way to read every single line in the file even if it contains funky characters?
The not-very-satisfying answer is that you need to imbue the input stream with a locale which understands the particular character encoding in question. If you don't know which locale to choose, you can use the empty locale.
For example (untested):
std::wifstream a("a.txt");
std::locale loc("");
a.imbue(loc);
Unfortunately, there is no standard way to determine what locales are available for a given platform, let alone select one based on the character encoding.
The above code puts the locale selection in the hands of the user, and if they set it to something plausible (e.g. en_AU.UTF-8) it might all Just Work.
Failing this, you probably need to resort to third-party libraries such as iconv or ICU.
Also relevant this blog entry (apologies for the self-promotion).
The problem is with your call to the global function getline (a,line). This takes a std::string. Use the std::wistream::getline method instead of the getline function.
C++ fstreams delegeate I/O to their filebufs. filebufs always read "raw bytes" from disk and then use the stream locale's codecvt facet to convert between these raw bytes into their "internal encoding".
A wfstream is a basic_fstream<wchar_t> and thus has a basic_filebuf<wchar_t> which uses the locale's codecvt<wchar_t, char> to convert the bytes read from disk into wchar_ts. If you read a UCS-2 encoded file, the conversion must thus be performed with a codecvt who "knows" that the external encoding is UCS-2. You thus need a locale with such a codecvt (see, for example, this SO question)
By default, the stream's locale is the global locale at the stream construction. To use a specific locale, it should be imbue()-d on the stream.