Ifstream reads wrong characters from text file - c++

I have the following simple code, that reads contents of a text file into array of chars:
const char* name = "test.txt";
std::cout << "Loading file " << name << std::endl;
std::ifstream file;
file.open(name);
file.seekg (0, std::ios::end);
int length = file.tellg();
std::cout << "Size: " << length << " bytes" << std::endl;
file.seekg (0, std::ios::beg);
char* buffer = new char[length];
file.read(buffer,length);
file.close();
std::cout.write(buffer,length);
However, it seems ifstream reads wrong number of chars from the file: 1 additional char for each line. I searched through the web and it looks like in win7 text files have carriage return symbol (\r) in addition to newline (\n) in the end of each line. However, the stream somehow does not see these \r, but still uses the original number of symbols in the file, reading additional bytes from beyond the end of the file. Is it possible to somehow solve this problem?
If it helps: I use MinGW compiler and Windows 7 64bit.

You might want to open the file in binary mode:
file.open(name, ios_base::in | ios_base::binary);
Otherwise what happens is that the standard library translates every Windows newline (CR+LF) into a single \n for you.
This means that the number of characters that you can read from the file is not the same as the size of the file. When you call read(), it reads as many characters as it can. If it can't read the number of characters you requested, it sets the stream's failbit.

You're starting from some very erroneous (but widespread) opinions.
file.tellg() doesn't return an int; it returns an implementation
defined object of type streampos, which must be a class type, and may
or may not be convertible into an integral type. And if it is
convertable into an integral type (and I don't know of an implementation
where it isn't, even if it is not required), there is no guarantee that
the resulting integer represents anything more than a magic cookie which
would allow reseeking to the same position.
In practice, this is probably not a big issue on modern machines: both
Unix and Windows return the offset in bytes from the start of the file.
In the case of Unix, this works fine, because the mapping of the
internal representation to the external one is one to one. In the case
of Windows, there is a remapping of line endings: in a text file, a line
ending is a two byte sequence of 0x0D, 0x0A, which becomes, when read,
the single char '\n'. And streampos (converted to an integral type)
gives the offset in bytes to where you have to seek in the file, and not
the number of char you have to read to get to that position. For things
like what you seem to be doing, this is not a problem; the allocated
buffer may be a little larger than necessary, but it will never be too
small.
Be aware that this may not be true on mainframes. Historically, at
least, mainframes used block oriented files, and the integral value of a
streampos could easily be something broken up into fields, with a
certain number of bits for the block number, and other bits for the byte
offset in the block. Depending on how these are laid out in the word,
a buffer allocated as you do could easily be several orders of magnitude
too big, or if the offset is placed on the high order bits, too small.
The only reliable way of getting the exact size of buffer you need is
system dependent, and on some systems (including Windows), there may be
no other way except by reading all of the characters and counting them.
(The reason streampos is required to be a class type is because,
historically, many older multibyte encodings had an encoding state; you
couldn't correctly decode a character without knowing what characters
preceded it. So streampos is required to contain two different
information: the position to seek in the file, and information about
this state. I don't think that there are any state dependent multibyte
encodings in wide use today, however.)

Read about opening files for binary reading (google or see here).

Related

What happens when I read a file into a string

For a small program, seen here here, I found out that with gcc-libstdc++ and clang++ - libc++ reading file contents into a string works as intended with std::string itself:
std::string filecontents;
{
std::ifstream t(file);
std::stringstream buffer;
buffer << t.rdbuf();
filecontents = buffer.str();
}
Later on I modify the string. E.g.
ending_it = std::find(ending_it, filecontents.end(), '$');
*ending_it = '\\';
auto ending_pos
= static_cast<size_t>(std::distance(filecontents.begin(), ending_it));
filecontents.insert(ending_pos + 1, ")");
This worked even if the file included non-ascii characters like a greek lambda. I never searched for these unicode characters, but they were in the string. Later on I output the string to std::cout.
Is this guaranteed to work in C++17 (and beyond)?
The question is: What are the conditions, under which I can read file contents into std::string via std::ifstream, work on the string like above and expect things to work correctly.
As far as I know, std::string uses char, which has only 1 byte.
Therefore it surprised me that the method worked with non-ascii chars in the file.
Thanks #user4581301 and #PeteBecker for their helpful comments making me understand the problem.
The question stems from a wrong mental model of std::string, or more fundamentally a wrong model of char.
This is nicely explained here and here.
I implicitly thought, that a char holds a "character" in a more colloquial sense and therefore knows of its encoding. Instead a char really only holds a single byte (in c++, in c its defined slightly differently). Therefore it is always well-defined to read a file into a string, as a string is first and foremost only an array of bytes.
This also means that reading a file in an encoding where a "character" can span multiple bytes results in those characters spanning multiple indices in the std::string.
This can be seen, when outputting a single char from the string.
Luckily whenever the file is ascii-encoded or utf8-encoded, the byte representation of an ascii character can only ever appear when encoding that character. This means that searching the string of the file for an ascii-character will exactly find these characters and nothing else. Therefore the above operations of searching for '$' and inserting a substring after an index that points to an ascii character will not corrupt the characters in the string.
Outputting the string to a terminal then just hands over the bytes to be interpreted by the terminal. If the terminal knows utf8, it will interprete the bytes accordingly.

why std::wofstream do not print all wstring into file?

I have a std::wstring whose size is 139,580,199 characters.
For debugging I printed it into file with this code:
std::wofstream f(L"C:\\some file.txt");
f << buffer;
f.close();
After that noticed that the end of string is missing. The created file size is 109,592,584 bytes (and the "size on disk" is 109,596,672 bytes).
Also checked if buffer contains null chars, did this:
size_t pos = buffer.find(L'\0');
Expecting result to be std::wstring::npos but it is 18446744073709551615, but my string doesn't have null char at the end so probably it's ok.
Can somebody explain, why I have not all string printed into file?
A lot depends on the locale, but typically, files on disk will
not use the same encoding form (or even the same encoding) as
that used by wchar_t; the filebuf which does the actual
reading and writing translates the encodings according to its
imbued locale. And there is only a vague relationship between
the length of a string in different encodings or encoding form.
(And the size the system sees doesn't correspond directly to the
number of bytes you can read from the file.)
To see if everything was written, check the status of f
after the close, i.e.:
f.close();
if ( !f ) {
// Something went wrong...
}
One thing that can go wrong is that the external encoding
doesn't have a representation for one of the characters. If
you're in the "C" locale, this could occur for any character
outside of the basic execution character set.
If there is no error above, there's no reason off hand to assume
that not all of the string has been written. What happens if
you try to read it in another program? Do you get the same
number of characters or not?
For the rest, nul characters are characters like any others in
a std::wstring; there's nothing special about them, including
when they are output to a stream. And 18446744073709551615
looks very much like the value I would expect for
std::wstring::npos on a 64 bit machine.
EDIT:
Following up on Mat Petersson's comment: it's actually highly
unlikely that the file ends up with less bytes than there are
code points in the std::wstring. (std::wstring::size()
returns the number of code points.) I was thinking in terms of
bytes, not in terms of what std::wstring::size() returns. So
the most likely explination is that you have some characters in
your string which aren't representable in the target encoding
(which probably only supports characters with code points
32-126, plus a few control characters, by default).

Any way to get rid of the null character at the end of an istream get?

I'm currently trying to write a bit of code to read a file and extract bits of it and save them as variables.
Here's the relevant code:
char address[10];
ifstream tracefile;
tracefile.open ("trace.txt");
tracefile.seekg(2, ios::beg);
tracefile.get(address, 10, ' ');
cout << address;
The contents of the file: (just the first line)
R 0x00000000
The issue I'm having is that address misses the final '0' because it puts a /0 character there, and I'm not sure how to get around that? So it outputs:
0x0000000
I'm also having issues with
tracefile.seekg(2, ios::cur);
It doesn't seem to work, hence why I've changed it to ios::beg just to try and get something work, although obviously that won't be useable once I try to read multiple lines after one another.
Any help would be appreciated.
ifstream::get() will attempt to produce a null-terminated C string, which you haven't provided enough space for.
You can either:
Allocate char address[11]; (or bigger) to hold a null-terminated string longer than 9 characters.
Use ifstream::read() instead to read the 10 bytes without a null-terminator.
Edit:
If you want a buffer that can dynamically account for the length of the line, use std::getline with a std::string.
std::string buffer;
tracefile.seekg(2, ios::beg);
std::getline( tracefile, buffer );
Edit 2
If you only want to read to the next whitespace, use:
std::string buffer;
tracefile.seekg(2, ios::beg);
tracefile >> buffer;
Make the buffer bigger, so that you can read the entire input text into it, including the terminating '\0'. Or use std::string, which doesn't have a pre-determined size.
There are several issues with your code. The first is that
seekg( 2, ios::beg ) is undefined behavior unless the stream
is opened in binary mode (which yours isn't). It will work
under Unix, and depending on the contents of the file, it
might work under Windows (but it could also send you to the
wrong place). On some other systems, it might systematically
fail, or do just about anything else. You cannot reliably seek
to arbitrary positions in a text stream.
The second is that if you want to read exactly 10 characters,
the function you need is istream::read, and not
istream::get. On the other hand, if you want to read up to
the next white space, using >> into a string will work best.
If you want to limit the number of characters extracted to a
maximum, set the width before calling >>:
std::string address;
// ...
tracefile >> std::setw( 10 ) >> address;
This avoids all issues of '\0', etc.
Finally, of course, you need error checking. You should
probably check whether the open succeeded before doing anything
else, and you should definitely check whether the read succeeded
before using the results. (As you've written the code, if the
open fails for any reason, you have undefined behavior.)
If you're reading multiple lines, of course, the best solution
is usually to use std::getline to read each line into a
string, and then parse that string (possibly using
std::istringstream). This prevents the main stream from
entering error state if there is a format error in the line, and
it provides automatic resynchronization in such cases.

Seeking istreambuf_iterator <wchar_t> clarifications, reading a complete text file of Unicode characters

In the book “Effective STL” by Scott Meyers, there is a nice example of reading an entire text file into a std::string object:
std::string sData;
/*** Open the file for reading, binary mode ***/
std::ifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
/*** Read in all the data from the file into one string object ***/
sData.assign (std::istreambuf_iterator <char> (ifFile),
std::istreambuf_iterator <char> ());
Note that it reads it in as 8-byte characters. This works very well. Recently though I have need for reading a file containing Unicode text (i.e., two bytes per char). However, when I try to (naively) change it to read the data from a Unicode text file into a std::wstring object like so:
std::wstring wsData;
/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
/*** Read in all the data from the file into one string object ***/
wsData.assign (std::istreambuf_iterator <wchar_t> (ifFile),
std::istreambuf_iterator <wchar_t> ());
The string that I get back, while being of wide characters, still has the alternate nulls. For example, if the file contains the Unicode string “ABC”, the bytes of the file (ignoring the Unicode lead bytes of 0xFF, 0xFE) are:
<’A’> <0> <’B’> <0> <’C’> <0>
The first code fragment above would correctly result in the following contents of the (char) string:
sData [0] = ‘A’
sData [1] = 0x00
sData [2] = ‘B’
sData [3] = 0x00
sData [4] = ‘C’
sData [5] = 0x00
However, when the second code fragment is run, it undesirably results in the following contents of the (wchar_t) string:
wsData [0] = L‘A’
wsData [1] = 0x0000
wsData [2] = L‘B’
wsData [3] = 0x0000
wsData [4] = L‘C’
wsData [5] = 0x0000
It’s as if the file were still being read byte by byte and then just simply translated into individual wchar_t characters.
I would have thought that the std::istreambuf_iterator, being specialized to wchar_t , should have resulted in the file being read two bytes at a time, shouldn’t it? If not, what’s its purpose then?
I have traced into the templates (no easy feat ;-), and the iterator does indeed still seem to be reading the file byte by byte and passing it on to its internal convert routine which dutifully states that conversion is done after each byte (not only after receiving 2 bytes).
I have searched a number of sites on the web (including this one) for this seemingly trivial task but have not found an explanation of this behavior or a good alternative that does not involve more code than I feel should be necessary (e.g., A Google search of the web produces that same second code fragment as a viable piece of code as well).
The only thing that I have found that works is the following, and I consider that to be a cheat as it needs direct access to the wstring’s internal buffer and then type-coerces it at that.
std::wstring wsData;
/*** Open the file for reading, binary mode ***/
std::wifstream ifFile (“MyFile.txt”, std::ios_base::binary); // Open for input, binary mode
wsData.resize (<Size of file in bytes> / sizeof (wchar_t));
ifFile.read ((char *) &wsData [0], <Size of file in bytes>);
Oh, and to forestall the inevitable “Why open the file in binary mode, why not in text mode” question, that open is intentional as if the file was opened in text mode (default), it means that CR/LF ("\r\n" or 0x0D0A) sequences will be converted into just LF ("\n" or 0x0A) sequences, whereas a pure byte read of the file would have preserved them. Regardless, for those diehards, changing that had, unsurprisingly, no effect.
So two questions here, why does the second case not work as one might expect (i.e., what is going on with those iterators), and what would be your favorite “kosher STL-way” of loading a file of Unicode characters into a wstring?
What am I missing here; it has to be something silly.
Chris
You must be disppointed with SO to have received no answers to your first question after
4-and-half-months. It is a good question, and most good questions are answered
(well or badly) within minutes. Two the likely reasons for the neglect of yours are:
You did not tag it "C++", so many C++ programmers who might have been able to help will never have
noticed it. (I have now tagged it "C++".)
Your question is about unicode stream-handling, which is no-one's idea of cool coding.
The misconception that has thwarted your investigations seems to be this: You appear to
believe that a wide-character stream, std::wfstream, and wide-character string, std::wstring,
are respectively the same as a "unicode stream" and a "unicode string", and specifically that
they are respectively the same as a UTF-16 stream and a UTF-16 string. Neither of these things is true.
An std::wifstream (std::basic_ifstream<wchar_t>) is an input stream that converts an
external sequence of bytes to an internal sequence of wchar_t, according to a specified
or default encoding of the external sequence.
Likewise an std::wofstream (std::basic_ofstream<wchar_t>) is an output stream that
converts an internal sequence of wchar_t to an external sequence of bytes, according to a
specified or default encoding of the external sequence.
And an std::wstring (std::basic_string<wchar_t>) is a string type that simply stores
a sequence of wchar_t, without knowledge of the encoding - if-any - from which they resulted.
Unicode is a family of byte-sequence encodings - UTF-8/-16/-32, and some more obscure others -
related by the principle that UTF-N encodes alphabets using a sequence of 1 or more
N-bit units per symbol. UTF-16 is apparently the encoding you are trying to read
into a std::wstring. You say:
I would have thought that the std::istreambuf_iterator, being specialized to wchar_t, should have resulted in the file being read two bytes at a time, shouldn't it? If not, what's its purpose then?
But once you know that wchar_t is not necessarily 2 bytes wide (it is in Microsoft's C libraries,
both 32 and 64-bit, but in GCC's it is 4 bytes wide), and also that a UTF-16 code-point (character)
need not fit into 2 bytes (it can require 4), you will see that that specifying an extraction
unit of wchar_t cannot be all there is to decoding a UTF-16 stream.
When you construct and open your input stream with:
std::wifstream ifFile ("MyFile.txt", std::ios_base::binary);
It is prepared to extract characters (of some alphabet) from "MyFile.txt" into values
of type wchar_t and it will extract those characters from the byte-sequence in the
file according to the encoding specified by the std::locale
that is operative on the stream when it does the extracting.
Your code does not specify an std::locale for your stream, so the library's default takes effect.
That default is the global C++ locale, which in turn by default is the
"C" locale; and the "C" locale assumes
the "identity encoding" of I/O byte sequences, i.e. 1 byte = 1 character (
setting aside the newline exception for text-mode I/O).
Thus, when you employ your std::istreambuf_iterator<wchar_t> to
extract the characters, the extraction proceeds by converting each byte
in the file to a wchar_t which it appends to the std::wstring wsData. The bytes
in the file are, as you say:
0xFF, 0xFE, 'A', 0x00, 'B', 0x00, 'C', 0x00
The first two, which you discount as "unicode lead bytes", are indeed a
UTF-16 byte-order mark (BOM) but in the default encoding they just are what they are.
Accordingly the wide-characters assigned to wsData are, as you observed:
0x00FF, 0x00FE, L'A', 0x0000, L'B', 0x0000, L'C', 0x0000
It's as if the file were still being read byte by byte and then just simply translated into individual wchar_t characters.
because that it precisely what is happening.
To stop this happening, you need to do something before you start extracting characters from the stream
to tell it that it is supposed to decode a UTF-16 character sequence. The way to do that
is conceptually rather tortuous. You need to imbue
the stream with an std::locale that possesses an
std::locale::facet that is an instantiation
std::codecvt<InternT, ExternT, StateT> (or is derived from such)
which will provide the stream with the correct methods from decoding UTF-16 into wchar_t.
But the gist of this is that you need to plug the right UTF-16 encoder/decoder into the stream and
in practice it is (or should be) simple enough. I am guessing that your compiler is a recent MS VC++.
If that's right, then, you can fix your code by:
Adding #include <locale> and #include <codecvt> to your headers
Adding the line:
ifFile.imbue(std::locale(ifFile.getloc(),new std::codecvt_utf16<wchar_t,0x10ffff,std::little_endian>));
right after:
std::wifstream ifFile ("MyFile.txt", std::ios_base::binary);
The effect of this new line is to "imbue" ifFile with a new locale that is the same
as the one it already had - ifFile.getloc() - but with a modified encoder/decoder facet
- std::codecvt_utf16<wchar_t,0x10ffff,std::little_endian>. This codecvt facet is
one that will decode UTF-16 characters with a maximum value of 0x10ffff into little-endian
wchar_t values (0x10ffff being the maximum value of UTF-16 code-points).
When you debug into the code thus amended you will now find that wsData is only 4 wide-characters long
and that those characters are:
0xFEFF, L'A', L'B', L'C'
as you expect them to be, with the first one being the UTF-16 little-endian BOM.
Notice that the order FE,FF is the reverse of what it was before application
of the codecvt facet, showing us that little-endian decoding was done as requested.
And it needed to be. Just edit the new line by removing std::little_endian,
debug it again, and you will then find that the first element of wsData becomes 0xFFFE
and that other three wide-characters become pictograms of the
IICore pictographic
character-set (if your debugger can display them). (Now, whenever a colleague
complains in amazement that their code is turning English Unicode into "Chinese",
you will know a likely explanation.)
Should you want to populate wsData without the leading BOM, you can do that by
amending the new line again and replacing std::little_endian with
std::codecvt_mode(std::little_endian|std::consume_header)
Finally, you may well have noted a bug in the new code, namely that a 2-byte wchar_t
is insufficiently wide to represent the UTF-16 code-points between 0x100000 and 0x10ffff
that could be read.
You will get away with this as long as all the code-points you have to read lie in the
UTF-16 Basic Multilingual Plane,
which spans [0,0xffff], and you might know that all inputs will forever obey that
constraint. Otherwise, a 16-bit wchar_t is not fit for purpose. Replace:
wchar_t with char32_t
std::wstring with std::basic_string<char32_t>
std::wifstream with std::basic_ifstream<char32_t>
and the code is fully fit to read an abitrary UTF-16 encoded file into a string.
(Readers who are working with the the GNU C++ library will find that as of v4.7.2
it does not yet provide the <codecvt> standard header. The header <bits/codecvt.h> exists and presumbly will sometime graduate to being <codecvt>, but at this point it only
exports the specializations class codecvt<char, char, mbstate_t> and
class codecvt<wchar_t, char, mbstate_t>, which are respectively the identity
conversion and the conversion between ASCII/UTF-8 and wchar_t. To solve the OP's problem
you need to subclass std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
yourself, as per this answer)

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.