For a small program, seen here here, I found out that with gcc-libstdc++ and clang++ - libc++ reading file contents into a string works as intended with std::string itself:
std::string filecontents;
{
std::ifstream t(file);
std::stringstream buffer;
buffer << t.rdbuf();
filecontents = buffer.str();
}
Later on I modify the string. E.g.
ending_it = std::find(ending_it, filecontents.end(), '$');
*ending_it = '\\';
auto ending_pos
= static_cast<size_t>(std::distance(filecontents.begin(), ending_it));
filecontents.insert(ending_pos + 1, ")");
This worked even if the file included non-ascii characters like a greek lambda. I never searched for these unicode characters, but they were in the string. Later on I output the string to std::cout.
Is this guaranteed to work in C++17 (and beyond)?
The question is: What are the conditions, under which I can read file contents into std::string via std::ifstream, work on the string like above and expect things to work correctly.
As far as I know, std::string uses char, which has only 1 byte.
Therefore it surprised me that the method worked with non-ascii chars in the file.
Thanks #user4581301 and #PeteBecker for their helpful comments making me understand the problem.
The question stems from a wrong mental model of std::string, or more fundamentally a wrong model of char.
This is nicely explained here and here.
I implicitly thought, that a char holds a "character" in a more colloquial sense and therefore knows of its encoding. Instead a char really only holds a single byte (in c++, in c its defined slightly differently). Therefore it is always well-defined to read a file into a string, as a string is first and foremost only an array of bytes.
This also means that reading a file in an encoding where a "character" can span multiple bytes results in those characters spanning multiple indices in the std::string.
This can be seen, when outputting a single char from the string.
Luckily whenever the file is ascii-encoded or utf8-encoded, the byte representation of an ascii character can only ever appear when encoding that character. This means that searching the string of the file for an ascii-character will exactly find these characters and nothing else. Therefore the above operations of searching for '$' and inserting a substring after an index that points to an ascii character will not corrupt the characters in the string.
Outputting the string to a terminal then just hands over the bytes to be interpreted by the terminal. If the terminal knows utf8, it will interprete the bytes accordingly.
Related
Thanks for taking your time to read this!
I'm having trouble parsing through a file with input redirection and I am having trouble reading through integers and characters.
Without using getline(), how do you read in the file including integers, characters, and any amount of whitespaces? (I know the >> operator can skip whitespace but fails when it hits a character)
Thanks!
The first thing you need to realise is that, fundamentally, there are no things like "integers" in your file. Your file does not contain typed data: it contains bytes.
Now, since C++ doesn't support any text encodings, for our purposes here we can consider bytes equivalent to "characters". (In reality, you'll probably layer something like a UTF-8 support library on top of your code, at which point "characters" takes on a whole new meaning. But we'll save that discussion for another day.)
At the most basic, then, we can just extract a bunch of bytes. Let's say 50 at a time:
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
char buf[CHUNK_SIZE];
while (ifs.read(buf, CHUNK_SIZE)) {
const size_t num_extracted = ifs.gcount();
parseData(buf, num_extracted);
}
The function parseData would then examine those bytes in whatever manner you see fit.
For many text files this is unnecessarily arduous. So, as you've discovered, the IOStreams part of the C++ Standard Library provides us with some shortcuts. For example, std::getline will read bytes up to a delimiter, rather than reading a certain quantity of bytes.
Using this, we can read in things "line by line" — assuming a "line" is a sequence of bytes terminated by a \n (or \r\n if your platform performs line-ending translation, and you haven't put the stream into binary mode):
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
std::string line;
while (std::getline(ifs, line)) {
parseLine(line);
}
Instead of \n you can provide, as a third argument to std::getline, some other delimiter.
The other facility it offers is operator<<, which will pick out tokens (sequences of bytes delimited by whitespace) and attempt to "lexically cast" them; that is, it'll try to interpret friendly human ASCII text into C++ data. So if your input is "123 abc", you can pull out the "123" into an int with value 123, and the "abc" into another string.
If you need more complex parsing, though, you're back to the initial offering, and to the conclusion of my answer: read everything and parse it byte-by-byte as you see fit. To help with this, there's sscanf inherited from the C standard library, or spooky incantations from Boost; or you could just write your own algorithms.
The above is true of any compatible input stream, be it a std::ifstream, a std::istringstream, or the good old ready-provided std::istream instance named std::cin (which I guess is how you're accepting the data, given your mention of input redirection: shell scripting?).
Well this is a direct "followup" of this question; I decided to split the problem into two - Originally I posted the whole picture to prevent getting another close with "YZ problem". For now consider I know already the character encoding.
However I read a string using std::getline from a file. This file is encoded in a format I know -say UTF16 big endian-.
But not "all" files are UTF16 (actually most are UTF8), I prefer to have as little code-copying as possible.
Now my first response is to "just read the bytes" and "then do the conversion to UTF-8", and skip the conversion if the input is already UTF-8. So I read it first into a std::string (please ignore the "ugglyness" of OpenFilestreams()[file_index]);
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
After this I oblviously have a nonsense string - as the bytes are ordered as if the string was UCS2/UTF-16. So how can I convert this std::string to another std::string resulting in the UTF8-byte ordering. - Or should I do this at line reading level (or even at opening the file-stream level?)
I prefer to keep myself to the C++11 standard, maybe boost/ICU if it is really better (already have boots, but no ICU library at my pc).
I have a std::wstring whose size is 139,580,199 characters.
For debugging I printed it into file with this code:
std::wofstream f(L"C:\\some file.txt");
f << buffer;
f.close();
After that noticed that the end of string is missing. The created file size is 109,592,584 bytes (and the "size on disk" is 109,596,672 bytes).
Also checked if buffer contains null chars, did this:
size_t pos = buffer.find(L'\0');
Expecting result to be std::wstring::npos but it is 18446744073709551615, but my string doesn't have null char at the end so probably it's ok.
Can somebody explain, why I have not all string printed into file?
A lot depends on the locale, but typically, files on disk will
not use the same encoding form (or even the same encoding) as
that used by wchar_t; the filebuf which does the actual
reading and writing translates the encodings according to its
imbued locale. And there is only a vague relationship between
the length of a string in different encodings or encoding form.
(And the size the system sees doesn't correspond directly to the
number of bytes you can read from the file.)
To see if everything was written, check the status of f
after the close, i.e.:
f.close();
if ( !f ) {
// Something went wrong...
}
One thing that can go wrong is that the external encoding
doesn't have a representation for one of the characters. If
you're in the "C" locale, this could occur for any character
outside of the basic execution character set.
If there is no error above, there's no reason off hand to assume
that not all of the string has been written. What happens if
you try to read it in another program? Do you get the same
number of characters or not?
For the rest, nul characters are characters like any others in
a std::wstring; there's nothing special about them, including
when they are output to a stream. And 18446744073709551615
looks very much like the value I would expect for
std::wstring::npos on a 64 bit machine.
EDIT:
Following up on Mat Petersson's comment: it's actually highly
unlikely that the file ends up with less bytes than there are
code points in the std::wstring. (std::wstring::size()
returns the number of code points.) I was thinking in terms of
bytes, not in terms of what std::wstring::size() returns. So
the most likely explination is that you have some characters in
your string which aren't representable in the target encoding
(which probably only supports characters with code points
32-126, plus a few control characters, by default).
I have a problem with wchar_t* to char* conversion.
I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct.
Assume that wchar string is "New Text File.txt"
In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. Though in watches string is represented correctly.
When I try to convert wchar to char with wcstombs
wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);
it converts just two letters to char* ("Ne") and then generates an error.
Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block:
if (*pwcs > 255) /* validate high byte */
{
errno = EILSEQ;
return (size_t)-1; /* error */
}
It's not thrown up as exception.
What can be the problem?
In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. I'll do my best to break them down for you here.
Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN:
The maximum number of bytes that can be stored in the multibyte output string.
Note that this does NOT say anything about the number of wide characters in the wide character input string. Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets). So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention:
If wcstombs encounters the wide-character null character (L'\0') either before or when count occurs, it converts it to an 8-bit 0 and stops.
Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do.
Why am I focusing so much on this null-termination issue? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field:
A variable-length field that contains the file name relative to the directory handle. The file name is in the Unicode character format and is not null-terminated.
The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field:
The size of the file name portion of the record, in bytes.
Note that this says bytes, not characters. Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error).
Now that we've elucidated the problem, it's time to talk about a possible solution. In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here:
If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string.
So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs().
Putting this all together, here is a snippet of code that does what you are trying to do:
size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!
Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up.
ONE LAST THING
After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? For example,
FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;
If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName. In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure:
typedef struct _FILE_NOTIFY_INFORMATION {
DWORD NextEntryOffset;
DWORD Action;
DWORD FileNameLength;
WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;
When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo. Since FileName is declared as an array of one WCHAR, you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int), which is why a second WCHAR is copied as well.
My guess is that the wide string that you are passing is invalid or incorrectly defined.
How is pwFileName defined? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo, so why are you not using fileInfo.FileName, as shown below?
wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);
the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source:
If wcstombs encounters a wide character it cannot convert to a
multibyte character, it returns –1 cast to type size_t and sets errno
to EILSEQ
In cases like this you should avoid 'assumed' input, and give an actual test case that fails.
The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.