Convert UCS-2 inside character array to UTF-8 std::string - c++

Well this is a direct "followup" of this question; I decided to split the problem into two - Originally I posted the whole picture to prevent getting another close with "YZ problem". For now consider I know already the character encoding.
However I read a string using std::getline from a file. This file is encoded in a format I know -say UTF16 big endian-.
But not "all" files are UTF16 (actually most are UTF8), I prefer to have as little code-copying as possible.
Now my first response is to "just read the bytes" and "then do the conversion to UTF-8", and skip the conversion if the input is already UTF-8. So I read it first into a std::string (please ignore the "ugglyness" of OpenFilestreams()[file_index]);
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
After this I oblviously have a nonsense string - as the bytes are ordered as if the string was UCS2/UTF-16. So how can I convert this std::string to another std::string resulting in the UTF8-byte ordering. - Or should I do this at line reading level (or even at opening the file-stream level?)
I prefer to keep myself to the C++11 standard, maybe boost/ICU if it is really better (already have boots, but no ICU library at my pc).

Related

What happens when I read a file into a string

For a small program, seen here here, I found out that with gcc-libstdc++ and clang++ - libc++ reading file contents into a string works as intended with std::string itself:
std::string filecontents;
{
std::ifstream t(file);
std::stringstream buffer;
buffer << t.rdbuf();
filecontents = buffer.str();
}
Later on I modify the string. E.g.
ending_it = std::find(ending_it, filecontents.end(), '$');
*ending_it = '\\';
auto ending_pos
= static_cast<size_t>(std::distance(filecontents.begin(), ending_it));
filecontents.insert(ending_pos + 1, ")");
This worked even if the file included non-ascii characters like a greek lambda. I never searched for these unicode characters, but they were in the string. Later on I output the string to std::cout.
Is this guaranteed to work in C++17 (and beyond)?
The question is: What are the conditions, under which I can read file contents into std::string via std::ifstream, work on the string like above and expect things to work correctly.
As far as I know, std::string uses char, which has only 1 byte.
Therefore it surprised me that the method worked with non-ascii chars in the file.
Thanks #user4581301 and #PeteBecker for their helpful comments making me understand the problem.
The question stems from a wrong mental model of std::string, or more fundamentally a wrong model of char.
This is nicely explained here and here.
I implicitly thought, that a char holds a "character" in a more colloquial sense and therefore knows of its encoding. Instead a char really only holds a single byte (in c++, in c its defined slightly differently). Therefore it is always well-defined to read a file into a string, as a string is first and foremost only an array of bytes.
This also means that reading a file in an encoding where a "character" can span multiple bytes results in those characters spanning multiple indices in the std::string.
This can be seen, when outputting a single char from the string.
Luckily whenever the file is ascii-encoded or utf8-encoded, the byte representation of an ascii character can only ever appear when encoding that character. This means that searching the string of the file for an ascii-character will exactly find these characters and nothing else. Therefore the above operations of searching for '$' and inserting a substring after an index that points to an ascii character will not corrupt the characters in the string.
Outputting the string to a terminal then just hands over the bytes to be interpreted by the terminal. If the terminal knows utf8, it will interprete the bytes accordingly.

how to convert utf8 to std::string?

I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet:
typedef std::wstring string_t; //defined in basic_types.h in cpprest lib
void demo() {
http_response response;
//code to handle respose ...
json::value output= response.extract_json();
string_t payload = output.at(L"payload").as_string();
vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload);
std::string utf8_payload(base64_encoded_payload.begin(), base64_encoded_payload.end()); //in debugger I see the Japanese chars are garbled.
string_t utf16_payload = utf8_to_utf16(utf8_payload); //in debugger I see the Japanese chars are good here
//then I need to process the utf8_payload which is an xml.
//I have an API available to process the xml which takes an string
processXML(utf16_payload); //need to convert utf16_payload to a string here;
}
I also tried this and I see str contains garbled chars!
#include <codecvt> // for codecvt_utf8_utf16
#include <locale> // for wstring_convert
#include <string> // for string, wstring
void wstr2str(void) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring japanese = L"北島 美奈";
std::string str = conversion.to_bytes(japanese); //str is garbled:(
}
my questions is: can utf8 containing Japanese char be converted to std::string without being garbled?
Update: I gained access to the processXML() code and changed the input argument type to std::wstring and it worked.
I figured when the xml was getting created, it was converting the std::string to wstring; however, it was not turning out good!
void processXML(std::wstring xmlStrBuf) { //chaned xmlStrBuf to wstring and worked
// more code
CComBSTR xmlBuff = xmlStrBuf.c_str();
VARIANT_BOOL bSuccess = false;
xmlDoc->loadXML(xmlBuff, &bSuccess);
//more code
}
Thanks for the answers and they were helpful when mentioned the string is only a storage.
You are confusing different concepts here.
Storage
This is how we save/store/hold our data. A std::string is a collection of chars, which are bytes. A std::wstring is a collection of wchar_ts, which are sometimes 2-byte wide value (but this is not guaranteed!).
Encoding
This is what the data means, and how it should be interpreted. A std::string, a collection of bytes, could hold UTF-8, or UTF-16, or UTF-32, or ASCII, or ShiftJIS, or morse code, or a JPEG, or a movie, or my DNA (lucky string!).
There are some strong conventions in play in the world. For example, on Windows, a std::wstring is generally accepted to hold UTF-16 (because the two-byte storage is convenient for this, and also because that's how the Windows API does it).
Newer versions of C++ give us things like std::u16_string and std::u32_string as well, which still do not directly have any notion of encoding, but are intended to be used for UTF-16 and UTF-32 respectively because their names make that intention more obvious to readers of code. C++20 will introduce std::u8_string which is intended to signify a UTF-8 encoded string (and is otherwise more or less like a std::string).
But these are just conventions. Nothing about the type std::string says "UTF-8" or any other thing. It doesn't know about or care about or enforce any encoding. It just stores bytes.
So, your question about "converting UTF-8 to std::string" does not really make any sense; it's like asking how to convert a road into a car.
"What should I do, then?"
Well, Base64 is also not an encoding. Well, actually, it totally is, but it's an encoding on top of the string encoding. It's a way of transmitting/escaping/sanitising the raw bytes, not a way of describing how to interpret them later. By asking cpprest to convert from Base64, that's just transforming the way the raw bytes are provided. That's why it gives you a std::vector<char> rather than a std::string because, although (as discussed above) std::string doesn't care about encoding, we sometimes use a std::vector<char> to really, properly, completely say that "this collection does not have any particular encoding, so please don't try to guess from convention or whatever what the encoding is in this use case; all it knows is that it is a bunch of bytes". This is down to opinion. Some people will still use a std::string for that; the authors of cpprest decided not to.
The point is that the use of the function from_base64 cannot tell us anything about the encoding of the text that you've retrieved. For that, we have to go back to the documentation for the text. We have no access to that, and you did not tell us anything about it. If it were just a JSON string, the encoding would be down to the cpprest JSON library and so you'd already be done. However, it's not: it's something packed into a Base64 representation by whoever created the JSON object. Again, that information is not something that you shared with us.
But, based on the variable names you've chosen, the data you're looking at is already UTF-8. You've then attempted to convert it to UTF-16, which is rather the opposite of what you've described you wanted to do.
(Similarly, in your second example, you've taken a std::wstring that [probably] already stores UTF-16 thanks to the L"wide string literal", then told the computer that it's UTF-8 and to convert it "again" to UTF-16, then extracted the raw bytes into a std::string. None of that makes sense.)
Instead, why not literally just processXML(utf8_payload);?
General advice
Encoding can be quite complex, although it's significantly easier to deal with once you've wrapped your mind around the basic concepts of all these layers of abstraction. For the future, and for this question if you wish to clarify it, you will need to ensure that you are absolutely clear, at each stage of the "pipeline" of your data as it gets transmitted from place A to place B, and gets converted from type C to type D, and whatever else, about what encoding it should be at each of those steps. If you want to change the encoding at one of those steps, then do so (though this should be rare!). But before you write any code make sure that you know for sure what it is that you need, otherwise you'll get yourself in a massive tangle.
Eventually you'll start to detect patterns that can help, though. For example, if you were expecting some delicious non-ASCII output and instead see strange text with lots of "Å" characters in it, that's probably UTF-8 that's being interpreted as ASCII by mistake. That's because of the way that the special sequence denoting Unicode codepoints larger than one byte in UTF-8 often starts with a byte whose numerical value is the same as that of the letter "Å" in ASCII (well, ISO/IEC 8859, but close enough).
Similarly, if you get Japanese and didn't expect it, in my experience that's usually because you've given the computer some bytes and told it that they are a string in UTF-16 encoding, when actually they were UTF-8. You just get more experienced at recognising these patterns as you work more, and it can help you to fix your bugs faster.
Just last week the last example there saved me quite a bit of time: I knew immediately that my source data must have been UTF-8, and was therefore able to quickly decide to remove the byte-copy into a std::wstring that I'd been attempting. Examining the bytes in an encoding-agnostic way revealed the "Å" pattern as well and then that was that. This was important because I had no documentation for the data source and thus no way to just look up what the encoding was supposed to be. I had to guess/deduce it. Hopefully that won't be the case for you here.
std::string is just a container for 8-bit wide char, and does not know/care about the encoding. Always think in symbols (letters, numbers, punctuation, etc.) The first 128 characters (0-127) were defined per the ASCII standard, thus requiring a single char to store each symbol. With all the languages and symbols there is, we couldn't represent each of them with just 256 possibilities. The UTF-8 encoding introduces a way to deal with this problem by allowing a single symbol to take 1, 2, 3 or 4 char wide. But, for the std::string object, this is entirely transparent and it's still dealing with a series of chars.
The reason why you're thinking the string is garbled is probably because your debugger assumes the contents of the std::string is always 1 symbol per char (extended ASCII for example), and as such, it's displaying the wrong characters.
Edit: you might want to read this post also.

Correctly reading a utf-16 text file into a string without external libraries?

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:
I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.
I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?
edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.
edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?
edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.
The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:
#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
// open as a byte stream
std::wifstream fin("text.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read
for(wchar_t c; fin.get(c); )
std::cout << std::showbase << std::hex << c << '\n';
}
When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.
For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.
Edit:
So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.
The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.
You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.
Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)
codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16
And what each one does
A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16
If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).
Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.
So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.
This should build and run anywhere, but makes a bunch of assumptions to actually work:
#include <fstream>
#include <sstream>
#include <iostream>
int main ()
{
std::stringstream ss;
std::ifstream fin("filename");
ss << fin.rdbuf(); // dump file contents into a stringstream
std::string const &s = ss.str();
if (s.size()%sizeof(wchar_t) != 0)
{
std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
return 1;
}
std::wstring ws;
ws.resize(s.size()/sizeof(wchar_t));
std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}
You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

Problem with getline and "strange characters"

I have a strange problem,
I use
wifstream a("a.txt");
wstring line;
while (a.good()) //!a.eof() not helping
{
getline (a,line);
//...
wcout<<line<<endl;
}
and it works nicely for txt file like this
http://www.speedyshare.com/files/29833132/a.txt
(sorry for the link, but it is just 80 bytes so it shouldn't be a problem to get it , if i c/p on SO newlines get lost)
BUT when I add for example 水 (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples )to any line that is the line where loading stops. I was under the wrong impression that getline that takes wstring as one input and wifstream as other can chew any txt input...
Is there any way to read every single line in the file even if it contains funky characters?
The not-very-satisfying answer is that you need to imbue the input stream with a locale which understands the particular character encoding in question. If you don't know which locale to choose, you can use the empty locale.
For example (untested):
std::wifstream a("a.txt");
std::locale loc("");
a.imbue(loc);
Unfortunately, there is no standard way to determine what locales are available for a given platform, let alone select one based on the character encoding.
The above code puts the locale selection in the hands of the user, and if they set it to something plausible (e.g. en_AU.UTF-8) it might all Just Work.
Failing this, you probably need to resort to third-party libraries such as iconv or ICU.
Also relevant this blog entry (apologies for the self-promotion).
The problem is with your call to the global function getline (a,line). This takes a std::string. Use the std::wistream::getline method instead of the getline function.
C++ fstreams delegeate I/O to their filebufs. filebufs always read "raw bytes" from disk and then use the stream locale's codecvt facet to convert between these raw bytes into their "internal encoding".
A wfstream is a basic_fstream<wchar_t> and thus has a basic_filebuf<wchar_t> which uses the locale's codecvt<wchar_t, char> to convert the bytes read from disk into wchar_ts. If you read a UCS-2 encoded file, the conversion must thus be performed with a codecvt who "knows" that the external encoding is UCS-2. You thus need a locale with such a codecvt (see, for example, this SO question)
By default, the stream's locale is the global locale at the stream construction. To use a specific locale, it should be imbue()-d on the stream.

Using C++, how do I read a string of a specific length, from a non-binary file?

The cplusplus.com example for reading text files shows that a line can be read using the getline function. However, I don't want to get an entire line; I want to get only a certain number of characters. How can this be done in a way that preserves character encoding?
I need a function that does something like this:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
resultStream << getstring(fileStream, 10); // read first 10 chars
file.ftell(10); // move to the next item
resultStream << getstring(fileStream, 10); // read 10 more chars
I thought about reading to a char buffer, but wouldn't this change the character encoding?
I really suspect that there's some confusion here regarding the term "character." Judging from the OP's question, he is using the term "character" to refer to a char (as opposed to a logical "character", like a multi-byte UTF-8 character), and thus for the purpose of reading from a text-file the term "character" is interchangeable with "byte."
If that is the case, you can read a certain number of bytes from disk using ifstream::read(), e.g.
ifstream fileStream;
fileStream.open("file.txt", ios::in);
char buffer[1024];
fileStream.read(buffer, sizeof(buffer));
Reading into a char buffer won't affect the character encoding at all. The exact sequence of bytes stored on disk will be copied into the buffer.
However, it is a different story if you are using a multi-byte character set where each character is variable-length. If characters are not fixed-size, there's no way to read exactly N characters from disk with a single disk read. This is not a limitation of C++, this is simply the reality of dealing with block devices (disks). At the lowest levels of your OS, block devices are addressed in terms of blocks, which in turn are made up of bytes. So you can always read an exact number of bytes from disk, but you can't read an exact number of logical characters from disk, unless each character is a fixed number of bytes. For character-sets like UTF-8 where each character is variable length, you'll have to either read in the entire file, or else perform speculative reads and parse the read buffer after each read to determine if you need to read more.
C++ itself doesn't have a concept of character encoding. chars are always the same size, as are wchar_ts. So if you need to read X chars of a multibyte char set (such as utf-8) then you'll either have to read a (single byte) char at a time (e.g. using getchar() - or X chars, speculatively, using istream::getline() ) and test the MBCS signals yourself, or use a third-party library to do it.
If the charset is a fixed width encoding, and you don't mind stopping when you get to a newline, then getline(), which allows you to specify the maximum number of chars to read, is probably what you want.
As a few people have mentioned, the C/C++ Standard Libraries don't really provide anything that operates above essentially byte level. So if you're wanting to do this using only the core libraries you don't have a ready made option.
Which leaves either checking if your chosen platform(s) provide another library that implements this capability, writing your own parser for handling character encodings, or punching something like "c++ utf8 library" or "posix unicode" into Google and taking a look at what turns up.
Possible interesting hits:
UTF-8 and Unicode FAQ
UTF-CPP
I'll leave further investigation to the reader.
I think you can use the sgetn member function of the streams associated streambuf...
char buf[32];
streamsize i = fileStream.rdbuf()->sgetn( &buf[0], 10 );
Which will read 10 chars into buf (if there are 10 available to read), returning the number of chars read.