Convert utf8 wstring to string on windows in C++ - c++

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());
but unfortunatelly the result of the following text is this:
"c:\git\myproject\bin\árvíztűrőtükörfúrógép" ->
"c:\git\myproject\bin\árvíztűrőtükörfúrógép"
What do I do wrong?
#include <string>
#include <locale>
#include <codecvt>
int main()
{
// wide character data
std::wstring wstr = L"árvíztűrőtükörfúrógép";
// wide to UTF-8
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
std::string str = conv1.to_bytes(wstr);
}
I was checking the value of the variable in visual studio debug mode.

The code is fine.
You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.
I was checking the value of the variable in visual studio debug mode.
Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.
So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.
Nothing is wrong here.
If you were to output the string like std::cout << str, and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.
You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8; or, in the Command window, type ? &str[0],s8. These techniques are explored by Giovanni Dicanio in his article "What's Wrong with My UTF-8 Strings in Visual Studio?".
It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?
It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.