Converting C++ std::wstring to utf8 with std::codecvt_xxx

Converting C++ std::wstring to utf8 with std::codecvt_xxx - c++

C++11 has tools to convert wide char strings std::wstring from/to utf8 representation: std::codecvt, std::codecvt_utf8, std::codecvt_utf8_utf16 etc.
Which one is usable by Windows app to convert regular wide char Windows strings std::wstring to utf8 std::string? Is it always works without configuring locales?

Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
Also, note that wstring is not exactly the same as UTF-16.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16

Seems that std::codecvt_utf8 works well for conversion std::wstring -> utf8. It passed all my tests. (Windows app, Visual Studio 2015, Windows 8 with EN locale)
I needed a way to convert filenames to UTF8. Therefore my test is about filenames.
In my app I use boost::filesystem::path 1.60.0 to deal with file path. It works well, but not able to convert filenames to UTF8 properly.
Internally Windows version of boost::filesystem::path uses std::wstring to store the file path. Unfortunately, build-in conversion to std::string works bad.
Test case:
create file with mixed symbols c:\test\皀皁皂皃的 (some random Asian symbols)
scan dir with boost::filesystem::directory_iterator, get boost::filesystem::path for the file
convert it to the std::string via build-in conversion filenamePath.string()
you get c:\test\?????. Asian symbols converted to '?'. Not good.
boost::filesystem uses std::codecvt internally. It doesn't work for conversion std::wstring -> std::string.
Instead of build-in boost::filesystem::path conversion you can define conversion function as this (original snippet):
std::string utf8_to_wstring(const std::wstring & str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
Then you can convert filepath to UTF8 easily: utf8_to_wstring(filenamePath.wstring()). It works perfectly.
It works for any filepath. I tested ASCII strings c:\test\test_file, Asian strings c:\test\皀皁皂皃的, Russian strings c:\test\абвгд, mixed strings c:\test\test_皀皁皂皃的, c:\test\test_абвгд, c:\test\test_皀皁皂皃的_абвгд. For every string I receive valid UTF8 representation.

Related

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?

It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

boost.log issue creating a file with a UTF8 file name

I use boost.log to create multi log file according to string value. But when the string is UTF8 coded, the file created has incorrect name (like this: è°.æ¦ç).
BOOST_LOG_SCOPED_LOGGER_ATTR(Logger::motion_checker, "RoleName", boost::log::attributes::constant< std::string >(name))
typedef boost::log::sinks::asynchronous_sink<boost::log::sinks::text_multifile_backend> multifile_sink;
boost::shared_ptr<multifile_sink> sink(new multifile_sink);
sink->locked_backend()->set_file_name_composer(boost::log::sinks::file::as_file_name_composer(
boost::log::expressions::stream << "./log/MotionCheck/" << boost::log::expressions::attr< std::string >("RoleName") << ".log"));
sink->set_formatter
(
boost::log::expressions::format("[%1%] - %2%")
% boost::log::expressions::attr< boost::posix_time::ptime >("TimeStamp")
% boost::log::expressions::smessage
);
sink->set_filter(channel == motion_check_channel);
core->add_sink(sink);
How to let boost.log handle UTF8 file name?

Boost.Log composes the file name in the encoding that is native for the underlying operating system. On Windows the file name is a UTF-16 string (the character type is wchar_t), on most POSIX systems it is typically UTF-8 (the character type is char).
In order to produce the file name in the native encoding the as_file_name_composer adapter creates a stream that performs character code conversion as needed when the adapted formatter is invoked. This basically lets you use both narrow and wide strings in the formatter as long as the encoding can be converted to the native one. You have to know though that the same-typed strings are assumed to have the same encoding, so if the native multibyte encoding is UTF-8 then all your narrow strings must also be UTF-8.
When character code conversion happens, the stream uses a locale that you can provide as the second argument for as_file_name_composer. By default the locale is default-constructed. If your default-constructed locale is not UTF-8 then the conversion will produce incorrect result, which I think is what's happening. You have to either set up your global locale to be UTF-8 or create a UTF-8 locale and pass it to the as_file_name_composer adapter. You can use Boost.Locale to generate a UTF-8 locale easily.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.

You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().

an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Why can't I convert UTF-16 text to other encoding on windows using boost::locale::conv::between

My c++ code use boost to convert encoding.
If I compile and run the codes on cygwin, it works OK, but if I compile codes directly on windows command line (cmd) with mingw-w64 or msvc11, the following code throw invalid_charset_error.
boost::locale::conv::between( encheckbeg, encheckend, consoleEncoding,
getCodingName(codingMethod) )
encheckbeg and encheckend are pointers point to char.
consoleEncoding is a c-string, it can be "Big5" or "UTF-8".
getCodingName return c-string, its content is charset name.
When getCodingName return "UTF-16LE" "UTF-16BE", I got exception. Other chaset names like "Big5" "GB18030" "UTF-8", I had tested these names, boost::locale::conv::between can recognize them. So I believed the problem is on UTF-16.
Is boost's charset conversion dependent on OS locale mechanism, so above problem appears? Why not using ICU convert UTF-16? And how do I solve this problem?

Boost Locale is not a header-only library. There are 3 implementations:
ICU: use ICU4C library
iconv: use iconv library
wconv: use Windows API
The wconv is default choice when you build Boost Locale with MSVC.
Unfortunately, the windows APIs, such as MultiByteToWideChar, that used to perform the conversion does not support UTF-16(You may take a look at the API description. I think the reason is wchar_t(LPWSTR) is UTF-16 already...)
A possible solution is add extra code for UTF-16, for example:
std::string mbcs = std::string("...");
std::wstring wstr = boost::locale::conv::to_utf<wchar_t>(mbcs,"Big5");//for Big5/GBK...
//wstr = boost::locale::conv::utf_to_utf<wchar_t>(utf8str);//for UTF-8
std::wstring_convert<std::codecvt_utf16<wchar_t>> utf16conv;//for UTF-16BE
//std::wstring_convert<std::codecvt_utf16<wchar_t, 0x10ffff, little_endian>> utf16conv;//for UTF-16LE
std::string utf16str = utf16conv.to_bytes(wstr);
Of course, you can also build Boost Locale using ICU. Just remember to build it first and deliver required runtime library/files with your program.

Convert hexadecimal into unicode character

In C++, I would like to save hexadecimal string into file as unicode character
Ex: 0x4E3B save to file ---> 主
Any suggestions or ideas are appreciated.

What encoding? I assume UTF-8.
What platform?
If you under Linux then
std::locale loc("en_US.UTF-8"); // or "" for system default
std::wofstream file;
file.imbue(loc); // make the UTF-8 locale for the stream as default
file.open("file.txt");
wchar_t cp = 0x4E3B;
file << cp;
However if you need Windows it is quite different story:
You need to convert code point to UTF-8. Many ways. If it is bigger then 0xFFFF then convert it to UTF-16 and then search how to use WideCharToMultiByte, and then save to file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting C++ std::wstring to utf8 with std::codecvt_xxx - c++

Related

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

boost.log issue creating a file with a UTF8 file name

Convert wide CString to char*

Why can't I convert UTF-16 text to other encoding on windows using boost::locale::conv::between

Convert hexadecimal into unicode character

Categories

Resources