boost.log issue creating a file with a UTF8 file name - c++

I use boost.log to create multi log file according to string value. But when the string is UTF8 coded, the file created has incorrect name (like this: è°.æ­¦ç).
BOOST_LOG_SCOPED_LOGGER_ATTR(Logger::motion_checker, "RoleName", boost::log::attributes::constant< std::string >(name))
typedef boost::log::sinks::asynchronous_sink<boost::log::sinks::text_multifile_backend> multifile_sink;
boost::shared_ptr<multifile_sink> sink(new multifile_sink);
sink->locked_backend()->set_file_name_composer(boost::log::sinks::file::as_file_name_composer(
boost::log::expressions::stream << "./log/MotionCheck/" << boost::log::expressions::attr< std::string >("RoleName") << ".log"));
sink->set_formatter
(
boost::log::expressions::format("[%1%] - %2%")
% boost::log::expressions::attr< boost::posix_time::ptime >("TimeStamp")
% boost::log::expressions::smessage
);
sink->set_filter(channel == motion_check_channel);
core->add_sink(sink);
How to let boost.log handle UTF8 file name?

Boost.Log composes the file name in the encoding that is native for the underlying operating system. On Windows the file name is a UTF-16 string (the character type is wchar_t), on most POSIX systems it is typically UTF-8 (the character type is char).
In order to produce the file name in the native encoding the as_file_name_composer adapter creates a stream that performs character code conversion as needed when the adapted formatter is invoked. This basically lets you use both narrow and wide strings in the formatter as long as the encoding can be converted to the native one. You have to know though that the same-typed strings are assumed to have the same encoding, so if the native multibyte encoding is UTF-8 then all your narrow strings must also be UTF-8.
When character code conversion happens, the stream uses a locale that you can provide as the second argument for as_file_name_composer. By default the locale is default-constructed. If your default-constructed locale is not UTF-8 then the conversion will produce incorrect result, which I think is what's happening. You have to either set up your global locale to be UTF-8 or create a UTF-8 locale and pass it to the as_file_name_composer adapter. You can use Boost.Locale to generate a UTF-8 locale easily.

Related

Converting C++ std::wstring to utf8 with std::codecvt_xxx

C++11 has tools to convert wide char strings std::wstring from/to utf8 representation: std::codecvt, std::codecvt_utf8, std::codecvt_utf8_utf16 etc.
Which one is usable by Windows app to convert regular wide char Windows strings std::wstring to utf8 std::string? Is it always works without configuring locales?
Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
Also, note that wstring is not exactly the same as UTF-16.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16
Seems that std::codecvt_utf8 works well for conversion std::wstring -> utf8. It passed all my tests. (Windows app, Visual Studio 2015, Windows 8 with EN locale)
I needed a way to convert filenames to UTF8. Therefore my test is about filenames.
In my app I use boost::filesystem::path 1.60.0 to deal with file path. It works well, but not able to convert filenames to UTF8 properly.
Internally Windows version of boost::filesystem::path uses std::wstring to store the file path. Unfortunately, build-in conversion to std::string works bad.
Test case:
create file with mixed symbols c:\test\皀皁皂皃的 (some random Asian symbols)
scan dir with boost::filesystem::directory_iterator, get boost::filesystem::path for the file
convert it to the std::string via build-in conversion filenamePath.string()
you get c:\test\?????. Asian symbols converted to '?'. Not good.
boost::filesystem uses std::codecvt internally. It doesn't work for conversion std::wstring -> std::string.
Instead of build-in boost::filesystem::path conversion you can define conversion function as this (original snippet):
std::string utf8_to_wstring(const std::wstring & str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
Then you can convert filepath to UTF8 easily: utf8_to_wstring(filenamePath.wstring()). It works perfectly.
It works for any filepath. I tested ASCII strings c:\test\test_file, Asian strings c:\test\皀皁皂皃的, Russian strings c:\test\абвгд, mixed strings c:\test\test_皀皁皂皃的, c:\test\test_абвгд, c:\test\test_皀皁皂皃的_абвгд. For every string I receive valid UTF8 representation.

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?
It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

Storing and retrieving UTF-8 strings from Windows resource (RC) files

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?
As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.

Convert hexadecimal into unicode character

In C++, I would like to save hexadecimal string into file as unicode character
Ex: 0x4E3B save to file ---> 主
Any suggestions or ideas are appreciated.
What encoding? I assume UTF-8.
What platform?
If you under Linux then
std::locale loc("en_US.UTF-8"); // or "" for system default
std::wofstream file;
file.imbue(loc); // make the UTF-8 locale for the stream as default
file.open("file.txt");
wchar_t cp = 0x4E3B;
file << cp;
However if you need Windows it is quite different story:
You need to convert code point to UTF-8. Many ways. If it is bigger then 0xFFFF then convert it to UTF-16 and then search how to use WideCharToMultiByte, and then save to file.

_wfopen equivalent under Mac OS X

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?
I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.
POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.
Here is a class that will wrap an UTF-8 generated string from a wchar_t string.
class Utf8
{
public:
Utf8(const wchar_t* wsz): m_utf8(NULL)
{
// OS X uses 32-bit wchar
const int bytes = wcslen(wsz) * sizeof(wchar_t);
// comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
: kCFStringEncodingUTF32BE;
CFStringRef str = CFStringCreateWithBytesNoCopy(NULL,
(const UInt8*)wsz, bytes,
encoding, false,
kCFAllocatorNull
);
const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
m_utf8 = new char[bytesUtf8];
CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
CFRelease(str);
}
~Utf8()
{
if( m_utf8 )
{
delete[] m_utf8;
}
}
public:
operator const char*() const { return m_utf8; }
private:
char* m_utf8;
};
Usage:
const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");
This will work for reading or writing files.
You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.
If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.
If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.
Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.
Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.
For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.
The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.
#JKP:
Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.
See here. Quote:
How a file name looks at the API level
depends on the API. Current Carbon
APIs handle file names as an array of
UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is
why UTF-8 works well in Terminal. How
it's stored on disk depends on the
disk format; HFS+ uses UTF-16, but
that's not important in most cases.
Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).
But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:
Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.
If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.
You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.
I have read file name from configuration UTF8 file through wifstream (it uses wchar_t buffer).
Mac implementation is different from Linux and Windows.
wifstream reads each byte from file to separate wchar_t cell in the buffer. So we have 3 empty bytes, although open requires char string. Thus programmer can use wcstombs function to convert wide character string to multi-byte string.
The API supports UTF8. For better understanding use memory watcher and hex editor for your file.