My question seems to have confused folks. Here's something concrete:
Our code does the following:
FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);
Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).
Under UNICODE build target, the above produces a garbage file full of ????.
I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.
Here's the question as it used to exist:
FILE* streams can be opened in t(ranslated) or b(inary) modes.
Desktop applications can be compiled for UNICODE or MBCS (under
Windows).
If my application is compiled for MBCS, then writing MBCS strings to a
"wt" stream results in a well-formed text file containing MBCS text
for the system code page (i.e. the code page "for non Unicode
software").
Because our software generally uses the _t versions of most string &
stream functions, in MBCS builds output is handled primarily by
puts(pszMBString) or something similar putc etc. Since
pszMBString is already in the system code page (e.g. 932 when
running on a Japanese machine), the string is written out verbatim
(although line terminators are massaged by puts and gets
automatically).
However, if my application is compiled for UNICODE, then writing MBCS
strings to a "wt" stream results in garbage (lots of "?????"
characters) (i.e. I convert the UNICODE to the system's default code
page and then write that to the stream using, for example,
fwrite(pszNarrow, 1, length, stream)).
I can open my streams in binary mode, in which case I'll get the
correct MBCS text... but, the line terminators will no longer be
PC-style CR+LF, but instead will be UNIX-style LF only. This, because
in binary (non-translated) mode, the file stream doesn't handle the
LF->CR+LF translation.
But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct
line terminators and MBCS text files using the system's code page.
Obviously I can manually adjust the line terminators myself and use
binary streams. However, this is a very invasive approach, as I now
have to find every bit of code throughout the system that writes text
files, and alter it so that it does all of this correctly. What blows
my mind, is that UNICODE target is stupider / less capable than the
MBCS target we used to use! Surely there is a way to toggle the C
library to say "output narrow strings as-is but handle line
terminators properly, exactly as you'd do in MBCS builds"?!
Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC.
Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software
The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.
It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:
FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);
In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.
Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.
Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:
When a Unicode stream-I/O function operates in text mode (the
default), the source or destination stream is assumed to be a sequence
of multibyte characters. Therefore, the Unicode stream-input functions
convert multibyte characters to wide characters (as if by a call to
the mbtowc function). For the same reason, the Unicode stream-output
functions convert wide characters to multibyte characters (as if by a
call to the wctomb function).
This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.
int main()
{
setlocale(LC_ALL, ".932");
FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
fputws(L"刃物種類\n", fout);
fclose(fout);
}
An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.
When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.
Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".
I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.
Related
I ported some code from C to C++ and have just found a problem with paths that contain em-dash, e.g. "C:\temp\test—1.dgn". A call to fstream::open() will fail, even though the path displays correctly in the Visual Studio 2005 debugger.
The weird thing is that the old code that used the C library fopen() function works fine. I thought I'd try my luck with the wfstream class instead, and then found that converting my C string using mbstowcs() loses the em-dash altogether, meaning it also fails.
I suppose this is a locale issue, but why isn't em-dash supported in the default locale? And why can't fstream handle an em-dash? I would have thought any byte character supported by the Windows filesystem would be supported by the file stream classes.
Given these limitations, what is the correct way to handle opening a file stream that may contain valid Windows file names that doesn't just fall over on certain characters?
Character em-dash is coded as U+2014 in UTF-16 (0x14 0x20 in little endian), 0xE2 0x80 0x94 in UTF-8, and with other codes or not code at all depending on the charset and code page used. Windows-1252 code page (very common in western European languages) has dash character 0x97 that we could consider equivalent.
Windows internally manages UTF-16 paths, so every time a function is called with its bad-called ANSI interface (functions ending with A) the path is converted using the current code page configured for the user to UTF-16.
On the other hand, RTL of C and C++ could be implemented accessing to the "ANSI" or "Unicode" (functions ending in W) interface. In the first case, the code page used to represent the string must be the same of the code page used for the system. In the second case, either we directly use utf-16 strings from the beginning, or the functions used to convert to utf-16 must be configured to use the same code page of the source string for the mapping.
Yes, it is a complex problem. And there are several wrong (or with problems) proposal to solve it:
Use wfstream instead fstream: wfstream do nothing with paths different to fstream. Nothing. It just means "manage the stream of bytes like wchar_t". (And it does that in a different way as one can expect, so making this class unuseless in the most of cases, but that is another history). To use the Unicode interface in Visual Studio implementation, it exists the overloaded constructor and open() function that accept const wchar_t*. Those function and constructor are overloaded for fstream and for wfstream. Use fstream with the right open().
mbstowcs(): The problem here is the locale (which contains the code page used in the string) to use. If you match the locale because the default locale matches the system one, cool. If not, you can try with mbstowcs_l(). But these functions are unsafe C functions, so you have to be careful with the buffer size. Anyway, this approach could makes sense only if the path to convert is got in runtime. If it is an static string known at compile time, better is to use it directly in your code.
L"C:\\temp\\test—1.dgn": The L prefix in the string doesn't means "converts this string to utf-16" (source code use to be in 8-bit characters), at least no in Visual Studio implementation. L prefix means "add a 0x00 byte after each character between the quotes". So —, equivalent to byte 0x97 in a narrow (ordinary) string, it become 0x97 0x00 when in a wide (prefixed with L) string, but not 0x14 0x20. Instead it is better to use its universal character name: L"C:\\temp\\test\\u20141.dgn"
One popular approach is to use always in your code either utf-8 or utf-16 and make the conversions only when strictly necessary. When converting a string with a specific code page to utf-8 or utf-16, tries to first convert to one of them (utf-8 or utf-16) identifying first the right code page. To do that conversion, uses the functions depending on where they come from. If you get your string from a XML file, well, the used code page is usually explicated there (and use to be utf-8). If it comes from a Windows control, use Windows API function, like MultiByteToWideChar. (CP_ACP or GetACP() uses to work as by default code page).
Uses always fstream (not wfstream) and its wide interfaces (open and constructor), not its narrow ones. (You can use again MultiByteToWideChar to convert from utf-8 to utf-16).
There are several articles and post with advices for this approach. One of them that I recommend you: http://www.nubaria.com/en/blog/?p=289.
This should work, provided that everything you do is in wide-char notation with wide-char functions. That is, use wfstream, but instead of using mbstowcs, use wide-string literals prefixed with the L char:
const wchar_t* filename = L"C:\temp\test—1.dgn";
Also, make sure your source file is saved as UTF-8 in Visual Studio. Otherwise the em-dash could get locale issues with the em-dash.
Posting this solution for others who run into this. The problem is that Windows assigns the "C" locale on startup by default, and em-dash (0x97) is defined in the "Windows-1252" codepage but is unmapped in the normal ASCII table used by the "C" locale. So the simple solution is to call:
setlocale ( LC_ALL, "" );
Prior to fstream::open. This sets the current codepage to the OS-defined codepage. In my program, the file I wanted to open with fstream was defined by the user, so it was in the system-defined codepage (Windows-1252).
So while fiddling with unicode and wide chars may be a solution to avoid unmapped characters, it wasn't the root of the problem. The actual problem was that the input string's codepage ("Windows-1252") didn't match the active codepage ("C") used by default in Windows programs.
is there any way to use environment variables in c++ for path to file?
Idea is to use them without expending so I don't need to use wchar for languages with unicode standard when I want to save/read file.
//EDIT
Little edit with more explanations.
So what I try to achieve is to read/write to file without worrying about characters in path. So I don't want to use wchar as path but it should work if path contains some wide chars.
There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings (Constrol Panel -> Clock, Language, and Region -> Region and Language -> Administrative) which need some actions from users and this is something that I try to avoid.
There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings
This is specifically a Windows problem.
On other platforms such as Linux, filepaths and environment variables are natively byte-based; you can access them using the standard C library functions that take byte-string paths like fopen() and getenv(). The pathnames may represent Unicode strings to the user (decoded using some encoding, almost always UTF-8 which can encode any character), but to the code they're just byte strings.
Windows, on the other hand, has filenames and environment variables that are natively strings of 16-bit (UTF-16) code units (which are nearly the same thing as Unicode character code points, but not quite, because that would be too easy... but that's a sadness for another time). You can call Win32 file-handling APIs like CreateFileW() and GetEnvironmentVariableW() using UTF-16 code unit strings (wchar_t, when compiled on Windows) and access any file names directly.
There are also old-school legacy byte-based Win32 functions like GetEnvironmentVariableA() (which is what GetEnvironmentVariable() points to if you are compiling a non-Unicode project). If you call those functions, Windows has to convert from the char byte strings you give it to UTF-16 strings, using some encoding.
That encoding is the ‘ANSI’ (‘A’) locale-specific default code page, which is what “Language for non-Unicode programs” sets.
Although that encoding can be changed by the user, it can't be set to UTF-8 or any other encoding that supports all characters, so even if you ask the user to change it, that still doesn't let you access all files. Thus the Win32 A APIs are always to be avoided.
The problem comes when you want to access files in a manner that works on both Windows and the other platforms. If you call the C standard library with byte strings, the Microsoft C runtime library adapts those calls to call the Win32 A byte-based APIs, which as above are irritatingly limited.
So your unattractive choices are:
use wchar_t and std::wstring strings in your code, using only Win32 APIs for interacting with filenames and environment variables, and accepting that your code will never run on other platforms, or;
use char and UTF-8-encoded std::string strings, and give up on your code accessing filenames and environment variables with non-ASCII characters in on Windows, or;
write a load of branching #ifdef code to switch between using C standard functions for filename and environment interaction, or using Win32 APIs with a bunch of UTF-8-char-to-wchar_t string conversions in between, so that code works across multiple platforms, or;
use a library that encapsulates (3) for you.
Notably there is boost::nowide (since Boost 1.73) which contains boost::nowide::getenv.
This isn't entirely Microsoft's fault: Windows NT was designed in their early days of Unicode before UTF-8 or the astral planes were invented, when it was thought that 16-bit code unit strings were a totally sensible way to store text, and not a lamentable disaster like we know it is now. It is, however, very sad that Windows has not been updated since then to treat UTF-8 as a first-class citizen and provide an easy way to write cross-platform applications.
The standard library gives you the function getenv. Here is an example:
#include <cstdlib>
int main()
{
char* pPath;
pPath = getenv("PATH");
if (pPath)
std::cout << "Path =" << pPath << std::endl;
return 0;
}
ADDENDUM A tentative answer of my own appears at the bottom of the question.
I am converting an archaic VC6 C++/MFC project to VS2013 and Unicode, based on the recommendations at utf8everywhere.org.
Along the way, I have been studying Unicode, UTF-16, UCS-2, UTF-8, the standard library and STL support of Unicode & UTF-8 (or, rather, the standard library's lack of support), ICU, Boost.Locale, and of course the Windows SDK and MFC's API that requires UTF-16 wchar's.
As I have been studying the above issues, a question continues to recur that I have not been able to answer to my satisfaction in a clarified way.
Consider the C library function mbstowcs. This function has the following signature:
size_t mbstowcs (wchar_t* dest, const char* src, size_t max);
The second parameter src is (according to the documentation) a
C-string with the multibyte characters to be interpreted. The
multibyte sequence shall begin in the initial shift state.
My question is in regards to this multibyte string. It is my understanding that the encoding of a multibyte string can differ from string to string, and the encoding is not specified by the standard. Nor does a particular encoding seem to be specified by the MSVC documentation for this function.
My understanding at this point is that on Windows, this multibyte string is expected to be encoded with the ANSI code page of the active locale. But my clarity begins to fade at this point.
I have been wondering whether the encoding of the source code file itself makes a difference in the behavior of mbstowcs, at least on Windows. And, I'm also confused about what happens at compile time vs. what happens at run time for the code snippet above.
Suppose you have a string literal passed to mbstowcs, like this:
wchar_t dest[1024];
mbstowcs (dest, "Hello, world!", 1024);
Suppose this code is compiled on a Windows machine. Suppose that the code page of the source code file itself is different than the code page of the current locale on the machine on which the compiler runs. Will the compiler take into consideration the source code file's encoding? Will the resulting binary be effected by the fact that the code page of the source code file is different than the code page of the active locale on which the compiler runs?
On the other hand, maybe I have it wrong - maybe the active locale of the runtime machine determines the code page that is expected of the string literal. Therefore, does the code page with which the source code file is saved need to match the code page of the computer on which the program ultimately runs? That seems so whacked to me that I find it hard to believe this would be the case. But as you can see, my clarity is lacking here.
On the other hand, if we change the call to mbstowcs to explicitly pass a UTF-8 string:
wchar_t dest[1024];
mbstowcs (dest, u8"Hello, world!", 1024);
... I assume that mbstowcs will always do the right thing - regardless of the code page of the source file, the current locale of the compiler, or the current locale of the computer on which the code runs. Am I correct about this?
I would appreciate clarity on these matters, in particular in regards to the specific questions I have raised above. If any or all of my questions are ill-formed, I would appreciate knowing that, as well.
ADDENDUM From the lengthy comments beneath #TheUndeadFish's answer, and from the answer to a question on a very similar topic here, I believe I have a tentative answer to my own question that I'd like to propose.
Let's follow the raw bytes of the source code file to see how the actual bytes are transformed through the entire process of compilation to runtime behavior:
The C++ standard 'ostensibly' requires that all characters in any source code file be a (particular) 96-character subset of ASCII called the basic source character set. (But see following bullet points.)
In terms of the actual byte-level encoding of these 96 characters in the source code file, the standard does not specify any particular encoding, but all 96 characters are ASCII characters, so in practice, there is never a question about what encoding the source file is in, because all encodings in existence represent these 96 ASCII characters using the same raw bytes.
However, character literals and code comments might commonly contain characters outside these basic 96.
This is typically supported by the compiler (even though this isn't required by the C++ standard). The source code's character set is called the source character set. But the compiler needs to have these same characters available in its internal character set (called the execution character set), or else those missing characters will be replaced by some other (dummy) character (such as a square or a question mark) prior to the compiler actually processing the source code - see the discussion that follows.
How the compiler determines the encoding that is used to encode the characters of the source code file (when characters appear that are outside the basic source character set) is implementation-defined.
Note that it is possible for the compiler to use a different character set (encoded however it likes) for its internal execution character set than the character set represented by the encoding of the source code file!
This means that even if the compiler knows about the encoding of the source code file (which implies that the compiler also knows about all the characters in the source code's character set), the compiler might still be forced to convert some characters in the source code's character set to different characters in the execution character set (thereby losing information). The standard states that this is acceptable, but that the compiler must not convert any characters in the source character set to the NULL character in the execution character set.
Nothing is said by the C++ standard about the encoding used for the execution character set, just as nothing is said about the characters that are required to be supported in the execution character set (other than the characters in the basic execution character set, which include all characters in the basic source character set plus a handful of additional ones such as the NULL character and the backspace character).
It is not really seemingly documented anywhere very clearly, even by Microsoft, how any of this process is handled in MSVC. I.e., how the compiler figures out what the encoding and corresponding character set of the source code file is, and/or what the choice of execution character set is, and/or what the encoding is that will be used for the execution character set during compilation of the source code file.
It seems that in the case of MSVC, the compiler will make a best-guess effort in its attempt to select an encoding (and corresponding character set) for any given source code file, falling back on the current locale's default code page of the machine the compiler is running on. Or you can take special steps to save the source code files as Unicode using an editor that will provide the proper byte-order mark (BOM) at the beginning of each source code file. This includes UTF-8, for which the BOM is typically optional or excluded - in the case of source code files read by the MSVC compiler, you must include the UTF-8 BOM.
And in terms of the execution character set and its encoding for MSVC, continue on with the next bullet point.
The compiler proceeds to read the source file and converts the raw bytes of the characters of the source code file from the encoding for the source character set into the (potentially different) encoding of the corresponding character in the execution character set (which will be the same character, if the given character is present in both character sets).
Ignoring code comments and character literals, all such characters are typically in the basic execution character set noted above. This is a subset of the ASCII character set, so encoding issues are irrelevant (all of these characters are, in practice, encoded identically on all compilers).
Regarding the code comments and character literals, though: the code comments are discarded, and if the character literals contain only characters in the basic source character set, then no problem - these characters will belong in the basic execution character set and still be ASCII.
But if the character literals in the source code contain characters outside of the basic source character set, then these characters are, as noted above, converted to the execution character set (possibly with some loss). But as noted, neither the characters, nor the encoding for this character set is defined by the C++ standard. Again, the MSVC documentation seems to be very weak on what this encoding and character set will be. Perhaps it is the default ANSI encoding indicated by the active locale on the machine on which the compiler runs? Perhaps it is UTF-16?
In any case, the raw bytes that will be burned into the executable for the character string literal correspond exactly to the compiler's encoding of the characters in the execution character set.
At runtime, mbstowcs is called and it is passed the bytes from the previous bullet point, unchanged.
It is now time for the C runtime library to interpret the bytes that are passed to mbstowcs.
Because no locale is provided with the call to mbstowcs, the C runtime has no idea what encoding to use when it receives these bytes - this is arguably the weakest link in this chain.
It is not documented by the C++ (or C) standard what encoding should be used to read the bytes passed to mbstowcs. I am not sure if the standard states that the input to mbstowcs is expected to be in the same execution character set as the characters in the execution character set of the compiler, OR if the encoding is expected to be the same for the compiler as for the C runtime implementation of mbstowcs.
But my tentative guess is that in the MSVC C runtime, apparently the locale of the current running thread will be used to determine both the runtime execution character set, and the encoding representing this character set, that will be used to interpret the bytes passed to mbstowcs.
This means that it will be very easy for these bytes to be mis-interpreted as different characters than were encoded in the source code file - very ugly, as far as I'm concerned.
If I'm right about all this, then if you want to force the C runtime to use a particular encoding, you should call the Window SDK's MultiByteToWideChar, as #HarryJohnston's comment indicates, because you can pass the desired encoding to that function.
Due to the above mess, there really isn't an automatic way to deal with character literals in source code files.
Therefore, as https://stackoverflow.com/a/1866668/368896 mentions, if there's a chance you'll have non-ASCII characters in your character literals, you should use resources (such as GetText's method, which also works via Boost.Locale on Windows in conjunction with the xgettext .exe that ships with Poedit), and in your source code, simply write functions to load the resources as raw (unchanged) bytes.
Make sure to save your resource files as UTF-8, and then make sure to call functions at runtime that explicitly support UTF-8 for their char *'s and std::string's, such as (from the recommendations at utf8everywhere.org) using Boost.Nowide (not really in Boost yet, I think) to convert from UTF-8 to wchar_t at the last possible moment prior to calling any Windows API functions that write text to dialog boxes, etc. (and using the W forms of these Windows API functions). For console output, you must call the SetConsoleOutputCP-type functions, such as is also described at https://stackoverflow.com/a/1866668/368896.
Thanks to those who took the time to read the lengthy proposed answer here.
The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.
On the MSDN documentation you linked is:
mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.
That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.
Now, taking a look at your proposed way of passing UTF-8:
mbstowcs (dest, u8"Hello, world!", 1024);
Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.
Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.
So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.
Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter.
mbstowcs() semantics are defined in terms of the currently installed C locale. If you are processing string with different encodings you will need to use setlocale() to change what encoding is currently being used. The relevant statement in the C standard is in 7.22.8 paragraph 1:
The behavior of the multibyte string functions is affected by the LC_CTYPE category of
the current locale.
I don't know enough about the C library but as far as I know none of these functions is really thread-safe. I consider it much easier to deal with different encodings and, in general, cultural conventions, using the C++ std::locale facilities. With respect to encoding conversions you'd look at the std::codecvt<...> facets. Admittedly, these aren't easy to use, though.
The current locale needs a bit of clarification: the program has a current global locale. Initially, this locale is somehow set up by the system and is possibly controlled by the user's environment in some form. For example, on UNIX system there are environment variables which choose the initial locale. Once the program is running, it can change the current locale, however. How that is done depends a bit on what is being used exactly: a running C++ program actually has two locales: one used by the C library and one used by the C++ library.
The C locale is used for all locale dependent function from the C library, e.g., mbstowcs() but also for tolower() and printf(). The C++ locale is used for all locale dependent function which are specific to the C++ library. Since C++ uses locale objects the global locale is just used as the default for entities not setting a locale specifically, and primarily for the stream (you'd set a stream's locale using s.imbue(loc)). Depending on which locale you set, there are different methods to set the global locale:
For the C locale you use setlocale().
For the C++ locale you use std::locale::global().
I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.
I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:
#define MY_STRING 鸟
However, when I save, my lovely Unicode characters all become ? marks.
I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".
What's the correct way to include Unicode in C++ source files?
Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.
I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.
It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.
While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.
Using the L prefix and \u or \U notation for escaping Unicode characters:
Section 6.4.3 of the C99 specification defines the \u escape sequences.
Example:
#define MY_STRING L"A \u8801 B"
/* A congruent-to B */
Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)
If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.
I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).
Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.
The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.
You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.
First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.