Using Unicode in a C++ source file - c++

I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.
I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:
#define MY_STRING 鸟
However, when I save, my lovely Unicode characters all become ? marks.
I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".
What's the correct way to include Unicode in C++ source files?

Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.
I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.
It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.
While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.

Using the L prefix and \u or \U notation for escaping Unicode characters:
Section 6.4.3 of the C99 specification defines the \u escape sequences.
Example:
#define MY_STRING L"A \u8801 B"
/* A congruent-to B */

Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)
If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.

Related

How to store Unicode characters in an array?

I'm writing a C++ wxWidgets calculator application, and I need to store the characters for the operators in an array. I have something like int ops[10] = {'+', '-', '*', '/', '^'};. What if I wanted to also store characters such as √, ÷ and × in said array, in a way so that they are also displayable inside a wxTextCtrl and a custom button?
This is actually a hairy question, even though it does not look like it at first. Your best option is to use Unicode Control Sequences instead of adding the special characters with your Source Code Editor.
wxString ops[]={L"+", L"-", L"*", L"\u00F7"};
You need to make sure that characters such as √, ÷ and × are being compiled correctly.
Your sourcefile (.cpp) needs to store them in a way that ensures the compiler is generating the correct characters. This is harder than it looks, especially when svn, git, windows and linux are involved.
Most of the time .cpp files are ANSI or 8 bit encoded and do not support Unicode constants out of the box.
You could save your sourcefile in UTF-8 codepage so that these characters are preserved. But not all compilers accept UTF-8
The best way to do that is to encode these using Unicode control characters.
wxString div(L"\u00F7"); is the string for ÷. Or in your case perhaps wxChar div('\u00F7'). You have to look up the Unicode control sequences for the other special chars. This way your source file will contain ANSI chars only and will be accepted by all compilers. You will also avoid code page problems when you exchange source files with different OS platforms.
Then you have to make sure that you compile wxWidgets with UNICODE awareness (although I think this is the default for wx3.x). Then, if your OS supports it, these special characters should show up.
Read up on Unicode controls (Wikipedia). Also good input is found in utf8everywhere.org. The file .editorconfig can also be of help.
Prefer to use wchar_t instead of int.
wchar_t ops[10] = {L'+', L'-', L'*', L'/', L'^', L'√', L'÷', L'×'};
These trivially support the characters you describe, and trivially and correctly convert to wxStrings.

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

Euro Symbol for C++

I am creating a small currency converting program in C++ in Visual Studio 2013.
For a pound sign, I have created a char variable using the Windows-1252 code for £ symbol.
const char poundSign(156);
I need to find a way in which I can implement a symbol for €, euro symbol. When I run the program.
You should not use any of the deprecated extended ASCII encodings as they are long obsolete nowadays. As user1937198 said, 156 is the character code of £ in the archaic Windows-1252* encoding. The appearance of non-ASCII characters in these encodings depends on the codepage selected by the user, which makes it impossible to mix characters from different codepages under such a scheme. Additionally, your users will be very confused if they pick the wrong codepage.
Consider using Unicode instead. Since you're on Windows, you should probably use UTF-16, in which case the correct way is to declare:
// make sure the source code is saved as UTF-16!
const wchar_t poundSign[] = L"£";
const wchar_t euroSign[] = L"€";
In UTF-16, there's no guarantee that a single character will take only 1 character due to the surrogate mechanism, hence it's best to store "characters"* as strings.
Keep in mind that this means the rest of your program should switch to Unicode as well: use the "W" versions of the Windows API (the WCHAR-versions).
[*] In technical lingo, each "Unicode character" is referred to as a code point.
You have to distinguish between the translation environment and the execution environment.
E.g., it heavily depends on where your program is being run what glyph is assigned to the character you intended to produce (cf. 'codepage').
You may want to check of your terminal or whatsoever is unicode-aware and then just use the appropriate codepoints.
Also, you should not create character constants the way you did. Use a character literal instead.

Does the multibyte-to-wide-string conversion function "mbstowcs", when passed a string literal, use the encoding of the source file?

ADDENDUM A tentative answer of my own appears at the bottom of the question.
I am converting an archaic VC6 C++/MFC project to VS2013 and Unicode, based on the recommendations at utf8everywhere.org.
Along the way, I have been studying Unicode, UTF-16, UCS-2, UTF-8, the standard library and STL support of Unicode & UTF-8 (or, rather, the standard library's lack of support), ICU, Boost.Locale, and of course the Windows SDK and MFC's API that requires UTF-16 wchar's.
As I have been studying the above issues, a question continues to recur that I have not been able to answer to my satisfaction in a clarified way.
Consider the C library function mbstowcs. This function has the following signature:
size_t mbstowcs (wchar_t* dest, const char* src, size_t max);
The second parameter src is (according to the documentation) a
C-string with the multibyte characters to be interpreted. The
multibyte sequence shall begin in the initial shift state.
My question is in regards to this multibyte string. It is my understanding that the encoding of a multibyte string can differ from string to string, and the encoding is not specified by the standard. Nor does a particular encoding seem to be specified by the MSVC documentation for this function.
My understanding at this point is that on Windows, this multibyte string is expected to be encoded with the ANSI code page of the active locale. But my clarity begins to fade at this point.
I have been wondering whether the encoding of the source code file itself makes a difference in the behavior of mbstowcs, at least on Windows. And, I'm also confused about what happens at compile time vs. what happens at run time for the code snippet above.
Suppose you have a string literal passed to mbstowcs, like this:
wchar_t dest[1024];
mbstowcs (dest, "Hello, world!", 1024);
Suppose this code is compiled on a Windows machine. Suppose that the code page of the source code file itself is different than the code page of the current locale on the machine on which the compiler runs. Will the compiler take into consideration the source code file's encoding? Will the resulting binary be effected by the fact that the code page of the source code file is different than the code page of the active locale on which the compiler runs?
On the other hand, maybe I have it wrong - maybe the active locale of the runtime machine determines the code page that is expected of the string literal. Therefore, does the code page with which the source code file is saved need to match the code page of the computer on which the program ultimately runs? That seems so whacked to me that I find it hard to believe this would be the case. But as you can see, my clarity is lacking here.
On the other hand, if we change the call to mbstowcs to explicitly pass a UTF-8 string:
wchar_t dest[1024];
mbstowcs (dest, u8"Hello, world!", 1024);
... I assume that mbstowcs will always do the right thing - regardless of the code page of the source file, the current locale of the compiler, or the current locale of the computer on which the code runs. Am I correct about this?
I would appreciate clarity on these matters, in particular in regards to the specific questions I have raised above. If any or all of my questions are ill-formed, I would appreciate knowing that, as well.
ADDENDUM From the lengthy comments beneath #TheUndeadFish's answer, and from the answer to a question on a very similar topic here, I believe I have a tentative answer to my own question that I'd like to propose.
Let's follow the raw bytes of the source code file to see how the actual bytes are transformed through the entire process of compilation to runtime behavior:
The C++ standard 'ostensibly' requires that all characters in any source code file be a (particular) 96-character subset of ASCII called the basic source character set. (But see following bullet points.)
In terms of the actual byte-level encoding of these 96 characters in the source code file, the standard does not specify any particular encoding, but all 96 characters are ASCII characters, so in practice, there is never a question about what encoding the source file is in, because all encodings in existence represent these 96 ASCII characters using the same raw bytes.
However, character literals and code comments might commonly contain characters outside these basic 96.
This is typically supported by the compiler (even though this isn't required by the C++ standard). The source code's character set is called the source character set. But the compiler needs to have these same characters available in its internal character set (called the execution character set), or else those missing characters will be replaced by some other (dummy) character (such as a square or a question mark) prior to the compiler actually processing the source code - see the discussion that follows.
How the compiler determines the encoding that is used to encode the characters of the source code file (when characters appear that are outside the basic source character set) is implementation-defined.
Note that it is possible for the compiler to use a different character set (encoded however it likes) for its internal execution character set than the character set represented by the encoding of the source code file!
This means that even if the compiler knows about the encoding of the source code file (which implies that the compiler also knows about all the characters in the source code's character set), the compiler might still be forced to convert some characters in the source code's character set to different characters in the execution character set (thereby losing information). The standard states that this is acceptable, but that the compiler must not convert any characters in the source character set to the NULL character in the execution character set.
Nothing is said by the C++ standard about the encoding used for the execution character set, just as nothing is said about the characters that are required to be supported in the execution character set (other than the characters in the basic execution character set, which include all characters in the basic source character set plus a handful of additional ones such as the NULL character and the backspace character).
It is not really seemingly documented anywhere very clearly, even by Microsoft, how any of this process is handled in MSVC. I.e., how the compiler figures out what the encoding and corresponding character set of the source code file is, and/or what the choice of execution character set is, and/or what the encoding is that will be used for the execution character set during compilation of the source code file.
It seems that in the case of MSVC, the compiler will make a best-guess effort in its attempt to select an encoding (and corresponding character set) for any given source code file, falling back on the current locale's default code page of the machine the compiler is running on. Or you can take special steps to save the source code files as Unicode using an editor that will provide the proper byte-order mark (BOM) at the beginning of each source code file. This includes UTF-8, for which the BOM is typically optional or excluded - in the case of source code files read by the MSVC compiler, you must include the UTF-8 BOM.
And in terms of the execution character set and its encoding for MSVC, continue on with the next bullet point.
The compiler proceeds to read the source file and converts the raw bytes of the characters of the source code file from the encoding for the source character set into the (potentially different) encoding of the corresponding character in the execution character set (which will be the same character, if the given character is present in both character sets).
Ignoring code comments and character literals, all such characters are typically in the basic execution character set noted above. This is a subset of the ASCII character set, so encoding issues are irrelevant (all of these characters are, in practice, encoded identically on all compilers).
Regarding the code comments and character literals, though: the code comments are discarded, and if the character literals contain only characters in the basic source character set, then no problem - these characters will belong in the basic execution character set and still be ASCII.
But if the character literals in the source code contain characters outside of the basic source character set, then these characters are, as noted above, converted to the execution character set (possibly with some loss). But as noted, neither the characters, nor the encoding for this character set is defined by the C++ standard. Again, the MSVC documentation seems to be very weak on what this encoding and character set will be. Perhaps it is the default ANSI encoding indicated by the active locale on the machine on which the compiler runs? Perhaps it is UTF-16?
In any case, the raw bytes that will be burned into the executable for the character string literal correspond exactly to the compiler's encoding of the characters in the execution character set.
At runtime, mbstowcs is called and it is passed the bytes from the previous bullet point, unchanged.
It is now time for the C runtime library to interpret the bytes that are passed to mbstowcs.
Because no locale is provided with the call to mbstowcs, the C runtime has no idea what encoding to use when it receives these bytes - this is arguably the weakest link in this chain.
It is not documented by the C++ (or C) standard what encoding should be used to read the bytes passed to mbstowcs. I am not sure if the standard states that the input to mbstowcs is expected to be in the same execution character set as the characters in the execution character set of the compiler, OR if the encoding is expected to be the same for the compiler as for the C runtime implementation of mbstowcs.
But my tentative guess is that in the MSVC C runtime, apparently the locale of the current running thread will be used to determine both the runtime execution character set, and the encoding representing this character set, that will be used to interpret the bytes passed to mbstowcs.
This means that it will be very easy for these bytes to be mis-interpreted as different characters than were encoded in the source code file - very ugly, as far as I'm concerned.
If I'm right about all this, then if you want to force the C runtime to use a particular encoding, you should call the Window SDK's MultiByteToWideChar, as #HarryJohnston's comment indicates, because you can pass the desired encoding to that function.
Due to the above mess, there really isn't an automatic way to deal with character literals in source code files.
Therefore, as https://stackoverflow.com/a/1866668/368896 mentions, if there's a chance you'll have non-ASCII characters in your character literals, you should use resources (such as GetText's method, which also works via Boost.Locale on Windows in conjunction with the xgettext .exe that ships with Poedit), and in your source code, simply write functions to load the resources as raw (unchanged) bytes.
Make sure to save your resource files as UTF-8, and then make sure to call functions at runtime that explicitly support UTF-8 for their char *'s and std::string's, such as (from the recommendations at utf8everywhere.org) using Boost.Nowide (not really in Boost yet, I think) to convert from UTF-8 to wchar_t at the last possible moment prior to calling any Windows API functions that write text to dialog boxes, etc. (and using the W forms of these Windows API functions). For console output, you must call the SetConsoleOutputCP-type functions, such as is also described at https://stackoverflow.com/a/1866668/368896.
Thanks to those who took the time to read the lengthy proposed answer here.
The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.
On the MSDN documentation you linked is:
mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.
That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.
Now, taking a look at your proposed way of passing UTF-8:
mbstowcs (dest, u8"Hello, world!", 1024);
Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.
Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.
So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.
Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter.
mbstowcs() semantics are defined in terms of the currently installed C locale. If you are processing string with different encodings you will need to use setlocale() to change what encoding is currently being used. The relevant statement in the C standard is in 7.22.8 paragraph 1:
The behavior of the multibyte string functions is affected by the LC_CTYPE category of
the current locale.
I don't know enough about the C library but as far as I know none of these functions is really thread-safe. I consider it much easier to deal with different encodings and, in general, cultural conventions, using the C++ std::locale facilities. With respect to encoding conversions you'd look at the std::codecvt<...> facets. Admittedly, these aren't easy to use, though.
The current locale needs a bit of clarification: the program has a current global locale. Initially, this locale is somehow set up by the system and is possibly controlled by the user's environment in some form. For example, on UNIX system there are environment variables which choose the initial locale. Once the program is running, it can change the current locale, however. How that is done depends a bit on what is being used exactly: a running C++ program actually has two locales: one used by the C library and one used by the C++ library.
The C locale is used for all locale dependent function from the C library, e.g., mbstowcs() but also for tolower() and printf(). The C++ locale is used for all locale dependent function which are specific to the C++ library. Since C++ uses locale objects the global locale is just used as the default for entities not setting a locale specifically, and primarily for the stream (you'd set a stream's locale using s.imbue(loc)). Depending on which locale you set, there are different methods to set the global locale:
For the C locale you use setlocale().
For the C++ locale you use std::locale::global().

How do I write MBCS files from a UNICODE application?

My question seems to have confused folks. Here's something concrete:
Our code does the following:
FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);
Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).
Under UNICODE build target, the above produces a garbage file full of ????.
I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.
Here's the question as it used to exist:
FILE* streams can be opened in t(ranslated) or b(inary) modes.
Desktop applications can be compiled for UNICODE or MBCS (under
Windows).
If my application is compiled for MBCS, then writing MBCS strings to a
"wt" stream results in a well-formed text file containing MBCS text
for the system code page (i.e. the code page "for non Unicode
software").
Because our software generally uses the _t versions of most string &
stream functions, in MBCS builds output is handled primarily by
puts(pszMBString) or something similar putc etc. Since
pszMBString is already in the system code page (e.g. 932 when
running on a Japanese machine), the string is written out verbatim
(although line terminators are massaged by puts and gets
automatically).
However, if my application is compiled for UNICODE, then writing MBCS
strings to a "wt" stream results in garbage (lots of "?????"
characters) (i.e. I convert the UNICODE to the system's default code
page and then write that to the stream using, for example,
fwrite(pszNarrow, 1, length, stream)).
I can open my streams in binary mode, in which case I'll get the
correct MBCS text... but, the line terminators will no longer be
PC-style CR+LF, but instead will be UNIX-style LF only. This, because
in binary (non-translated) mode, the file stream doesn't handle the
LF->CR+LF translation.
But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct
line terminators and MBCS text files using the system's code page.
Obviously I can manually adjust the line terminators myself and use
binary streams. However, this is a very invasive approach, as I now
have to find every bit of code throughout the system that writes text
files, and alter it so that it does all of this correctly. What blows
my mind, is that UNICODE target is stupider / less capable than the
MBCS target we used to use! Surely there is a way to toggle the C
library to say "output narrow strings as-is but handle line
terminators properly, exactly as you'd do in MBCS builds"?!
Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC.
Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software
The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.
It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:
FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);
In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.
Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.
Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:
When a Unicode stream-I/O function operates in text mode (the
default), the source or destination stream is assumed to be a sequence
of multibyte characters. Therefore, the Unicode stream-input functions
convert multibyte characters to wide characters (as if by a call to
the mbtowc function). For the same reason, the Unicode stream-output
functions convert wide characters to multibyte characters (as if by a
call to the wctomb function).
This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.
int main()
{
setlocale(LC_ALL, ".932");
FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
fputws(L"刃物種類\n", fout);
fclose(fout);
}
An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.
When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.
Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".
I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.