I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).
Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.
The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.
You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.
First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.
Related
General question
Is there a possibility to avoid character set conversion when writing to std::cout / std::cerr?
I do something like
std::cout << "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" << std::endl;
And I want the output to be written to the console maintaining the UTF-8 encoding (my console uses UTF-8 encoding, but my C++ Standard Library, GNUs libstdc++, doesn't think so for some reason).
If there's no possibility to forbid character encoding conversion: Can I set std::cout to use UTF-8, so it hopefully figures out itself that no conversion is needed?
Background
I used the Windows API function SetConsoleOutputCP(CP_UTF8); to set my console's encoding to UTF-8.
The problem seems to be that UTF-8 does not match the code page typicallly used for my system's locale and libstdc++ therefore sets up std::cout with the default ANSI code page instead of correctly recognizing the switch.
Edit: Turns out I misinterpreted the issue and the solution is actually a lot simpler (or not...).
The "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" was just meant as a placeholder (and I shouldn't have used it as it has hidden the actual issue).
In my real code the "UTF-8 string" is a Glib::ustring, and those are by definition UTF-8 encoded.
However I did not realize that the output operator << was defined in glibmm in a way that forces character set conversion.
It uses g_locale_from_utf8() internally which in turn uses g_get_charset() to determine the target encoding.
Unfortunately the documentation for g_get_charset() states
On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.
which simply means that glib will neither care for the C locale I set nor will it attempt to determine the encoding my console actually uses and basically makes it impossible to use many glib functions to create UTF-8 output. (As a matter of fact this also means that this issue has the exact same cause as the issue that triggered my other question: Force UTF-8 encoding in glib's "g_print()").
I'm currently considering this a bug in glib (or a serious limitation at best) and will probably open a report in the issue tracker for it.
You are looking at the wrong side, as you are talking about a string literal, included in your source code (and not input from your keyboard), and for that to work properly you have to tell the compiler which encoding is being used for all those characters (I think the first c++ spec that mentions non-ascii charsets is c++11)
As you are using actually the UTF charset, you should have to encode all them in at least a wchar_t to be considered as such, or to agree in the translator (probably this is what happens) that UTF chars will be UTF-8 encoded, when used as string literals. This will commonly mean that they will be printed as UTF-8 and, if you use a UTF-8 compliant console device, they will be printed ok, without any other problem.
I know there's a gcc option to specify the encoding used in string literals for a source file, and there should be another in clang also. Check the documentation and probably this will solve any issues. But the best thing to be portable, is not to depend on the codeset or use one like ISO-10646 (but know that full utf coverage is not only utf-8, utf-8 is only a way to encode UTF chars, and as so, it's only a way to represent UTF characters)
Another issue, is that C++11 doesn't refer to the UTF consortium standard, but to the ISO counterpart (ISO-10646, I think), both are similar, but not equal, and the character encodings are similar, but not equal (the codesize of the ISO is 32 bit while the Unicode consortium's is 21 bit, for example). These and other differences between them make some tricks to go in C++ and produce problems when one is thinking in strict Unicode.
Of course, to output correct strings on a UTF-8 terminal, you have to encode UTF codes to utf-8 format before sending them to the terminal. This is true, even if you have already them encoded as utf-8 in a string object. If you say they are already utf-8 then no conversion is made at all... but if you don't say, the normal consideration is that you are using normal utf codes (but limiting to 8bit codes), limiting yourself to eight bit codes, and encoding them to utf-8 before printing... this leads to encoding errors (double encoding) as something like ú (unicode code \u00fa) should be encoded in utf-8 as the character sequence { 0xc3, 0xba };, but if you don't say the string literal is indeed in utf-8, both characters will be handled as the two characters codes for Â(\u00c3) and º(\u00ba) characters, and will be recoded as { 0xc3, 0x83, 0xc2, 0xba }; that will show them incorrectly. This is very common error and you should probably have seen it when some encoding is done incorrectly. Source for the samples here.
I ported some code from C to C++ and have just found a problem with paths that contain em-dash, e.g. "C:\temp\test—1.dgn". A call to fstream::open() will fail, even though the path displays correctly in the Visual Studio 2005 debugger.
The weird thing is that the old code that used the C library fopen() function works fine. I thought I'd try my luck with the wfstream class instead, and then found that converting my C string using mbstowcs() loses the em-dash altogether, meaning it also fails.
I suppose this is a locale issue, but why isn't em-dash supported in the default locale? And why can't fstream handle an em-dash? I would have thought any byte character supported by the Windows filesystem would be supported by the file stream classes.
Given these limitations, what is the correct way to handle opening a file stream that may contain valid Windows file names that doesn't just fall over on certain characters?
Character em-dash is coded as U+2014 in UTF-16 (0x14 0x20 in little endian), 0xE2 0x80 0x94 in UTF-8, and with other codes or not code at all depending on the charset and code page used. Windows-1252 code page (very common in western European languages) has dash character 0x97 that we could consider equivalent.
Windows internally manages UTF-16 paths, so every time a function is called with its bad-called ANSI interface (functions ending with A) the path is converted using the current code page configured for the user to UTF-16.
On the other hand, RTL of C and C++ could be implemented accessing to the "ANSI" or "Unicode" (functions ending in W) interface. In the first case, the code page used to represent the string must be the same of the code page used for the system. In the second case, either we directly use utf-16 strings from the beginning, or the functions used to convert to utf-16 must be configured to use the same code page of the source string for the mapping.
Yes, it is a complex problem. And there are several wrong (or with problems) proposal to solve it:
Use wfstream instead fstream: wfstream do nothing with paths different to fstream. Nothing. It just means "manage the stream of bytes like wchar_t". (And it does that in a different way as one can expect, so making this class unuseless in the most of cases, but that is another history). To use the Unicode interface in Visual Studio implementation, it exists the overloaded constructor and open() function that accept const wchar_t*. Those function and constructor are overloaded for fstream and for wfstream. Use fstream with the right open().
mbstowcs(): The problem here is the locale (which contains the code page used in the string) to use. If you match the locale because the default locale matches the system one, cool. If not, you can try with mbstowcs_l(). But these functions are unsafe C functions, so you have to be careful with the buffer size. Anyway, this approach could makes sense only if the path to convert is got in runtime. If it is an static string known at compile time, better is to use it directly in your code.
L"C:\\temp\\test—1.dgn": The L prefix in the string doesn't means "converts this string to utf-16" (source code use to be in 8-bit characters), at least no in Visual Studio implementation. L prefix means "add a 0x00 byte after each character between the quotes". So —, equivalent to byte 0x97 in a narrow (ordinary) string, it become 0x97 0x00 when in a wide (prefixed with L) string, but not 0x14 0x20. Instead it is better to use its universal character name: L"C:\\temp\\test\\u20141.dgn"
One popular approach is to use always in your code either utf-8 or utf-16 and make the conversions only when strictly necessary. When converting a string with a specific code page to utf-8 or utf-16, tries to first convert to one of them (utf-8 or utf-16) identifying first the right code page. To do that conversion, uses the functions depending on where they come from. If you get your string from a XML file, well, the used code page is usually explicated there (and use to be utf-8). If it comes from a Windows control, use Windows API function, like MultiByteToWideChar. (CP_ACP or GetACP() uses to work as by default code page).
Uses always fstream (not wfstream) and its wide interfaces (open and constructor), not its narrow ones. (You can use again MultiByteToWideChar to convert from utf-8 to utf-16).
There are several articles and post with advices for this approach. One of them that I recommend you: http://www.nubaria.com/en/blog/?p=289.
This should work, provided that everything you do is in wide-char notation with wide-char functions. That is, use wfstream, but instead of using mbstowcs, use wide-string literals prefixed with the L char:
const wchar_t* filename = L"C:\temp\test—1.dgn";
Also, make sure your source file is saved as UTF-8 in Visual Studio. Otherwise the em-dash could get locale issues with the em-dash.
Posting this solution for others who run into this. The problem is that Windows assigns the "C" locale on startup by default, and em-dash (0x97) is defined in the "Windows-1252" codepage but is unmapped in the normal ASCII table used by the "C" locale. So the simple solution is to call:
setlocale ( LC_ALL, "" );
Prior to fstream::open. This sets the current codepage to the OS-defined codepage. In my program, the file I wanted to open with fstream was defined by the user, so it was in the system-defined codepage (Windows-1252).
So while fiddling with unicode and wide chars may be a solution to avoid unmapped characters, it wasn't the root of the problem. The actual problem was that the input string's codepage ("Windows-1252") didn't match the active codepage ("C") used by default in Windows programs.
So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.
My question seems to have confused folks. Here's something concrete:
Our code does the following:
FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);
Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).
Under UNICODE build target, the above produces a garbage file full of ????.
I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.
Here's the question as it used to exist:
FILE* streams can be opened in t(ranslated) or b(inary) modes.
Desktop applications can be compiled for UNICODE or MBCS (under
Windows).
If my application is compiled for MBCS, then writing MBCS strings to a
"wt" stream results in a well-formed text file containing MBCS text
for the system code page (i.e. the code page "for non Unicode
software").
Because our software generally uses the _t versions of most string &
stream functions, in MBCS builds output is handled primarily by
puts(pszMBString) or something similar putc etc. Since
pszMBString is already in the system code page (e.g. 932 when
running on a Japanese machine), the string is written out verbatim
(although line terminators are massaged by puts and gets
automatically).
However, if my application is compiled for UNICODE, then writing MBCS
strings to a "wt" stream results in garbage (lots of "?????"
characters) (i.e. I convert the UNICODE to the system's default code
page and then write that to the stream using, for example,
fwrite(pszNarrow, 1, length, stream)).
I can open my streams in binary mode, in which case I'll get the
correct MBCS text... but, the line terminators will no longer be
PC-style CR+LF, but instead will be UNIX-style LF only. This, because
in binary (non-translated) mode, the file stream doesn't handle the
LF->CR+LF translation.
But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct
line terminators and MBCS text files using the system's code page.
Obviously I can manually adjust the line terminators myself and use
binary streams. However, this is a very invasive approach, as I now
have to find every bit of code throughout the system that writes text
files, and alter it so that it does all of this correctly. What blows
my mind, is that UNICODE target is stupider / less capable than the
MBCS target we used to use! Surely there is a way to toggle the C
library to say "output narrow strings as-is but handle line
terminators properly, exactly as you'd do in MBCS builds"?!
Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).
My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC.
Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).
BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!
Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.
If using MFC, one needs to:
// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE
For C++ and C libraries, one must tell the libraries to use the system code page:
// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).
The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).
One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).
Useful reference materials:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software
The C library has support for both narrow (char) and wide (wchar_t) strings. In Windows these two types of strings are called MBCS (or ANSI) and Unicode respectively.
It is fully possible to use the narrow functions even though you have defined _UNICODE. The following code should produce the same output, regardless if _UNICODE is defined or not:
FILE* f = fopen("foo.txt", "wt");
fputs("foo\nbar\n", f);
fclose(f);
In your question you wrote: "I convert the UNICODE to the system's default code page and write that to the stream". This leads me to believe that your wide string contain characters that cannot be converted to the current code page, and thus replacing each of them with a question-mark.
Perhaps you could use some other encoding than the current code page. I recommend using the UTF-8 encoding where ever possible.
Update: Testing your example code on a Windows machine running on code page 1252, the call to _fputts returns -1, indicating an error. errno was set to EILSEQ, which means "Illegal byte sequence". The MSDN documentation for fopen states that:
When a Unicode stream-I/O function operates in text mode (the
default), the source or destination stream is assumed to be a sequence
of multibyte characters. Therefore, the Unicode stream-input functions
convert multibyte characters to wide characters (as if by a call to
the mbtowc function). For the same reason, the Unicode stream-output
functions convert wide characters to multibyte characters (as if by a
call to the wctomb function).
This is key information for this error. wctomb will use the locale for the C standard library. By explicitly setting the locale for the C standard library to code page 932 (Shift JIS), the code ran perfectly and the output was correctly encoded in Shift JIS in the output file.
int main()
{
setlocale(LC_ALL, ".932");
FILE * fout = _wfsopen(L"丸穴種類.txt", L"w", _SH_DENYNO);
fputws(L"刃物種類\n", fout);
fclose(fout);
}
An alternative (and perhaps preferable) solution to this would be to handle the conversions yourself before calling the narrow string functions of the C standard library.
When you compile for UNICODE, c++ library knows nothing about MBCS. If you say you open the file for outputting text, it will attempt to treat the buffers you pass to it as UNICODE buffers.
Also, MBCS is variable-length encoding. To parse it, c++ library needs to iterate over characters, which is of course impossible when it knows nothing about MBCS. Hence it's impossible to "just handle line terminators correctly".
I would suggest that you either prepare your strings beforehand, or make your own function that writes string to file. Not sure if writing characters one by one would be efficient (measurements required), but if not, you can handle strings piecewise, putting everything that doesn't contain \n in one go.
I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?
utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).
howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.
and so ideally you should really ignore what those bytes might encode.
might be difficult to reconcile with naïve user’s view, though