Convert a multibyte UTF8 to wchar_t for usage with _wfopen() - c++

There are different threads to similar problems, but after searching and trying a lot, I could not find a solution. So that's what I have:
There is a pathname of a file which originally has the name "C:\F\鸡汤饭\abstr.txt". This is some internal representation where I do not have access to.
What I get in my application is this string converted to UTF-8 multi-byte handed over as a char array. So in this array I can find the data "C:\F\鸡汤饭\abstr.txt".
Now I want to open the related file. I found _wfopen() could do that job, but it expects a wchar_t string. So I tried to convert this multibyte UTF-8 char array to wchar_t via mbstowcs() - but this does not work, the resulting wchar_t array contains exactly the same data and _wfopen() fails.
So... any idea how I can open this file correctly?

Finally the solution
fs::path p = fs::u8path(u8"要らない.txt");
using std::filesystem did work properly. Surprisingly this was introduced with C++17 but is already deprecated :-O

Related

Convert from std::wstring to std::string

I'm converting wstring to string with std::codecvt_utf8 as described in this question, but when I tried Greek or Chinese alphabet symbols are corrupted, I can see it in the debug Locals window, for example 日本 became "日本"
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; //also tried codecvt_utf8_utf16
std::string str = myconv.to_bytes(wstr);
What am I doing wrong?
std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.
Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.
I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.
But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.
As verification, python gives the following:
>>> '日本'.encode('utf8').decode('cp1252')
'日本'
Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.
Therefore the conversion seems to have worked as intended.
As mentioned by #MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

In C++ when to use WCHAR and when to use CHAR

I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.
Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.
WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.
The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.
In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.
It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

How to use utf8 character arrays in c++?

Is it possible to have char *s to work with utf8 encoding in C++ (VC2010)?
For example if my source file is saved in utf8 and I write something like this:
const char* c = "aäáéöő";
Is this possible to make it utf-8 encoded? And if yes, how is it possible to use
char* c2 = new char[strlen("aäáéöő")];
for dynamic allocation if characters can be variable length?
The encoding for narrow character string literals is implementation defined, so you'd really have to read the documentation (if you can find it). A quick experiment shows that both VC++ (VC8, anyway) and g++ (4.4.2, anyway) actually just copy the bytes from the source file; the string literal will be in whatever encoding your editor saved it in. (This is clearly in violation of the standard, but it seems to be common practice.)
C++11 has UTF-8 string literals, which would allow you to write u8"text", and be ensured that "text" was encoded in UTF-8. But I don't really expect it to work reliably: the problem is that in order to do this, the compiler has to know what encoding your source file has. In all probability, compiler writers will continue to ignore the issue, just copying the bytes from the source file, and achieve conformance simply be documenting that the source file must be in UTF-8 for these features to work.
If the text you want to put in the string is in your source code, make sure your source code file is in UTF-8.
If that don't work, try maybe using \u1234 with 1234 being a code point value.
You can also try to use UTF8-CPP maybe.
Take a look at this answer : Using Unicode in C++ source code
See this MSDN article which talks about converting between string types (that should give you examples on how to use them). The strings types that are covered include char *, wchar_t*, _bstr_t, CComBSTR, CString, basic_string, and System.String:
How to: Convert Between Various String Types
There is a hotfix for VisualStudio 2010 SP1 which can help: http://support.microsoft.com/kb/980263.
The hotfix adds a pragma to override visual studio's control the character encoding for the char type:
#pragma execution_character_set("utf-8")
Without the pragma, char* based literals are typically interpreted as the default code page (typically 1252)
This should all be superseded eventually by new string literal prefix modifiers specified by C++0x (u8, u, and U for utf-8, utf-16, and utf-32 respectively), which ideally will be supprted in the next major version of Visual Studio after 2010.
It is possible, save the file in UTF-8 without BOM signature encoding.
//Save As UTF8 without BOM signature
#include<stdio.h>
#include<windows.h>
int main(){
SetConsoleOutputCP(65001);
char *c1 = "aäáéöő";
char *c2 = new char[strlen("aäáéöő")];
strcpy(c2,c1);
printf("%s\n",c1);
printf("%s\n",c2);
}
Result:
D:\Debug>program
aäáéöő
aäáéöő
The result of redirection program is really UTF8 encoded file.
This is compiler - independent answer (compile on Windows).
(A similar question.)

How can I convert a wchar_t* to char* without losing data?

I'm using a Japanese string as a wchar_t, and I need to convert it to a char*. Is there any method or function to convert wchar_t* to char* without losing data?
It is not enough to say "I have a string as wchar_t". You must also know what encoding the characters of the string are in. This is probably UTF-16, but you need to know definitely.
It is also not enough to say "I want to convert to char". Again, you must make a decision on what encoding the characters will be represented in. JIS? Shift-JIS? EUC? UTF-8? Another encoding?
If you know the answers to the two questions above, you can do the conversion without any problem using WideCharToMultiByte.
What you have to do first is to choose the string encoding such as UTF-8 or UTF-16. And then, encode your wchar_t[] strings in the encoding you choose via libiconv or other similar string encoding library.
You need to call WideCharToMultiByte and pass in the code page encoding identifier for the Japanese multibyte encoding you want. See the MDSN for that function. On Windows, the local multibyte set is CP932, the MS variation on ShiftJIS. However, you might conceivably want UTF-8 to send to someone who wants it.

Read Unicode Files

I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).
Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.
The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.
You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.
First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.