How can I convert a wchar_t* to char* without losing data? - c++

I'm using a Japanese string as a wchar_t, and I need to convert it to a char*. Is there any method or function to convert wchar_t* to char* without losing data?

It is not enough to say "I have a string as wchar_t". You must also know what encoding the characters of the string are in. This is probably UTF-16, but you need to know definitely.
It is also not enough to say "I want to convert to char". Again, you must make a decision on what encoding the characters will be represented in. JIS? Shift-JIS? EUC? UTF-8? Another encoding?
If you know the answers to the two questions above, you can do the conversion without any problem using WideCharToMultiByte.

What you have to do first is to choose the string encoding such as UTF-8 or UTF-16. And then, encode your wchar_t[] strings in the encoding you choose via libiconv or other similar string encoding library.

You need to call WideCharToMultiByte and pass in the code page encoding identifier for the Japanese multibyte encoding you want. See the MDSN for that function. On Windows, the local multibyte set is CP932, the MS variation on ShiftJIS. However, you might conceivably want UTF-8 to send to someone who wants it.

Related

Convert from std::wstring to std::string

I'm converting wstring to string with std::codecvt_utf8 as described in this question, but when I tried Greek or Chinese alphabet symbols are corrupted, I can see it in the debug Locals window, for example 日本 became "日本"
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; //also tried codecvt_utf8_utf16
std::string str = myconv.to_bytes(wstr);
What am I doing wrong?
std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.
Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.
I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.
But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.
As verification, python gives the following:
>>> '日本'.encode('utf8').decode('cp1252')
'日本'
Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.
Therefore the conversion seems to have worked as intended.
As mentioned by #MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

Can I read åäö from wxWidget wxTextCtrl?

In C++. I have a wxTexfield and want the user to input a swedish translation of a word.
Everything works until the user types å, ä or ö (utf8).
Converting wxString to utf8 is not the problem - the problem is i can not even get the text out of the field. For the rest of text i use (where ans is a ponter to the Textfield). Any Idea? For the other strings i just use and it works perfekt.
std::string ch = std::string((ans->GetValue()));
You can't convert an arbitrary Unicode string to std::string without specifying the encoding. By default, the encoding is that of the current locale which, especially under Windows, is not necessarily UTF-8 which is what you almost certainly want to use precisely because the characters not representable in this encoding will be simply lost during conversion.
So the correct thing to do is to explicitly use ans->GetValue().ToUTF8() and then your std::string will contain UTF-8-encoded representation of your characters. Of course, you need to realize that the string won't be of length 1, even for a single character, so perhaps you need to use std::wstring instead.
P.S. In wxWidgets 3.1.5+ you also have utf8_string() directly returning std::string, so you can also use this one if you have a new enough version.

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

Choosing encoding for icu::UnicodeString

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?
The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.
If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.
I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison
UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

Why use MultiByteToWideCharArray to convert std::string to std::wstring?

I want to convert a std::string to std::wstring. There are two approaches which i have come across.
Given a string str we cant convert into wide string using the following code
wstring widestring = std::wstring(str.begin(),str.end());
The other approach is to use MultiByteToWideCharArray().
What i wanted to understand was what is the drawback of using the first approach and how does the second approach solves thing problem
MultiByteToWideChar offers more options(like the ability to select "codepages") and translates non-standard symbols correctly
The first option doesn't support multibyte encoding. It will iterate through each byte (char) in the string and convert it to a wide character. When you have a string with multibyte encoding, individual characters can take more than one byte, so a standard string iterator is inappropriate.
The MultiByteToWideChar function has support for different multibyte formats, as specified by the codepage parameter.