Is there any conceivable reason why I would see different results using unicode string literals versus the actual hex value for the UChar.
UnicodeString s1(0x0040); // # sign
UnicodeString s2("\u0040");
s1 isn't equivalent to s2. Why?
The \u escape sequence AFAIK is implementation defined, so it's hard to say why they are not equivalent without knowing details on your particular compiler. That said, it's simply not a safe way of doing things.
UnicodeString has a constructor taking a UChar and one for UChar32. I'd be explicit when using them:
UnicodeString s(static_cast<UChar>(0x0040));
UnicodeString also provide an unescape() method that's fairly handy:
UnicodeString s = UNICODE_STRING_SIMPLE("\\u4ECA\\u65E5\\u306F").unescape(); // 今日は
couldn't reproduce on ICU 4.8.1.1
#include <stdio.h>
#include "unicode/unistr.h"
int main(int argc, const char *argv[]) {
UnicodeString s1(0x0040); // # sign
UnicodeString s2("\u0040");
printf("s1==s2: %s\n", (s1==s2)?"T":"F");
// printf("s1.equals s2: %d\n", s1.equals(s2));
printf("s1.length: %d s2.length: %d\n", s1.length(), s2.length());
printf("s1.charAt(0)=U+%04X s2.charAt(0)=U+%04X\n", s1.charAt(0), s2.charAt(0));
return 0;
}
=>
s1==s2: T
s1.length: 1 s2.length: 1
s1.charAt(0)=U+0040 s2.charAt(0)=U+0040
gcc 4.4.5 RHEL 6.1 x86_64
For anyone else who find's this, here's what I found (in ICU's documentation).
The compiler's and the runtime character set's codepage encodings are
not specified by the C/C++ language standards and are usually not a
Unicode encoding form. They typically depend on the settings of the
individual system, process, or thread. Therefore, it is not possible
to instantiate a Unicode character or string variable directly with
C/C++ character or string literals. The only safe way is to use
numeric values. It is not an issue for User Interface (UI) strings
that are translated.
[1] http://userguide.icu-project.org/strings
The double quotes in your \u constant are the problem. This evaluated properly:
wchar_t m1( 0x0040 );
wchar_t m2( '\u0040' );
bool equal = ( m1 == m2 );
equal was true.
Related
This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.
By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.
The problem with the third method is this:
putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).
Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz
I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.
Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.
UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a LoadStringOem() function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte() without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.
As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback).
As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.
Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/) works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font must contain desired characters, and must not be "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.
I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.
I write a function to convert wstring to string.If I remove the code setlocale(LC_CTYPE, "") the program goes wrong.I refer to cplusplus read the doc.
C string containing the name of a C locale. These are system specific,
but at least the two following locales must exist:
"C" Minimal "C" locale
"" Environment's default locale
If the value of this parameter is NULL, the function does not make any
changes to the current locale, but the name of the current locale is
still returned by the function.
my code here,source code from cplusplus.com(I add some chinese character):
/* wcstombs example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* wcstombs, wchar_t(C) */
#include <locale.h> /* setlocale */
int main()
{
setlocale(LC_CTYPE, "");
const wchar_t str[] = L"中国、wcstombs example";
char buffer[64];
int ret;
printf ("wchar_t string: %ls \n",str);
ret = wcstombs ( buffer, str, sizeof(buffer) );
if (ret==64)
buffer[63]='\0';
if (ret)
printf ("length:%d,multibyte string: %s \n",ret,buffer);
return 0;
}
If I remove the code setlocale(LC_CTYPE, ""),the program does not run as I expect.
My question is :"If I run in different machine,the program will differ? As the doc say,if the locale is "" ,function does not make any changes to the current locale,but the name of the current locale is still returned by the funciton."
Because the current locale in different machine may differ?
Here is a my c++ version of convert wstring with string,while string to wstring do not need function setlocale,and the program runs well:
/*
string converts to wstring
*/
std::wstring s2ws(const std::string& src)
{
std::wstring res = L"";
size_t const wcs_len = mbstowcs(NULL, src.c_str(), 0);
std::vector<wchar_t> buffer(wcs_len + 1);
mbstowcs(&buffer[0], src.c_str(), src.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
/*
wstring converts to string
*/
std::string ws2s(const std::wstring & src)
{
setlocale(LC_CTYPE, "");
std::string res = "";
size_t const mbs_len = wcstombs(NULL, src.c_str(), 0);
std::vector<char> buffer(mbs_len + 1);
wcstombs(&buffer[0], src.c_str(), buffer.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
If the second argument to setlocale is NULL, it does nothing apart from returning the current locale. But you're not doing that. You're sending it a string entirely consisting of a single nil byte, aka "". My setlocale man page says
If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent.
So what this is doing for you is setting the locale to whatever the user has specified or to the system default.
Without running setlocale at all presumably leaves the current locale either uninitialized or NULL on your system, which is why your program fails without that setting.
Two other man pages for stuff you're using say
The behavior of mbstowcs() depends on the LC_CTYPE category of the current locale.
The behavior of wcstombs() depends on the LC_CTYPE category of the current locale.
Presumably these routines are what is failing if you haven't set the locale at all.
I would guess that you probably don't need to run the setlocale statement on every invocation of these routines, but you do need to make sure it's run at least once before running them.
As far as what happens differently depending on the current locale, I believe that would be how exactly the multibyte string is converted to wide characters and vis versa. I think that the man page for those routines leaves it vague because of that difference. Personally, I'd prefer if it set some examples, such as, "if the current locale is C, the multibyte string is ASCII characters." I would guess there's also at least one in which it is interpreted as UTF-8, but I don't know enough about the different locales to say exactly which one that is. There's probably also at least one locale where the multibyte string happened to be another two bytes per character encoding, but C and C++ would still treat it as bytes.
Edit: Thinking about this more, given the characters you added to the example code, it might make sense to explicitly state that using locales that do not support Chinese characters will cause the final printf to report that the length was -1, and this includes the default C locale. In this case, the contents of the buffer is not clearly specified by the standard - at least, my reading of it indicates that the buffer value will probably be all of the characters up to but not including the one that failed to convert. While neither the C++ documentation nor the C documentation state what happens regarding the character that could not be converted. I haven't paid for the official standards, but I do have copies of the last free releases. C++17 defers to C17. C17 also refrains from commenting on this aspect of this function. For wcsrtombs, it explicitly states that the conversion state is unspecified. However, on wcstombs_s, C17 states
If the conversion stops without converting a null wide character and dst is not a null pointer, then a null character is stored into the array pointed to by dst immediately following any multibyte characters already stored.
In my own experiments with the code provided by the OP above, it appears that the wcstombs implementation on Fedora 28 simply refrains from making any further changes to the buffer. That seems to indicate to me, if the exact behavior of the code matters for this situation, it may make sense to use wcstombs_s instead. But at a minimum, you just check to see if the length returned is -1, and if it is, report an error rather than assuming the conversion worked.
Short version I am using unicode. I am attempting to use a std::string, to a function that requires a const WCHAR string; DrawString(const WCHAR, ...
I compile with GCC. Everything is unicode, I have specified.
I have been trying to convert a string, into a wchar_t*. The purpose is so that I can output using a GDI+ function, its parameters require it so.
Here is how I have outputted string literals, no problems, debugs fine, works fine.
http://msdn.microsoft.com/en-us/library/ms535991%28v=vs.85%29.aspx for reference why:
// works fine
wchar_t* wcBuff;
wcBuff = (wchar_t*)L"Some text here.\0";
AddString(wcBuff, wcslen(wcBuff), &gFontFamilyInfo, FontStyleBold, 20, ptOrg_Controls, &strFormat_Info);
Now this is what I have been trying, all day, and a side note: my conversion function works fine, it is not an issue, nor creating one.
// problems
string s = "Level " + convert::intToString(6) + "\0";
// try 1 - Segfault
wchar_t* wcBuff = new wchar_t[s.length() + 1];
copy(s.begin(), s.end(), wcBuff);
// random tries, compiles, but access violations (my conversion function here has worked other places, do not know for sure here.
wchar_t* wcBuff;
wstring wstr = convert::stringToWideChar(s);
wstring strvalue = convert::stringToWideChar(s);
wcBuff = (wchar_t*)strvalue.c_str();
wcBuff = (wchar_t*)wstr.c_str();
wstring foo;
foo.assign(s.begin(), s.end());
wcBuff = (wchar_t*)foo.c_str();
Everything compiles, but then presents problems. Some runtime errors as soon as it reaches that point. Others access violations and segfaults. Some compiles and debugs no problem, but the strings output constantly changes with random characters.
Any ideas?
(this is not really an answer, but it's too big for a comment)
Try 1: you didn't null-terminate the string
Try 2: can't comment without seeing the conversion function. Remove the casts.
Try 3: Remove the casts, should be OK.
In all cases use wchar_t const *wcBuff. If "Try 3" fails then it means you have a bug somewhere else in your code, that is showing up here. Try to produce a MCVE. You should be able to get it down to about 10-20 lines.
Even if you manage to write the correct code for what you're intending, this is a fairly naive conversion as it doesn't handle characters outside the 0-127 range properly. You need to think about whether that is what you want, or whether you want to do a UTF-8 conversion, etc.
In Windows you can use MultiByteToWideChar.
#include <string>
int main() {
// Can use convenient wstring
std::wstring wstr = L"My wide string";
// When you need a whar_t* just do this
const wchar_t* s = wstr.c_str();
// unicode form of strcpy
wchar_t buf[100] = {0};
wcscpy (buf,s);
// And if you want to convert from string to wstring
std::string thin = "I only take up one byte per character!";
std::wstring wide(thin.begin(), thin.end());
return 0;
}
I first get my data into a wstring. Like this:
(Converting from string):
std::string sString = "This is my string text";
std::wstring str1(sString.begin(), sString.end());
(Converting from int):
wstring str1 = std::to_wstring(BirthDate);
Then, I use it in GDI+ Command like this:
graphics.DrawString(str1.c_str(), -1,
&font, PointF(10, 5), &st);
First thing first. GDI+ is a C++ library. It uses Microsoft C++ ABI. Microsoft C++ ABI is wildly incompatible with gcc so you might just forget about using it. You can try to use WinAPI or any other library that uses C calling conventions.
Now for the wstring question. wchar_t is 32 bits wide in gcc, but Windows APIs require it to be 16 bits wide. You cannot use any native Windows call that requires wchar_t.
You can use -fshort-wchar command line option in gcc, that would make wchar_t 16 bits wide and you will regain compatibility with Windows APIs, but lose compatibility with libc, so no library functions that act on wchar_t for you. std::wstring will probably work as it's header-only, but wprintf or wscpy or all other compiled stuff won't.
None of this is detected at compile time, as the only things gcc sees are header files. It cannot tell whether corresponding libraries are compiled with 16-bit wchar_t or 32-bit wchar_t.
You can use uint16_t when you need to pass an array of wchar_t to a Windows function. If you can use C++11, it has char16_t that you can use too. Here's an example that should work with Basic Multilingual Plane characters:
std::wstring myLittleNiceWstring;
...
std::vector<uint16_t> myUglyCompatibilityString;
std::copy(myLittleNiceWstring.begin(),
myLittleNiceWstring.end(),
std::back_inserter(myUglyCompatibilityString));
myUglyCompatibilityString.push_back(0);
UglyWindowsAPI(static_cast<WCHAR*>(myUglyCompatibilityString.data());
If you have non-BMP characters, you need to convert UTF32 to UTF16 rather than just copy characters with std::copy. You can use libiconv for that or write a conversion routine yourself (it's rather simple) or just boorow some code from the internet.
It is my opinion that Windows-centric development with GCC is rather difficult because of this and other issues. You can use gcc as long as you stick to POSIX-ish APIs.
I ran the same code which determines number of characters in a wide-character string. The tested string has ascii, numbers and Korean language.
#include <iostream>
using namespace std;
template <class T,class trait>
void DumpCharacters(T& a)
{
size_t length = a.size();
for(size_t i=0;i<length;i++)
{
trait n = a[i];
cout<<i<<" => "<<n<<endl;
}
cout<<endl;
}
int main(int argc, char* argv[])
{
wstring u = L"123abc가1나1다";
wcout<<u<<endl;
DumpCharacters<wstring,wchar_t>(u);
string s = "123abc가1나1다";
cout<<s<<endl;
DumpCharacters<string,char>(s);
return 0;
}
The obvious thing is that wstring.size() in Visual C++ 2010 returns the number of letters (11 characters), regardless if it is ascii or international character. However, it returns the byte count of string data (17 bytes) in XCode 4.2 in Mac OS X.
Please reply me how to get the character length of a wide-character string, not byte count in xcode.
--- added on 12 Feb --
I found that wcslen() also returns 17 in xcode. it returns 11 in vc++.
Here's the tested code:
const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);
--- added on 18 Feb --
I found that llvm 3.0 causes the wrong length. This problem is fixed after changing compiler frontend from llvm3.0 to 4.2
wcslen() works differently in Xcode and VC++ says the details.
It is an error if the std::wstring version uses 17 characters: it should only use 11 characters. Using recent SVN heads of gcc and clang it uses 11 characters for the std::wstring and 17 characters for the std::string. I think this is what expected.
Please note that the standard C++ library internally has a different idea of what a "character" is than what might be expected when multi-word encodings (e.g. UTF-8 for words of type char and UTF-16 for words with 16 bits) are used. Here is the first paragraph of the chapter describing string (21.1 [strings.general]):
This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types , and objects of char-like types are called char-like objects or simply characters.
This basically means that when using Unicode the various functions won't pay attention to what constitutes a code point but rather process the strings as a sequence of words. This is severe impacts and what will happen e.g. when producing substrings because these may easily split multi-byte characters apart. Currently, the standard C++ library doesn't have any support for processing multi-bytes encodings internally because it is assumed that the translation from an encoding to characters is done when reading data (and correspondingly the other way when writing data). If you are processing multi-byte encoded strings internally, you need be aware of this as there is no support at all.
It is recognized that this state of affairs is actually a problem. For C++2011 the character type char32_t was added which should support Unicode character still better than wchar_t (because Unicode uses 20 bits while wchar_t was allowed to only support 16 bits which is a choice made on some platforms at a time when Unicode was promising to use at most 16 bits). However, this would still not deal with combining characters. It is recognized by the C++ committee that this is a problem and that proper character processing in the standard C++ library would be something nice to have but so far nobody as come forward with a comprehensive proposal to address this problem (if you feel you want to propose something like this but you don't know how, please feel free to contact me and I will help you with how to submit a proposal).
XCode 4.2 apparently used UTF-8 (or something very similar) as narrow multibyte encoding to represent your characters string literal "123abc가1나1다" in the program's source code when initializing string s. The UTF-8 representation of that string happens to be 17 bytes long.
The wide character representation (stored in u) is 11 wide characters. There are many ways to convert from narrow to wide encoding. Try this:
#include <iostream>
#include <clocale>
#include <cstdlib>
int main()
{
std::wstring u = L"123abc가1나1다";
std::cout << "Wide string containts " << u.size() << " characters\n";
std::string s = "123abc가1나1다";
std::cout << "Narrow string contains " << s.size() << " bytes\n";
std::setlocale(LC_ALL, "");
std::cout << "Which can be converted to "
<< std::mbstowcs(NULL, s.c_str(), s.size())
<< " wide characters in the current locale,\n";
}
Use .length(), not .size() to get the string length.
std::string and std::wstring are typedefs of std::basic_string templated on char and wchar_t. The size() member function returns the number of elements in the string - the number of char's or wchar_t's. "" and L"" don't deal with encodings.
I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.
If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.
mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All Windows locales uses a two byte wchar_t and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t and have the encoding differ by locale. So wchar_t seems to me to be a bad choice for portability and Unicode. *
Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.
First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:
std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);
In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:
std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)
Examples of using these:
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");
The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:
template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;
The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.
Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.
***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496
What is wchar_t?
wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5
This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.
Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.
The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.
Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.
This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.
What use is wchar_t today?
Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.
The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.
For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').
In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.
I've written helper functions to convert to/from UTF8 strings (C++11):
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
template <typename T>
string toUTF8(const basic_string<T, char_traits<T>, allocator<T>>& source)
{
string result;
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.to_bytes(source);
return result;
}
template <typename T>
void fromUTF8(const string& source, basic_string<T, char_traits<T>, allocator<T>>& result)
{
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.from_bytes(source);
}
Usage example:
// Unicode <-> UTF8
{
wstring uStr = L"Unicode string";
string str = toUTF8(uStr);
wstring after;
fromUTF8(str, after);
assert(uStr == after);
}
// UTF16 <-> UTF8
{
u16string uStr;
uStr.push_back('A');
string str = toUTF8(uStr);
u16string after;
fromUTF8(str, after);
assert(uStr == after);
}
As far as I know, C++ provides no standard methods to convert from or to UTF-32. However, for UTF-16 there are the methods mbstowcs (Multi-Byte to Wide character string), and the inverse, wcstombs.
If you need UTF-32 too, you need iconv, which is in POSIX 2001 but not in standard C, so on Windows you'll need a replacement like libiconv.
Here's an example on how to use mbstowcs:
#include <string>
#include <iostream>
#include <stdlib.h>
using namespace std;
wstring widestring(const string &text);
int main()
{
string text;
cout << "Enter something: ";
cin >> text;
wcout << L"You entered " << widestring(text) << ".\n";
return 0;
}
wstring widestring(const string &text)
{
wstring result;
result.resize(text.length());
mbstowcs(&result[0], &text[0], text.length());
return result;
}
The reverse goes like this:
string mbstring(const wstring &text)
{
string result;
result.resize(text.length());
wcstombs(&result[0], &text[0], text.length());
return result;
}
Nitpick: Yes, I know, the size of wchar_t is implementation defined, so it could be 4 Bytes (UTF-32). However, I don't know a compiler which does that.