I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.
If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.
mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All Windows locales uses a two byte wchar_t and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t and have the encoding differ by locale. So wchar_t seems to me to be a bad choice for portability and Unicode. *
Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.
First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:
std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);
In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:
std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)
Examples of using these:
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");
The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:
template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;
The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.
Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.
***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496
What is wchar_t?
wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5
This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.
Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.
The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.
Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.
This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.
What use is wchar_t today?
Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.
The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.
For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').
In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.
I've written helper functions to convert to/from UTF8 strings (C++11):
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
template <typename T>
string toUTF8(const basic_string<T, char_traits<T>, allocator<T>>& source)
{
string result;
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.to_bytes(source);
return result;
}
template <typename T>
void fromUTF8(const string& source, basic_string<T, char_traits<T>, allocator<T>>& result)
{
wstring_convert<codecvt_utf8_utf16<T>, T> convertor;
result = convertor.from_bytes(source);
}
Usage example:
// Unicode <-> UTF8
{
wstring uStr = L"Unicode string";
string str = toUTF8(uStr);
wstring after;
fromUTF8(str, after);
assert(uStr == after);
}
// UTF16 <-> UTF8
{
u16string uStr;
uStr.push_back('A');
string str = toUTF8(uStr);
u16string after;
fromUTF8(str, after);
assert(uStr == after);
}
As far as I know, C++ provides no standard methods to convert from or to UTF-32. However, for UTF-16 there are the methods mbstowcs (Multi-Byte to Wide character string), and the inverse, wcstombs.
If you need UTF-32 too, you need iconv, which is in POSIX 2001 but not in standard C, so on Windows you'll need a replacement like libiconv.
Here's an example on how to use mbstowcs:
#include <string>
#include <iostream>
#include <stdlib.h>
using namespace std;
wstring widestring(const string &text);
int main()
{
string text;
cout << "Enter something: ";
cin >> text;
wcout << L"You entered " << widestring(text) << ".\n";
return 0;
}
wstring widestring(const string &text)
{
wstring result;
result.resize(text.length());
mbstowcs(&result[0], &text[0], text.length());
return result;
}
The reverse goes like this:
string mbstring(const wstring &text)
{
string result;
result.resize(text.length());
wcstombs(&result[0], &text[0], text.length());
return result;
}
Nitpick: Yes, I know, the size of wchar_t is implementation defined, so it could be 4 Bytes (UTF-32). However, I don't know a compiler which does that.
Related
The following code works as expected. The source code, file "file.txt" and "out.txt" are all encoded with utf8. But it does not work when I change wchar_t to char16_t at the first line in main(). I've tried both gcc5.4 and clang8.0 with -std=c++11. My goal is to replace wchar_t with char16_t, as wchar_t takes twice space in RAM. I thought these 2 types are equally well supported in c++11 and later standards. What do I miss here?
#include<iostream>
#include<fstream>
#include<locale>
#include<codecvt>
#include<string>
int main(){
typedef wchar_t my_char;
std::locale::global(std::locale("en_US.UTF-8"));
std::ofstream out("file.txt");
out << "123正则表达式abc" << std::endl;
out.close();
std::basic_ifstream<my_char> win("file.txt");
std::basic_string<my_char> wstr;
win >> wstr;
win.close();
std::ifstream in("file.txt");
std::string str;
in >> str;
in.close();
std::wstring_convert<std::codecvt_utf8<my_char>, my_char> my_char_conv;
std::basic_string<my_char> conv = my_char_conv.from_bytes(str);
std::cout << (wstr == conv ? "true" : "false") << std::endl;
std::basic_ofstream<my_char> wout("out.txt");
wout << wstr << std::endl << conv << std::endl;
wout.close();
return 0;
}
EDIT
The modified code does not compile with clang8.0. It compiles with gcc5.4 but crashes at run-time as shown by #Brian.
The various stream classes need a set of definitions to be operational. The standard library requires the relevant definitions and objects only for char and wchar_t but not for char16_t or char32_t. Off the top of my head the following is needed to use std::basic_ifstream<cT> or std::basic_ofstream<cT>:
std::char_traits<cT> to specify how the character type behaves. I think this template is specialized for char16_t and char32_t.
The used std::locale needs to contain an instance of the std::num_put<cT> facet to format numeric types. This facet can just be instantiated and a new std::locale containing it can be created but the standard doesn't mandate that it is present in a std::locale object.
The used std::locale needs to contain an instance of the facet std::num_get<cT> to read numeric types. Again, this facet can be instantiated but isn't required to be present by default.
the facet std::numpunct<cT> needs to be specialized and put into the used std::locale to deal with decimal points, thousand separators, and textual boolean values. Even if it isn't really used it will be referenced from the numeric formatting and parsing functions. There is no ready specialization for char16_t or char32_t.
The facet std::ctype<cT> needs to be specialized and put into the used facet to support widening, narrowing, and classification of the character type. There is no ready specialization for char16_t or char32_t.
The facet std::codecvt<cT, char, std::mbstate_t> needs to be specialized and put into the used std::locale to convert between external byte sequences and internal "character" sequences. There is no ready specialization for char16_t or char32_t.
Most of the facets are reasonably easy to do: they just need to forward a simple conversion or do table look-ups. However, the std::codecvt facet tends to be rather tricky, especially because std::mbstate_t is an opaque type from the point of view of the standard C++ library.
All of that can be done. It is a while since I last did a proof of concept implementation for a character type. It took me about a day worth of work. Of course, I knew what I need to do when I embarked on the work having implemented the locales and IOStreams library before. To add a reasonable amount of tests rather than merely having a simple demo would probably take me a week or so (assuming I can actually concentrate on this work).
The following code works as expected. The source code, file "file.txt" and "out.txt" are all encoded with utf8. But it does not work when I change wchar_t to char16_t at the first line in main(). I've tried both gcc5.4 and clang8.0 with -std=c++11. My goal is to replace wchar_t with char16_t, as wchar_t takes twice space in RAM. I thought these 2 types are equally well supported in c++11 and later standards. What do I miss here?
#include<iostream>
#include<fstream>
#include<locale>
#include<codecvt>
#include<string>
int main(){
typedef wchar_t my_char;
std::locale::global(std::locale("en_US.UTF-8"));
std::ofstream out("file.txt");
out << "123正则表达式abc" << std::endl;
out.close();
std::basic_ifstream<my_char> win("file.txt");
std::basic_string<my_char> wstr;
win >> wstr;
win.close();
std::ifstream in("file.txt");
std::string str;
in >> str;
in.close();
std::wstring_convert<std::codecvt_utf8<my_char>, my_char> my_char_conv;
std::basic_string<my_char> conv = my_char_conv.from_bytes(str);
std::cout << (wstr == conv ? "true" : "false") << std::endl;
std::basic_ofstream<my_char> wout("out.txt");
wout << wstr << std::endl << conv << std::endl;
wout.close();
return 0;
}
EDIT
The modified code does not compile with clang8.0. It compiles with gcc5.4 but crashes at run-time as shown by #Brian.
The various stream classes need a set of definitions to be operational. The standard library requires the relevant definitions and objects only for char and wchar_t but not for char16_t or char32_t. Off the top of my head the following is needed to use std::basic_ifstream<cT> or std::basic_ofstream<cT>:
std::char_traits<cT> to specify how the character type behaves. I think this template is specialized for char16_t and char32_t.
The used std::locale needs to contain an instance of the std::num_put<cT> facet to format numeric types. This facet can just be instantiated and a new std::locale containing it can be created but the standard doesn't mandate that it is present in a std::locale object.
The used std::locale needs to contain an instance of the facet std::num_get<cT> to read numeric types. Again, this facet can be instantiated but isn't required to be present by default.
the facet std::numpunct<cT> needs to be specialized and put into the used std::locale to deal with decimal points, thousand separators, and textual boolean values. Even if it isn't really used it will be referenced from the numeric formatting and parsing functions. There is no ready specialization for char16_t or char32_t.
The facet std::ctype<cT> needs to be specialized and put into the used facet to support widening, narrowing, and classification of the character type. There is no ready specialization for char16_t or char32_t.
The facet std::codecvt<cT, char, std::mbstate_t> needs to be specialized and put into the used std::locale to convert between external byte sequences and internal "character" sequences. There is no ready specialization for char16_t or char32_t.
Most of the facets are reasonably easy to do: they just need to forward a simple conversion or do table look-ups. However, the std::codecvt facet tends to be rather tricky, especially because std::mbstate_t is an opaque type from the point of view of the standard C++ library.
All of that can be done. It is a while since I last did a proof of concept implementation for a character type. It took me about a day worth of work. Of course, I knew what I need to do when I embarked on the work having implemented the locales and IOStreams library before. To add a reasonable amount of tests rather than merely having a simple demo would probably take me a week or so (assuming I can actually concentrate on this work).
I ran the same code which determines number of characters in a wide-character string. The tested string has ascii, numbers and Korean language.
#include <iostream>
using namespace std;
template <class T,class trait>
void DumpCharacters(T& a)
{
size_t length = a.size();
for(size_t i=0;i<length;i++)
{
trait n = a[i];
cout<<i<<" => "<<n<<endl;
}
cout<<endl;
}
int main(int argc, char* argv[])
{
wstring u = L"123abc가1나1다";
wcout<<u<<endl;
DumpCharacters<wstring,wchar_t>(u);
string s = "123abc가1나1다";
cout<<s<<endl;
DumpCharacters<string,char>(s);
return 0;
}
The obvious thing is that wstring.size() in Visual C++ 2010 returns the number of letters (11 characters), regardless if it is ascii or international character. However, it returns the byte count of string data (17 bytes) in XCode 4.2 in Mac OS X.
Please reply me how to get the character length of a wide-character string, not byte count in xcode.
--- added on 12 Feb --
I found that wcslen() also returns 17 in xcode. it returns 11 in vc++.
Here's the tested code:
const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);
--- added on 18 Feb --
I found that llvm 3.0 causes the wrong length. This problem is fixed after changing compiler frontend from llvm3.0 to 4.2
wcslen() works differently in Xcode and VC++ says the details.
It is an error if the std::wstring version uses 17 characters: it should only use 11 characters. Using recent SVN heads of gcc and clang it uses 11 characters for the std::wstring and 17 characters for the std::string. I think this is what expected.
Please note that the standard C++ library internally has a different idea of what a "character" is than what might be expected when multi-word encodings (e.g. UTF-8 for words of type char and UTF-16 for words with 16 bits) are used. Here is the first paragraph of the chapter describing string (21.1 [strings.general]):
This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types , and objects of char-like types are called char-like objects or simply characters.
This basically means that when using Unicode the various functions won't pay attention to what constitutes a code point but rather process the strings as a sequence of words. This is severe impacts and what will happen e.g. when producing substrings because these may easily split multi-byte characters apart. Currently, the standard C++ library doesn't have any support for processing multi-bytes encodings internally because it is assumed that the translation from an encoding to characters is done when reading data (and correspondingly the other way when writing data). If you are processing multi-byte encoded strings internally, you need be aware of this as there is no support at all.
It is recognized that this state of affairs is actually a problem. For C++2011 the character type char32_t was added which should support Unicode character still better than wchar_t (because Unicode uses 20 bits while wchar_t was allowed to only support 16 bits which is a choice made on some platforms at a time when Unicode was promising to use at most 16 bits). However, this would still not deal with combining characters. It is recognized by the C++ committee that this is a problem and that proper character processing in the standard C++ library would be something nice to have but so far nobody as come forward with a comprehensive proposal to address this problem (if you feel you want to propose something like this but you don't know how, please feel free to contact me and I will help you with how to submit a proposal).
XCode 4.2 apparently used UTF-8 (or something very similar) as narrow multibyte encoding to represent your characters string literal "123abc가1나1다" in the program's source code when initializing string s. The UTF-8 representation of that string happens to be 17 bytes long.
The wide character representation (stored in u) is 11 wide characters. There are many ways to convert from narrow to wide encoding. Try this:
#include <iostream>
#include <clocale>
#include <cstdlib>
int main()
{
std::wstring u = L"123abc가1나1다";
std::cout << "Wide string containts " << u.size() << " characters\n";
std::string s = "123abc가1나1다";
std::cout << "Narrow string contains " << s.size() << " bytes\n";
std::setlocale(LC_ALL, "");
std::cout << "Which can be converted to "
<< std::mbstowcs(NULL, s.c_str(), s.size())
<< " wide characters in the current locale,\n";
}
Use .length(), not .size() to get the string length.
std::string and std::wstring are typedefs of std::basic_string templated on char and wchar_t. The size() member function returns the number of elements in the string - the number of char's or wchar_t's. "" and L"" don't deal with encodings.
I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf, e.g.:
xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );
char *utf8_s; // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s; // characters are written in ISO-8859-1
The implementation of xcoder_streambuf would use ICU's converters API. It would take the data coming in (in this case, from utf8_s), transcode it, and write it out using the iostream's original steambuf.
Is that a reasonable way to go? If not, what would be better?
Is that a reasonable way to go?
Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.
The behaviour of outputting through basic_streambuf<> is defined by the overflow(int_type c) virtual function.
The description of basic_filebuf<>::overflow(int_type c = traits::eof()) includes a_codecvt.out(state, b, p, end, xbuf, xbuf+XSIZE, xbuf_end); where a_codecvt is defined as:
const codecvt<charT,char,typename traits::state_type>& a_codecvt
= use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());
so you are expected to imbue a locale with the appropriate codecvt<charT,char,typename traits::state_type> converter.
The class codecvt<internT,externT,stateT> is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.
The standard library support for Unicode made some progress since 1997:
the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.
This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).
If not, what would be better?
I would introduce a different type for UTF8, like:
struct utf8 {
unsigned char d; // d for data
};
struct latin1 {
unsigned char c; // c for character
};
This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream/ostream.
Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.
Is there any conceivable reason why I would see different results using unicode string literals versus the actual hex value for the UChar.
UnicodeString s1(0x0040); // # sign
UnicodeString s2("\u0040");
s1 isn't equivalent to s2. Why?
The \u escape sequence AFAIK is implementation defined, so it's hard to say why they are not equivalent without knowing details on your particular compiler. That said, it's simply not a safe way of doing things.
UnicodeString has a constructor taking a UChar and one for UChar32. I'd be explicit when using them:
UnicodeString s(static_cast<UChar>(0x0040));
UnicodeString also provide an unescape() method that's fairly handy:
UnicodeString s = UNICODE_STRING_SIMPLE("\\u4ECA\\u65E5\\u306F").unescape(); // 今日は
couldn't reproduce on ICU 4.8.1.1
#include <stdio.h>
#include "unicode/unistr.h"
int main(int argc, const char *argv[]) {
UnicodeString s1(0x0040); // # sign
UnicodeString s2("\u0040");
printf("s1==s2: %s\n", (s1==s2)?"T":"F");
// printf("s1.equals s2: %d\n", s1.equals(s2));
printf("s1.length: %d s2.length: %d\n", s1.length(), s2.length());
printf("s1.charAt(0)=U+%04X s2.charAt(0)=U+%04X\n", s1.charAt(0), s2.charAt(0));
return 0;
}
=>
s1==s2: T
s1.length: 1 s2.length: 1
s1.charAt(0)=U+0040 s2.charAt(0)=U+0040
gcc 4.4.5 RHEL 6.1 x86_64
For anyone else who find's this, here's what I found (in ICU's documentation).
The compiler's and the runtime character set's codepage encodings are
not specified by the C/C++ language standards and are usually not a
Unicode encoding form. They typically depend on the settings of the
individual system, process, or thread. Therefore, it is not possible
to instantiate a Unicode character or string variable directly with
C/C++ character or string literals. The only safe way is to use
numeric values. It is not an issue for User Interface (UI) strings
that are translated.
[1] http://userguide.icu-project.org/strings
The double quotes in your \u constant are the problem. This evaluated properly:
wchar_t m1( 0x0040 );
wchar_t m2( '\u0040' );
bool equal = ( m1 == m2 );
equal was true.