how does one convert std::u16string -> std::wstring using <codecvt>? - c++

I found a bunch of questions on a similar topic, but nothing regarding wide to wide conversion with <codecvt>, which is supposed to be the correct choice in the modern code.
The std::codecvt_utf16<wchar_t> seems to be a logical choice to perform the conversion.
However std::wstring_convert seem to expect std::string at one end. The methods from_bytes and to_bytes emphasize this purpose.
I mean, the best solution so far is something like std::copy, which might work for my specific case, but seems kinda low tech and probably not too correct either.
I have a string feeling that I am missing something rather obvious.
Cheers.

The std::wstring_convert and std::codecvt... classes are deprecated in C++17 onward. There is no longer a standard way to convert between the various string classes.
If your compiler still supports the classes, you can certainly use them. However, you cannot convert directly from std::u16string to std::wstring (and vice versa) with them. You will have to convert to an intermediate UTF-8 std::string first, and then convert that afterwards, eg:
std::u16string utf16 = ...;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> utf16conv;
std::string utf8 = utf16conv.to_bytes(utf16);
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> wconv;
std::wstring wstr = wconv.from_bytes(utf8);
Just know that this approach will break when the classes are eventually dropped from the standard library.
Using std::copy() (or simply the various std::wstring data construct/assign methods) will work only on Windows, where wchar_t and char16_t are both 16-bit in size representing UTF-16:
std::u16string utf16 = ...;
std::wstring wstr;
#ifdef _WIN32
wstr.reserve(utf16.size());
std::copy(utf16.begin(), utf16.end(), std::back_inserter(wstr));
/*
or: wstr = std::wstring(utf16.begin(), utf16.end());
or: wstr.assign(utf16.begin(), utf16.end());
or: wstr = std::wstring(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
or: wstr.assign(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
*/
#else
// do something else ...
#endif
But, on other platforms, where wchar_t is 32-bit in size representing UTF-32, you will need to actually convert the data, using the code shown above, or a platform-specific API or 3rd party Unicode library that can do the data conversion, such as libiconv, ICU. etc.

you cannot convert directly from std::u16string to std::wstring (and vice versa) with them. You will have to convert to an intermediate UTF-8 std::string first, and then convert that afterwards
This doesn't appear to be the case as
clang: converting const char16_t* (UTF-16) to wstring (UCS-4)
shows:
u16string s = u"hello";
wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
wchar_t> conv;
wstring ws = conv.from_bytes(
reinterpret_cast<const char*> (&s[0]),
reinterpret_cast<const char*> (&s[0] + s.size()));

Related

c++ how to convert wchar_t to char16_t

Convert between string, u16string & u32string
This post explains the opposite of my question. So I need to post a new question
I need to convert wchar_t to char16_t. I found a sample of doing the opposite ( char16_t -> wchar_t) here:
I am not familiar with templates etc, sorry. Can anybody give me an example of converting wchar_t to char16_t please?
I have this piece of code that I want to adapt for converting wchar_t to char16_t.
std::wstring u16fmt(const char16_t* str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert_wstring;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::string utf8str = convert.to_bytes(str);
std::wstring wstr = convert_wstring.from_bytes(utf8str);
return wstr;
}
Ah, and it should run on Windows and Linux
If sizeof( wchar_t ) == 2 (*), you're straddled with Windows and can only hope your wstring holds UTF-16 (and hasn't been smashed flat to UCS-2 by some old Windows function).
If sizeof( wchar_t ) == 4 (*), you're not on Windows and need to do a UTF-32 to UTF-16 conversion.
(*): Assuming CHAR_BIT == 8.
I am, however, rather pessimistic about standard library's Unicode capabilities beyond simple "piping through", so if you're going to do any actual work on those strings, I'd recommend ICU, the de-facto C/C++ standard library for all things Unicode.
icu::UnicodeString has a wchar_t * constructor, and you can call getTerminatedBuffer() to get a (non-owning) const char16_t *. Or, of course, just use icu::UnicodeString, which uses UTF-16 internally.

Is there a function to concatenate two char16_t

There is the function wcsncat_s() for concatenating two wchar_t*:
errno_t wcsncat_s( wchar_t *restrict dest, rsize_t destsz, const wchar_t *restrict src, rsize_t count );
Is there an equivalent function for concatenating two char16_t?
Not really.
On Windows, though, wchar_t is functionally identical to char16_t, so you could just cast your char16_t* to a wchar_t*.
Otherwise you can do it simply enough by writing yourself a function for it.
You could use std::u16string if you want something portable.
std::u16string str1(u16"The quick brown fox ");
std::u16string str2(u16"Jumped over the lazy dog");
std::u16string str3 = str1+str2; // concatenate
const char16_t* psz = str3.c_str();
The validity of psz lasts as long as str3 doesn't go out of out scope.
But the more portable and flexible solution is to just use wchar_t everywhere (which is 32-bit on Mac). Unless you are explicitly using 16-bit char strings (perhaps for a specific UTf16 processing routine), it's easier to just keep your code in the wide char (wchar_t) space. Plays nicer with native APIs and libraries on Mac and Windows.

Convert ICU Unicode string to std::wstring (or wchar_t*)

Is there an icu function to create a std::wstring from an icu UnicodeString ? I have been searching the ICU manual but haven't been able to find one.
(I know i can convert UnicodeString to UTF8 and then convert to platform dependent wchar_t* but i am looking for one function in UnicodeString which can do this conversion.
The C++ standard doesn't dictate any specific encoding for std::wstring. On Windows systems, wchar_t is 16-bit, and on Linux, macOS, and several other platforms, wchar_t is 32-bit. As far as C++'s std::wstring is concerned, it is just an arbitrary sequence of wchar_t in much the same way that std::string is just an arbitrary sequence of char.
It seems that icu::UnicodeString has no in-built way of creating a std::wstring, but if you really want to create a std::wstring anyway, you can use the C-based API u_strToWCS() like this:
icu::UnicodeString ustr = /* get from somewhere */;
std::wstring wstr;
int32_t requiredSize;
UErrorCode error = U_ZERO_ERROR;
// obtain the size of string we need
u_strToWCS(nullptr, 0, &requiredSize, ustr.getBuffer(), ustr.length(), &error);
// resize accordingly (this will not include any terminating null character, but it also doesn't need to either)
wstr.resize(requiredSize);
// copy the UnicodeString buffer to the std::wstring.
u_strToWCS(wstr.data(), wstr.size(), nullptr, ustr.getBuffer(), ustr.length(), &error);
Supposedly, u_strToWCS() will use the most efficient method for converting from UChar to wchar_t (if they are the same size, then it is just a straightfoward copy I suppose).

Conversion (const char*) var goes wrong

I need to convert from CString to double in Embedded Visual C++, which supports only old style C++. I am using the following code
CString str = "4.5";
double var = atof( (const char*) (LPCTSTR) str )
and resutlt is var=4.0, so I am loosing decimal digits.
I have made another test
LPCTSTR str = "4.5";
const char* var = (const char*) str
and result again var=4.0
Can anyone help me to get a correct result?
The issue here is, that you are lying to the compiler, and the compiler trusts you. Using Embedded Visual C++ I'm going to assume, that you are targeting Windows CE. Windows CE exposes a Unicode API surface only, so your project is very likely set to use Unicode (UTF-16 LE encoding).
In that case, CString expands to CStringW, which stores code units as wchar_t. When doing (const char*) (LPCTSTR) str you are then casting from a wchar_t const* to a char const*. Given the input, the first byte has the value 52 (the ASCII encoding for the character 4). The second byte has the value 0. That is interpreted as the terminator of the C-style string. In other words, you are passing the string "4" to your call to atof. Naturally, you'll get the value 4.0 as the result.
To fix the code, use something like the following:
CStringW str = L"4.5";
double var = _wtof( str.GetString() );
_wtof is a Microsoft-specific extension to its CRT.
Note two things in particular:
The code uses a CString variant with explicit character encoding (CStringW). Always be explicit about your string types. This helps read your code and catch bugs before they happen (although all those C-style casts in the original code defeats that entirely).
The code calls the CString::GetString member to retrieve a pointer to the immutable buffer. This, too, makes the code easier to read, by not using what looks to be a C-style cast (but is an operator instead).
Also consider defining the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION macro to prevent inadvertent character set conversions from happening (e.g. CString str = "4.5";). This, too, helps you catch bugs early (unless you defeat that with C-style casts as well).
CString is not const char* To convert a TCHAR CString to ASCII, use the CT2A macro - this will also allow you to convert the string to UTF8 (or any other Windows code page):
// Convert using the local code page
CString str(_T("Hello, world!"));
CT2A ascii(str);
TRACE(_T("ASCII: %S\n"), ascii.m_psz);
// Convert to UTF8
CString str(_T("Some Unicode goodness"));
CT2A ascii(str, CP_UTF8);
TRACE(_T("UTF8: %S\n"), ascii.m_psz);
Found a solution using scanf
CString str="4.5"
double var=0.0;
_stscanf( str, _T("%lf"), &var );
This gives a correct result var=4.5
Thanks everyone for comments and help.

Assigning a "const char*" to std::string is allowed, but assigning to std::wstring doesn't compile. Why?

I assumed that std::wstring and std::string both provide more or less the same interface.
So I tried to enable unicode capabilities for our application
# ifdef APP_USE_UNICODE
typedef std::wstring AppStringType;
# else
typedef std::string AppStringType;
# endif
However that gives me a lot of compile errors when -DAPP_USE_UNICODE is used.
It turned out, that the compiler chokes when a const char[] is assigned to std::wstring.
EDIT: improved example by removing the usage of literal "hello".
#include <string>
void myfunc(const char h[]) {
string s = h; // compiles OK
wstring w = h; // compile Error
}
Why does it make such a difference?
Assigning a const char* to std::string is allowed, but assigning to std::wstring gives compile errors.
Shouldn't std::wstring provide the same interface as std::string? At least for such a basic operation as assignment?
(environment: gcc-4.4.1 on Ubuntu Karmic 32bit)
You should do:
#include <string>
int main() {
const wchar_t h[] = L"hello";
std::wstring w = h;
return 0;
}
std::string is a typedef of std::basic_string<char>, while std::wstring is a typedef of std::basic_string<wchar_t>. As such, the 'equivalent' C-string of a wstring is an array of wchar_ts.
The 'L' in front of the string literal is to indicate that you are using a wide-char string constant.
The relevant part of the string API is this constructor:
basic_string(const charT*);
For std::string, charT is char. For std::wstring it's wchar_t. So the reason it doesn't compile is that wstring doesn't have a char* constructor. Why doesn't wstring have a char* constructor?
There is no one unique way to convert a string of char to a string of wchar. What's the encoding used with the char string? Is it just 7 bit ASCII? Is it UTF-8? Is it UTF-7? Is it SHIFT-JIS? So I don't think it would entirely make sense for std::wstring to have an automatic conversion from char*, even though you could cover most cases. You can use:
w = std::wstring(h, h + sizeof(h) - 1);
which will convert each char in turn to wchar (except the NUL terminator), and in this example that's probably what you want. As int3 says though, if that's what you mean it's most likely better to use a wide string literal in the first place.
To convert from a multibyte encoding to a wide character encoding, take a look at the header <locale> and the type std::codecvt. The Dinkumware library has a class Dinkum::wstring_convert that makes performing such multibyte-to-wide conversions easier.
The function std::codecvt_byname allows one to find a codecvt instance for a particular named encoding. Unfortunately, discovering the names of the encodings (or locales) on your system is implementation-specific.
Small suggestion... Do not use "Unicode" strings under Linux (a.k.a. wide strings). std::string is perfectly fine and holds Unicode very well (UTF-8).
Most Linux API works with char * strings and most popular encoding is UTF-8.
So... Just don't bother yourself using wstring.
In addition to the other answers, you could use a trick from Microsoft's book (specifically, tchar.h), and write something like this:
# ifdef APP_USE_UNICODE
typedef std::wstring AppStringType;
#define _T(s) (L##s)
# else
typedef std::string AppStringType;
#define _T(s) (s)
# endif
AppStringType foo = _T("hello world!");
(Note: my macro-fu is weak, and this is untested, but you get the idea.)
Looks like you can do something like this:
#include <sstream>
// ...
std::wstringstream tmp;
tmp << "hello world";
std::wstring our_string =
Although for a more complex situation, you may want to break down and use mbstowcs
you should use
#include <tchar.h>
tstring instead of wstring/string
TCHAR* instead of char*
and _T("hello") instead of "hello" or L"hello"
this will use the appropriate form of string+char, when _UNICODE is defined.