Convert ICU Unicode string to std::wstring (or wchar_t*) - c++

Is there an icu function to create a std::wstring from an icu UnicodeString ? I have been searching the ICU manual but haven't been able to find one.
(I know i can convert UnicodeString to UTF8 and then convert to platform dependent wchar_t* but i am looking for one function in UnicodeString which can do this conversion.

The C++ standard doesn't dictate any specific encoding for std::wstring. On Windows systems, wchar_t is 16-bit, and on Linux, macOS, and several other platforms, wchar_t is 32-bit. As far as C++'s std::wstring is concerned, it is just an arbitrary sequence of wchar_t in much the same way that std::string is just an arbitrary sequence of char.
It seems that icu::UnicodeString has no in-built way of creating a std::wstring, but if you really want to create a std::wstring anyway, you can use the C-based API u_strToWCS() like this:
icu::UnicodeString ustr = /* get from somewhere */;
std::wstring wstr;
int32_t requiredSize;
UErrorCode error = U_ZERO_ERROR;
// obtain the size of string we need
u_strToWCS(nullptr, 0, &requiredSize, ustr.getBuffer(), ustr.length(), &error);
// resize accordingly (this will not include any terminating null character, but it also doesn't need to either)
wstr.resize(requiredSize);
// copy the UnicodeString buffer to the std::wstring.
u_strToWCS(wstr.data(), wstr.size(), nullptr, ustr.getBuffer(), ustr.length(), &error);
Supposedly, u_strToWCS() will use the most efficient method for converting from UChar to wchar_t (if they are the same size, then it is just a straightfoward copy I suppose).

Related

c++ how to convert wchar_t to char16_t

Convert between string, u16string & u32string
This post explains the opposite of my question. So I need to post a new question
I need to convert wchar_t to char16_t. I found a sample of doing the opposite ( char16_t -> wchar_t) here:
I am not familiar with templates etc, sorry. Can anybody give me an example of converting wchar_t to char16_t please?
I have this piece of code that I want to adapt for converting wchar_t to char16_t.
std::wstring u16fmt(const char16_t* str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert_wstring;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::string utf8str = convert.to_bytes(str);
std::wstring wstr = convert_wstring.from_bytes(utf8str);
return wstr;
}
Ah, and it should run on Windows and Linux
If sizeof( wchar_t ) == 2 (*), you're straddled with Windows and can only hope your wstring holds UTF-16 (and hasn't been smashed flat to UCS-2 by some old Windows function).
If sizeof( wchar_t ) == 4 (*), you're not on Windows and need to do a UTF-32 to UTF-16 conversion.
(*): Assuming CHAR_BIT == 8.
I am, however, rather pessimistic about standard library's Unicode capabilities beyond simple "piping through", so if you're going to do any actual work on those strings, I'd recommend ICU, the de-facto C/C++ standard library for all things Unicode.
icu::UnicodeString has a wchar_t * constructor, and you can call getTerminatedBuffer() to get a (non-owning) const char16_t *. Or, of course, just use icu::UnicodeString, which uses UTF-16 internally.

how does one convert std::u16string -> std::wstring using <codecvt>?

I found a bunch of questions on a similar topic, but nothing regarding wide to wide conversion with <codecvt>, which is supposed to be the correct choice in the modern code.
The std::codecvt_utf16<wchar_t> seems to be a logical choice to perform the conversion.
However std::wstring_convert seem to expect std::string at one end. The methods from_bytes and to_bytes emphasize this purpose.
I mean, the best solution so far is something like std::copy, which might work for my specific case, but seems kinda low tech and probably not too correct either.
I have a string feeling that I am missing something rather obvious.
Cheers.
The std::wstring_convert and std::codecvt... classes are deprecated in C++17 onward. There is no longer a standard way to convert between the various string classes.
If your compiler still supports the classes, you can certainly use them. However, you cannot convert directly from std::u16string to std::wstring (and vice versa) with them. You will have to convert to an intermediate UTF-8 std::string first, and then convert that afterwards, eg:
std::u16string utf16 = ...;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> utf16conv;
std::string utf8 = utf16conv.to_bytes(utf16);
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> wconv;
std::wstring wstr = wconv.from_bytes(utf8);
Just know that this approach will break when the classes are eventually dropped from the standard library.
Using std::copy() (or simply the various std::wstring data construct/assign methods) will work only on Windows, where wchar_t and char16_t are both 16-bit in size representing UTF-16:
std::u16string utf16 = ...;
std::wstring wstr;
#ifdef _WIN32
wstr.reserve(utf16.size());
std::copy(utf16.begin(), utf16.end(), std::back_inserter(wstr));
/*
or: wstr = std::wstring(utf16.begin(), utf16.end());
or: wstr.assign(utf16.begin(), utf16.end());
or: wstr = std::wstring(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
or: wstr.assign(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
*/
#else
// do something else ...
#endif
But, on other platforms, where wchar_t is 32-bit in size representing UTF-32, you will need to actually convert the data, using the code shown above, or a platform-specific API or 3rd party Unicode library that can do the data conversion, such as libiconv, ICU. etc.
you cannot convert directly from std::u16string to std::wstring (and vice versa) with them. You will have to convert to an intermediate UTF-8 std::string first, and then convert that afterwards
This doesn't appear to be the case as
clang: converting const char16_t* (UTF-16) to wstring (UCS-4)
shows:
u16string s = u"hello";
wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
wchar_t> conv;
wstring ws = conv.from_bytes(
reinterpret_cast<const char*> (&s[0]),
reinterpret_cast<const char*> (&s[0] + s.size()));

Conversion (const char*) var goes wrong

I need to convert from CString to double in Embedded Visual C++, which supports only old style C++. I am using the following code
CString str = "4.5";
double var = atof( (const char*) (LPCTSTR) str )
and resutlt is var=4.0, so I am loosing decimal digits.
I have made another test
LPCTSTR str = "4.5";
const char* var = (const char*) str
and result again var=4.0
Can anyone help me to get a correct result?
The issue here is, that you are lying to the compiler, and the compiler trusts you. Using Embedded Visual C++ I'm going to assume, that you are targeting Windows CE. Windows CE exposes a Unicode API surface only, so your project is very likely set to use Unicode (UTF-16 LE encoding).
In that case, CString expands to CStringW, which stores code units as wchar_t. When doing (const char*) (LPCTSTR) str you are then casting from a wchar_t const* to a char const*. Given the input, the first byte has the value 52 (the ASCII encoding for the character 4). The second byte has the value 0. That is interpreted as the terminator of the C-style string. In other words, you are passing the string "4" to your call to atof. Naturally, you'll get the value 4.0 as the result.
To fix the code, use something like the following:
CStringW str = L"4.5";
double var = _wtof( str.GetString() );
_wtof is a Microsoft-specific extension to its CRT.
Note two things in particular:
The code uses a CString variant with explicit character encoding (CStringW). Always be explicit about your string types. This helps read your code and catch bugs before they happen (although all those C-style casts in the original code defeats that entirely).
The code calls the CString::GetString member to retrieve a pointer to the immutable buffer. This, too, makes the code easier to read, by not using what looks to be a C-style cast (but is an operator instead).
Also consider defining the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION macro to prevent inadvertent character set conversions from happening (e.g. CString str = "4.5";). This, too, helps you catch bugs early (unless you defeat that with C-style casts as well).
CString is not const char* To convert a TCHAR CString to ASCII, use the CT2A macro - this will also allow you to convert the string to UTF8 (or any other Windows code page):
// Convert using the local code page
CString str(_T("Hello, world!"));
CT2A ascii(str);
TRACE(_T("ASCII: %S\n"), ascii.m_psz);
// Convert to UTF8
CString str(_T("Some Unicode goodness"));
CT2A ascii(str, CP_UTF8);
TRACE(_T("UTF8: %S\n"), ascii.m_psz);
Found a solution using scanf
CString str="4.5"
double var=0.0;
_stscanf( str, _T("%lf"), &var );
This gives a correct result var=4.5
Thanks everyone for comments and help.

Converting "normal" std::string to utf-8

Let's see if I can explain this without too many factual errors...
I'm writing a string class and I want it to use utf-8 (stored in a std::string) as it's internal storage.
I want it to be able to take both "normal" std::string and std::wstring as input and output.
Working with std::wstring is not a problem, I can use std::codecvt_utf8<wchar_t> to convert both from and to std::wstring.
However after extensive googling and searching on SO I have yet to find a way to convert between a "normal/default" C++ std::string (which I assume in Windows is using the local system localization?) and an utf-8 std::string.
I guess one option would be to first convert the std::string to an std::wstring using std::codecvt<wchar_t, char> and then convert it to utf-8 as above, but this seems quite inefficient given that at least the first 128 values of a char should translate straight over to utf-8 without conversion regardless of localization if I understand correctly.
I found this similar question: C++: how to convert ASCII or ANSI to UTF8 and stores in std::string
Although I'm a bit skeptic towards that answer as it's hard coded to latin 1 and I want this to work with all types of localization to be on the safe side.
No answers involving boost thanks, I don't want the headache of getting my codebase to work with it.
If your "normal string" is encoded using the system's code page and you want to convert it to UTF-8 then this should work:
std::string codepage_str;
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[0], size);
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[0], utf8_size,
nullptr, nullptr);

How to convert std::wstring to a TCHAR*?

How to convert a std::wstring to a TCHAR*? std::wstring.c_str() does not work since it returns a wchar_t*.
How do I get from wchar_t* to TCHAR*, or from std::wstring to TCHAR*?
use this :
wstring str1(L"Hello world");
TCHAR * v1 = (wchar_t *)str1.c_str();
#include <atlconv.h>
TCHAR *dst = W2T(src.c_str());
Will do the right thing in ANSI or Unicode builds.
TCHAR* is defined to be wchar_t* if UNICODE is defined, otherwise it's char*. So your code might look something like this:
wchar_t* src;
TCHAR* result;
#ifdef UNICODE
result = src;
#else
//I think W2A is defined in atlbase.h, and it returns a stack-allocated var.
//If that's not OK, look at the documenation for wcstombs.
result = W2A(src);
#endif
in general this is not possible since wchar_t may not be the same size as TCHAR.
several solutions are already listed for converting between character sets. these can work if the character sets overlap for the range being converted.
I prefer to sidestep the issue entirely wherever possible and use a standard string that is defined on the TCHAR character set as follows:
typedef std::basic_string<TCHAR> tstring;
using this you now have a standard library compatible string that is also compatible with the windows TCHAR macro.
You can use:
wstring ws = L"Testing123";
string s(ws.begin(), ws.end());
// s.c_str() is what you're after
Assuming that you are operating on Windows.
If you are in Unicode build configuration, then TCHAR and wchar_t are the same thing. You might need a cast depending on whether you have /Z set for wchar_t is a type versus wchar_t is a typedef.
If you are in a multibyte build configuration, you need MultiByteToWideChar (and vica versa).