libxml2 xmlChar * to std::wstring

libxml2 xmlChar * to std::wstring - c++

libxml2 seems to store all its strings in UTF-8, as xmlChar *.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar *. I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.
xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.
Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.
The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring constructor throws an exception.
My recommendations:
Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string. Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).
If you absolutely must have wchar_t, then chances are good that you are on Windows. Here is how you do it on Windows:
std::wstring utf8_to_wstring(const char *utf8)
{
size_t utf8len = std::strlen(utf8);
int wclen = MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, NULL, 0);
wchar_t *wc = NULL;
try {
wc = new wchar_t[wclen];
MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, wc, wclen);
std::wstring wstr(wc, wclen);
delete[] wc;
wc = NULL;
return wstr;
} catch (std::exception &) {
if (wc)
delete[] wc;
}
}
If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv, man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv().
Remember: You probably don't want wchar_t or std::wstring. What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.

add
#include <boost/locale.hpp>
convert xmlChar* to string
std::string strGbk((char*)node);
convert string to wstring
std::string strGbk = "china powerful forever";
std::wstring wstr = boost::locale::conv::to_utf<wchar_t>(strGbk, "gbk");
std::cout << strGbk << std::endl;
std::wcout << wstr. << std::endl;
it works for me,good lucks.

Related

What is the right way to convert UTF16 string to wchar_t on Mac?

In the project that still uses XCode 3 (no C++11 features like codecvt)

Use a conversion library, like libiconv. You can set its input encoding to "UTF-16LE" or "UTF-16BE" as needed, and set its output encoding to "wchar_t" rather than any specific charset.
#include <iconv.h>
uint16_t *utf16 = ...; // input data
size_t utf16len = ...; // in bytes
wchar_t *outbuf = ...; // allocate an initial buffer
size_t outbuflen = ...; // in bytes
char *inptr = (char*) utf16;
char *outptr = (char*) outbuf;
iconv_t cvt = iconv_open("wchar_t", "UTF-16LE");
while (utf16len > 0)
{
if (iconv(cvt, &inptr, &utf16len, &outptr, &outbuflen) == (size_t)(−1))
{
if (errno == E2BIG)
{
// resize outbuf to a larger size and
// update outptr and outbuflen according...
}
else
break; // conversion failure
}
}
iconv_close(cvt);

Why do you want wchar_t on mac? wchar_t does not necessary be 16 bit, it is not very useful on mac.
I suggest to convert yo NSString using
char* payload; // point to string with UTF16 encoding
NSString* s = [NSString stringWithCString:payload encoding: NSUTF16LittleEndianStringEncoding];
To convert NSString to UTF16
const char* payload = [s cStringUsingEncoding:NSUTF16LittleEndianStringEncoding];
Note that mac support NSUTF16BigEndianStringEncoding as well.
Note2: Although const char* is used, the data is encoded with UTF16 so don't pass it to strlen().

I would go the safest route.
Get the UTF-16 string as a UTF-8 string (using NSString)
set the locale to UTF-8
use mbstowcs() to convert the UTF-8 multi-byte string to a wchart_t
At each step you are ensured the string value will be protected.

InternetCanonicalizeUrl fails to decode diacritic letters

I'm having lots of troubles while dealing with some characters into a URL, let's suppose that I have the following URL:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=An%C3%A1lisis
Which must be converted to:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=Análisis
In order to deal with the decoding of diacritic letters, I've decided to use the InternetCanonicalizeUrl function, because the application that I'm working on is going to work only in Windows and I don't want to install additional libraries, the helper function I've used is the following:
String DecodeURL(const String &a_URL)
{
String result;
unsigned long size = a_reportType.Length() * 2;
wchar_t *buffer = new wchar_t[size];
if (InternetCanonicalizeUrlW(a_URL.c_str(), buffer, &size, ICU_DECODE | ICU_NO_ENCODE))
{
result = buffer;
}
delete [] buffer;
return result;
}
That works kind of well for almost any of the URL passed through it, except for diacritic letters, my example URL is decoded as follows:
http: //localhost/somewere/myLibrary.dll/rest/something?parameter=AnÃ¡lisis
The IDE I'm working with is CodeGear™ C++Builder® 2009 (that's why I'm forced to use String instead of std::string), I've also tried with a AnsiString and char buffer version with the same results.
Any hint/alternative about how to deal with this error?
Thanks in advance.

InternetCanonicalizeUrl() is doing the right thing, you just have to take into account what it is actually doing.
URLs do not support Unicode (IRIs do), so Unicode data has to be charset-encoded into byte octets and then those octets are url-encoded using %HH sequences as needed. In this case, the data was encoded as UTF-8 (not uncommon in many URLs nowadays, but also not guaranteed), but InternetCanonicalizeUrl() has no way of knowing that as URLs do not have a syntax for describing which charset is being used. All it can do is decode %HH sequences to the relevant byte octet values, it cannot charset-decode the octets for you. In the case of the Unicode version, InternetCanonicalizeUrlW() returns those byte values as-is as wchar_t elements. But either way, you have to charset-decode the octets yourself to recover the original Unicode data.
So what you can do in this case is copy the decoded data to a UTF8String and then assign/return that as a String so it gets decoded to UTF-16. That will only work for UTF-8 encoded URLs, of course. For example:
String DecodeURL(const String &a_URL)
{
DWORD size = 0;
if (!InternetCanonicalizeUrlW(a_URL.c_str(), NULL, &size, ICU_DECODE | ICU_NO_ENCODE))
{
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
String buffer;
buffer.SetLength(size-1);
if (InternetCanonicalizeUrlW(a_URL.c_str(), buffer.c_str(), &size, ICU_DECODE | ICU_NO_ENCODE))
{
UTF8String utf8;
utf8.SetLength(buffer.Length());
for (int i = 1; i <= buffer.Length(); ++i)
utf8[i] = (char) buffer[i];
return utf8;
}
}
}
return String();
}
Alternatively:
// encoded URLs are always ASCII, so it is safe
// to pass an encoded URL UnicodeString as an
// AnsiString...
String DecodeURL(const AnsiString &a_URL)
{
DWORD size = 0;
if (!InternetCanonicalizeUrlA(a_URL.c_str(), NULL, &size, ICU_DECODE | ICU_NO_ENCODE))
{
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
UTF8String buffer;
buffer.SetLength(size-1);
if (InternetCanonicalizeUrlA(a_URL.c_str(), buffer.c_str(), &size, ICU_DECODE | ICU_NO_ENCODE))
{
return utf8;
}
}
}
FYI, C++Builder ships with Indy pre-installed. Indy has a TIdURI class, which can decode URL and take charsets into account, eg:
#include <IdGlobal.hpp>
#include <IdURI.hpp>
String DecodeURL(const String &a_URL)
{
return TIdURI::URLDecode(URL, enUTF8);
}
In any case, you have to know the charset used to encode the URL data. If you do not, all you can do is decode the raw octets and then use heuristic analysis to guess what the charset might be, but that is not 100% reliable for non-ASCII and non-UTF charsets.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!

You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.

Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?

Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.

You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.

Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}

C++ Convert string (or char) to wstring (or wchar_t)

string s = "おはよう";
wstring ws = FUNCTION(s, ws);
How would i assign the contents of s to ws?
Searched google and used some techniques but they can't assign the exact content. The content is distorted.

Assuming that the input string in your example (おはよう) is a UTF-8 encoded (which it isn't, by the looks of it, but let's assume it is for the sake of this explanation :-)) representation of a Unicode string of your interest, then your problem can be fully solved with the standard library (C++11 and newer) alone.
The TL;DR version:
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
Longer online compilable and runnable example:
(They all show the same example. There are just many for redundancy...)
http://ideone.com/KA1oty
http://ide.geeksforgeeks.org/5pRLSh
http://rextester.com/DIJZK52174
Note (old):
As pointed out in the comments and explained in https://stackoverflow.com/a/17106065/6345 there are cases when using the standard library to convert between UTF-8 and UTF-16 might give unexpected differences in the results on different platforms. For a better conversion, consider std::codecvt_utf8 as described on http://en.cppreference.com/w/cpp/locale/codecvt_utf8
Note (new):
Since the codecvt header is deprecated in C++17, some worry about the solution presented in this answer were raised. However, the C++ standards committee added an important statement in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0618r0.html saying
this library component should be retired to Annex D, along side , until a suitable replacement is standardized.
So in the foreseeable future, the codecvt solution in this answer is safe and portable.

int StringToWString(std::wstring &ws, const std::string &s)
{
std::wstring wsTmp(s.begin(), s.end());
ws = wsTmp;
return 0;
}

Your question is underspecified. Strictly, that example is a syntax error. However, std::mbstowcs is probably what you're looking for.
It is a C-library function and operates on buffers, but here's an easy-to-use idiom, courtesy of Mooing Duck:
std::wstring ws(s.size(), L' '); // Overestimate number of code points.
ws.resize(std::mbstowcs(&ws[0], s.c_str(), s.size())); // Shrink to fit.

If you are using Windows/Visual Studio and need to convert a string to wstring you could use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java). A standard way of converting a std::wstring to utf8 std::string is showed in this answer.
//
// using ATL
CA2W ca2w(str, CP_UTF8);
//
// or the standard way taken from the answer above
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.

Windows API only, pre C++11 implementation, in case someone needs it:
#include <stdexcept>
#include <vector>
#include <windows.h>
using std::runtime_error;
using std::string;
using std::vector;
using std::wstring;
wstring utf8toUtf16(const string & str)
{
if (str.empty())
return wstring();
size_t charsNeeded = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), NULL, 0);
if (charsNeeded == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
vector<wchar_t> buffer(charsNeeded);
int charsConverted = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), &buffer[0], buffer.size());
if (charsConverted == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
return wstring(&buffer[0], charsConverted);
}

Here's a way to combining string, wstring and mixed string constants to wstring. Use the wstringstream class.
This does NOT work for multi-byte character encodings. This is just a dumb way of throwing away type safety and expanding 7 bit characters from std::string into the lower 7 bits of each character of std:wstring. This is only useful if you have a 7-bit ASCII strings and you need to call an API that requires wide strings.
#include <sstream>
std::string narrow = "narrow";
std::wstring wide = L"wide";
std::wstringstream cls;
cls << " abc " << narrow.c_str() << L" def " << wide.c_str();
std::wstring total= cls.str();

From char* to wstring:
char* str = "hello worlddd";
wstring wstr (str, str+strlen(str));
From string to wstring:
string str = "hello worlddd";
wstring wstr (str.begin(), str.end());
Note this only works well if the string being converted contains only ASCII characters.

This variant of it is my favourite in real life. It converts the input, if it is valid UTF-8, to the respective wstring. If the input is corrupted, the wstring is constructed out of the single bytes. This is extremely helpful if you cannot really be sure about the quality of your input data.
std::wstring convert(const std::string& input)
{
try
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input);
}
catch(std::range_error& e)
{
size_t length = input.length();
std::wstring result;
result.reserve(length);
for(size_t i = 0; i < length; i++)
{
result.push_back(input[i] & 0xFF);
}
return result;
}
}

using Boost.Locale:
ws = boost::locale::conv::utf_to_utf<wchar_t>(s);

You can use boost path or std path; which is a lot more easier.
boost path is easier for cross-platform application
#include <boost/filesystem/path.hpp>
namespace fs = boost::filesystem;
//s to w
std::string s = "xxx";
auto w = fs::path(s).wstring();
//w to s
std::wstring w = L"xxx";
auto s = fs::path(w).string();
if you like to use std:
#include <filesystem>
namespace fs = std::filesystem;
//The same
c++ older version
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
//The same
The code within still implement a converter which you dont have to unravel the detail.

For me the most uncomplicated option without big overhead is:
Include:
#include <atlbase.h>
#include <atlconv.h>
Convert:
char* whatever = "test1234";
std::wstring lwhatever = std::wstring(CA2W(std::string(whatever).c_str()));
If needed:
lwhatever.c_str();

String to wstring
std::wstring Str2Wstr(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
wstring to String
std::string Wstr2Str(const std::wstring& wstr)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}

If you have QT and if you are lazy to implement a function and stuff you can use
std::string str;
QString(str).toStdWString()

Here is my super basic solution that might not work for everyone. But would work for a lot of people.
It requires usage of the Guideline Support Library.
Which is a pretty official C++ library that was designed by many C++ committee authors:
https://github.com/isocpp/CppCoreGuidelines
https://github.com/Microsoft/GSL
std::string to_string(std::wstring const & wStr)
{
std::string temp = {};
for (wchar_t const & wCh : wStr)
{
// If the string can't be converted gsl::narrow will throw
temp.push_back(gsl::narrow<char>(wCh));
}
return temp;
}
All my function does is allow the conversion if possible. Otherwise throw an exception.
Via the usage of gsl::narrow (https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#es49-if-you-must-use-a-cast-use-a-named-cast)

method s2ws works well. Hope helps.
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}

Based upon my own testing (On windows 8, vs2010) mbstowcs can actually damage original string, it works only with ANSI code page. If MultiByteToWideChar/WideCharToMultiByte can also cause string corruption - but they tends to replace characters which they don't know with '?' question marks, but mbstowcs tends to stop when it encounters unknown character and cut string at that very point. (I have tested Vietnamese characters on finnish windows).
So prefer Multi*-windows api function over analogue ansi C functions.
Also what I've noticed shortest way to encode string from one codepage to another is not use MultiByteToWideChar/WideCharToMultiByte api function calls but their analogue ATL macros: W2A / A2W.
So analogue function as mentioned above would sounds like:
wstring utf8toUtf16(const string & str)
{
USES_CONVERSION;
_acp = CP_UTF8;
return A2W( str.c_str() );
}
_acp is declared in USES_CONVERSION macro.
Or also function which I often miss when performing old data conversion to new one:
string ansi2utf8( const string& s )
{
USES_CONVERSION;
_acp = CP_ACP;
wchar_t* pw = A2W( s.c_str() );
_acp = CP_UTF8;
return W2A( pw );
}
But please notice that those macro's use heavily stack - don't use for loops or recursive loops for same function - after using W2A or A2W macro - better to return ASAP, so stack will be freed from temporary conversion.

std::string -> wchar_t[] with safe mbstowcs_s function:
auto ws = std::make_unique<wchar_t[]>(s.size() + 1);
mbstowcs_s(nullptr, ws.get(), s.size() + 1, s.c_str(), s.size());
This is from my sample code

use this code to convert your string to wstring
std::wstring string2wString(const std::string& s){
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
int main(){
std::wstring str="your string";
std::wstring wStr=string2wString(str);
return 0;
}

string s = "おはよう"; is an error.
You should use wstring directly:
wstring ws = L"おはよう";

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

libxml2 xmlChar * to std::wstring - c++

Related

What is the right way to convert UTF16 string to wchar_t on Mac?

InternetCanonicalizeUrl fails to decode diacritic letters

Call popen() on a command with Chinese characters on Mac

Compare std::wstring and std::string

C++ Convert string (or char) to wstring (or wchar_t)

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

libxml2 xmlChar * to std::wstring - c++

Related

What is the right way to convert UTF16 string to wchar_t on Mac?

InternetCanonicalizeUrl fails to decode diacritic letters

Call popen() on a command with Chinese characters on Mac

Compare std::wstring and std::string

C++ Convert string (or char*) to wstring (or wchar_t*)

Categories

Resources

C++ Convert string (or char) to wstring (or wchar_t)