One file lib to conv utf8 (char*) to wchar_t? - c++

I am using libjson which is awesome. The only problem I have is I need to convert an utf8 string (char*) to a wide char string (wchar_t*). I googled and tried 3 different libs and they ALL failed (due to missing headers).
I don't need anything fancy. Just a one way conversion. How do I do this?

If you're on windows (which, chances are you are, given your need for wchar_t), use MultiByteToWideChar function (declared in windows.h), as so:
int length = MultiByteToWideChar(CP_UTF8, 0, src, src_length, 0, 0);
wchar_t *output_buffer = new wchar_t [length];
MultiByteToWideChar(CP_UTF8, 0, src, src_length, output_buffer, length);
Alternatively, if all you're looking for is a literal multibyte representation of your UTF8 (which is improbable, but possible), use the following (stdlib.h):
wchar_t * output_buffer = new wchar_t [1024];
int length = mbstowcs(output_buffer, src, 1024);
if(length > 1024){
delete[] output_buffer;
output_buffer = new wchar_t[length+1];
mbstowcs(output_buffer, src, length);
}
Hope this helps.

the below successfully enables CreateDirectoryW() to write to C:\Users\ПетрКарасев , basically an easier-to-understand wrapper around the MultiByteTyoWideChar mentioned by someone earlier.
std::wstring utf16_from_utf8(const std::string & utf8)
{
// Special case of empty input string
if (utf8.empty())
return std::wstring();
// Шаг 1, Get length (in wchar_t's) of resulting UTF-16 string
const int utf16_length = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
NULL, // unused - no conversion done in this step
0 // request size of destination buffer, in wchar_t's
);
if (utf16_length == 0)
{
// Error
DWORD error = ::GetLastError();
throw ;
}
// // Шаг 2, Allocate properly sized destination buffer for UTF-16 string
std::wstring utf16;
utf16.resize(utf16_length);
// // Шаг 3, Do the actual conversion from UTF-8 to UTF-16
if ( ! ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
0, // default flags
utf8.data(), // source UTF-8 string
utf8.length(), // length (in chars) of source UTF-8 string
&utf16[0], // destination buffer
utf16.length() // size of destination buffer, in wchar_t's
) )
{
// не работает сука ...
DWORD error = ::GetLastError();
throw;
}
return utf16; // ура!
}

Here is a piece of code i wrote. It seems to work well enough. It returns 0 on utf8 error or when the value is > FFFF (which cant be held by a wchar_t)
#include <string>
using namespace std;
wchar_t* utf8_to_wchar(const char*utf8){
wstring sz;
wchar_t c;
auto p=utf8;
while(*p!=0){
auto v=(*p);
if(v>=0){
c = v;
sz+=c;
++p;
continue;
}
int shiftCount=0;
if((v&0xE0) == 0xC0){
shiftCount=1;
c = v&0x1F;
}
else if((v&0xF0) == 0xE0){
shiftCount=2;
c = v&0xF;
}
else
return 0;
++p;
while(shiftCount){
v = *p;
++p;
if((v&0xC0) != 0x80) return 0;
c<<=6;
c |= (v&0x3F);
--shiftCount;
}
sz+=c;
}
return (wchar_t*)sz.c_str();
}

The following (untested) code shows how to convert a multibyte string in your current locale into a wide string. So if your current locale is UTF-8, then this will suit your needs.
const char * inputStr = ... // your UTF-8 input
size_t maxSize = strlen(inputStr) + 1;
wchar_t * outputWStr = new wchar_t[maxSize];
size_t result = mbstowcs(outputWStr, inputStr, maxSize);
if (result == -1) {
cerr << "Invalid multibyte characters in input";
}
You can use setlocale() to set your locale.

Related

How to convert std::string to wchar_t*

std::regex regexpy("y:(.+?)\"");
std::smatch my;
regex_search(value.text, my, regexpy);
y = my[1];
std::wstring wide_string = std::wstring(y.begin(), y.end());
const wchar_t* p_my_string = wide_string.c_str();
wchar_t* my_string = const_cast<wchar_t*>(p_my_string);
URLDownloadToFile(my_string, aDest);
I'm using Unicode, the encoding of the source string is ASCII, UrlDownloadToFile expands to UrlDownloadToFileW (wchar_t*) the code above compiles in debug mode, but with a lot of warnings like:
warning C4244: 'argument': conversion from 'wchar_t' to 'const _Elem', possible loss of data
So do I ask, how I could convert a std::string to a wchar_t?
First off, you don't need the const_cast, as URLDownloadToFileW() takes a const wchar_t* as input, so passing it wide_string.c_str() will work as-is:
URLDownloadToFile(..., wide_string.c_str(), ...);
That being said, you are constructing a std::wstring with the individual char values of a std::string as-is. That will work without data loss only for ASCII characters <= 127, which have the same numeric values in both ASCII and Unicode. For non-ASCII characters, you need to actually convert the char data to Unicode, such as with MultiByteToWideChar() (or equivilent), eg:
std::wstring to_wstring(const std::string &s)
{
std::wstring wide_string;
// NOTE: be sure to specify the correct codepage that the
// str::string data is actually encoded in...
int len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), NULL, 0);
if (len > 0) {
wide_string.resize(len);
MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), &wide_string[0], len);
}
return wide_string;
}
URLDownloadToFileW(..., to_wstring(y).c_str(), ...);
That being said, there is a simpler solution. If the std::string is encoded in the user's default locale, you can simply call URLDownloadToFileA() instead, passing it the original std::string as-is, and let the OS handle the conversion for you, eg:
URLDownloadToFileA(..., y.c_str(), ...);
There is a cross-platform solution. You can use std::mbtowc.
std::wstring convert_mb_to_wc(std::string s) {
std::wstring out;
std::mbtowc(nullptr, 0, 0);
int offset;
size_t index = 0;
for (wchar_t wc;
(offset = std::mbtowc(&wc, &s[index], s.size() - index)) > 0;
index += offset) {
out.push_back(wc);
}
return out;
}
Adapted from an example on cppreference.com at https://en.cppreference.com/w/cpp/string/multibyte/mbtowc .

How to get the name of a Unicode character?

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.
For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".
Or if there are some other way to get the same result from Windows with C or C++.
The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).
You can use undocumented GetUName method from getuname.dll:
std::string GetUnicodeCharacterName(wchar_t character)
{
// https://github.com/reactos/reactos/tree/master/dll/win32/getuname
typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));
if (!pfnGetUName)
return {};
std::array<WCHAR, 256> buffer;
int length = pfnGetUName(character, buffer.data());
return utf8::narrow(buffer.data(), length);
}
// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
if (!std::iswgraph(character))
{
if (character <= 0x21)
character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
else
character = 0xFFFD; // REPLACEMENT CHARACTER
}
return character;
}
// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
// 𐌸 <U+10338 Supplementary Multilingual Plane>
// 🚒 <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
// UTF-8 <=> UTF-32 converter
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;
// UTF-8 to UTF-32
std::u32string utf32string = utf32conv.from_bytes(string);
std::string characterNames;
characterNames.reserve(35 * utf32string.size());
for (const char32_t& codePoint : utf32string)
{
if (!characterNames.empty())
characterNames.append(", ");
char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";
// UTF-32 to UTF-8
std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
}
return characterNames;
}
The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).
Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):
std::string GetUCharNameWrapper(char32_t codePoint)
{
typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));
if (!pfnU_charName)
return {};
int errorCode = 0;
std::array<char, 512> buffer;
int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);
if (errorCode != 0)
return {};
return std::string(buffer.data(), length);
}

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!
Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

What is the right way to convert UTF16 string to wchar_t on Mac?

In the project that still uses XCode 3 (no C++11 features like codecvt)
Use a conversion library, like libiconv. You can set its input encoding to "UTF-16LE" or "UTF-16BE" as needed, and set its output encoding to "wchar_t" rather than any specific charset.
#include <iconv.h>
uint16_t *utf16 = ...; // input data
size_t utf16len = ...; // in bytes
wchar_t *outbuf = ...; // allocate an initial buffer
size_t outbuflen = ...; // in bytes
char *inptr = (char*) utf16;
char *outptr = (char*) outbuf;
iconv_t cvt = iconv_open("wchar_t", "UTF-16LE");
while (utf16len > 0)
{
if (iconv(cvt, &inptr, &utf16len, &outptr, &outbuflen) == (size_t)(−1))
{
if (errno == E2BIG)
{
// resize outbuf to a larger size and
// update outptr and outbuflen according...
}
else
break; // conversion failure
}
}
iconv_close(cvt);
Why do you want wchar_t on mac? wchar_t does not necessary be 16 bit, it is not very useful on mac.
I suggest to convert yo NSString using
char* payload; // point to string with UTF16 encoding
NSString* s = [NSString stringWithCString:payload encoding: NSUTF16LittleEndianStringEncoding];
To convert NSString to UTF16
const char* payload = [s cStringUsingEncoding:NSUTF16LittleEndianStringEncoding];
Note that mac support NSUTF16BigEndianStringEncoding as well.
Note2: Although const char* is used, the data is encoded with UTF16 so don't pass it to strlen().
I would go the safest route.
Get the UTF-16 string as a UTF-8 string (using NSString)
set the locale to UTF-8
use mbstowcs() to convert the UTF-8 multi-byte string to a wchart_t
At each step you are ensured the string value will be protected.

Why is the following C++ code printing only the first character?

I am trying to convert a char string to a wchar string.
In more detail: I am trying to convert a char[] to a wchar[] first and then append " 1" to that string and the print it.
char src[256] = "c:\\user";
wchar_t temp_src[256];
mbtowc(temp_src, src, 256);
wchar_t path[256];
StringCbPrintf(path, 256, _T("%s 1"), temp_src);
wcout << path;
But it prints just c
Is this the right way to convert from char to wchar? I have come to know of another way since. But I'd like to know why the above code works the way it does?
mbtowc converts only a single character. Did you mean to use mbstowcs?
Typically you call this function twice; the first to obtain the required buffer size, and the second to actually convert it:
#include <cstdlib> // for mbstowcs
const char* mbs = "c:\\user";
size_t requiredSize = ::mbstowcs(NULL, mbs, 0);
wchar_t* wcs = new wchar_t[requiredSize + 1];
if(::mbstowcs(wcs, mbs, requiredSize + 1) != (size_t)(-1))
{
// Do what's needed with the wcs string
}
delete[] wcs;
If you rather use mbstowcs_s (because of deprecation warnings), then do this:
#include <cstdlib> // also for mbstowcs_s
const char* mbs = "c:\\user";
size_t requiredSize = 0;
::mbstowcs_s(&requiredSize, NULL, 0, mbs, 0);
wchar_t* wcs = new wchar_t[requiredSize + 1];
::mbstowcs_s(&requiredSize, wcs, requiredSize + 1, mbs, requiredSize);
if(requiredSize != 0)
{
// Do what's needed with the wcs string
}
delete[] wcs;
Make sure you take care of locale issues via setlocale() or using the versions of mbstowcs() (such as mbstowcs_l() or mbstowcs_s_l()) that takes a locale argument.
why are you using C code, and why not write it in a more portable way, for example what I would do here is use the STL!
std::string src = std::string("C:\\user") +
std::string(" 1");
std::wstring dne = std::wstring(src.begin(), src.end());
wcout << dne;
it's so simple it's easy :D
L"Hello World"
the prefix L in front of the string makes it a wide char string.