C++: socket encoding (working with TeamSpeak) - c++

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!

Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

Related

On Windows, stat and GetFileAttributes fail for paths containing strange characters

The code below demonstrates how stat and GetFileAttributes fail when the path contains some strange (but valid) ASCII characters.
As a workaround, I would use the 8.3 DOS file name. But this does not work when the drive has 8.3 names disabled.
(8.3 names are disabled with the fsutil command: fsutil behavior set disable8dot3 1).
Is it possible to get stat and/or GetFileAttributes to work in this case?
If not, is there another way of determining whether or not a path is a directory or file?
#include "stdafx.h"
#include <sys/stat.h>
#include <string>
#include <Windows.h>
#include <atlpath.h>
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
// The final characters in the path below are 0xc3 (Ã) and 0x3f (?).
// Create a test directory with the name à and set TEST_DIR below to your test directory.
const char* TEST_DIR = "D:\\tmp\\VisualStudio\\TestProject\\ConsoleApplication1\\test_data\\Ã";
int main()
{
std::string testDir = TEST_DIR;
// test stat and _wstat
struct stat st;
const auto statSucceeded = stat(testDir.c_str(), &st) == 0;
if (!statSucceeded)
{
printf("stat failed\n");
}
std::wstring testDirW = s2ws(testDir);
struct _stat64i32 stW;
const auto statSucceededW = _wstat(testDirW.data(), &stW) == 0;
if (!statSucceededW)
{
printf("_wstat failed\n");
}
// test PathIsDirectory
const auto isDir = PathIsDirectory(testDirW.c_str()) != 0;
if (!isDir)
{
printf("PathIsDirectory failed\n");
}
// test GetFileAttributes
const auto fileAttributes = ::GetFileAttributes(testDirW.c_str());
const auto getFileAttributesWSucceeded = fileAttributes != INVALID_FILE_ATTRIBUTES;
if (!getFileAttributesWSucceeded)
{
printf("GetFileAttributes failed\n");
}
return 0;
}
The problem you have encountered comes from using the MultiByteToWideChar function. Using CP_ACP can default to a code page that does not support some characters. If you change the default system code page to UTF8, your code will work. Since you cannot tell your clients what code page to use, you can use a third party library such as International Components for Unicode to convert from the host code page to UTF16.
I ran your code using console code page 65001 and VS2015 and your code worked as written. I also added positive printfs to verify that it did work.
Don't start with a narrow string literal and try to convert it, start with a wide string literal - one that represents the actual filename. You can use hexadecimal escape sequences to avoid any dependency on the encoding of the source code.
If the actual code doesn't use string literals, the best resolution depends on the situation; for example, if the file name is being read from a file, you need to make sure that you know what encoding the file is in and perform the conversion accordingly.
If the actual code reads the filename from the command line arguments, you can use wmain() instead of main() to get the arguments as wide strings.

Get the user's codepage name for functions in boost::locale::conv

The task at hand
I'm parsing a filename from an UTF-8 encoded XML on Windows. I need to pass that filename on to a function that I can't change. Internally it uses _fsopen() which does not support Unicode strings.
Current approach
My current approach is to convert the filename to the user's charset hoping that the filename is representable in that encoding. I'm then using boost::locale::conv::from_utf() to convert from UTF-8 and I'm using boost::locale::util::get_system_locale() to get the name of the current locale.
Life is good?
I'm on a German system using code page Windows-1252 thus get_system_locale() correctly yields de_DE.windows-1252. If I test the approach with a filename containing an umlaut everything works as expected.
The Problem
Just to make sure I switched my system locale to Ukrainian which uses code page Windows-1251. Using some Cyrillic letter in the filename my approach fails. The reason is that get_system_locale() still yields de_DE.windows-1252 which is now incorrect.
On the other side GetACP() correctly yields 1252 for the German locale and 1251 for the Ukrainian locale. I also know that Boost.Locale can convert to a given locale as this small test program works as I expect:
#include <boost/locale.hpp>
#include <iostream>
#include <string>
#include <windows.h>
int main()
{
std::cout << "Codepage: " << GetACP() << std::endl;
std::cout << "Boost.Locale: " << boost::locale::util::get_system_locale() << std::endl;
namespace blc = boost::locale::conv;
// Cyrillic small letter zhe -> \xe6 (ш on 1251, æ on 1252)
std::string const test1251 = blc::from_utf(std::string("\xd0\xb6"), "windows-1251");
std::cout << "1251: " << static_cast<int>(test1251.front()) << std::endl;
// Latin small letter sharp s -> \xdf (Я on 1251, ß on 1252)
auto const test1252 = blc::from_utf(std::string("\xc3\x9f"), "windows-1252");
std::cout << "1252: " << static_cast<int>(test1252.front()) << std::endl;
}
Questions
How can I query the name of the user locale in a format Boost.Locale supports? Using std::locale("").name() yields German_Germany.1252, using it results in a boost::locale::conv::invalid_charset_error exception.
Is it possible that the system locale remains de_DE.windows-1252 although I'm supposedly changing it as local admin? Similarly system language is German although my account's language is English. (Log in screen is German until I log in)
should I stick with using short filenames? Does not seem to work reliably though.
Fine-print
Compiler is MSVC18
Boost is version 1.56.0, backend supposedly winapi
System is Win7, system language is German, user language English
ANSI is deprecated so don't bother with it.
Windows uses UTF16, you must convert from UTF8 to UTF16 using MultiByteToWideChar. This conversion is safe.
std::wstring getU16(const std::string &str)
{
if (str.empty()) return std::wstring();
int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
std::wstring res(sz, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
return res;
}
You then use _wfsopen (from the link you provided) to open file with UTF16 name.
int main()
{
//UTF8 source:
std::string filename_u8;
//This line works in VS2015 only
//For older version comment out the next line, obtain UTF8 from another source
filename_u8 = u8"c:\\test\\__ελληνικά.txt";
//convert to UTF16
std::wstring filename_utf16 = getU16(filename_u8);
FILE *file = NULL;
_wfopen_s(&file, filename_utf16.c_str(), L"w");
if (file)
{
//Add BOM, optional...
//Write the file name in to file, for testing...
fwrite(filename_u8.data(), 1, filename_u8.length(), file);
fclose(file);
}
else
{
cout << "access denined, or folder doesn't exits...
}
return 0;
}
Edit, getting ANSI from UTF8, using GetACP()
std::wstring string_to_wstring(const std::string &str, int codepage)
{
if (str.empty()) return std::wstring();
int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
std::wstring res(sz, 0);
MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
return res;
}
std::string wstring_to_string(const std::wstring &wstr, int codepage)
{
if (wstr.empty()) return std::string();
int sz = WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
std::string res(sz, 0);
WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
return res;
}
std::string get_ansi_from_utf8(const std::string &utf8, int codepage)
{
std::wstring utf16 = string_to_wstring(utf8, CP_UTF8);
std::string ansi = wstring_to_string(utf16, codepage);
return ansi;
}
Barmak's way is the best way to do it.
To clear up the locale stuff, the process always starts with the "C" locale. You can use the setlocale function to set the locale to the system default or any arbitrary locale.
#include <clocale>
// Get the current locale
setlocale(LC_ALL,NULL);
// Set locale to system default
setlocale(LC_ALL,"");
// Set locale to German
setlocale(LC_ALL,"de-DE");

Store non-English string in std::string

I have a simple string in std::wstring
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
I want to store this string in a std::string.
I have tried the below code but the result is not the same as input string
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
//setup converter
typedef std::codecvt_utf8_utf16 <wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
//use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( tempStr );
The Korean string present in the input string is converted to "ìžë™".
Is there any way I can get the same string in std::string?
Expected result:
converted_str should contain F:\Projects\Current_자동_\Cam.xml
Below is an screenshot of debugging showing 3 values in 3 scenarios (conversion in 3 ways). But none of them gives the desired value.
Your conversion code is fine.
In fact, in UTF-8 (the string you store in std::string), the characters 자동 corresponds to:
자 (UTF-16 0xC790) ---> UTF-8: EC 9E 90
동 (UTF-16 0xB3D9) ---> UTF-8: EB 8F 99
If you run the following program, which just prints the converted UTF-8 bytes, you get this output:
ec 9e 90 eb 8f 99
#include <iomanip> // For std::hex
#include <iostream> // For console output
#include <string> // For STL strings
#include <codecvt> // For Unicode conversions
void print_char_hex(const char ch)
{
auto * p = reinterpret_cast<const unsigned char*>(&ch);
int i = *p;
std::cout << std::hex << i << ' ';
}
int main()
{
std::wstring utf16_str = L"\xC790\xB3D9";
// setup converter
typedef std::codecvt_utf8_utf16<wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
// use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( utf16_str );
// Output the converted bytes (UTF-8)
for (size_t i = 0; i < converted_str.length(); ++i)
{
print_char_hex(converted_str[i]);
}
std::cout << std::endl;
}
I think the best solution would be to use the wide-char APIs to open the file, e.g. CreateFileW(...);, because then you can use the wide-char file name directly.
If this is not possible, maybe the string should not be converted to UTF8, but to the system default ANSI code page.
I think this might work:
char out[200];
wchar_t * in = L"F:\\Projects\\Current_자동_\\Cam.xml";
WideCharToMultiByte(CP_ACP, 0, in, 100, out, 100, 0, 0);
or maybe another Korean code page:
WideCharToMultiByte(949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(1361, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(10003, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20833, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50225, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50933, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(51949, 0, in, 100, out, 100, 0, 0);
The code page ids can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
good luck :-)
This works.. You can tell because the conversion back to UTF16 is valid.. If you write the UTF8 string to a file, it will also display properly. This way, you now have two ways of validating that it works.
// UTF16ToUTF8.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <codecvt>
std::wstring ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
int _tmain(int argc, _TCHAR* argv[])
{
std::wstring u16 = L"_자동_";
std::string u8 = ToUTF8(u16);
MessageBoxW(NULL, ToUTF16(u8).c_str(), L"", 0);
std::cin.get();
return 0;
}
You can store UTF-8 in std:string as regular char sequence. Here's library with some useful things, such as length() and everything about indexing, you may want to have http://utfcpp.sourceforge.net/.
For windows console you need to set codepage to 65001 and will become UTF-8.
Sadly or not, std::wstring and whole wchar_t thing doesn't specify any specific encoding.
By the way, you're using Managed C++, why wouldn't you use .NET Framework's System::String^? There's no problems with encodings at all. http://msdn.microsoft.com/ru-ru/library/system.string(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp
The problem is not in your string conversion code. This is a typical source file encoding problem. Visual studio does not use Unicode as default so you should convert your source file's encoding to UTF-8 yourself. To make this conversion you can open your file with notepad++ and click Encoding->Convert to UTF-8
Note1: In VS2010 and vs2012 if you write non-ascii characters to a source file visual studio now warns you and offers to make this conversion.
Note2: From your use of macro _T() I predict this is targeted only to Windows. If you try to build UTF-8 encoded source files that contains BOM with gcc you may get different errors. In any case the best approach would be to read your UTF-8 encoded text data from a file during run-time.

How to set file encoding format to UTF8 in C++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)
What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.
line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*
// ...
if( inputFile.is_open() )
{
while( inputFile.good() )
{
getline(inputFile,line);
//1
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
wchar_t *pwcharText;
pwcharText = new wchar_t[ dwCount];
//2
MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );
//3
dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
char *pText;
pText = new char[ dwCount ];
//4
WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );
fprintf(pOutputFile,pText);
fprintf(pOutputFile,"\n");
delete[] pwcharText;
delete[] pText;
}
}
// ...
Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?
I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.
On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.
std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;
It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.
On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.
AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );
if (dwCount == 0) continue;
std::vector<WCHAR> utf16Text(dwCount);
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );
dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );
if (dwCount == 0) continue;
std::vector<CHAR> utf8Text(dwCount);
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );
fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);
fprintf(pOutputFile, "\n");
The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.
So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说:微笑!😊." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.
Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.
#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>
int main(int argc, char** argv)
{
std::fstream afile;
std::string line1A = u8"Confucius says: Smile. 孔子说:微笑! 😊";
std::wstring line1W;
afile.open("Test.txt", std::ios::out | std::ios::app);
if (!afile.is_open())
return 0;
afile << "\n" << line1A;
afile.close();
afile.open("Test.txt", std::ios::in);
std::getline(afile, line1A);
line1W = CA2W(line1A.c_str(), CP_UTF8);
MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
afile.close();
return 0;
}

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}