The question is how to convert wstring to string?
I have next example :
#include <string>
#include <iostream>
int main()
{
std::wstring ws = L"Hello";
std::string s( ws.begin(), ws.end() );
//std::cout <<"std::string = "<<s<<std::endl;
std::wcout<<"std::wstring = "<<ws<<std::endl;
std::cout <<"std::string = "<<s<<std::endl;
}
the output with commented out line is :
std::string = Hello
std::wstring = Hello
std::string = Hello
but without is only :
std::wstring = Hello
Is anything wrong in the example? Can I do the conversion like above?
EDIT
New example (taking into account some answers) is
#include <string>
#include <iostream>
#include <sstream>
#include <locale>
int main()
{
setlocale(LC_CTYPE, "");
const std::wstring ws = L"Hello";
const std::string s( ws.begin(), ws.end() );
std::cout<<"std::string = "<<s<<std::endl;
std::wcout<<"std::wstring = "<<ws<<std::endl;
std::stringstream ss;
ss << ws.c_str();
std::cout<<"std::stringstream = "<<ss.str()<<std::endl;
}
The output is :
std::string = Hello
std::wstring = Hello
std::stringstream = 0x860283c
therefore the stringstream can not be used to convert wstring into string.
As Cubbi pointed out in one of the comments, std::wstring_convert (C++11) provides a neat simple solution (you need to #include <locale> and <codecvt>):
std::wstring string_to_convert;
//setup converter
using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;
//use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( string_to_convert );
I was using a combination of wcstombs and tedious allocation/deallocation of memory before I came across this.
http://en.cppreference.com/w/cpp/locale/wstring_convert
update(2013.11.28)
One liners can be stated as so (Thank you Guss for your comment):
std::wstring str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes("some string");
Wrapper functions can be stated as so: (Thank you ArmanSchwarz for your comment)
std::wstring s2ws(const std::string& str)
{
using convert_typeX = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.from_bytes(str);
}
std::string ws2s(const std::wstring& wstr)
{
using convert_typeX = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}
Note: there's some controversy on whether string/wstring should be passed in to functions as references or as literals (due to C++11 and compiler updates). I'll leave the decision to the person implementing, but it's worth knowing.
Note: I'm using std::codecvt_utf8 in the above code, but if you're not using UTF-8 you'll need to change that to the appropriate encoding you're using:
http://en.cppreference.com/w/cpp/header/codecvt
An older solution from: http://forums.devshed.com/c-programming-42/wstring-to-string-444006.html
std::wstring wide( L"Wide" );
std::string str( wide.begin(), wide.end() );
// Will print no problemo!
std::cout << str << std::endl;
Update (2021): However, at least on more recent versions of MSVC, this may generate a wchar_t to char truncation warning. The warning can be quieted by using std::transform instead with explicit conversion in the transformation function, e.g.:
std::wstring wide( L"Wide" );
std::string str;
std::transform(wide.begin(), wide.end(), std::back_inserter(str), [] (wchar_t c) {
return (char)c;
});
Or if you prefer to preallocate and not use back_inserter:
std::string str(wide.length(), 0);
std::transform(wide.begin(), wide.end(), str.begin(), [] (wchar_t c) {
return (char)c;
});
See example on various compilers here.
Beware that there is no character set conversion going on here at all. What this does is simply to assign each iterated wchar_t to a char - a truncating conversion. It uses the std::string c'tor:
template< class InputIt >
basic_string( InputIt first, InputIt last,
const Allocator& alloc = Allocator() );
As stated in comments:
values 0-127 are identical in virtually every encoding, so truncating
values that are all less than 127 results in the same text. Put in a
chinese character and you'll see the failure.
the values 128-255 of windows codepage 1252 (the Windows English
default) and the values 128-255 of unicode are mostly the same, so if
that's teh codepage you're using most of those characters should be
truncated to the correct values. (I totally expected á and õ to work,
I know our code at work relies on this for é, which I will soon fix)
And note that code points in the range 0x80 - 0x9F in Win1252 will not work. This includes €, œ, ž, Ÿ, ...
Here is a worked-out solution based on the other suggestions:
#include <string>
#include <iostream>
#include <clocale>
#include <locale>
#include <vector>
int main() {
std::setlocale(LC_ALL, "");
const std::wstring ws = L"ħëłlö";
const std::locale locale("");
typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
const converter_type& converter = std::use_facet<converter_type>(locale);
std::vector<char> to(ws.length() * converter.max_length());
std::mbstate_t state;
const wchar_t* from_next;
char* to_next;
const converter_type::result result = converter.out(state, ws.data(), ws.data() + ws.length(), from_next, &to[0], &to[0] + to.size(), to_next);
if (result == converter_type::ok or result == converter_type::noconv) {
const std::string s(&to[0], to_next);
std::cout <<"std::string = "<<s<<std::endl;
}
}
This will usually work for Linux, but will create problems on Windows.
Default encoding on:
Windows UTF-16.
Linux UTF-8.
MacOS UTF-8.
My solution Steps, includes null chars \0 (avoid truncated). Without using functions on windows.h header:
Add Macros to detect Platform.
Windows/Linux and others
Create function to convert std::wstring to std::string and inverse std::string to std::wstring
Create function for print
Print std::string/ std::wstring
Check RawString Literals. Raw String Suffix.
Linux Code. Print directly std::string using std::cout, Default Encoding on Linux is UTF-8, no need extra functions.
On Windows if you need to print unicode. We can use WriteConsole for print unicode chars from std::wstring.
Finally on Windows. You need a powerfull and complete view support for unicode chars in console.
I recommend Windows Terminal
QA
Tested on Microsoft Visual Studio 2019 with VC++; std=c++17. (Windows Project)
Tested on repl.it using Clang compiler; std=c++17.
Q. Why you not use <codecvt> header functions and classes?.
A. Deprecate Removed or deprecated features impossible build on VC++, but no problems on g++. I prefer 0 warnings and headaches.
Q. std ::wstring is cross platform?
A. No. std::wstring uses wchar_t elements. On Windows wchar_t size is 2 bytes, each character is stored in UTF-16 units, if character is bigger than U+FFFF, the character is represented in two UTF-16 units(2 wchar_t elements) called surrogate pairs. On Linux wchar_t size is 4 bytes each character is stored in one wchar_t element, no needed surrogate pairs. Check Standard data types on UNIX, Linux, and Windowsl.
Q. std ::string is cross platform?
A. Yes. std::string uses char elements. char type is guaranted that is same byte size in most compilers. char type size is 1 byte. Check Standard data types on UNIX, Linux, and Windowsl.
Full example code
#include <iostream>
#include <set>
#include <string>
#include <locale>
// WINDOWS
#if (_WIN32)
#include <Windows.h>
#include <conio.h>
#define WINDOWS_PLATFORM 1
#define DLLCALL STDCALL
#define DLLIMPORT _declspec(dllimport)
#define DLLEXPORT _declspec(dllexport)
#define DLLPRIVATE
#define NOMINMAX
//EMSCRIPTEN
#elif defined(__EMSCRIPTEN__)
#include <emscripten/emscripten.h>
#include <emscripten/bind.h>
#include <unistd.h>
#include <termios.h>
#define EMSCRIPTEN_PLATFORM 1
#define DLLCALL
#define DLLIMPORT
#define DLLEXPORT __attribute__((visibility("default")))
#define DLLPRIVATE __attribute__((visibility("hidden")))
// LINUX - Ubuntu, Fedora, , Centos, Debian, RedHat
#elif (__LINUX__ || __gnu_linux__ || __linux__ || __linux || linux)
#define LINUX_PLATFORM 1
#include <unistd.h>
#include <termios.h>
#define DLLCALL CDECL
#define DLLIMPORT
#define DLLEXPORT __attribute__((visibility("default")))
#define DLLPRIVATE __attribute__((visibility("hidden")))
#define CoTaskMemAlloc(p) malloc(p)
#define CoTaskMemFree(p) free(p)
//ANDROID
#elif (__ANDROID__ || ANDROID)
#define ANDROID_PLATFORM 1
#define DLLCALL
#define DLLIMPORT
#define DLLEXPORT __attribute__((visibility("default")))
#define DLLPRIVATE __attribute__((visibility("hidden")))
//MACOS
#elif defined(__APPLE__)
#include <unistd.h>
#include <termios.h>
#define DLLCALL
#define DLLIMPORT
#define DLLEXPORT __attribute__((visibility("default")))
#define DLLPRIVATE __attribute__((visibility("hidden")))
#include "TargetConditionals.h"
#if TARGET_OS_IPHONE && TARGET_IPHONE_SIMULATOR
#define IOS_SIMULATOR_PLATFORM 1
#elif TARGET_OS_IPHONE
#define IOS_PLATFORM 1
#elif TARGET_OS_MAC
#define MACOS_PLATFORM 1
#else
#endif
#endif
typedef std::string String;
typedef std::wstring WString;
#define EMPTY_STRING u8""s
#define EMPTY_WSTRING L""s
using namespace std::literals::string_literals;
class Strings
{
public:
static String WideStringToString(const WString& wstr)
{
if (wstr.empty())
{
return String();
}
size_t pos;
size_t begin = 0;
String ret;
#if WINDOWS_PLATFORM
int size;
pos = wstr.find(static_cast<wchar_t>(0), begin);
while (pos != WString::npos && begin < wstr.length())
{
WString segment = WString(&wstr[begin], pos - begin);
size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, &segment[0], segment.size(), NULL, 0, NULL, NULL);
String converted = String(size, 0);
WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, &segment[0], segment.size(), &converted[0], converted.size(), NULL, NULL);
ret.append(converted);
ret.append({ 0 });
begin = pos + 1;
pos = wstr.find(static_cast<wchar_t>(0), begin);
}
if (begin <= wstr.length())
{
WString segment = WString(&wstr[begin], wstr.length() - begin);
size = WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, &segment[0], segment.size(), NULL, 0, NULL, NULL);
String converted = String(size, 0);
WideCharToMultiByte(CP_UTF8, WC_ERR_INVALID_CHARS, &segment[0], segment.size(), &converted[0], converted.size(), NULL, NULL);
ret.append(converted);
}
#elif LINUX_PLATFORM || MACOS_PLATFORM || EMSCRIPTEN_PLATFORM
size_t size;
pos = wstr.find(static_cast<wchar_t>(0), begin);
while (pos != WString::npos && begin < wstr.length())
{
WString segment = WString(&wstr[begin], pos - begin);
size = wcstombs(nullptr, segment.c_str(), 0);
String converted = String(size, 0);
wcstombs(&converted[0], segment.c_str(), converted.size());
ret.append(converted);
ret.append({ 0 });
begin = pos + 1;
pos = wstr.find(static_cast<wchar_t>(0), begin);
}
if (begin <= wstr.length())
{
WString segment = WString(&wstr[begin], wstr.length() - begin);
size = wcstombs(nullptr, segment.c_str(), 0);
String converted = String(size, 0);
wcstombs(&converted[0], segment.c_str(), converted.size());
ret.append(converted);
}
#else
static_assert(false, "Unknown Platform");
#endif
return ret;
}
static WString StringToWideString(const String& str)
{
if (str.empty())
{
return WString();
}
size_t pos;
size_t begin = 0;
WString ret;
#ifdef WINDOWS_PLATFORM
int size = 0;
pos = str.find(static_cast<char>(0), begin);
while (pos != std::string::npos) {
std::string segment = std::string(&str[begin], pos - begin);
std::wstring converted = std::wstring(segment.size() + 1, 0);
size = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, &segment[0], segment.size(), &converted[0], converted.length());
converted.resize(size);
ret.append(converted);
ret.append({ 0 });
begin = pos + 1;
pos = str.find(static_cast<char>(0), begin);
}
if (begin < str.length()) {
std::string segment = std::string(&str[begin], str.length() - begin);
std::wstring converted = std::wstring(segment.size() + 1, 0);
size = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, segment.c_str(), segment.size(), &converted[0], converted.length());
converted.resize(size);
ret.append(converted);
}
#elif LINUX_PLATFORM || MACOS_PLATFORM || EMSCRIPTEN_PLATFORM
size_t size;
pos = str.find(static_cast<char>(0), begin);
while (pos != String::npos)
{
String segment = String(&str[begin], pos - begin);
WString converted = WString(segment.size(), 0);
size = mbstowcs(&converted[0], &segment[0], converted.size());
converted.resize(size);
ret.append(converted);
ret.append({ 0 });
begin = pos + 1;
pos = str.find(static_cast<char>(0), begin);
}
if (begin < str.length())
{
String segment = String(&str[begin], str.length() - begin);
WString converted = WString(segment.size(), 0);
size = mbstowcs(&converted[0], &segment[0], converted.size());
converted.resize(size);
ret.append(converted);
}
#else
static_assert(false, "Unknown Platform");
#endif
return ret;
}
};
enum class ConsoleTextStyle
{
DEFAULT = 0,
BOLD = 1,
FAINT = 2,
ITALIC = 3,
UNDERLINE = 4,
SLOW_BLINK = 5,
RAPID_BLINK = 6,
REVERSE = 7,
};
enum class ConsoleForeground
{
DEFAULT = 39,
BLACK = 30,
DARK_RED = 31,
DARK_GREEN = 32,
DARK_YELLOW = 33,
DARK_BLUE = 34,
DARK_MAGENTA = 35,
DARK_CYAN = 36,
GRAY = 37,
DARK_GRAY = 90,
RED = 91,
GREEN = 92,
YELLOW = 93,
BLUE = 94,
MAGENTA = 95,
CYAN = 96,
WHITE = 97
};
enum class ConsoleBackground
{
DEFAULT = 49,
BLACK = 40,
DARK_RED = 41,
DARK_GREEN = 42,
DARK_YELLOW = 43,
DARK_BLUE = 44,
DARK_MAGENTA = 45,
DARK_CYAN = 46,
GRAY = 47,
DARK_GRAY = 100,
RED = 101,
GREEN = 102,
YELLOW = 103,
BLUE = 104,
MAGENTA = 105,
CYAN = 106,
WHITE = 107
};
class Console
{
private:
static void EnableVirtualTermimalProcessing()
{
#if defined WINDOWS_PLATFORM
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD dwMode = 0;
GetConsoleMode(hOut, &dwMode);
if (!(dwMode & ENABLE_VIRTUAL_TERMINAL_PROCESSING))
{
dwMode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING;
SetConsoleMode(hOut, dwMode);
}
#endif
}
static void ResetTerminalFormat()
{
std::cout << u8"\033[0m";
}
static void SetVirtualTerminalFormat(ConsoleForeground foreground, ConsoleBackground background, std::set<ConsoleTextStyle> styles)
{
String format = u8"\033[";
format.append(std::to_string(static_cast<int>(foreground)));
format.append(u8";");
format.append(std::to_string(static_cast<int>(background)));
if (styles.size() > 0)
{
for (auto it = styles.begin(); it != styles.end(); ++it)
{
format.append(u8";");
format.append(std::to_string(static_cast<int>(*it)));
}
}
format.append(u8"m");
std::cout << format;
}
public:
static void Clear()
{
#ifdef WINDOWS_PLATFORM
std::system(u8"cls");
#elif LINUX_PLATFORM || defined MACOS_PLATFORM
std::system(u8"clear");
#elif EMSCRIPTEN_PLATFORM
emscripten::val::global()["console"].call<void>(u8"clear");
#else
static_assert(false, "Unknown Platform");
#endif
}
static void Write(const String& s, ConsoleForeground foreground = ConsoleForeground::DEFAULT, ConsoleBackground background = ConsoleBackground::DEFAULT, std::set<ConsoleTextStyle> styles = {})
{
#ifndef EMSCRIPTEN_PLATFORM
EnableVirtualTermimalProcessing();
SetVirtualTerminalFormat(foreground, background, styles);
#endif
String str = s;
#ifdef WINDOWS_PLATFORM
WString unicode = Strings::StringToWideString(str);
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), unicode.c_str(), static_cast<DWORD>(unicode.length()), nullptr, nullptr);
#elif defined LINUX_PLATFORM || defined MACOS_PLATFORM || EMSCRIPTEN_PLATFORM
std::cout << str;
#else
static_assert(false, "Unknown Platform");
#endif
#ifndef EMSCRIPTEN_PLATFORM
ResetTerminalFormat();
#endif
}
static void WriteLine(const String& s, ConsoleForeground foreground = ConsoleForeground::DEFAULT, ConsoleBackground background = ConsoleBackground::DEFAULT, std::set<ConsoleTextStyle> styles = {})
{
Write(s, foreground, background, styles);
std::cout << std::endl;
}
static void Write(const WString& s, ConsoleForeground foreground = ConsoleForeground::DEFAULT, ConsoleBackground background = ConsoleBackground::DEFAULT, std::set<ConsoleTextStyle> styles = {})
{
#ifndef EMSCRIPTEN_PLATFORM
EnableVirtualTermimalProcessing();
SetVirtualTerminalFormat(foreground, background, styles);
#endif
WString str = s;
#ifdef WINDOWS_PLATFORM
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), static_cast<DWORD>(str.length()), nullptr, nullptr);
#elif LINUX_PLATFORM || MACOS_PLATFORM || EMSCRIPTEN_PLATFORM
std::cout << Strings::WideStringToString(str);
#else
static_assert(false, "Unknown Platform");
#endif
#ifndef EMSCRIPTEN_PLATFORM
ResetTerminalFormat();
#endif
}
static void WriteLine(const WString& s, ConsoleForeground foreground = ConsoleForeground::DEFAULT, ConsoleBackground background = ConsoleBackground::DEFAULT, std::set<ConsoleTextStyle> styles = {})
{
Write(s, foreground, background, styles);
std::cout << std::endl;
}
static void WriteLine()
{
std::cout << std::endl;
}
static void Pause()
{
char c;
do
{
c = getchar();
std::cout << "Press Key " << std::endl;
} while (c != 64);
std::cout << "KeyPressed" << std::endl;
}
static int PauseAny(bool printWhenPressed = false, ConsoleForeground foreground = ConsoleForeground::DEFAULT, ConsoleBackground background = ConsoleBackground::DEFAULT, std::set<ConsoleTextStyle> styles = {})
{
int ch;
#ifdef WINDOWS_PLATFORM
ch = _getch();
#elif LINUX_PLATFORM || MACOS_PLATFORM || EMSCRIPTEN_PLATFORM
struct termios oldt, newt;
tcgetattr(STDIN_FILENO, &oldt);
newt = oldt;
newt.c_lflag &= ~(ICANON | ECHO);
tcsetattr(STDIN_FILENO, TCSANOW, &newt);
ch = getchar();
tcsetattr(STDIN_FILENO, TCSANOW, &oldt);
#else
static_assert(false, "Unknown Platform");
#endif
if (printWhenPressed)
{
Console::Write(String(1, ch), foreground, background, styles);
}
return ch;
}
};
int main()
{
std::locale::global(std::locale(u8"en_US.UTF8"));
auto str = u8"🐶\0Hello\0🐶123456789也不是可运行的程序123456789日本"s;//
WString wstr = L"🐶\0Hello\0🐶123456789也不是可运行的程序123456789日本"s;
WString wstrResult = Strings::StringToWideString(str);
String strResult = Strings::WideStringToString(wstr);
bool equals1 = wstr == wstrResult;
bool equals2 = str == strResult;
Console::WriteLine(u8"█ Converted Strings printed with Console::WriteLine"s, ConsoleForeground::GREEN);
Console::WriteLine(wstrResult, ConsoleForeground::BLUE);//Printed OK on Windows/Linux.
Console::WriteLine(strResult, ConsoleForeground::BLUE);//Printed OK on Windows/Linux.
Console::WriteLine(u8"█ Converted Strings printed with std::cout/std::wcout"s, ConsoleForeground::GREEN);
std::cout << strResult << std::endl;//Printed OK on Linux. BAD on Windows.
std::wcout << wstrResult << std::endl; //Printed BAD on Windows/Linux.
Console::WriteLine();
Console::WriteLine(u8"Press any key to exit"s, ConsoleForeground::DARK_GRAY);
Console::PauseAny();
}
You cant test this code on https://repl.it/#JomaCorpFX/StringToWideStringToString#main.cpp
**Screenshots**
Using Windows Terminal
Using cmd/powershell
Repl.it capture
Instead of including locale and all that fancy stuff, if you know for FACT your string is convertible just do this:
#include <iostream>
#include <string>
using namespace std;
int main()
{
wstring w(L"bla");
string result;
for(char x : w)
result += x;
cout << result << '\n';
}
Live example here
Besides just converting the types, you should also be conscious about the string's actual format.
When compiling for Multi-byte Character set Visual Studio and the Win API assumes UTF8 (Actually windows encoding which is Windows-28591 ).
When compiling for Unicode Character set Visual studio and the Win API assumes UTF16.
So, you must convert the string from UTF16 to UTF8 format as well, and not just convert to std::string.
This will become necessary when working with multi-character formats like some non-latin languages.
The idea is to decide that std::wstring always represents UTF16.
And std::string always represents UTF8.
This isn't enforced by the compiler, it's more of a good policy to have.
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).
To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
#include <string>
#include <codecvt>
int main()
{
std::string original8 = u8"הלו";
std::wstring original16 = L"הלו";
//C++11 format converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
//convert to UTF8 and std::string
std::string utf8NativeString = convert.to_bytes(original16);
std::wstring utf16NativeString = convert.from_bytes(original8);
assert(utf8NativeString == original8);
assert(utf16NativeString == original16);
return 0;
}
I believe the official way is still to go thorugh codecvt facets (you need some sort of locale-aware translation), as in
resultCode = use_facet<codecvt<char, wchar_t, ConversionState> >(locale).
in(stateVar, scratchbuffer, scratchbufferEnd, from, to, toLimit, curPtr);
or something like that, I don't have working code lying around. But I'm not sure how many people these days use that machinery and how many simply ask for pointers to memory and let ICU or some other library handle the gory details.
There are two issues with the code:
The conversion in const std::string s( ws.begin(), ws.end() ); is not required to correctly map the wide characters to their narrow counterpart. Most likely, each wide character will just be typecast to char.
The resolution to this problem is already given in the answer by kem and involves the narrow function of the locale's ctype facet.
You are writing output to both std::cout and std::wcout in the same program. Both cout and wcout are associated with the same stream (stdout) and the results of using the same stream both as a byte-oriented stream (as cout does) and a wide-oriented stream (as wcout does) are not defined.
The best option is to avoid mixing narrow and wide output to the same (underlying) stream. For stdout/cout/wcout, you can try switching the orientation of stdout when switching between wide and narrow output (or vice versa):
#include <iostream>
#include <stdio.h>
#include <wchar.h>
int main() {
std::cout << "narrow" << std::endl;
fwide(stdout, 1); // switch to wide
std::wcout << L"wide" << std::endl;
fwide(stdout, -1); // switch to narrow
std::cout << "narrow" << std::endl;
fwide(stdout, 1); // switch to wide
std::wcout << L"wide" << std::endl;
}
At the time of writing this answer, the number one google search for "convert string wstring" would land you on this page. My answer shows how to convert string to wstring, although this is NOT the actual question, and I should probably delete this answer but that is considered bad form. You may want to jump to this StackOverflow answer, which is now higher ranked than this page.
Here's a way to combining string, wstring and mixed string constants to wstring. Use the wstringstream class.
#include <sstream>
std::string narrow = "narrow";
std::wstring wide = "wide";
std::wstringstream cls;
cls << " abc " << narrow.c_str() << L" def " << wide.c_str();
std::wstring total= cls.str();
If you are dealing with file paths (as I often am when I find the need for wstring-to-string) you can use filesystem::path (since C++17):
#include <filesystem>
const std::wstring wPath = GetPath(); // some function that returns wstring
const std::string path = std::filesystem::path(wPath).string();
You might as well just use the ctype facet's narrow method directly:
#include <clocale>
#include <locale>
#include <string>
#include <vector>
inline std::string narrow(std::wstring const& text)
{
std::locale const loc("");
wchar_t const* from = text.c_str();
std::size_t const len = text.size();
std::vector<char> buffer(len + 1);
std::use_facet<std::ctype<wchar_t> >(loc).narrow(from, from + len, '_', &buffer[0]);
return std::string(&buffer[0], &buffer[len]);
}
This solution is inspired in dk123's solution, but uses a locale dependent codecvt facet. The result is in locale encoded string instead of UTF-8 (if it is not set as locale):
std::string w2s(const std::wstring &var)
{
static std::locale loc("");
auto &facet = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t>>(loc);
return std::wstring_convert<std::remove_reference<decltype(facet)>::type, wchar_t>(&facet).to_bytes(var);
}
std::wstring s2w(const std::string &var)
{
static std::locale loc("");
auto &facet = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t>>(loc);
return std::wstring_convert<std::remove_reference<decltype(facet)>::type, wchar_t>(&facet).from_bytes(var);
}
I was searching for it, but I can't find it. Finally I found that I can get the right facet from std::locale using the std::use_facet() function with the right typename. Hope this helps.
I spent many sad days trying to come up with a way to do this for C++17, which deprecated code_cvt facets, and this is the best I was able to come up with by combining code from a few different sources:
setlocale( LC_ALL, "en_US.UTF-8" ); //Invoked in main()
std::string wideToMultiByte( std::wstring const & wideString )
{
std::string ret;
std::string buff( MB_CUR_MAX, '\0' );
for ( wchar_t const & wc : wideString )
{
int mbCharLen = std::wctomb( &buff[ 0 ], wc );
if ( mbCharLen < 1 ) { break; }
for ( int i = 0; i < mbCharLen; ++i )
{
ret += buff[ i ];
}
}
return ret;
}
std::wstring multiByteToWide( std::string const & multiByteString )
{
std::wstring ws( multiByteString.size(), L' ' );
ws.resize(
std::mbstowcs( &ws[ 0 ],
multiByteString.c_str(),
multiByteString.size() ) );
return ws;
}
I tested this code on Windows 10, and at least for my purposes, it seems to work fine. Please don't lynch me if this doesn't consider some crazy edge cases that you might need to handle, I'm sure someone with more experience can improve on this! :-)
Also, credit where it's due:
Adapted for wideToMultiByte()
Copied for multiByteToWide
In my case, I have to use multibyte character (MBCS), and I want to use std::string and std::wstring. And can't use c++11. So I use mbstowcs and wcstombs.
I make same function with using new, delete [], but it is slower then this.
This can help How to: Convert Between Various String Types
EDIT
However, in case of converting to wstring and source string is no alphabet and multi byte string, it's not working.
So I change wcstombs to WideCharToMultiByte.
#include <string>
std::wstring get_wstr_from_sz(const char* psz)
{
//I think it's enough to my case
wchar_t buf[0x400];
wchar_t *pbuf = buf;
size_t len = strlen(psz) + 1;
if (len >= sizeof(buf) / sizeof(wchar_t))
{
pbuf = L"error";
}
else
{
size_t converted;
mbstowcs_s(&converted, buf, psz, _TRUNCATE);
}
return std::wstring(pbuf);
}
std::string get_string_from_wsz(const wchar_t* pwsz)
{
char buf[0x400];
char *pbuf = buf;
size_t len = wcslen(pwsz)*2 + 1;
if (len >= sizeof(buf))
{
pbuf = "error";
}
else
{
size_t converted;
wcstombs_s(&converted, buf, pwsz, _TRUNCATE);
}
return std::string(pbuf);
}
EDIT to use 'MultiByteToWideChar' instead of 'wcstombs'
#include <Windows.h>
#include <boost/shared_ptr.hpp>
#include "string_util.h"
std::wstring get_wstring_from_sz(const char* psz)
{
int res;
wchar_t buf[0x400];
wchar_t *pbuf = buf;
boost::shared_ptr<wchar_t[]> shared_pbuf;
res = MultiByteToWideChar(CP_ACP, 0, psz, -1, buf, sizeof(buf)/sizeof(wchar_t));
if (0 == res && GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
res = MultiByteToWideChar(CP_ACP, 0, psz, -1, NULL, 0);
shared_pbuf = boost::shared_ptr<wchar_t[]>(new wchar_t[res]);
pbuf = shared_pbuf.get();
res = MultiByteToWideChar(CP_ACP, 0, psz, -1, pbuf, res);
}
else if (0 == res)
{
pbuf = L"error";
}
return std::wstring(pbuf);
}
std::string get_string_from_wcs(const wchar_t* pcs)
{
int res;
char buf[0x400];
char* pbuf = buf;
boost::shared_ptr<char[]> shared_pbuf;
res = WideCharToMultiByte(CP_ACP, 0, pcs, -1, buf, sizeof(buf), NULL, NULL);
if (0 == res && GetLastError() == ERROR_INSUFFICIENT_BUFFER)
{
res = WideCharToMultiByte(CP_ACP, 0, pcs, -1, NULL, 0, NULL, NULL);
shared_pbuf = boost::shared_ptr<char[]>(new char[res]);
pbuf = shared_pbuf.get();
res = WideCharToMultiByte(CP_ACP, 0, pcs, -1, pbuf, res, NULL, NULL);
}
else if (0 == res)
{
pbuf = "error";
}
return std::string(pbuf);
}
#include <boost/locale.hpp>
namespace lcv = boost::locale::conv;
inline std::wstring fromUTF8(const std::string& s)
{ return lcv::utf_to_utf<wchar_t>(s); }
inline std::string toUTF8(const std::wstring& ws)
{ return lcv::utf_to_utf<char>(ws); }
In case anyone else is interested: I needed a class that could be used interchangeably wherever either a string or wstring was expected. The following class convertible_string, based on dk123's solution, can be initialized with either a string, char const*, wstring or wchar_t const* and can be assigned to by or implicitly converted to either a string or wstring (so can be passed into a functions that take either).
class convertible_string
{
public:
// default ctor
convertible_string()
{}
/* conversion ctors */
convertible_string(std::string const& value) : value_(value)
{}
convertible_string(char const* val_array) : value_(val_array)
{}
convertible_string(std::wstring const& wvalue) : value_(ws2s(wvalue))
{}
convertible_string(wchar_t const* wval_array) : value_(ws2s(std::wstring(wval_array)))
{}
/* assignment operators */
convertible_string& operator=(std::string const& value)
{
value_ = value;
return *this;
}
convertible_string& operator=(std::wstring const& wvalue)
{
value_ = ws2s(wvalue);
return *this;
}
/* implicit conversion operators */
operator std::string() const { return value_; }
operator std::wstring() const { return s2ws(value_); }
private:
std::string value_;
};
I am using below to convert wstring to string.
std::string strTo;
char *szTo = new char[someParam.length() + 1];
szTo[someParam.size()] = '\0';
WideCharToMultiByte(CP_ACP, 0, someParam.c_str(), -1, szTo, (int)someParam.length(), NULL, NULL);
strTo = szTo;
delete szTo;
// Embarcadero C++ Builder
// convertion string to wstring
string str1 = "hello";
String str2 = str1; // typedef UnicodeString String; -> str2 contains now u"hello";
// convertion wstring to string
String str2 = u"hello";
string str1 = UTF8string(str2).c_str(); // -> str1 contains now "hello"
Related
I'm working on a C++/WinRT project and would like to use Boost (1.73.0) Locale in it. Inspired by a Microsoft blog post I have replaced calls to unsupported WinApi functions in the library sources and compiled with windows-api=store. I managed to build all intended library variants (arm / x64, debug / release) and went on to build the UWP app. That is the point where I am stuck right now. The build fails to link with the following output:
vccorlibd.lib(init.obj) : error LNK2038: mismatch detected for 'vccorlib_lib_should_be_specified_before_msvcrt_lib_to_linker': value '1' doesn't match value '0' in MSVCRTD.lib(app_appinit.obj)
vccorlibd.lib(init.obj) : error LNK2005: __crtWinrtInitType already
defined in MSVCRTD.lib(app_appinit.obj)
The line triggering the failure:
const auto text = boost::locale::conv::to_utf<char>("xxx", "yyy");
My setup is Visual Studio 17 & 19, SDK 10.0.19041.0. I have tried this solution proposed to a similar question but to no avail. The modifications I made in the boost_1_73_0/libs/locale/src/win32 directory follow:
api.h
// detecting UWP
#include <boost/predef.h>
// ...
class winlocale{
public:
winlocale() :
lcid(0)
{
}
winlocale(std::string const name)
{
#if BOOST_PLAT_WINDOWS_RUNTIME
str.assign(cbegin(name), cend(name));
#endif
lcid = locale_to_lcid(name);
}
#if BOOST_PLAT_WINDOWS_RUNTIME
std::wstring str;
#endif
unsigned lcid;
bool is_c() const
{
return lcid == 0;
}
};
// ...
inline std::wstring win_map_string_l(unsigned flags,wchar_t const *begin,wchar_t const *end,winlocale const &l)
{
std::wstring res;
#if BOOST_PLAT_WINDOWS_RUNTIME
int len = LCMapStringEx(l.str.c_str(), flags, begin, end - begin, 0, 0,
/* >=win_vista must set to null*/ 0,
/*Reserved; must be NULL.*/ 0,
/*Reserved; must be 0.*/ 0);
if (len == 0)
return res;
std::vector<wchar_t> buf(len + 1);
int l2 = LCMapStringEx(l.str.c_str(), flags, begin, end - begin, &buf.front(), buf.size(),
/* >=win_vista must set to null*/ 0,
/*Reserved; must be NULL.*/ 0,
/*Reserved; must be 0.*/ 0);
res.assign(&buf.front(), l2);
#else
int len = LCMapStringW(l.lcid,flags,begin,end-begin,0,0);
if(len == 0)
return res;
std::vector<wchar_t> buf(len+1);
int l2 = LCMapStringW(l.lcid,flags,begin,end-begin,&buf.front(),buf.size());
res.assign(&buf.front(),l2);
#endif
return res;
}
// …
inline std::wstring wcsfmon_l(double value,winlocale const &l)
{
std::wostringstream ss;
ss.imbue(std::locale::classic());
ss << std::setprecision(std::numeric_limits<double>::digits10+1) << value;
std::wstring sval = ss.str();
#if BOOST_PLAT_WINDOWS_RUNTIME
int len = GetCurrencyFormatEx(l.str.c_str(), 0, sval.c_str(), 0, 0, 0);
std::vector<wchar_t> buf(len + 1);
GetCurrencyFormatEx(l.str.c_str(), 0, sval.c_str(), 0, &buf.front(), len);
#else
int len = GetCurrencyFormatW(l.lcid,0,sval.c_str(),0,0,0);
std::vector<wchar_t> buf(len+1);
GetCurrencyFormatW(l.lcid,0,sval.c_str(),0,&buf.front(),len);
#endif
return &buf.front();
}
inline std::wstring wcs_format_date_l(wchar_t const *format,SYSTEMTIME const *tm,winlocale const &l)
{
#if BOOST_PLAT_WINDOWS_RUNTIME
int len = GetDateFormatEx(l.str.c_str(), 0, tm, format, 0, 0,
/*Reserved; must set to NULL.*/ 0);
std::vector<wchar_t> buf(len + 1);
GetDateFormatEx(l.str.c_str(), 0, tm, format, &buf.front(), len,
/*Reserved; must set to NULL.*/ 0);
#else
int len = GetDateFormatW(l.lcid,0,tm,format,0,0);
std::vector<wchar_t> buf(len+1);
GetDateFormatW(l.lcid,0,tm,format,&buf.front(),len);
#endif
return &buf.front();
}
inline std::wstring wcs_format_time_l(wchar_t const *format,SYSTEMTIME const *tm,winlocale const &l)
{
#if BOOST_PLAT_WINDOWS_RUNTIME
int len = GetTimeFormatEx(l.str.c_str(), 0, tm, format, 0, 0);
std::vector<wchar_t> buf(len + 1);
GetTimeFormatEx(l.str.c_str(), 0, tm, format, &buf.front(), len);
#else
int len = GetTimeFormatW(l.lcid, 0, tm, format, 0, 0);
std::vector<wchar_t> buf(len + 1);
GetTimeFormatW(l.lcid, 0, tm, format, &buf.front(), len);
#endif
return &buf.front();
}
lcid.cpp
BOOL CALLBACK procEx(LPWSTR s, DWORD, LPARAM)
{
table_type& tbl = real_lcid_table();
try {
std::wistringstream ss;
ss.str(s);
ss >> std::hex;
unsigned lcid;
ss >> lcid; // <----- not sure about this line, but anyway...
if (ss.fail() || !ss.eof()) {
return FALSE;
}
char iso_639_lang[16];
char iso_3166_country[16];
if (GetLocaleInfoA(lcid, LOCALE_SISO639LANGNAME, iso_639_lang, sizeof(iso_639_lang)) == 0)
return FALSE;
std::string lc_name = iso_639_lang;
if (GetLocaleInfoA(lcid, LOCALE_SISO3166CTRYNAME, iso_3166_country, sizeof(iso_3166_country)) != 0) {
lc_name += "_";
lc_name += iso_3166_country;
}
table_type::iterator p = tbl.find(lc_name);
if (p != tbl.end()) {
if (p->second > lcid)
p->second = lcid;
}
else {
tbl[lc_name] = lcid;
}
}
catch (...) {
tbl.clear();
return FALSE;
}
return TRUE;
}
table_type const &get_ready_lcid_table()
{
if(table)
return *table;
else {
boost::unique_lock<boost::mutex> lock(lcid_table_mutex());
if(table)
return *table;
#if BOOST_PLAT_WINDOWS_RUNTIME
EnumSystemLocalesEx(procEx, LCID_INSTALLED,
/*callback arg*/ 0, /*reserved*/ 0);
#else
EnumSystemLocalesA(proc, LCID_INSTALLED);
#endif
table = &real_lcid_table();
return *table;
}
}
UPDATE
I found out, that the problem is actually caused by inclusion of <boost/predef.h> in the api.h, which I use to define the BOOST_PLAT_WINDOWS_RUNTIME macro when appropriate. I still don't know how to handle this situation however. Can anybody point me somewhere? Thank you.
I want to convert wstring to u16string in C++.
I can convert wstring to string, or reverse. But I don't know how convert to u16string.
u16string CTextConverter::convertWstring2U16(wstring str)
{
int iSize;
u16string szDest[256] = {};
memset(szDest, 0, 256);
iSize = WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, NULL, 0,0,0);
WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);
u16string s16 = szDest;
return s16;
}
Error in WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);'s szDest. Cause of u16string can't use with LPSTR.
How can I fix this code?
For a platform-independent solution see this answer.
If you need a solution only for the Windows platform, the following code will be sufficient:
std::wstring wstr( L"foo" );
std::u16string u16str( wstr.begin(), wstr.end() );
On the Windows platform, a std::wstring is interchangeable with std::u16string because sizeof(wstring::value_type) == sizeof(u16string::value_type) and both are UTF-16 (little endian) encoded.
wstring::value_type = wchar_t
u16string::value_type = char16_t
The only difference is that wchar_t is signed, whereas char16_t is unsigned. So you only have to do sign conversion, which can be performed by using the u16string constructor that takes an iterator pair as arguments. This constructor will implicitly convert wchar_t to char16_t.
Full example console application:
#include <windows.h>
#include <string>
int main()
{
static_assert( sizeof(std::wstring::value_type) == sizeof(std::u16string::value_type),
"std::wstring and std::u16string are expected to have the same character size" );
std::wstring wstr( L"foo" );
std::u16string u16str( wstr.begin(), wstr.end() );
// The u16string constructor performs an implicit conversion like:
wchar_t wch = L'A';
char16_t ch16 = wch;
// Need to reinterpret_cast because char16_t const* is not implicitly convertible
// to LPCWSTR (aka wchar_t const*).
::MessageBoxW( 0, reinterpret_cast<LPCWSTR>( u16str.c_str() ), L"test", 0 );
return 0;
}
Update
I had thought the standard version did not work, but in fact this was simply due to bugs in the Visual C++ and libstdc++ 3.4.21 runtime libraries. It does work with clang++ -std=c++14 -stdlib=libc++. Here is a version that tests whether the standard method works on your compiler:
#include <codecvt>
#include <cstdlib>
#include <cstring>
#include <cwctype>
#include <iostream>
#include <locale>
#include <clocale>
#include <vector>
using std::cout;
using std::endl;
using std::exit;
using std::memcmp;
using std::size_t;
using std::wcout;
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif
using std::size_t;
void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_U16TEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name));
std::wcout.imbue(std::locale());
#endif
}
int main(void)
{
constexpr char16_t msg_utf16[] = u"¡Hola, mundo! \U0001F600"; // Shouldn't assume endianness.
constexpr wchar_t msg_w[] = L"¡Hola, mundo! \U0001F600";
constexpr char32_t msg_utf32[] = U"¡Hola, mundo! \U0001F600";
constexpr char msg_utf8[] = u8"¡Hola, mundo! \U0001F600";
init_locale();
const std::codecvt_utf16<wchar_t, 0x1FFFF, std::little_endian> converter_w;
const size_t max_len = sizeof(msg_utf16);
std::vector<char> out(max_len);
std::mbstate_t state;
const wchar_t* from_w = nullptr;
char* to_next = nullptr;
converter_w.out( state, msg_w, msg_w+sizeof(msg_w)/sizeof(wchar_t), from_w, out.data(), out.data() + out.size(), to_next );
if (memcmp( msg_utf8, out.data(), sizeof(msg_utf8) ) == 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> converts to UTF-8, not UTF-16!" << endl;
} else if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<wchar_t> conversion is correct." << endl;
}
out.clear();
out.resize(max_len);
const std::codecvt_utf16<char32_t, 0x1FFFF, std::little_endian> converter_u32;
const char32_t* from_u32 = nullptr;
converter_u32.out( state, msg_utf32, msg_utf32+sizeof(msg_utf32)/sizeof(char32_t), from_u32, out.data(), out.data() + out.size(), to_next );
if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<char32_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<char32_t> conversion is correct." << endl;
}
wcout << msg_w << endl;
return EXIT_SUCCESS;
}
Previous
A bit late to the game, but here’s a version that additionally checks whether wchar_t is 32-bits (as it is on Linux), and if so, performs surrogate-pair conversion. I recommend saving this source as UTF-8 with a BOM. Here is a link to it on ideone.
#include <cassert>
#include <cwctype>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <locale>
#include <string>
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif
using std::size_t;
void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_U16TEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name));
std::wcout.imbue(std::locale());
#endif
}
std::u16string make_u16string( const std::wstring& ws )
/* Creates a UTF-16 string from a wide-character string. Any wide characters
* outside the allowed range of UTF-16 are mapped to the sentinel value U+FFFD,
* per the Unicode documentation. (http://www.unicode.org/faq/private_use.html
* retrieved 12 March 2017.) Unpaired surrogates in ws are also converted to
* sentinel values. Noncharacters, however, are left intact. As a fallback,
* if wide characters are the same size as char16_t, this does a more trivial
* construction using that implicit conversion.
*/
{
/* We assume that, if this test passes, a wide-character string is already
* UTF-16, or at least converts to it implicitly without needing surrogate
* pairs.
*/
if ( sizeof(wchar_t) == sizeof(char16_t) ) {
return std::u16string( ws.begin(), ws.end() );
} else {
/* The conversion from UTF-32 to UTF-16 might possibly require surrogates.
* A surrogate pair suffices to represent all wide characters, because all
* characters outside the range will be mapped to the sentinel value
* U+FFFD. Add one character for the terminating NUL.
*/
const size_t max_len = 2 * ws.length() + 1;
// Our temporary UTF-16 string.
std::u16string result;
result.reserve(max_len);
for ( const wchar_t& wc : ws ) {
const std::wint_t chr = wc;
if ( chr < 0 || chr > 0x10FFFF || (chr >= 0xD800 && chr <= 0xDFFF) ) {
// Invalid code point. Replace with sentinel, per Unicode standard:
constexpr char16_t sentinel = u'\uFFFD';
result.push_back(sentinel);
} else if ( chr < 0x10000UL ) { // In the BMP.
result.push_back(static_cast<char16_t>(wc));
} else {
const char16_t leading = static_cast<char16_t>(
((chr-0x10000UL) / 0x400U) + 0xD800U );
const char16_t trailing = static_cast<char16_t>(
((chr-0x10000UL) % 0x400U) + 0xDC00U );
result.append({leading, trailing});
} // end if
} // end for
/* The returned string is shrunken to fit, which might not be the Right
* Thing if there is more to be added to the string.
*/
result.shrink_to_fit();
// We depend here on the compiler to optimize the move constructor.
return result;
} // end if
// Not reached.
}
int main(void)
{
static const std::wstring wtest(L"☪☮∈✡℩☯✝ \U0001F644");
static const std::u16string u16test(u"☪☮∈✡℩☯✝ \U0001F644");
const std::u16string converted = make_u16string(wtest);
init_locale();
std::wcout << L"sizeof(wchar_t) == " << sizeof(wchar_t) << L".\n";
for( size_t i = 0; i <= u16test.length(); ++i ) {
if ( u16test[i] != converted[i] ) {
std::wcout << std::hex << std::showbase
<< std::right << std::setfill(L'0')
<< std::setw(4) << (unsigned)converted[i] << L" ≠ "
<< std::setw(4) << (unsigned)u16test[i] << L" at "
<< i << L'.' << std::endl;
return EXIT_FAILURE;
} // end if
} // end for
std::wcout << wtest << std::endl;
return EXIT_SUCCESS;
}
Footnote
Since someone asked: The reason I suggest UTF-8 with BOM is that some compilers, including MSVC 2015, will assume a source file is encoded according to the current code page unless there is a BOM or you specify an encoding on the command line. No encoding works on all toolchains, unfortunately, but every tool I’ve used that’s modern enough to support C++14 also understands the BOM.
- To convert CString to std:wstring and string
string CString2string(CString str)
{
int bufLen = WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, NULL, 0, NULL,NULL);
char *buf = new char[bufLen];
WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, buf, bufLen, NULL, NULL);
string sRet(buf);
delete[] buf;
return sRet;
}
CString strFileName = "test.txt";
wstring wFileName(strFileName.GetBuffer());
strFileName.ReleaseBuffer();
string sFileName = CString2string(strFileName);
- To convert string to CString
CString string2CString(string s)
{
int bufLen = MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, NULL, 0);
WCHAR *buf = new WCHAR[bufLen];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, buf, bufLen);
CString strRet(buf);
delete[] buf;
return strRet;
}
string sFileName = "test.txt";
CString strFileName = string2CString(sFileName);
I have a lot of code written based on UTF-8 using C++03, STL and Boost 1.54.
All the code outputs data to the console via std::cout or std::cerr.
I do not want to introduce a new library to my code base or switch to C++11,
but I want to port the code to Windows.
Rewriting all code to either use std::wcout or std::wcerr instead of
std::cout or std::cerr is not what I intend but I still want to display
all on console as UTF-16.
Is there a way to change std::cout and std::cerr to transform all char based data (which is UTF-8 encoded) to wchar_t based data (which would be UTF-16 encoded) before writing it to the console?
It would be great to see a solution for this using just C++03, STL and Boost 1.54.
I found that Boost Locale has conversion functions for single strings and there is a UTF-8 to UTF-32 iterator in Boost Spirit available but I could not find any facet codecvt to transform UTF-8 to UTF-16 without using an additional library or switching to C++11.
Thanks in advance.
PS: I know it is doable with something like this, but I hope to find a better solution here.
I did not came up with a better solution than already hinted.
So I will just share the solution based on streambuf here for anyone who is interested in it.
Hopefully, someone will come up with a better solution and share it here.
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
#define TEST_ARG_TYPE wchar_t
#else /* not windows, unicode */
#define TEST_ARG_TYPE char
#endif /* windows, unicode */
#ifndef _O_U16TEXT
#define _O_U16TEXT 0x20000
#endif
static size_t countValidUtf8Bytes(const unsigned char * buf, const size_t size) {
size_t i, charSize;
const unsigned char * src = buf;
for (i = 0; i < size && (*src) != 0; i += charSize, src += charSize) {
charSize = 0;
if ((*src) >= 0xFC) {
charSize = 6;
} else if ((*src) >= 0xF8) {
charSize = 5;
} else if ((*src) >= 0xF0) {
charSize = 4;
} else if ((*src) >= 0xE0) {
charSize = 3;
} else if ((*src) >= 0xC0) {
charSize = 2;
} else if ((*src) >= 0x80) {
/* Skip continuous UTF-8 character (should never happen). */
for (; (i + charSize) < size && src[charSize] != 0 && src[charSize] >= 0x80; charSize++) {
charSize++;
}
} else {
/* ASCII character. */
charSize = 1;
}
if ((i + charSize) > size) break;
}
return i;
}
#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
#include <locale>
#include <streambuf>
#include <boost/locale.hpp>
extern "C" {
#include <fcntl.h>
#include <io.h>
#include <windows.h>
int _CRT_glob;
extern void __wgetmainargs(int *, wchar_t ***, wchar_t ***, int, int *);
}
class Utf8ToUtf16Buffer : public std::basic_streambuf< char, std::char_traits<char> > {
private:
char * outBuf;
FILE * outFd;
public:
static const size_t BUFFER_SIZE = 1024;
typedef std::char_traits<char> traits_type;
typedef traits_type::int_type int_type;
typedef traits_type::pos_type pos_type;
typedef traits_type::off_type off_type;
explicit Utf8ToUtf16Buffer(FILE * o) : outBuf(new char[BUFFER_SIZE]), outFd(o) {
/* Initialize the put pointer. Overflow won't get called until this
* buffer is filled up, so we need to use valid pointers.
*/
this->setp(outBuf, outBuf + BUFFER_SIZE - 1);
}
~Utf8ToUtf16Buffer() {
delete[] outBuf;
}
protected:
virtual int_type overflow(int_type c);
virtual int_type sync();
};
Utf8ToUtf16Buffer::int_type Utf8ToUtf16Buffer::overflow(Utf8ToUtf16Buffer::int_type c) {
char * iBegin = this->outBuf;
char * iEnd = this->pptr();
int_type result = traits_type::not_eof(c);
/* If this is the end, add an eof character to the buffer.
* This is why the pointers passed to setp are off by 1
* (to reserve room for this).
*/
if ( ! traits_type::eq_int_type(c, traits_type::eof()) ) {
*iEnd = traits_type::to_char_type(c);
iEnd++;
}
/* Calculate output data length. */
int_type iLen = static_cast<int_type>(iEnd - iBegin);
int_type iLenU8 = static_cast<int_type>(
countValidUtf8Bytes(reinterpret_cast<const unsigned char *>(iBegin), static_cast<size_t>(iLen))
);
/* Convert string to UTF-16 and write to defined file descriptor. */
if (fwprintf(this->outFd, boost::locale::conv::utf_to_utf<wchar_t>(std::string(outBuf, outBuf + iLenU8)).c_str()) < 0) {
/* Failed to write data to output file descriptor. */
result = traits_type::eof();
}
/* Reset the put pointers to indicate that the buffer is free. */
if (iLenU8 == iLen) {
this->setp(outBuf, outBuf + BUFFER_SIZE + 1);
} else {
/* Move incomplete UTF-8 characters remaining in buffer. */
const size_t overhead = static_cast<size_t>(iLen - iLenU8);
memmove(outBuf, outBuf + iLenU8, overhead);
this->setp(outBuf + overhead, outBuf + BUFFER_SIZE + 1);
}
return result;
}
Utf8ToUtf16Buffer::int_type Utf8ToUtf16Buffer::sync() {
return traits_type::eq_int_type(this->overflow(traits_type::eof()), traits_type::eof()) ? -1 : 0;
}
#endif /* windows, unicode */
int test_main(int argc, TEST_ARG_TYPE ** argv);
#if defined(_WIN32) && defined(_UNICODE) && (defined(__MSVCRT__) ||defined(_MSC_VER))
int main(/*int argc, char ** argv*/) {
wchar_t ** wenpv, ** wargv;
int wargc, si = 0;
/* this also creates the global variable __wargv */
__wgetmainargs(&wargc, &wargv, &wenpv, _CRT_glob, &si);
/* enable UTF-16 output to standard output console */
_setmode(_fileno(stdout), _O_U16TEXT);
std::locale::global(boost::locale::generator().generate("UTF-8"));
Utf8ToUtf16Buffer u8cout(stdout);
std::streambuf * out = std::cout.rdbuf();
std::cout.rdbuf(&u8cout);
/* process user defined main function */
const int result = test_main(wargc, wargv);
/* revert stream buffers to let cout clean up remaining memory correctly */
std::cout.rdbuf(out);
return result;
#else /* not windows or unicode */
int main(int argc, char ** argv) {
return test_main(argc, argv);
#endif /* windows, unicode */
}
int test_main(int /*argc*/, TEST_ARG_TYPE ** /*argv*/) {
const std::string str("\x61\x62\x63\xC3\xA4\xC3\xB6\xC3\xBC\xE3\x81\x82\xE3\x81\x88\xE3\x81\x84\xE3\x82\xA2\xE3\x82\xA8\xE3\x82\xA4\xE4\xBA\x9C\xE6\xB1\x9F\xE6\x84\x8F");
for (size_t i = 1; i <= str.size(); i++) {
const std::string part(str.begin(), str.begin() + i);
const size_t validByteCount = countValidUtf8Bytes(reinterpret_cast<const unsigned char *>(part.c_str()), part.size());
wprintf(L"i = %u, v = %u\n", i, validByteCount);
const std::string valid(str.begin(), str.begin() + validByteCount);
std::cout << valid << std::endl;
std::cout.flush();
for (size_t j = 0; j < part.size(); j++) {
wprintf(L"%02X", static_cast<int>(part[j]) & 0xFF);
}
wprintf(L"\n");
}
return EXIT_SUCCESS;
}
I feel like this is probably a bad idea.. but I think it should still be seen as it does work provided that the console has the right font..
#include <iostream>
#include <windows.h>
//#include <io.h>
//#include <fcntl.h>
std::wstring UTF8ToUTF16(const char* utf8)
{
std::wstring utf16;
int len = MultiByteToWideChar(CP_UTF8, 0, utf8, -1, NULL, 0);
if (len > 1)
{
utf16.resize(len);
MultiByteToWideChar(CP_UTF8, 0, utf8, -1, &utf16[0], len);
}
return utf16;
}
std::ostream& operator << (std::ostream& os, const char* data)
{
//_setmode(_fileno(stdout), _O_U16TEXT);
SetConsoleCP(1200);
std::wstring str = UTF8ToUTF16(data);
DWORD slen = str.size();
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), slen, &slen, nullptr);
MessageBoxW(NULL, str.c_str(), L"", 0);
return os;
}
std::ostream& operator << (std::ostream& os, const std::string& data)
{
//_setmode(_fileno(stdout), _O_U16TEXT);
SetConsoleCP(1200);
std::wstring str = UTF8ToUTF16(&data[0]);
DWORD slen = str.size();
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), slen, &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const wchar_t* data)
{
DWORD slen = wcslen(data);
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data, slen, &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const std::wstring& data)
{
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), nullptr, nullptr);
return os;
}
int main()
{
std::cout<<"Россия";
}
And now cout and std::wcout both use the WriteConsoleW function.. You'd have to overload it for const char*, char*, std::string, char, etc.. whatever you need.. Maybe template it.
I have gotten a WCHAR[MAX_PATH] from (PROCESSENTRY32) pe32.szExeFile on Windows. The following do not work:
std::string s;
s = pe32.szExeFile; // compile error. cast (const char*) doesnt work either
and
std::string s;
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
s = pe32.szExeFile;
For your first example you can just do:
std::wstring s(pe32.szExeFile);
and for second:
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
std::wstring s(pe32.szExeFile);
as std::wstring has a char* ctor
Your call to WideCharToMultiByte looks correct, provided ch is a
sufficiently large buffer. After than, however, you want to assign the
buffer (ch) to the string (or use it to construct a string), not
pe32.szExeFile.
There are convenient conversion classes from ATL; you may want to use some of them, e.g.:
std::string s( CW2A(pe32.szExeFile) );
Note however that a conversion from Unicode UTF-16 to ANSI can be lossy. If you wan't a non-lossy conversion, you could convert from UTF-16 to UTF-8, and store UTF-8 inside std::string.
If you don't want to use ATL, there are some convenient freely available C++ wrappers around raw Win32 WideCharToMultiByte to convert from UTF-16 to UTF-8 using STL strings.
#ifndef __STRINGCAST_H__
#define __STRINGCAST_H__
#include <vector>
#include <string>
#include <cstring>
#include <cwchar>
#include <cassert>
template<typename Td>
Td string_cast(const wchar_t* pSource, unsigned int codePage = CP_ACP);
#endif // __STRINGCAST_H__
template<>
std::string string_cast( const wchar_t* pSource, unsigned int codePage )
{
assert(pSource != 0);
size_t sourceLength = std::wcslen(pSource);
if(sourceLength > 0)
{
int length = ::WideCharToMultiByte(codePage, 0, pSource, sourceLength, NULL, 0, NULL, NULL);
if(length == 0)
return std::string();
std::vector<char> buffer( length );
::WideCharToMultiByte(codePage, 0, pSource, sourceLength, &buffer[0], length, NULL, NULL);
return std::string(buffer.begin(), buffer.end());
}
else
return std::string();
}
and use this template as followed
PWSTR CurWorkDir;
std::string CurWorkLogFile;
CurWorkDir = new WCHAR[length];
CurWorkLogFile = string_cast<std::string>(CurWorkDir);
....
delete [] CurWorkDir;
So we have such function:
std::string url_encode_wstring(const std::wstring &input)
{
std::string output;
int cbNeeded = WideCharToMultiByte(CP_UTF8, 0, input.c_str(), -1, NULL, 0, NULL, NULL);
if (cbNeeded > 0) {
char *utf8 = new char[cbNeeded];
if (WideCharToMultiByte(CP_UTF8, 0, input.c_str(), -1, utf8, cbNeeded, NULL, NULL) != 0) {
for (char *p = utf8; *p; *p++) {
char onehex[5];
_snprintf(onehex, sizeof(onehex), "%%%02.2X", (unsigned char)*p);
output.append(onehex);
}
}
delete[] utf8;
}
return output;
}
Its grate for windows but I wonder how (and is it possible) to make it work under linux?
IMHO you should use a portable character codec library.
Here's an example of minimal portable code using iconv, which should be more than enough.
It's supposed to work on Windows (if it does, you can get rid of your windows-specific code altogether).
I follow the GNU guidelines not to use the wcstombs & co functions ( https://www.gnu.org/s/hello/manual/libc/iconv-Examples.html )
Depending on the use case, handle errors appropriately... and to enhance performance, you can create a class out of it.
#include <iostream>
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <stdexcept>
std::string wstring_to_utf8_string(const std::wstring &input)
{
size_t in_size = input.length() * sizeof(wchar_t);
char * in_buf = (char*)input.data();
size_t buf_size = input.length() * 6; // pessimistic: max UTF-8 char size
char * buf = new char[buf_size];
memset(buf, 0, buf_size);
char * out_buf(buf);
size_t out_size(buf_size);
iconv_t conv_desc = iconv_open("UTF-8", "wchar_t");
if (conv_desc == iconv_t(-1))
throw std::runtime_error(std::string("Could not open iconv: ") + strerror(errno));
size_t iconv_value = iconv(conv_desc, &in_buf, &in_size, &out_buf, &out_size);
if (iconv_value == -1)
throw std::runtime_error(std::string("When converting: ") + strerror(errno));
int ret = iconv_close(conv_desc);
if (ret != 0)
throw std::runtime_error(std::string("Could not close iconv: ") + strerror(errno));
std::string s(buf);
delete [] buf;
return s;
}
int main() {
std::wstring in(L"hello world");
std::wcout << L"input: [" << in << L"]" << std::endl;
std::string out(wstring_to_utf8_string(in));
std::cerr << "output: [" << out << "]" << std::endl;
return 0;
}