c++ can't convert string to wstring - c++

I would like to convert a string variable to wstring due to some german characters that cause problem when doing a substr over the variable. The start position is falsified when any these special characters is present before it. (For instance: for "ä" size() returns 2 instead of 1)
I know that the following conversion works:
wstring ws = L"ä";
Since, I am trying to convert a variable, I would like to know if there is an alternative way for it such as
wstring wstr = L"%s"+str //this is syntaxically wrong, but wanted sth alike
Beside that, I have already tried the following example to convert string to wstring:
string foo("ä");
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl;
cout << wfoo.size() << endl;
, but I get errors like
‘wstring_convert’ was not declared in this scope
I am using ubuntu 14.04 and my main.cpp is compiled with cmake. Thanks for your help!

The solution from "hahakubile" worked for me:
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
But the return value is not 100% correct:
string s = "101446012MaßnStörfall PAt #Maßnahme Störfall 00810000100121000102000020100000000000000";
wstring ws2 = s2ws(s);
cout << ws2.size() << endl; // returns 110 which is correct
wcout << ws2.substr(29,40) << endl; // returns #Ma�nahme St�rfall with symbols
I am wondering why it replaced german characters with symbols.
Thanks again!

If you are using Windows/Visual Studio and need to convert a string to wstring you should use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java).
CA2W ca2w(str, CP_UTF8);
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.

The main point is that
string foo("ä")
Is already an error. Start from here and read all answers. And beware, one is very wrong :)

Related

Reading an input of mixed unicode characters and integers [duplicate]

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

Coding a path in unicode c++

I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this:
https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input char data: c4 87
Output wchar_t data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input char data: 3f
Output wchar_t data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";
Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that &dash; when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.
The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?

Is it possible to concatenate string and wstring? [duplicate]

string s = "おはよう";
wstring ws = FUNCTION(s, ws);
How would i assign the contents of s to ws?
Searched google and used some techniques but they can't assign the exact content. The content is distorted.
Assuming that the input string in your example (おはよう) is a UTF-8 encoded (which it isn't, by the looks of it, but let's assume it is for the sake of this explanation :-)) representation of a Unicode string of your interest, then your problem can be fully solved with the standard library (C++11 and newer) alone.
The TL;DR version:
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
Longer online compilable and runnable example:
(They all show the same example. There are just many for redundancy...)
http://ideone.com/KA1oty
http://ide.geeksforgeeks.org/5pRLSh
http://rextester.com/DIJZK52174
Note (old):
As pointed out in the comments and explained in https://stackoverflow.com/a/17106065/6345 there are cases when using the standard library to convert between UTF-8 and UTF-16 might give unexpected differences in the results on different platforms. For a better conversion, consider std::codecvt_utf8 as described on http://en.cppreference.com/w/cpp/locale/codecvt_utf8
Note (new):
Since the codecvt header is deprecated in C++17, some worry about the solution presented in this answer were raised. However, the C++ standards committee added an important statement in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0618r0.html saying
this library component should be retired to Annex D, along side , until a suitable replacement is standardized.
So in the foreseeable future, the codecvt solution in this answer is safe and portable.
int StringToWString(std::wstring &ws, const std::string &s)
{
std::wstring wsTmp(s.begin(), s.end());
ws = wsTmp;
return 0;
}
Your question is underspecified. Strictly, that example is a syntax error. However, std::mbstowcs is probably what you're looking for.
It is a C-library function and operates on buffers, but here's an easy-to-use idiom, courtesy of Mooing Duck:
std::wstring ws(s.size(), L' '); // Overestimate number of code points.
ws.resize(std::mbstowcs(&ws[0], s.c_str(), s.size())); // Shrink to fit.
If you are using Windows/Visual Studio and need to convert a string to wstring you could use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java). A standard way of converting a std::wstring to utf8 std::string is showed in this answer.
//
// using ATL
CA2W ca2w(str, CP_UTF8);
//
// or the standard way taken from the answer above
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.
Windows API only, pre C++11 implementation, in case someone needs it:
#include <stdexcept>
#include <vector>
#include <windows.h>
using std::runtime_error;
using std::string;
using std::vector;
using std::wstring;
wstring utf8toUtf16(const string & str)
{
if (str.empty())
return wstring();
size_t charsNeeded = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), NULL, 0);
if (charsNeeded == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
vector<wchar_t> buffer(charsNeeded);
int charsConverted = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), &buffer[0], buffer.size());
if (charsConverted == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
return wstring(&buffer[0], charsConverted);
}
Here's a way to combining string, wstring and mixed string constants to wstring. Use the wstringstream class.
This does NOT work for multi-byte character encodings. This is just a dumb way of throwing away type safety and expanding 7 bit characters from std::string into the lower 7 bits of each character of std:wstring. This is only useful if you have a 7-bit ASCII strings and you need to call an API that requires wide strings.
#include <sstream>
std::string narrow = "narrow";
std::wstring wide = L"wide";
std::wstringstream cls;
cls << " abc " << narrow.c_str() << L" def " << wide.c_str();
std::wstring total= cls.str();
From char* to wstring:
char* str = "hello worlddd";
wstring wstr (str, str+strlen(str));
From string to wstring:
string str = "hello worlddd";
wstring wstr (str.begin(), str.end());
Note this only works well if the string being converted contains only ASCII characters.
This variant of it is my favourite in real life. It converts the input, if it is valid UTF-8, to the respective wstring. If the input is corrupted, the wstring is constructed out of the single bytes. This is extremely helpful if you cannot really be sure about the quality of your input data.
std::wstring convert(const std::string& input)
{
try
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input);
}
catch(std::range_error& e)
{
size_t length = input.length();
std::wstring result;
result.reserve(length);
for(size_t i = 0; i < length; i++)
{
result.push_back(input[i] & 0xFF);
}
return result;
}
}
using Boost.Locale:
ws = boost::locale::conv::utf_to_utf<wchar_t>(s);
You can use boost path or std path; which is a lot more easier.
boost path is easier for cross-platform application
#include <boost/filesystem/path.hpp>
namespace fs = boost::filesystem;
//s to w
std::string s = "xxx";
auto w = fs::path(s).wstring();
//w to s
std::wstring w = L"xxx";
auto s = fs::path(w).string();
if you like to use std:
#include <filesystem>
namespace fs = std::filesystem;
//The same
c++ older version
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
//The same
The code within still implement a converter which you dont have to unravel the detail.
For me the most uncomplicated option without big overhead is:
Include:
#include <atlbase.h>
#include <atlconv.h>
Convert:
char* whatever = "test1234";
std::wstring lwhatever = std::wstring(CA2W(std::string(whatever).c_str()));
If needed:
lwhatever.c_str();
String to wstring
std::wstring Str2Wstr(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
wstring to String
std::string Wstr2Str(const std::wstring& wstr)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}
If you have QT and if you are lazy to implement a function and stuff you can use
std::string str;
QString(str).toStdWString()
Here is my super basic solution that might not work for everyone. But would work for a lot of people.
It requires usage of the Guideline Support Library.
Which is a pretty official C++ library that was designed by many C++ committee authors:
https://github.com/isocpp/CppCoreGuidelines
https://github.com/Microsoft/GSL
std::string to_string(std::wstring const & wStr)
{
std::string temp = {};
for (wchar_t const & wCh : wStr)
{
// If the string can't be converted gsl::narrow will throw
temp.push_back(gsl::narrow<char>(wCh));
}
return temp;
}
All my function does is allow the conversion if possible. Otherwise throw an exception.
Via the usage of gsl::narrow (https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#es49-if-you-must-use-a-cast-use-a-named-cast)
method s2ws works well. Hope helps.
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
Based upon my own testing (On windows 8, vs2010) mbstowcs can actually damage original string, it works only with ANSI code page. If MultiByteToWideChar/WideCharToMultiByte can also cause string corruption - but they tends to replace characters which they don't know with '?' question marks, but mbstowcs tends to stop when it encounters unknown character and cut string at that very point. (I have tested Vietnamese characters on finnish windows).
So prefer Multi*-windows api function over analogue ansi C functions.
Also what I've noticed shortest way to encode string from one codepage to another is not use MultiByteToWideChar/WideCharToMultiByte api function calls but their analogue ATL macros: W2A / A2W.
So analogue function as mentioned above would sounds like:
wstring utf8toUtf16(const string & str)
{
USES_CONVERSION;
_acp = CP_UTF8;
return A2W( str.c_str() );
}
_acp is declared in USES_CONVERSION macro.
Or also function which I often miss when performing old data conversion to new one:
string ansi2utf8( const string& s )
{
USES_CONVERSION;
_acp = CP_ACP;
wchar_t* pw = A2W( s.c_str() );
_acp = CP_UTF8;
return W2A( pw );
}
But please notice that those macro's use heavily stack - don't use for loops or recursive loops for same function - after using W2A or A2W macro - better to return ASAP, so stack will be freed from temporary conversion.
std::string -> wchar_t[] with safe mbstowcs_s function:
auto ws = std::make_unique<wchar_t[]>(s.size() + 1);
mbstowcs_s(nullptr, ws.get(), s.size() + 1, s.c_str(), s.size());
This is from my sample code
use this code to convert your string to wstring
std::wstring string2wString(const std::string& s){
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
int main(){
std::wstring str="your string";
std::wstring wStr=string2wString(str);
return 0;
}
string s = "おはよう"; is an error.
You should use wstring directly:
wstring ws = L"おはよう";

How can I cin and cout some unicode text?

I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.

C++ Convert string (or char*) to wstring (or wchar_t*)

string s = "おはよう";
wstring ws = FUNCTION(s, ws);
How would i assign the contents of s to ws?
Searched google and used some techniques but they can't assign the exact content. The content is distorted.
Assuming that the input string in your example (おはよう) is a UTF-8 encoded (which it isn't, by the looks of it, but let's assume it is for the sake of this explanation :-)) representation of a Unicode string of your interest, then your problem can be fully solved with the standard library (C++11 and newer) alone.
The TL;DR version:
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
Longer online compilable and runnable example:
(They all show the same example. There are just many for redundancy...)
http://ideone.com/KA1oty
http://ide.geeksforgeeks.org/5pRLSh
http://rextester.com/DIJZK52174
Note (old):
As pointed out in the comments and explained in https://stackoverflow.com/a/17106065/6345 there are cases when using the standard library to convert between UTF-8 and UTF-16 might give unexpected differences in the results on different platforms. For a better conversion, consider std::codecvt_utf8 as described on http://en.cppreference.com/w/cpp/locale/codecvt_utf8
Note (new):
Since the codecvt header is deprecated in C++17, some worry about the solution presented in this answer were raised. However, the C++ standards committee added an important statement in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0618r0.html saying
this library component should be retired to Annex D, along side , until a suitable replacement is standardized.
So in the foreseeable future, the codecvt solution in this answer is safe and portable.
int StringToWString(std::wstring &ws, const std::string &s)
{
std::wstring wsTmp(s.begin(), s.end());
ws = wsTmp;
return 0;
}
Your question is underspecified. Strictly, that example is a syntax error. However, std::mbstowcs is probably what you're looking for.
It is a C-library function and operates on buffers, but here's an easy-to-use idiom, courtesy of Mooing Duck:
std::wstring ws(s.size(), L' '); // Overestimate number of code points.
ws.resize(std::mbstowcs(&ws[0], s.c_str(), s.size())); // Shrink to fit.
If you are using Windows/Visual Studio and need to convert a string to wstring you could use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java). A standard way of converting a std::wstring to utf8 std::string is showed in this answer.
//
// using ATL
CA2W ca2w(str, CP_UTF8);
//
// or the standard way taken from the answer above
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.
Windows API only, pre C++11 implementation, in case someone needs it:
#include <stdexcept>
#include <vector>
#include <windows.h>
using std::runtime_error;
using std::string;
using std::vector;
using std::wstring;
wstring utf8toUtf16(const string & str)
{
if (str.empty())
return wstring();
size_t charsNeeded = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), NULL, 0);
if (charsNeeded == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
vector<wchar_t> buffer(charsNeeded);
int charsConverted = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), &buffer[0], buffer.size());
if (charsConverted == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
return wstring(&buffer[0], charsConverted);
}
Here's a way to combining string, wstring and mixed string constants to wstring. Use the wstringstream class.
This does NOT work for multi-byte character encodings. This is just a dumb way of throwing away type safety and expanding 7 bit characters from std::string into the lower 7 bits of each character of std:wstring. This is only useful if you have a 7-bit ASCII strings and you need to call an API that requires wide strings.
#include <sstream>
std::string narrow = "narrow";
std::wstring wide = L"wide";
std::wstringstream cls;
cls << " abc " << narrow.c_str() << L" def " << wide.c_str();
std::wstring total= cls.str();
From char* to wstring:
char* str = "hello worlddd";
wstring wstr (str, str+strlen(str));
From string to wstring:
string str = "hello worlddd";
wstring wstr (str.begin(), str.end());
Note this only works well if the string being converted contains only ASCII characters.
This variant of it is my favourite in real life. It converts the input, if it is valid UTF-8, to the respective wstring. If the input is corrupted, the wstring is constructed out of the single bytes. This is extremely helpful if you cannot really be sure about the quality of your input data.
std::wstring convert(const std::string& input)
{
try
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input);
}
catch(std::range_error& e)
{
size_t length = input.length();
std::wstring result;
result.reserve(length);
for(size_t i = 0; i < length; i++)
{
result.push_back(input[i] & 0xFF);
}
return result;
}
}
using Boost.Locale:
ws = boost::locale::conv::utf_to_utf<wchar_t>(s);
You can use boost path or std path; which is a lot more easier.
boost path is easier for cross-platform application
#include <boost/filesystem/path.hpp>
namespace fs = boost::filesystem;
//s to w
std::string s = "xxx";
auto w = fs::path(s).wstring();
//w to s
std::wstring w = L"xxx";
auto s = fs::path(w).string();
if you like to use std:
#include <filesystem>
namespace fs = std::filesystem;
//The same
c++ older version
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
//The same
The code within still implement a converter which you dont have to unravel the detail.
For me the most uncomplicated option without big overhead is:
Include:
#include <atlbase.h>
#include <atlconv.h>
Convert:
char* whatever = "test1234";
std::wstring lwhatever = std::wstring(CA2W(std::string(whatever).c_str()));
If needed:
lwhatever.c_str();
String to wstring
std::wstring Str2Wstr(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
wstring to String
std::string Wstr2Str(const std::wstring& wstr)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}
If you have QT and if you are lazy to implement a function and stuff you can use
std::string str;
QString(str).toStdWString()
Here is my super basic solution that might not work for everyone. But would work for a lot of people.
It requires usage of the Guideline Support Library.
Which is a pretty official C++ library that was designed by many C++ committee authors:
https://github.com/isocpp/CppCoreGuidelines
https://github.com/Microsoft/GSL
std::string to_string(std::wstring const & wStr)
{
std::string temp = {};
for (wchar_t const & wCh : wStr)
{
// If the string can't be converted gsl::narrow will throw
temp.push_back(gsl::narrow<char>(wCh));
}
return temp;
}
All my function does is allow the conversion if possible. Otherwise throw an exception.
Via the usage of gsl::narrow (https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#es49-if-you-must-use-a-cast-use-a-named-cast)
method s2ws works well. Hope helps.
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}
Based upon my own testing (On windows 8, vs2010) mbstowcs can actually damage original string, it works only with ANSI code page. If MultiByteToWideChar/WideCharToMultiByte can also cause string corruption - but they tends to replace characters which they don't know with '?' question marks, but mbstowcs tends to stop when it encounters unknown character and cut string at that very point. (I have tested Vietnamese characters on finnish windows).
So prefer Multi*-windows api function over analogue ansi C functions.
Also what I've noticed shortest way to encode string from one codepage to another is not use MultiByteToWideChar/WideCharToMultiByte api function calls but their analogue ATL macros: W2A / A2W.
So analogue function as mentioned above would sounds like:
wstring utf8toUtf16(const string & str)
{
USES_CONVERSION;
_acp = CP_UTF8;
return A2W( str.c_str() );
}
_acp is declared in USES_CONVERSION macro.
Or also function which I often miss when performing old data conversion to new one:
string ansi2utf8( const string& s )
{
USES_CONVERSION;
_acp = CP_ACP;
wchar_t* pw = A2W( s.c_str() );
_acp = CP_UTF8;
return W2A( pw );
}
But please notice that those macro's use heavily stack - don't use for loops or recursive loops for same function - after using W2A or A2W macro - better to return ASAP, so stack will be freed from temporary conversion.
std::string -> wchar_t[] with safe mbstowcs_s function:
auto ws = std::make_unique<wchar_t[]>(s.size() + 1);
mbstowcs_s(nullptr, ws.get(), s.size() + 1, s.c_str(), s.size());
This is from my sample code
use this code to convert your string to wstring
std::wstring string2wString(const std::string& s){
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
int main(){
std::wstring str="your string";
std::wstring wStr=string2wString(str);
return 0;
}
string s = "おはよう"; is an error.
You should use wstring directly:
wstring ws = L"おはよう";