I'm trying to convert UTF-16 encoded strings to UCS-4
If I understand correctly, C++11 provides this conversion through codecvt_utf16.
My code is something like:
#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>
using namespace std;
int main()
{
u16string s;
s.push_back('h');
s.push_back('e');
s.push_back('l');
s.push_back('l');
s.push_back('o');
wstring_convert<codecvt_utf16<wchar_t>, wchar_t> conv;
wstring ws = conv.from_bytes(reinterpret_cast<const char*> (s.c_str()));
wcout << ws << endl;
return 0;
}
Note: the explicit push_backs to get around the fact that my version of clang (Xcode 4.2) doesn't have unicode string literals.
When the code is run, I get terminate exception. Am I doing something illegal here? I was thinking it should work because the const char* that I passed to wstring_convert is UTF-16 encoded, right? I have also considered endianness being the issue, but I have checked that it's not the case.
Two errors:
1) from_bytes() overload that takes the single const char* expects a null-terminated byte string, but your very second byte is '\0'.
2) your system is likely little-endian, so you need to convert from UTF-16LE to UCS-4:
#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>
using namespace std;
int main()
{
u16string s;
s.push_back('h');
s.push_back('e');
s.push_back('l');
s.push_back('l');
s.push_back('o');
wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
wchar_t> conv;
wstring ws = conv.from_bytes(
reinterpret_cast<const char*> (&s[0]),
reinterpret_cast<const char*> (&s[0] + s.size()));
wcout << ws << endl;
return 0;
}
Tested with Visual Studio 2010 SP1 on Windows and CLang++/libc++-svn on Linux.
Related
consider the following piece of code:
#include <iostream>
#include <string>
#include <codecvt>
std::wstring string_to_wstring(const std::string& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
return converter.from_bytes(str);
}
int main()
{
std::string str = "abcä"; // without the "ä" it works
std::wstring wstr = string_to_wstring(str);
std::wcout << wstr << L"\n";
}
This throws me a "bad_conversion" exception, which seems to be caused by the umlaut because if I remove the "ä", everything works.
I have found the code for the string_to_wstring function some time ago here on SO and it worked until now very well. Mainly because I never came accross any umlauts.
Can we fix this function to work with any characters? Or is there a better (more efficient/safe) way to convert between string and wstring?
#include <iostream>
using namespace std;
int main() {
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
Whys this doesn't work? And how solve this?
It doesn't work because in the default C locale, there is no character which corresponds to U+00A2.
If you're using a standard ubuntu install, it is most likely that your user locale uses a larger character set than US-ASCII, quite possibly Unicode encoded with UTF-8. So you just need to switch to the locale specified in the environment, as follows:
#include <iostream>
/* locale is needed for std::setlocale */
#include <locale>
#include <string>
int main() {
/* The following switches to the locale specified
* by the LC_ALL environment variable.
*/
std::setlocale (LC_ALL, "");
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
If you use std::string instead of std::wstring and std::cout instead of std::wcout, then you don't need the setlocale because no translation is needed (provided the console expects UTF-8).
I have programmed some code but there is some problem. In this codes i am trying to convert string to wstring. But this string have "█" characters. This character have 219 ascii code.
This conversion getting error.
In my code:
string strs= "█and█something else";
wstring wstr(strs.begin(),strs.end());
After debugging, I am getting result like this
?and?something else
How do I correct this problem?
Thanks...
The C-library solution for converting between the system's narrow and wide encoding use the mbsrtowcs and wcsrtombs functions from the <cwchar> header. I've spelt this out in this answer.
In C++11, you can use the wstring_convert template instantiated with a suitable codecvt facet. Unfortunately this requires some custom rigging, which is spelt out on the cppreference page.
I've adapted it here into a self-contained example which converts a wstring to a string, converting from the system's wide into the system's narrow encoding:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template <typename Facet>
struct deletable_facet : Facet
{
using Facet::Facet;
};
int main()
{
std::wstring_convert<
deletable_facet<std::codecvt<wchar_t, char, std::mbstate_t>>> conv;
std::wstring ws(L"Hello world.");
std::string ns = conv.to_bytes(ws);
std::cout << ns << std::endl;
}
Consider the following example:
#include <iostream>
#include <clocale>
#include <cstdlib>
#include <string>
int main()
{
std::setlocale(LC_ALL, "en_US.utf8");
std::string s = "03A0";
wchar_t wstr = std::strtoul(s.c_str(), nullptr, 16);
std::wcout << wstr;
}
This outputs Π on Coliru.
Question
std::strtoul, is from <cstdlib>. I'm perfectly fine with using it, but I was wondering if the above example was possible using only the C++ standard library (perhaps stringstreams)?
Note also that there is no prefex 0x on the string indicating hexadecimal.
Sure, std::stoul:
wchar_t wstr = std::stoul(s, nullptr, 16);
The main difference is the fact that it can throw exceptions for errors.
#include "stdafx.h"
#include <string>
#include <windows.h>
using namespace std;
int main()
{
string FilePath = "C:\\Documents and Settings\\whatever";
CreateDirectory(FilePath, NULL);
return 0;
}
Error: error C2664: 'CreateDirectory' : cannot convert parameter 1 from 'const char *' to 'LPCTSTR'
How do I make this conversion?
The next step is to set today's date as a string or char and concatenate it with the filepath. Will this change how I do step 1?
I am terrible at data types and conversions, is there a good explanation for 5 year olds out there?
std::string is a class that holds char-based data. To pass a std::string data to API functions, you have to use its c_str() method to get a char* pointer to the string's actual data.
CreateDirectory() takes a TCHAR* as input. If UNICODE is defined, TCHAR maps to wchar_t, otherwise it maps to char instead. If you need to stick with std::string but do not want to make your code UNICODE-aware, then use CreateDirectoryA() instead, eg:
#include "stdafx.h"
#include <string>
#include <windows.h>
int main()
{
std::string FilePath = "C:\\Documents and Settings\\whatever";
CreateDirectoryA(FilePath.c_str(), NULL);
return 0;
}
To make this code TCHAR-aware, you can do this instead:
#include "stdafx.h"
#include <string>
#include <windows.h>
int main()
{
std::basic_string<TCHAR> FilePath = TEXT("C:\\Documents and Settings\\whatever");
CreateDirectory(FilePath.c_str(), NULL);
return 0;
}
However, Ansi-based OS versions are long dead, everything is Unicode nowadays. TCHAR should not be used in new code anymore:
#include "stdafx.h"
#include <string>
#include <windows.h>
int main()
{
std::wstring FilePath = L"C:\\Documents and Settings\\whatever";
CreateDirectoryW(FilePath.c_str(), NULL);
return 0;
}
If you're not building a Unicode executable, calling c_str() on the std::string will result in a const char* (aka non-Unicode LPCTSTR) that you can pass into CreateDirectory().
The code would look like this:
CreateDirectory(FilePath.c_str(), NULL):
Please note that this will result in a compile error if you're trying to build a Unicode executable.
If you have to append to FilePath I would recommend that you either continue to use std::string or use Microsoft's CString to do the string manipulation as that's less painful that doing it the C way and juggling raw char*. Personally I would use std::string unless you are already in an MFC application that uses CString.