How to convert UTF-8 std::string to UTF-16 std::wstring? - c++

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

This is how you do it with C++11:
std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);
And these are the headers you need:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
A more complete example available here:
http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.
std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch&0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch&0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch&0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}

There are some relevant Q&A here and here which is worth a read.
Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.
There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).
#include <string>
#include <codecvt>
int main()
{
std::string original8 = u8"הלו";
std::wstring original16 = L"הלו";
//C++11 format converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
//convert to UTF8 and std::string
std::string utf8NativeString = convert.to_bytes(original16);
std::wstring utf16NativeString = convert.from_bytes(original8);
assert(utf8NativeString == original8);
assert(utf16NativeString == original16);
return 0;
}

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.
A simple usage of it would look something like this on using namespace
utility::conversions
utf8_to_utf16("sample_string");

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx
In the comment section of that page, there are also some interesting suggestions for this task like:
// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";
std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();
Or
// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());
Though I'm not sure the validity of these approaches...

Related

How to convert accented character to hex in C++? [duplicate]

This question already has answers here:
Convert string from UTF-8 to ISO-8859-1
(3 answers)
Closed 1 year ago.
Referring to the ISO-8859-1 (Latin-1) encoding:
The capital E acute (É) has a hex value of C9.
I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.
Currently, I am only able to write a function that converts an ASCII string to hex:
std::string Helper::ToHex(std::string input) {
std::stringstream strstream;
std::string output;
for (int i=0; i<input.length(); i++) {
strstream << std::hex << unsigned(input[i]);
}
strstream >> output;
}
However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.
std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo

How to parse std::string containing unicode literals?

I have std::string which stores characters encoded in UTF. Example:
std::string a = "\\u00c1\\u00c4\\u00d3";
Note that the length of a is 18 (3 characters, 6 ASCII symbols for each UTF character).
Question: How can I convert a into C++ string that have only 3 characters? Are there any standard functions (libraries) to do that?
There is nothing in the standard C++ library to handle this kind of conversion automatically for you. You are going to have to parse this string yourself, manually converting each 6-char "\uXXXX" substring into a 1-wchar value 0xXXXX that you can then store into a std::wstring or std::u16string as needed.
For example:
std::string a = "\\u00c1\\u00c4\\u00d3";
std::wstring ws;
ws.reserve(a.size());
for(size_t i = 0; i < a.size();)
{
char ch = a[i++];
if ((ch == '\\') && (i < a.size()) && (a[i] == 'u'))
{
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(++i, 4), nullptr, 16));
i += 4;
ws.push_back(wc);
}
else
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.push_back(static_cast<wchar_t>(ch));
}
}
Live Demo
Alternatively:
std::string a = "\\u00c1\\u00c4\\u00d3";
std::wstring ws;
ws.reserve(a.size());
size_t start = 0;
do
{
size_t found = a.find("\\u", start);
if (found == std::string::npos) break;
if (start < found)
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.begin()+found);
}
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(found+2, 4), nullptr, 16));
ws.push_back(wc);
start = found + 6;
}
while (true);
if (start < a.size())
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.end());
}
Live Demo
Otherwise, use a 3rd party library that already does this kind of translation for you.

Convert escape sequences the way a compiler would [duplicate]

What's the easiest way to convert a C++ std::string to another std::string, which has all the unprintable characters escaped?
For example, for the string of two characters [0x61,0x01], the result string might be "a\x01" or "a%01".
Take a look at the Boost's String Algorithm Library. You can use its is_print classifier (together with its operator! overload) to pick out nonprintable characters, and its find_format() functions can replace those with whatever formatting you wish.
#include <iostream>
#include <boost/format.hpp>
#include <boost/algorithm/string.hpp>
struct character_escaper
{
template<typename FindResultT>
std::string operator()(const FindResultT& Match) const
{
std::string s;
for (typename FindResultT::const_iterator i = Match.begin();
i != Match.end();
i++) {
s += str(boost::format("\\x%02x") % static_cast<int>(*i));
}
return s;
}
};
int main (int argc, char **argv)
{
std::string s("a\x01");
boost::find_format_all(s, boost::token_finder(!boost::is_print()), character_escaper());
std::cout << s << std::endl;
return 0;
}
Assumes the execution character set is a superset of ASCII and CHAR_BIT is 8. For the OutIter pass a back_inserter (e.g. to a vector<char> or another string), ostream_iterator, or any other suitable output iterator.
template<class OutIter>
OutIter write_escaped(std::string const& s, OutIter out) {
*out++ = '"';
for (std::string::const_iterator i = s.begin(), end = s.end(); i != end; ++i) {
unsigned char c = *i;
if (' ' <= c and c <= '~' and c != '\\' and c != '"') {
*out++ = c;
}
else {
*out++ = '\\';
switch(c) {
case '"': *out++ = '"'; break;
case '\\': *out++ = '\\'; break;
case '\t': *out++ = 't'; break;
case '\r': *out++ = 'r'; break;
case '\n': *out++ = 'n'; break;
default:
char const* const hexdig = "0123456789ABCDEF";
*out++ = 'x';
*out++ = hexdig[c >> 4];
*out++ = hexdig[c & 0xF];
}
}
}
*out++ = '"';
return out;
}
Assuming that "easiest way" means short and yet easily understandable while not depending on any other resources (like libs) I would go this way:
#include <cctype>
#include <sstream>
// s is our escaped output string
std::string s = "";
// loop through all characters
for(char c : your_string)
{
// check if a given character is printable
// the cast is necessary to avoid undefined behaviour
if(isprint((unsigned char)c))
s += c;
else
{
std::stringstream stream;
// if the character is not printable
// we'll convert it to a hex string using a stringstream
// note that since char is signed we have to cast it to unsigned first
stream << std::hex << (unsigned int)(unsigned char)(c);
std::string code = stream.str();
s += std::string("\\x")+(code.size()<2?"0":"")+code;
// alternatively for URL encodings:
//s += std::string("%")+(code.size()<2?"0":"")+code;
}
}
One person's unprintable character is another's multi-byte character. So you'll have to define the encoding before you can work out what bytes map to what characters, and which of those is unprintable.
Have you seen the article about how to Generate Escaped String Output Using Spirit.Karma?

hex string to extended ascii/unicode conversion in c++ [duplicate]

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"

Convert string from UTF-8 to ISO-8859-1

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"