C++ & Boost: encode/decode UTF-8 - c++

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.
The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with
http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.
For instance, in Python it would look like so:
>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'
What I'm ultimately after is this:
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws);
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}
I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.
Some sample code would greatly be appreciated! Thanks

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:
inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}
Usage:
wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string s = conv.to_bytes(uchars);
std::wstring ws2 = conv.from_bytes(s);
std::cout << std::boolalpha
<< (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
<< (ws2 == uchars ) << '\n';
}
output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9
true
true

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
Almost as easy as Python encoding/decoding :)
Note that Boost.Locale is not a header-only library.

For a drop-in replacement for std::string/std::wstring that handles utf8, see TINYUTF8.
In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.

Related

How to convert a text like "\320\272\320\276\320\274..." to std::wstring in C++?

I am working on a code that processes message from Ubuntu, some of the messages contains, for example:
localhost sshd 1658 - - Invalid user \320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274 from 172.28.60.28 port 50712 ]
where "\320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274" is the user name that originally is in Russian. How to convert it to std::wstring?
The numbers after the backslashes are the UTF-8 byte sequence values of the Cyrillic letters, each byte represented as an octal number.
You could for example use a regex replace to replace each \ooo with its value so that you get a real UTF-8 string out:
See it on Wandbox
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
std::string const source = R"(Invalid user \320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274 from 172.28.60.28 port 50712)";
boost::regex const re(R"(\\\d\d\d)");
auto const replacer = [](boost::smatch const& match, auto it) {
auto const byteVal = std::stoi(&match[0].str()[1], 0, 8);
*it = static_cast<char>(byteVal);
return ++it;
};
std::string const out = boost::regex_replace(source, re, replacer);
std::cout << out << std::endl;
return EXIT_SUCCESS;
}
If you really need to, you can then convert this std::string to std::wstring using e.g. Thomas's method.
If you have a std::string containing UTF-8 code-points and you wish to convert this to std::wstring you can do this in the following way, using the std::codecvt_utf8 facet and the std::wstring_convert class template:
#include <locale>
std::wstring convert(const std::string& utf8String) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter{};
return converter.from_bytes(utf8String);
}
The format of the resulting std::wstring will either be UCS2 (on Windows platforms) or UCS4 (most non-Windows platforms).
Note, that the std::codecvt_utf8 facet is deprecated as of C++17, and instead consumers are encouraged to rely on specialized unicode/text-processing libraries. But this should suffice for now.

Print unicode char

I tried a very simple code in C++:
#include <iostream>
#include <string>
int main()
{
std::wstring test = L"asdfa-";
test += u'ç';
std::wcout << test;
}
But the result was:
asdfa-?
It was not possible print 'ç', with cout or wcout, how can I can print this string correctally?
OS: Linux.
Ps: I use wstring instead of string, because sometimes I need calculate the length of the string, and this size must be the same of what is on the screen.
Ps: I need concatenate the unicode char, it can't be on the string constructor.
First, here's something that does work:
#include <iostream>
#include <string>
int main() {
std::string test = "asdfa-";
test += "ç";
std::cout << test;
}
I used just regular strings here and let C++ keep everything in UTF-8. I think you already know that this would work because you mentioned that you wanted to concatenate the ç rather than just leaving it in the string constructor.
Dealing with char, char16_t, char32_t, and wchar_t in C++ has never really been fun. You have to be careful with the L, u, and U prefixes.
However, where possible, if you deal with utf-8 strings, and avoid characters, you can generally get things to work much better. And since most consoles (with the possible exception of old Windows machines) understand utf-8 pretty well, this is the approach that often just works the best. So if you have wide characters, see if you can convert them to regular std::string objects and work in that domain.
One general way of handling this would be:
Input (convert from multibyte to wide using current locale)
Your App: work with wide strings
Output or saving to a file (convert from wide to multibyte)
For wide string manipulations like num of characters, substring etc. there is wcsXXX class of functions.
If you are using libstdc++ on Linux: you forgot an essential call at the beginning of the program
std::locale::global(std::locale(""));
This is assuming you are on Linux and your locale supports UTF-8.
If you are using libc++: forget about using wstreams. This library does not support I/O of wide characters in a useful way (i.e. translation to UTF-8 like libstdc++ does).
Windows has a wholly separate set of quirks regarding Unicode. You are lucky if you don't have to deal with them.
demo with gcc/libstdc++ and a call to std::locale
demo with gcc/libstdc++ and no call to std::locale
Different versions of clang/libc++ behave differently with this example: some output ? instead of the non-ascii char, some output nothing; some crash on call to std::locale, some don't. None do the right thing, which is printing the ç, or maybe I just haven't found one that works. I don't recommend using libc++ if you need anything related to locale or wchar_t.
I solved this problem using a conversion function:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
std::string wstr2str(const std::wstring& wstr) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wstr);
}
int main()
{
std::wstring test = L"asdfa-";
test += L'ç';
std::string str = wstr2str(test)
std::cout << str;
}

character and number Persian in regular Expression C++ [duplicate]

I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and facing some issues with it.
std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");
Is the issue is with \\u?
This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.
This works for me where source text is UTF-8:
inline std::wstring from_utf8(const std::string& utf8)
{
// code to convert from utf8 to utf32/utf16
}
inline std::string to_utf8(const std::wstring& ws)
{
// code to convert from utf32/utf16 to utf8
}
int main()
{
std::string test = "john.doe#神谕.com"; // utf8
std::string expr = "[\\u0080-\\uDB7F]+"; // utf8
std::wstring wtest = from_utf8(test);
std::wstring wexpr = from_utf8(expr);
std::wregex we(wexpr);
std::wsmatch wm;
if(std::regex_search(wtest, wm, we))
{
std::cout << to_utf8(wm.str(0)) << '\n';
}
}
Output:
神谕
Note: If you need a UTF conversion library I used THIS ONE in the example above.
Edit: Or, you could use the functions given in this answer:
Any good solutions for C++ string code point and code unit?

How to make languages-friendly function to lower?

I want one function 'to lower' (from word) to work correctly on two languages, for example, english and russian. What should I do? Should I use std::wstring for it, or I can go along with std::string?
Also I want it to be cross-platform and don't reinvent the wheel.
The canonical library for this kind of things is ICU:
http://site.icu-project.org/
There is also a boost wrapper:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html
See also this question:
Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library
Make sure first that you understand the concept of locales, and that you have a firm grasp of what Unicode and more generally coding systems is all about.
Some good reads for a quick start:
http://joelonsoftware.com/articles/Unicode.html
http://en.wikipedia.org/wiki/Locale
I think this solution is ok. I'm not sure it suits for every situation, but it's quite possible.
#include <locale>
#include <codecvt>
#include <string>
std::string toLowerCase (const std::string& word) {
std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
std::locale loc("en_US.UTF-8");
std::wstring wword = conv.from_bytes(word);
for (int i = 0; i < wword.length(); ++i) {
wword[i] = std::tolower(word[i], loc);
}
return conv.to_bytes(wword);
}

Wide to narrow characters

What is the cleanest way of converting a std::wstring into a std::string? I have used W2A et al macros in the past, but I have never liked them.
What you might be looking for is icu, an open-source, cross-platform library for dealing with Unicode and legacy encodings amongst many other things.
The most native way is std::ctype<wchar_t>::narrow(), but that does little more than std::copy as gishu suggested and you still need to manage your own buffers.
If you're not trying to perform any translation but just want a one-liner, you can do std::string my_string( my_wstring.begin(), my_wstring.end() ).
If you want actual encoding translation, you can use locales/codecvt or one of the libraries from another answer, but I'm guessing that's not what you're looking for.
Since this is one of the first results for a search of "c++ narrow string," and it is from before C++11, here is the C++11 way of solving this problem:
#include <codecvt>
#include <locale>
#include <string>
std::string narrow( const std::wstring& str ){
std::wstring_convert<
std::codecvt_utf8_utf16< std::wstring::value_type >,
std::wstring::value_type
> utf16conv;
return utf16conv.to_bytes( str );
}
std::wstring_convert: http://en.cppreference.com/w/cpp/locale/wstring_convert
std::codecvt_utf8_utf16: http://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16
If the encoding in the wstring is UTF-16 and you want conversion to a UTF-8 encoded string, you can use UTF8 CPP library:
utf8::utf16to8(wstr.begin(), wstr.end(), back_inserter(str));
See if this helps. This one uses std::copy to achieve your goal.
http://www.codeguru.com/forum/archive/index.php/t-193852.html
I don't know if it's the "cleanest" but I've used copy() function without any problems so far.
#include <iostream>
#include <algorithm>
using namespace std;
string wstring2string(const wstring & wstr)
{
string str(wstr.length(),’ ‘);
copy(wstr.begin(),wstr.end(),str.begin());
return str;
}
wstring string2wstring(const string & str)
{
wstring wstr(str.length(),L’ ‘);
copy(str.begin(),str.end(),wstr.begin());
return wstr;
}
http://agraja.wordpress.com/2008/09/08/cpp-string-wstring-conversion/