Why codecvt can't convert unicode outside BMP to u16string? - c++

I am trying to understand C++ unicode and having this confused me now.
Code:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
void trial1(){
string a = "\U00010000z";
cout << a << endl;
u16string b;
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
b = converter.from_bytes(a);
u16string c = b.substr(0, 1);
string q = converter.to_bytes(c);
cout << q << endl;
}
void trial2(){
u16string a = u"\U00010000";
cout << a.length() << endl; // 2
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
string b = converter.to_bytes(a);
}
int main() {
// both don't work
// trial1();
// trial2();
return 0;
}
I have tested that u16string can store unicode outside BMP as surrogate pairs, e.g. u"\U00010000" is stored with 2 char16_t.
So why std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter; doesn't work for both trial1 and trial2 and throw an exception?

std::codecvt_utf8 does not support conversions to/from UTF-16, only UCS-2 and UTF-32. You need to use std::codecvt_utf8_utf16 instead.

Related

HEX string to UTF-8(UNICODE) string

I have a HEX string that has Unicode characters in it. and I need to convert that UTF-8(Unicode) and store in a string variable.
I am new in Unicode and I don't have much idea to try anything.
std::string HEX_string= "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std:string unicode_string=getUnicodeString(HEX_string);
I expect صباح الخير value in unicode_string variable.
Since that hex string is a bunch of space-separated base-16 encoded Unicode codepoints, it's easy to convert using just standard functions, in particular std::c32rtomb():
#include <iostream>
#include <string>
#include <sstream>
#include <cstdlib>
#include <clocale>
#include <cuchar>
#include <climits>
std::string
getUnicodeString(const std::string &hex)
{
std::istringstream codepoints{hex};
std::string cp;
std::string out;
std::mbstate_t state;
char u8[MB_LEN_MAX];
while (codepoints >> cp) {
char32_t c = std::stoul(cp, nullptr, 16);
auto len = std::c32rtomb(u8, c, &state);
if (len == std::size_t(-1)) {
std::cerr << "Unable to convert " << cp << " to UTF-8 codepoint!\n";
std::exit(EXIT_FAILURE);
} else if (len > 0) {
out.append(u8, len);
}
}
return out;
}
int main() {
// Make sure that c32rtomb() works with UTF-32 code units
static_assert(__STDC_UTF_32__);
// Requires a UTF-8 locale to get a UTF-8 string.
std::setlocale(LC_ALL, "");
std::string HEX_string = "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std::string unicode_string = getUnicodeString(HEX_string);
std::cout << unicode_string << '\n';
return 0;
}
After compiling it, running it produces:
$ echo $LANG
en_US.utf8
$ ./a.out
صباح الخير
You don't have any codepoints outside the BMP in that sample to be sure if your input is encoded in UTF-16 or UTF-32. The above code assumes UTF-32, but if it's UTF-16 you can change c32rtomb() to c16rtomb() and char32_t to char16_t and it'll handle UTF-16 surrogate pairs correctly.

How to uppercase a u32string (char32_t) with a specific locale?

On Windows with Visual Studio 2017 I can use the following code to uppercase a u32string (which is based on char32_t):
#include <locale>
#include <iostream>
#include <string>
void toUpper(std::u32string& u32str, std::string localeStr)
{
std::locale locale(localeStr);
for (unsigned i = 0; i<u32str.size(); ++i)
u32str[i] = std::toupper(u32str[i], locale);
}
The same thing is not working with macOS and XCode.
I'm getting such errors:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:795:44: error: implicit instantiation of undefined template 'std::__1::ctype<char32_t>'
return use_facet<ctype<_CharT> >(__loc).toupper(__c);
Is there a portable way of doing this?
I have found a solution:
Instead of using std::u32string I'm now using std::string with utf8 encoding.
Conversion from std::u32string to std::string (utf8) can be done via utf8-cpp: http://utfcpp.sourceforge.net/
It's needed to convert the utf8 string to std::wstring (because std::toupper is not implemented on all platforms for std::u32string).
void toUpper(std::string& str, std::string localeStr)
{
//unicode to wide string converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//convert to wstring (because std::toupper is not implemented on all platforms for u32string)
std::wstring wide = converter.from_bytes(str);
std::locale locale;
try
{
locale = std::locale(localeStr);
}
catch(const std::exception&)
{
std::cerr << "locale not supported by system: " << localeStr << " (" << getLocaleByLanguage(localeStr) << ")" << std::endl;
}
auto& f = std::use_facet<std::ctype<wchar_t>>(locale);
f.toupper(&wide[0], &wide[0] + wide.size());
//convert back
str = converter.to_bytes(wide);
}
Note:
On Windows localeStr has to be something like this: en, de, fr, ...
On other Systems: localeStr must be de_DE, fr_FR, en_US, ...

Why boost locale didn't provide character level rule type?

Env: boost1.53.0 c++11;
New to c++.
In boost locale boundary analysis, the rule type is specified for word(eg.boundary::word_letter, boundary::word_number) and sentence , but there is no boundary rule type for character. All I want is something like isUpperCase(), isLowerCase(), isDigit(), isPunctuation().
Tried boost string algorithm which didn't work.
boost::locale::generator gen;
std::locale loc = gen("ru_RU.UTF-8");
std::string context = "ДВ";
std::cout << boost::algorithm::all(context, boost::algorithm::is_upper(loc));
Why these features can be accessed easily in Java or python but so so confusing in C++? Any consist way to achieve these?
This works for me under VS 2013.
locale::global(locale("ru-RU"));
std::string context = "ДВ";
std::cout << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
Prints 1
It is important how you initialize the locale.
UPDATE:
Here's solution which will work under Ubuntu.
#include <iostream>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/predicate.hpp>
#include <boost/locale.hpp>
using namespace std;
int main()
{
locale::global(locale("ru_RU"));
wstring context = L"ДВ";
wcout << boolalpha << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context1 = L"ПРИВЕТ, МИР"; //HELLO WORLD in russian
wcout << boolalpha << any_of(context1.begin(), context1.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context2 = L"привет мир"; //hello world in russian
wcout << boolalpha << any_of(context2.begin(), context2.end(), boost::algorithm::is_upper());
return 0;
}
Prints
true
true
false
This will work with boost::algorithm::all as well.
wstring context = L"ДВ";
wcout << boolalpha << boost::algorithm::all(context, boost::algorithm::is_upper());
Boost.locale is based on ICU and ICU itself did provide character level classification, which seems pretty consist and readable(more of Java-style).
Here is a simple example.
#include <unicode/brkiter.h>
#include <unicode/utypes.h>
#include <unicode/uchar.h>
int main()
{
UnicodeString s("А аБ Д д2 -");
UErrorCode status = U_ERROR_WARNING_LIMIT;
Locale ru("ru", "RU");
BreakIterator* bi = BreakIterator::createCharacterInstance(ru, status);
bi->setText(s);
int32_t p = bi->first();
while(p != BreakIterator::DONE) {
std::string type;
if(u_isUUppercase(s.charAt(p)))
type = "upper" ;
if(u_isULowercase(s.charAt(p)))
type = "lower" ;
if(u_isUWhiteSpace(s.charAt(p)))
type = "whitespace" ;
if(u_isdigit(s.charAt(p)))
type = "digit" ;
if(u_ispunct(s.charAt(p)))
type = "punc" ;
printf("Boundary at position %d is %s\n", p, type.c_str());
p= bi->next();
}
delete bi;
return 0;
}

Utf-8 to URI percent encoding

I'm trying to convert Unicode code points to percent encoded UTF-8 code units.
The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly.
I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. Rather than the expected %E5%84%A3, I'm seeing %xE5%x84%xA3 (for the unicode U+5123).
What am I doing wrong?
Added code (note that utf8.h belongs to the UTF8-CPP library).
#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"
std::string unicode_to_utf8_units(int32_t unicode)
{
unsigned char u[5] = {0,0,0,0,0};
unsigned char *iter = u, *limit = utf8::append(unicode, u);
std::string s;
for (; iter != limit; ++iter) {
s.push_back(*iter);
}
return s;
}
int main()
{
std::ofstream ofs("test.txt", std::ios_base::out);
if (!ofs.good()) {
std::cout << "ofstream encountered a problem." << std::endl;
return 1;
}
utf8::uint32_t unicode = 0x5123;
auto s = unicode_to_utf8_units(unicode);
for (auto &c : s) {
ofs << "%" << c;
}
ofs.close();
return 0;
}
You actually need to convert byte values to the corresponding ASCII strings, for example:
"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.
Each byte needs to be converted to: "%C3" and "%C9" respectively.
The best way to do so is to use sstream:
std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";
for (int i = 0; i < utf8str.length(); ++i) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}
Or in C++11:
for (auto c: utf8str) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}
Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value.
First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

converting a chararcter to \uxxx format in C /C++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.
Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000