How to uppercase a u32string (char32_t) with a specific locale? - c++

On Windows with Visual Studio 2017 I can use the following code to uppercase a u32string (which is based on char32_t):
#include <locale>
#include <iostream>
#include <string>
void toUpper(std::u32string& u32str, std::string localeStr)
{
std::locale locale(localeStr);
for (unsigned i = 0; i<u32str.size(); ++i)
u32str[i] = std::toupper(u32str[i], locale);
}
The same thing is not working with macOS and XCode.
I'm getting such errors:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:795:44: error: implicit instantiation of undefined template 'std::__1::ctype<char32_t>'
return use_facet<ctype<_CharT> >(__loc).toupper(__c);
Is there a portable way of doing this?

I have found a solution:
Instead of using std::u32string I'm now using std::string with utf8 encoding.
Conversion from std::u32string to std::string (utf8) can be done via utf8-cpp: http://utfcpp.sourceforge.net/
It's needed to convert the utf8 string to std::wstring (because std::toupper is not implemented on all platforms for std::u32string).
void toUpper(std::string& str, std::string localeStr)
{
//unicode to wide string converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//convert to wstring (because std::toupper is not implemented on all platforms for u32string)
std::wstring wide = converter.from_bytes(str);
std::locale locale;
try
{
locale = std::locale(localeStr);
}
catch(const std::exception&)
{
std::cerr << "locale not supported by system: " << localeStr << " (" << getLocaleByLanguage(localeStr) << ")" << std::endl;
}
auto& f = std::use_facet<std::ctype<wchar_t>>(locale);
f.toupper(&wide[0], &wide[0] + wide.size());
//convert back
str = converter.to_bytes(wide);
}
Note:
On Windows localeStr has to be something like this: en, de, fr, ...
On other Systems: localeStr must be de_DE, fr_FR, en_US, ...

Related

Why codecvt can't convert unicode outside BMP to u16string?

I am trying to understand C++ unicode and having this confused me now.
Code:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
void trial1(){
string a = "\U00010000z";
cout << a << endl;
u16string b;
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
b = converter.from_bytes(a);
u16string c = b.substr(0, 1);
string q = converter.to_bytes(c);
cout << q << endl;
}
void trial2(){
u16string a = u"\U00010000";
cout << a.length() << endl; // 2
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
string b = converter.to_bytes(a);
}
int main() {
// both don't work
// trial1();
// trial2();
return 0;
}
I have tested that u16string can store unicode outside BMP as surrogate pairs, e.g. u"\U00010000" is stored with 2 char16_t.
So why std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter; doesn't work for both trial1 and trial2 and throw an exception?
std::codecvt_utf8 does not support conversions to/from UTF-16, only UCS-2 and UTF-32. You need to use std::codecvt_utf8_utf16 instead.

HEX string to UTF-8(UNICODE) string

I have a HEX string that has Unicode characters in it. and I need to convert that UTF-8(Unicode) and store in a string variable.
I am new in Unicode and I don't have much idea to try anything.
std::string HEX_string= "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std:string unicode_string=getUnicodeString(HEX_string);
I expect صباح الخير value in unicode_string variable.
Since that hex string is a bunch of space-separated base-16 encoded Unicode codepoints, it's easy to convert using just standard functions, in particular std::c32rtomb():
#include <iostream>
#include <string>
#include <sstream>
#include <cstdlib>
#include <clocale>
#include <cuchar>
#include <climits>
std::string
getUnicodeString(const std::string &hex)
{
std::istringstream codepoints{hex};
std::string cp;
std::string out;
std::mbstate_t state;
char u8[MB_LEN_MAX];
while (codepoints >> cp) {
char32_t c = std::stoul(cp, nullptr, 16);
auto len = std::c32rtomb(u8, c, &state);
if (len == std::size_t(-1)) {
std::cerr << "Unable to convert " << cp << " to UTF-8 codepoint!\n";
std::exit(EXIT_FAILURE);
} else if (len > 0) {
out.append(u8, len);
}
}
return out;
}
int main() {
// Make sure that c32rtomb() works with UTF-32 code units
static_assert(__STDC_UTF_32__);
// Requires a UTF-8 locale to get a UTF-8 string.
std::setlocale(LC_ALL, "");
std::string HEX_string = "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std::string unicode_string = getUnicodeString(HEX_string);
std::cout << unicode_string << '\n';
return 0;
}
After compiling it, running it produces:
$ echo $LANG
en_US.utf8
$ ./a.out
صباح الخير
You don't have any codepoints outside the BMP in that sample to be sure if your input is encoded in UTF-16 or UTF-32. The above code assumes UTF-32, but if it's UTF-16 you can change c32rtomb() to c16rtomb() and char32_t to char16_t and it'll handle UTF-16 surrogate pairs correctly.

C++ tolower on special characters such as ü

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.
How I use my function:
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
LowerCase += tolower(NotLowerCase[i]);
}
For example:
Test -> test
TeST2 -> test2
Grüßen -> gr????en
(§) -> ()
3 and 4 are not working as expected as you can see
How can I fix this issue? I have to keep the special chars, but as lowercase.
The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.
#include <iostream>
#include <cctype>
#include <clocale>
int main()
{
unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
// but ´ (acute accent) in ISO-8859-1
std::setlocale(LC_ALL, "en_US.iso88591");
std::cout << std::hex << std::showbase;
std::cout << "in iso8859-1, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
std::setlocale(LC_ALL, "en_US.iso885915");
std::cout << "in iso8859-15, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
}
You might also change std::string to std::wstring which is Unicode on many C++ implementations.
wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
LowerCase += towlower(ch);
}
Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.
Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.
Finally, C++ has more sophisticated support for managing locales, see <locale> for details.
I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).
std::locale::global(std::locale(""));
That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.
Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.
This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().
Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.
// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
std::wstring ws;
std::mbstate_t ps{};
char const* src = mb.data();
std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
ws.resize(len);
src = mb.data();
mbsrtowcs(&ws[0], &src, ws.size(), &ps);
if(src)
throw std::runtime_error("invalid multibyte character after: '"
+ std::string(mb.data(), src) + "'");
ws.pop_back();
return ws;
}
// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
std::string mb;
std::mbstate_t ps{};
wchar_t const* src = ws.data();
std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
mb.resize(len);
src = ws.data();
wcsrtombs(&mb[0], &src, mb.size(), &ps);
if(src)
throw std::runtime_error("invalid wide character");
mb.pop_back();
return mb;
}
int main()
{
// set locale to the one chosen by the user
// (or the one set by the system default)
std::locale::global(std::locale(""));
try
{
string NotLowerCase = "Grüßen";
std::cout << NotLowerCase << '\n';
// convert system/user multibyte character codes
// to wide string versions
std::wstring ws1 = mb_to_ws(NotLowerCase);
std::wstring ws2;
for(unsigned int i = 0; i < ws1.length(); i++) {
// use the system/user locale
ws2 += std::tolower(ws1[i], std::locale(""));
}
// convert wide string character codes back
// to system/user multibyte versions
string LowerCase = ws_to_mb(ws2);
std::cout << LowerCase << '\n';
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
catch(...)
{
std::cerr << "Unknown exception." << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Code not heavily tested
use ASCII
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
if(NotLowerCase[i]<65||NotLowerCase[i]>122)
{
LowerCase+='?';
}
else
LowerCase += tolower(NotLowerCase[i]);
}

Why boost locale didn't provide character level rule type?

Env: boost1.53.0 c++11;
New to c++.
In boost locale boundary analysis, the rule type is specified for word(eg.boundary::word_letter, boundary::word_number) and sentence , but there is no boundary rule type for character. All I want is something like isUpperCase(), isLowerCase(), isDigit(), isPunctuation().
Tried boost string algorithm which didn't work.
boost::locale::generator gen;
std::locale loc = gen("ru_RU.UTF-8");
std::string context = "ДВ";
std::cout << boost::algorithm::all(context, boost::algorithm::is_upper(loc));
Why these features can be accessed easily in Java or python but so so confusing in C++? Any consist way to achieve these?
This works for me under VS 2013.
locale::global(locale("ru-RU"));
std::string context = "ДВ";
std::cout << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
Prints 1
It is important how you initialize the locale.
UPDATE:
Here's solution which will work under Ubuntu.
#include <iostream>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/predicate.hpp>
#include <boost/locale.hpp>
using namespace std;
int main()
{
locale::global(locale("ru_RU"));
wstring context = L"ДВ";
wcout << boolalpha << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context1 = L"ПРИВЕТ, МИР"; //HELLO WORLD in russian
wcout << boolalpha << any_of(context1.begin(), context1.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context2 = L"привет мир"; //hello world in russian
wcout << boolalpha << any_of(context2.begin(), context2.end(), boost::algorithm::is_upper());
return 0;
}
Prints
true
true
false
This will work with boost::algorithm::all as well.
wstring context = L"ДВ";
wcout << boolalpha << boost::algorithm::all(context, boost::algorithm::is_upper());
Boost.locale is based on ICU and ICU itself did provide character level classification, which seems pretty consist and readable(more of Java-style).
Here is a simple example.
#include <unicode/brkiter.h>
#include <unicode/utypes.h>
#include <unicode/uchar.h>
int main()
{
UnicodeString s("А аБ Д д2 -");
UErrorCode status = U_ERROR_WARNING_LIMIT;
Locale ru("ru", "RU");
BreakIterator* bi = BreakIterator::createCharacterInstance(ru, status);
bi->setText(s);
int32_t p = bi->first();
while(p != BreakIterator::DONE) {
std::string type;
if(u_isUUppercase(s.charAt(p)))
type = "upper" ;
if(u_isULowercase(s.charAt(p)))
type = "lower" ;
if(u_isUWhiteSpace(s.charAt(p)))
type = "whitespace" ;
if(u_isdigit(s.charAt(p)))
type = "digit" ;
if(u_ispunct(s.charAt(p)))
type = "punc" ;
printf("Boundary at position %d is %s\n", p, type.c_str());
p= bi->next();
}
delete bi;
return 0;
}

converting a chararcter to \uxxx format in C /C++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.
Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000