C++ tolower on special characters such as ü - c++

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.
How I use my function:
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
LowerCase += tolower(NotLowerCase[i]);
}
For example:
Test -> test
TeST2 -> test2
Grüßen -> gr????en
(§) -> ()
3 and 4 are not working as expected as you can see
How can I fix this issue? I have to keep the special chars, but as lowercase.

The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.
#include <iostream>
#include <cctype>
#include <clocale>
int main()
{
unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
// but ´ (acute accent) in ISO-8859-1
std::setlocale(LC_ALL, "en_US.iso88591");
std::cout << std::hex << std::showbase;
std::cout << "in iso8859-1, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
std::setlocale(LC_ALL, "en_US.iso885915");
std::cout << "in iso8859-15, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
}
You might also change std::string to std::wstring which is Unicode on many C++ implementations.
wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
LowerCase += towlower(ch);
}
Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.
Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.
Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).
std::locale::global(std::locale(""));
That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.
Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.
This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().
Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.
// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
std::wstring ws;
std::mbstate_t ps{};
char const* src = mb.data();
std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
ws.resize(len);
src = mb.data();
mbsrtowcs(&ws[0], &src, ws.size(), &ps);
if(src)
throw std::runtime_error("invalid multibyte character after: '"
+ std::string(mb.data(), src) + "'");
ws.pop_back();
return ws;
}
// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
std::string mb;
std::mbstate_t ps{};
wchar_t const* src = ws.data();
std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
mb.resize(len);
src = ws.data();
wcsrtombs(&mb[0], &src, mb.size(), &ps);
if(src)
throw std::runtime_error("invalid wide character");
mb.pop_back();
return mb;
}
int main()
{
// set locale to the one chosen by the user
// (or the one set by the system default)
std::locale::global(std::locale(""));
try
{
string NotLowerCase = "Grüßen";
std::cout << NotLowerCase << '\n';
// convert system/user multibyte character codes
// to wide string versions
std::wstring ws1 = mb_to_ws(NotLowerCase);
std::wstring ws2;
for(unsigned int i = 0; i < ws1.length(); i++) {
// use the system/user locale
ws2 += std::tolower(ws1[i], std::locale(""));
}
// convert wide string character codes back
// to system/user multibyte versions
string LowerCase = ws_to_mb(ws2);
std::cout << LowerCase << '\n';
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
catch(...)
{
std::cerr << "Unknown exception." << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Code not heavily tested

use ASCII
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
if(NotLowerCase[i]<65||NotLowerCase[i]>122)
{
LowerCase+='?';
}
else
LowerCase += tolower(NotLowerCase[i]);
}

Related

How to uppercase a u32string (char32_t) with a specific locale?

On Windows with Visual Studio 2017 I can use the following code to uppercase a u32string (which is based on char32_t):
#include <locale>
#include <iostream>
#include <string>
void toUpper(std::u32string& u32str, std::string localeStr)
{
std::locale locale(localeStr);
for (unsigned i = 0; i<u32str.size(); ++i)
u32str[i] = std::toupper(u32str[i], locale);
}
The same thing is not working with macOS and XCode.
I'm getting such errors:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:795:44: error: implicit instantiation of undefined template 'std::__1::ctype<char32_t>'
return use_facet<ctype<_CharT> >(__loc).toupper(__c);
Is there a portable way of doing this?
I have found a solution:
Instead of using std::u32string I'm now using std::string with utf8 encoding.
Conversion from std::u32string to std::string (utf8) can be done via utf8-cpp: http://utfcpp.sourceforge.net/
It's needed to convert the utf8 string to std::wstring (because std::toupper is not implemented on all platforms for std::u32string).
void toUpper(std::string& str, std::string localeStr)
{
//unicode to wide string converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//convert to wstring (because std::toupper is not implemented on all platforms for u32string)
std::wstring wide = converter.from_bytes(str);
std::locale locale;
try
{
locale = std::locale(localeStr);
}
catch(const std::exception&)
{
std::cerr << "locale not supported by system: " << localeStr << " (" << getLocaleByLanguage(localeStr) << ")" << std::endl;
}
auto& f = std::use_facet<std::ctype<wchar_t>>(locale);
f.toupper(&wide[0], &wide[0] + wide.size());
//convert back
str = converter.to_bytes(wide);
}
Note:
On Windows localeStr has to be something like this: en, de, fr, ...
On other Systems: localeStr must be de_DE, fr_FR, en_US, ...

Why boost locale didn't provide character level rule type?

Env: boost1.53.0 c++11;
New to c++.
In boost locale boundary analysis, the rule type is specified for word(eg.boundary::word_letter, boundary::word_number) and sentence , but there is no boundary rule type for character. All I want is something like isUpperCase(), isLowerCase(), isDigit(), isPunctuation().
Tried boost string algorithm which didn't work.
boost::locale::generator gen;
std::locale loc = gen("ru_RU.UTF-8");
std::string context = "ДВ";
std::cout << boost::algorithm::all(context, boost::algorithm::is_upper(loc));
Why these features can be accessed easily in Java or python but so so confusing in C++? Any consist way to achieve these?
This works for me under VS 2013.
locale::global(locale("ru-RU"));
std::string context = "ДВ";
std::cout << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
Prints 1
It is important how you initialize the locale.
UPDATE:
Here's solution which will work under Ubuntu.
#include <iostream>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/predicate.hpp>
#include <boost/locale.hpp>
using namespace std;
int main()
{
locale::global(locale("ru_RU"));
wstring context = L"ДВ";
wcout << boolalpha << any_of(context.begin(), context.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context1 = L"ПРИВЕТ, МИР"; //HELLO WORLD in russian
wcout << boolalpha << any_of(context1.begin(), context1.end(), boost::algorithm::is_upper());
wcout<<endl;
wstring context2 = L"привет мир"; //hello world in russian
wcout << boolalpha << any_of(context2.begin(), context2.end(), boost::algorithm::is_upper());
return 0;
}
Prints
true
true
false
This will work with boost::algorithm::all as well.
wstring context = L"ДВ";
wcout << boolalpha << boost::algorithm::all(context, boost::algorithm::is_upper());
Boost.locale is based on ICU and ICU itself did provide character level classification, which seems pretty consist and readable(more of Java-style).
Here is a simple example.
#include <unicode/brkiter.h>
#include <unicode/utypes.h>
#include <unicode/uchar.h>
int main()
{
UnicodeString s("А аБ Д д2 -");
UErrorCode status = U_ERROR_WARNING_LIMIT;
Locale ru("ru", "RU");
BreakIterator* bi = BreakIterator::createCharacterInstance(ru, status);
bi->setText(s);
int32_t p = bi->first();
while(p != BreakIterator::DONE) {
std::string type;
if(u_isUUppercase(s.charAt(p)))
type = "upper" ;
if(u_isULowercase(s.charAt(p)))
type = "lower" ;
if(u_isUWhiteSpace(s.charAt(p)))
type = "whitespace" ;
if(u_isdigit(s.charAt(p)))
type = "digit" ;
if(u_ispunct(s.charAt(p)))
type = "punc" ;
printf("Boundary at position %d is %s\n", p, type.c_str());
p= bi->next();
}
delete bi;
return 0;
}

Utf-8 to URI percent encoding

I'm trying to convert Unicode code points to percent encoded UTF-8 code units.
The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly.
I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. Rather than the expected %E5%84%A3, I'm seeing %xE5%x84%xA3 (for the unicode U+5123).
What am I doing wrong?
Added code (note that utf8.h belongs to the UTF8-CPP library).
#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"
std::string unicode_to_utf8_units(int32_t unicode)
{
unsigned char u[5] = {0,0,0,0,0};
unsigned char *iter = u, *limit = utf8::append(unicode, u);
std::string s;
for (; iter != limit; ++iter) {
s.push_back(*iter);
}
return s;
}
int main()
{
std::ofstream ofs("test.txt", std::ios_base::out);
if (!ofs.good()) {
std::cout << "ofstream encountered a problem." << std::endl;
return 1;
}
utf8::uint32_t unicode = 0x5123;
auto s = unicode_to_utf8_units(unicode);
for (auto &c : s) {
ofs << "%" << c;
}
ofs.close();
return 0;
}
You actually need to convert byte values to the corresponding ASCII strings, for example:
"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.
Each byte needs to be converted to: "%C3" and "%C9" respectively.
The best way to do so is to use sstream:
std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";
for (int i = 0; i < utf8str.length(); ++i) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}
Or in C++11:
for (auto c: utf8str) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}
Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value.
First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

converting a chararcter to \uxxx format in C /C++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.
Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000

restore runtime unicode strings

I'm building an application that receives runtime strings with encoded unicode via tcp, an example string would be "\u7cfb\u8eca\u4e21\uff1a\u6771\u5317 ...". I have the following but unfortunately I can only benefit from it at compile time due to: incomplete universal character name \u since its expecting 4 hexadecimal characters at compile time.
QString restoreUnicode(QString strText)
{
QRegExp rx("\\\\u([0-9a-z]){4}");
return strText.replace(rx, QString::fromUtf8("\u\\1"));
}
I'm seeking a solution at runtime, I could I foreseen break up these strings and do some manipulation to convert those hexadecimals after the "\u" delimiters into base 10 and then pass them into the constructor of a QChar but I'm looking for a better way if one exists as I am very concerned about the time complexity incurred by such a method and am not an expert.
Does anyone have any solutions or tips.
You should decode the string by yourself. Just take the Unicode entry (rx.indexIn(strText)), parse it (int result; std::istringstream iss(s); if (!(iss>>std::hex>>result).fail()) ... and replace the original string \\uXXXX with (wchar_t)result.
For closure and anyone who comes across this thread in future, here is my initial solution before optimising the scope of these variables. Not a fan of it but it works given the unpredictable nature of unicode and/or ascii in the stream of which I have no control over (client only), whilst Unicode presence is low, it is good to handle it instead of ugly \u1234 etc.
QString restoreUnicode(QString strText)
{
QRegExp rxUnicode("\\\\u([0-9a-z]){4}");
bool bSuccessFlag;
int iSafetyOffset = 0;
int iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
while (iNeedle != -1)
{
QChar cCodePoint(strText.mid(iNeedle + 2, 4).toInt(&bSuccessFlag, 16));
if ( bSuccessFlag )
strText = strText.replace(strText.mid(iNeedle, 6), QString(cCodePoint));
else
iSafetyOffset = iNeedle + 1; // hop over non code point to avoid lock
iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
}
return strText;
}
#include <assert.h>
#include <iostream>
#include <string>
#include <sstream>
#include <locale>
#include <codecvt> // C++11
using namespace std;
int main()
{
char const data[] = "\\u7cfb\\u8eca\\u4e21\\uff1a\\u6771\\u5317";
istringstream stream( data );
wstring ws;
int code;
char slashCh, uCh;
while( stream >> slashCh >> uCh >> hex >> code )
{
assert( slashCh == '\\' && uCh == 'u' );
ws += wchar_t( code );
}
cout << "Unicode code points:" << endl;
for( auto it = ws.begin(); it != ws.end(); ++it )
{
cout << hex << 0 + *it << endl;
}
cout << endl;
// The following is C++11 specific.
cout << "UTF-8 encoding:" << endl;
wstring_convert< codecvt_utf8< wchar_t > > converter;
string const bytes = converter.to_bytes( ws );
for( auto it = bytes.begin(); it != bytes.end(); ++it )
{
cout << hex << 0 + (unsigned char)*it << ' ';
}
cout << endl;
}