HEX string to UTF-8(UNICODE) string - c++

I have a HEX string that has Unicode characters in it. and I need to convert that UTF-8(Unicode) and store in a string variable.
I am new in Unicode and I don't have much idea to try anything.
std::string HEX_string= "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std:string unicode_string=getUnicodeString(HEX_string);
I expect صباح الخير value in unicode_string variable.

Since that hex string is a bunch of space-separated base-16 encoded Unicode codepoints, it's easy to convert using just standard functions, in particular std::c32rtomb():
#include <iostream>
#include <string>
#include <sstream>
#include <cstdlib>
#include <clocale>
#include <cuchar>
#include <climits>
std::string
getUnicodeString(const std::string &hex)
{
std::istringstream codepoints{hex};
std::string cp;
std::string out;
std::mbstate_t state;
char u8[MB_LEN_MAX];
while (codepoints >> cp) {
char32_t c = std::stoul(cp, nullptr, 16);
auto len = std::c32rtomb(u8, c, &state);
if (len == std::size_t(-1)) {
std::cerr << "Unable to convert " << cp << " to UTF-8 codepoint!\n";
std::exit(EXIT_FAILURE);
} else if (len > 0) {
out.append(u8, len);
}
}
return out;
}
int main() {
// Make sure that c32rtomb() works with UTF-32 code units
static_assert(__STDC_UTF_32__);
// Requires a UTF-8 locale to get a UTF-8 string.
std::setlocale(LC_ALL, "");
std::string HEX_string = "0635 0628 0627 062d 0020 0627 0644 062e 064a 0631";
std::string unicode_string = getUnicodeString(HEX_string);
std::cout << unicode_string << '\n';
return 0;
}
After compiling it, running it produces:
$ echo $LANG
en_US.utf8
$ ./a.out
صباح الخير
You don't have any codepoints outside the BMP in that sample to be sure if your input is encoded in UTF-16 or UTF-32. The above code assumes UTF-32, but if it's UTF-16 you can change c32rtomb() to c16rtomb() and char32_t to char16_t and it'll handle UTF-16 surrogate pairs correctly.

Related

Why codecvt can't convert unicode outside BMP to u16string?

I am trying to understand C++ unicode and having this confused me now.
Code:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
void trial1(){
string a = "\U00010000z";
cout << a << endl;
u16string b;
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
b = converter.from_bytes(a);
u16string c = b.substr(0, 1);
string q = converter.to_bytes(c);
cout << q << endl;
}
void trial2(){
u16string a = u"\U00010000";
cout << a.length() << endl; // 2
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
string b = converter.to_bytes(a);
}
int main() {
// both don't work
// trial1();
// trial2();
return 0;
}
I have tested that u16string can store unicode outside BMP as surrogate pairs, e.g. u"\U00010000" is stored with 2 char16_t.
So why std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter; doesn't work for both trial1 and trial2 and throw an exception?
std::codecvt_utf8 does not support conversions to/from UTF-16, only UCS-2 and UTF-32. You need to use std::codecvt_utf8_utf16 instead.

How to uppercase a u32string (char32_t) with a specific locale?

On Windows with Visual Studio 2017 I can use the following code to uppercase a u32string (which is based on char32_t):
#include <locale>
#include <iostream>
#include <string>
void toUpper(std::u32string& u32str, std::string localeStr)
{
std::locale locale(localeStr);
for (unsigned i = 0; i<u32str.size(); ++i)
u32str[i] = std::toupper(u32str[i], locale);
}
The same thing is not working with macOS and XCode.
I'm getting such errors:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:795:44: error: implicit instantiation of undefined template 'std::__1::ctype<char32_t>'
return use_facet<ctype<_CharT> >(__loc).toupper(__c);
Is there a portable way of doing this?
I have found a solution:
Instead of using std::u32string I'm now using std::string with utf8 encoding.
Conversion from std::u32string to std::string (utf8) can be done via utf8-cpp: http://utfcpp.sourceforge.net/
It's needed to convert the utf8 string to std::wstring (because std::toupper is not implemented on all platforms for std::u32string).
void toUpper(std::string& str, std::string localeStr)
{
//unicode to wide string converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//convert to wstring (because std::toupper is not implemented on all platforms for u32string)
std::wstring wide = converter.from_bytes(str);
std::locale locale;
try
{
locale = std::locale(localeStr);
}
catch(const std::exception&)
{
std::cerr << "locale not supported by system: " << localeStr << " (" << getLocaleByLanguage(localeStr) << ")" << std::endl;
}
auto& f = std::use_facet<std::ctype<wchar_t>>(locale);
f.toupper(&wide[0], &wide[0] + wide.size());
//convert back
str = converter.to_bytes(wide);
}
Note:
On Windows localeStr has to be something like this: en, de, fr, ...
On other Systems: localeStr must be de_DE, fr_FR, en_US, ...

C++ tolower on special characters such as ü

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.
How I use my function:
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
LowerCase += tolower(NotLowerCase[i]);
}
For example:
Test -> test
TeST2 -> test2
Grüßen -> gr????en
(§) -> ()
3 and 4 are not working as expected as you can see
How can I fix this issue? I have to keep the special chars, but as lowercase.
The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.
#include <iostream>
#include <cctype>
#include <clocale>
int main()
{
unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
// but ´ (acute accent) in ISO-8859-1
std::setlocale(LC_ALL, "en_US.iso88591");
std::cout << std::hex << std::showbase;
std::cout << "in iso8859-1, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
std::setlocale(LC_ALL, "en_US.iso885915");
std::cout << "in iso8859-15, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
}
You might also change std::string to std::wstring which is Unicode on many C++ implementations.
wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
LowerCase += towlower(ch);
}
Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.
Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.
Finally, C++ has more sophisticated support for managing locales, see <locale> for details.
I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).
std::locale::global(std::locale(""));
That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.
Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.
This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().
Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.
// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
std::wstring ws;
std::mbstate_t ps{};
char const* src = mb.data();
std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
ws.resize(len);
src = mb.data();
mbsrtowcs(&ws[0], &src, ws.size(), &ps);
if(src)
throw std::runtime_error("invalid multibyte character after: '"
+ std::string(mb.data(), src) + "'");
ws.pop_back();
return ws;
}
// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
std::string mb;
std::mbstate_t ps{};
wchar_t const* src = ws.data();
std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
mb.resize(len);
src = ws.data();
wcsrtombs(&mb[0], &src, mb.size(), &ps);
if(src)
throw std::runtime_error("invalid wide character");
mb.pop_back();
return mb;
}
int main()
{
// set locale to the one chosen by the user
// (or the one set by the system default)
std::locale::global(std::locale(""));
try
{
string NotLowerCase = "Grüßen";
std::cout << NotLowerCase << '\n';
// convert system/user multibyte character codes
// to wide string versions
std::wstring ws1 = mb_to_ws(NotLowerCase);
std::wstring ws2;
for(unsigned int i = 0; i < ws1.length(); i++) {
// use the system/user locale
ws2 += std::tolower(ws1[i], std::locale(""));
}
// convert wide string character codes back
// to system/user multibyte versions
string LowerCase = ws_to_mb(ws2);
std::cout << LowerCase << '\n';
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
catch(...)
{
std::cerr << "Unknown exception." << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Code not heavily tested
use ASCII
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
if(NotLowerCase[i]<65||NotLowerCase[i]>122)
{
LowerCase+='?';
}
else
LowerCase += tolower(NotLowerCase[i]);
}

Utf-8 to URI percent encoding

I'm trying to convert Unicode code points to percent encoded UTF-8 code units.
The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly.
I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. Rather than the expected %E5%84%A3, I'm seeing %xE5%x84%xA3 (for the unicode U+5123).
What am I doing wrong?
Added code (note that utf8.h belongs to the UTF8-CPP library).
#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"
std::string unicode_to_utf8_units(int32_t unicode)
{
unsigned char u[5] = {0,0,0,0,0};
unsigned char *iter = u, *limit = utf8::append(unicode, u);
std::string s;
for (; iter != limit; ++iter) {
s.push_back(*iter);
}
return s;
}
int main()
{
std::ofstream ofs("test.txt", std::ios_base::out);
if (!ofs.good()) {
std::cout << "ofstream encountered a problem." << std::endl;
return 1;
}
utf8::uint32_t unicode = 0x5123;
auto s = unicode_to_utf8_units(unicode);
for (auto &c : s) {
ofs << "%" << c;
}
ofs.close();
return 0;
}
You actually need to convert byte values to the corresponding ASCII strings, for example:
"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.
Each byte needs to be converted to: "%C3" and "%C9" respectively.
The best way to do so is to use sstream:
std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";
for (int i = 0; i < utf8str.length(); ++i) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}
Or in C++11:
for (auto c: utf8str) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}
Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value.
First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

converting a chararcter to \uxxx format in C /C++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.
Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000