converting a chararcter to \uxxx format in C /C++ - c++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.

Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000

Related

Why codecvt can't convert unicode outside BMP to u16string?

I am trying to understand C++ unicode and having this confused me now.
Code:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
using namespace std;
void trial1(){
string a = "\U00010000z";
cout << a << endl;
u16string b;
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
b = converter.from_bytes(a);
u16string c = b.substr(0, 1);
string q = converter.to_bytes(c);
cout << q << endl;
}
void trial2(){
u16string a = u"\U00010000";
cout << a.length() << endl; // 2
std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter;
string b = converter.to_bytes(a);
}
int main() {
// both don't work
// trial1();
// trial2();
return 0;
}
I have tested that u16string can store unicode outside BMP as surrogate pairs, e.g. u"\U00010000" is stored with 2 char16_t.
So why std::wstring_convert<codecvt_utf8<char16_t>, char16_t> converter; doesn't work for both trial1 and trial2 and throw an exception?
std::codecvt_utf8 does not support conversions to/from UTF-16, only UCS-2 and UTF-32. You need to use std::codecvt_utf8_utf16 instead.

How to uppercase a u32string (char32_t) with a specific locale?

On Windows with Visual Studio 2017 I can use the following code to uppercase a u32string (which is based on char32_t):
#include <locale>
#include <iostream>
#include <string>
void toUpper(std::u32string& u32str, std::string localeStr)
{
std::locale locale(localeStr);
for (unsigned i = 0; i<u32str.size(); ++i)
u32str[i] = std::toupper(u32str[i], locale);
}
The same thing is not working with macOS and XCode.
I'm getting such errors:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__locale:795:44: error: implicit instantiation of undefined template 'std::__1::ctype<char32_t>'
return use_facet<ctype<_CharT> >(__loc).toupper(__c);
Is there a portable way of doing this?
I have found a solution:
Instead of using std::u32string I'm now using std::string with utf8 encoding.
Conversion from std::u32string to std::string (utf8) can be done via utf8-cpp: http://utfcpp.sourceforge.net/
It's needed to convert the utf8 string to std::wstring (because std::toupper is not implemented on all platforms for std::u32string).
void toUpper(std::string& str, std::string localeStr)
{
//unicode to wide string converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//convert to wstring (because std::toupper is not implemented on all platforms for u32string)
std::wstring wide = converter.from_bytes(str);
std::locale locale;
try
{
locale = std::locale(localeStr);
}
catch(const std::exception&)
{
std::cerr << "locale not supported by system: " << localeStr << " (" << getLocaleByLanguage(localeStr) << ")" << std::endl;
}
auto& f = std::use_facet<std::ctype<wchar_t>>(locale);
f.toupper(&wide[0], &wide[0] + wide.size());
//convert back
str = converter.to_bytes(wide);
}
Note:
On Windows localeStr has to be something like this: en, de, fr, ...
On other Systems: localeStr must be de_DE, fr_FR, en_US, ...

How to store an arabic symbol into wchat_t or char32_t?

Compiler: clang 3.5.0
The following code works as far as I expected:
#include <iostream>
char cp[] = "ي";
int main()
{
std::cout << cp; //Prints ي
}
DEMO
But if we try to store that symbol into char32_t or wchar_t we've got an error:
#include <iostream>
wchar_t t = 'ي'; //character too large for enclosing character literal type
int main(){ }
DEMO
Is it possible to store such symbols into a wchar_t or char32_t object? I suspect it depends on a particular compiler and OS I'm using.
Use L"ي" to put it in wchar_t, and later print it out using std::wcout:
Use U"ي" to put it in char32_t, and later print it out using std::wcout after being cast by wchar_t:
#include <iostream>
wchar_t cp[] = L"ي";
char32_t cp2[] = U"ي";
int main()
{
std::wcout << cp; // Prints ي
std::wcout << (wchar_t) cp2; // Prints ي
}
PS: This doesn't work if you are trying to write text that cannot be represented in your default locale.

Utf-8 to URI percent encoding

I'm trying to convert Unicode code points to percent encoded UTF-8 code units.
The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly.
I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. Rather than the expected %E5%84%A3, I'm seeing %xE5%x84%xA3 (for the unicode U+5123).
What am I doing wrong?
Added code (note that utf8.h belongs to the UTF8-CPP library).
#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"
std::string unicode_to_utf8_units(int32_t unicode)
{
unsigned char u[5] = {0,0,0,0,0};
unsigned char *iter = u, *limit = utf8::append(unicode, u);
std::string s;
for (; iter != limit; ++iter) {
s.push_back(*iter);
}
return s;
}
int main()
{
std::ofstream ofs("test.txt", std::ios_base::out);
if (!ofs.good()) {
std::cout << "ofstream encountered a problem." << std::endl;
return 1;
}
utf8::uint32_t unicode = 0x5123;
auto s = unicode_to_utf8_units(unicode);
for (auto &c : s) {
ofs << "%" << c;
}
ofs.close();
return 0;
}
You actually need to convert byte values to the corresponding ASCII strings, for example:
"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.
Each byte needs to be converted to: "%C3" and "%C9" respectively.
The best way to do so is to use sstream:
std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";
for (int i = 0; i < utf8str.length(); ++i) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}
Or in C++11:
for (auto c: utf8str) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}
Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value.
First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

C++ convert integer to hex string for Color in Graphicsmagick

How can i convert an integer ranging from 0 to 255 to a string with exactly two chars, containg the hexadecimal representation of the number?
Example
input: 180
output: "B4"
My goal is to set the grayscale color in Graphicsmagick. So, taking the same example i want the following final output:
"#B4B4B4"
so that i can use it for assigning the color: Color("#B4B4B4");
Should be easy, right?
You don't need to. This is an easier way:
ColorRGB(red/255., green/255., blue/255.)
You can use the native formatting features of the IOStreams part of the C++ Standard Library, like this:
#include <string>
#include <sstream>
#include <iostream>
#include <ios>
#include <iomanip>
std::string getHexCode(unsigned char c) {
// Not necessarily the most efficient approach,
// creating a new stringstream each time.
// It'll do, though.
std::stringstream ss;
// Set stream modes
ss << std::uppercase << std::setw(2) << std::setfill('0') << std::hex;
// Stream in the character's ASCII code
// (using `+` for promotion to `int`)
ss << +c;
// Return resultant string content
return ss.str();
}
int main() {
// Output: "B4, 04"
std::cout << getHexCode(180) << ", " << getHexCode(4);
}
Live example.
Using printf using the %x format specifier. Alternatively, strtol specifying the base as 16.
#include<cstdio>
int main()
{
int a = 180;
printf("%x\n", a);
return 0;
}