Print char32_t to console - c++

How can I print (cout / wcout / ...) char32_t to console in C++11?
The following code prints hex values:
u32string s2 = U"Добрый день";
for(auto x:s2){
wcout<<(char32_t)x<<endl;
}

First, I don't think wcout is supposed to print as characters anything but char and wchar_t. char32_t is neither.
Here's a sample program that prints individual wchar_t's:
#include <iostream>
using namespace std;
int main()
{
wcout << (wchar_t)0x41 << endl;
return 0;
}
Output (ideone):
A
Currently, it's impossible to get consistent Unicode output in the console even in major OSes. Simplistic Unicode text output via cout, wcout, printf(), wprintf() and the like won't work on Windows without major hacks. The problem of getting readable Unicode text in the Windows console is in having and being able to select proper Unicode fonts. Windows' console is quite broken in this respect. See this answer of mine and follow the link(s) in it.

I know this is very old, but I had to solve it on my own and there you go.
The idea is to switch between UTF-8 and UTF-32 encodings of Unicode: you can cout u8 strings, so just translate the UTF-32 encoded char32_t to it and you're done. Those are the low level functions I came up with (no Modern C++). Probably those can be optimized, also: any suggestion is appreciated.
char* char_utf32_to_utf8(char32_t utf32, const char* buffer)
// Encodes the UTF-32 encoded char into a UTF-8 string.
// Stores the result in the buffer and returns the position
// of the end of the buffer
// (unchecked access, be sure to provide a buffer that is big enough)
{
char* end = const_cast<char*>(buffer);
if(utf32 < 0x7F) *(end++) = static_cast<unsigned>(utf32);
else if(utf32 < 0x7FF) {
*(end++) = 0b1100'0000 + static_cast<unsigned>(utf32 >> 6);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
}
else if(utf32 < 0x10000){
*(end++) = 0b1110'0000 + static_cast<unsigned>(utf32 >> 12);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 6) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
} else if(utf32 < 0x110000) {
*(end++) = 0b1111'0000 + static_cast<unsigned>(utf32 >> 18);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 12) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 6) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
}
else throw encoding_error(end);
*end = '\0';
return end;
}
You can implement this function in a class if you want, in a constructor, in a template, or whatever you prefer.
Follows the overloaded operator with the char array
std::ostream& operator<<(std::ostream& os, const char32_t* s)
{
const char buffer[5] {0}; // That's the famous "big-enough buffer"
while(s && *s)
{
char_utf32_to_utf8(*(s++), buffer);
os << buffer;
}
return os;
}
and with the u32string
std::ostream& operator<<(std::ostream& os, const std::u32string& s)
{
return (os << s.c_str());
}
Running the simplest stupidest test with the Unicode characters found on Wikipedia
int main()
{
std::cout << std::u32string(U"\x10437\x20AC") << std::endl;
}
leads to 𐐷€ printed on the (Linux) console. This should be tested with different Unicode characters, though...
Also this varies with endianness but I'm sure you can find the solution looking at this.

Related

How to convert accented character to hex in C++? [duplicate]

This question already has answers here:
Convert string from UTF-8 to ISO-8859-1
(3 answers)
Closed 1 year ago.
Referring to the ISO-8859-1 (Latin-1) encoding:
The capital E acute (É) has a hex value of C9.
I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.
Currently, I am only able to write a function that converts an ASCII string to hex:
std::string Helper::ToHex(std::string input) {
std::stringstream strstream;
std::string output;
for (int i=0; i<input.length(); i++) {
strstream << std::hex << unsigned(input[i]);
}
strstream >> output;
}
However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.
std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo

hex string to extended ascii/unicode conversion in c++ [duplicate]

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"

Convert string from UTF-8 to ISO-8859-1

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"

c++ string erase does not work for UTF8 string, what library can I use?

Program:
void foo() {
string sourceStr = "Tag:贾鑫#VoltDB";
string insertStr = "XinJia";
int start = 4;
int length = 2;
sourceStr.erase(start, length);
sourceStr.insert(start, insertStr);
cout << sourceStr << endl;
}
For this program, I want to get output as "Tag:XinJia#VoltDB", but it seems that the std string erase and insert does not work for UTF-8 string.
Is there any boost library that I can use? How should I solve this problem?
After talking with others, I realize that there is no standard library that can solve this problem. So I write a function to do my work and would like to share it with others who have this similar problem:
std::string overlay_function(const char* sourceStr, size_t sourceLength,
std::string insertStr, size_t startPos, size_t length) {
int32_t i = 0, j = 0;
while (i < sourceLength) {
if ((sourceStr[i] & 0xc0) != 0x80) {
if (++j == startPos) break;
}
i++;
}
std::string result = std::string(sourceStr, i);
result.append(insertStr);
bool reached = false;
j = 0;
while (i < sourceLength) {
if ((sourceStr[i] & 0xc0) != 0x80) {
if (reached) break;
if (++j == length) reached = true;
}
i++;
}
result.append(std::string(&sourceStr[i], sourceLength - i));
return result;
}
With this funciton, my program can be:
cout << overlay_function(sourceStr, sourceStr.length(), 4+1, 2) << endl;
Hope it helps.
Indices in C++ string are encoding value indices, not character (or in your case ideogram) indices. With UTF-8 each character can be composed of more than one encoding unit, and in your case it is so. Find the correct encoding unit index.
Tip 1: I'd use .substr and + string concatenation for this.
Tip 2: it seems that you can search for the characters : and #. Note that these encoding units cannot occur in multi-unit UTF-8 character. Check out the methods of string.

how to convert ucs4 to ucs2 using C++ and ucs2 to ucs4?

Is there any C++ method support this conversion?
By now i just fill character '0' to convert ucs2 to ucs4, is it safe?
thanks!
It's correct for UCS2, but that's most likely not what you have. Nowadays, you're more likely to encounter UTF-16. Unlike UCS-2, UTF-16 encodes Unicode characters as either one or two 16-bit units. This is necessary because Unicode has more than 65536 characters in its current version.
The more complex conversions usually can be done by your OS, and there are several (non-standard) libraries that offer the same functionality, e.g. ICU.
I have something like that. Hope it will help:
String^ StringFromUCS4(const char32_t* element, int length)
{
StringBuilder^ result = gcnew StringBuilder(length);
const char32_t* pUCS4 = element;
int characterCount = 0;
while (*pUCS4 != 0)
{
wchar_t cUTF16;
if (*pUCS4 < 0x10000)
{
cUTF16 = (wchar_t)*pUCS4;
}
else
{
unsigned int t = *pUCS4 - 0x10000;
unsigned int h = (((t << 12) >> 22) + 0xD800);
unsigned int l = (((t << 22) >> 22) + 0xDC00);
cUTF16 = (wchar_t)((h << 16) | (l & 0x0000FFFF));
}
result->Append((wchar_t)*pUCS4);
characterCount++;
if (characterCount >= length)
{
break;
}
pUCS4++;
}
return result->ToString();
}