Learning c++, trying to find a way to display UTF-16 characters by adding the 4 digits after the "\u". But, for example, if I try to directly add 0000:
string temp = "\u" + "0000";
I get the error: incorrectly formed universal character name. So is there a way to get these two to form one Unicode character? Also I realize that the end four numbers range from 0-F but for now I just want to focus on the 0-9 characters.
How can I add"\u" with a different string
Edit: I was looking for the C++ equivalent of the JavaScript function:
String.fromCharCode()
You can't say "\u" + "0000", because the parsing of escape sequences happens early in the process, before the actual compilation begins. By the time the strings would be tacked together, escape sequences are already parsed and won't be again. And since \u is not a valid escape sequence on its own, you get an error about it.
You can't separate a string literal like that. The special sequence inside the quotes is a directive to the compiler to insert the relevant Unicode character at compile time so if you break it into two pieces it is no longer recognized as a directive.
To programatically generate a UTF-16 character based on its Unicode codepoint number you could use the Standard Library Unicode converson functions. Unfortunately there is no direct conversion between UTF-32 (Unicode codepoints) and UTF-16 so you have to go through UTF-8 as an intermediate value:
// UTF-16 may contain either one or two char16_t characters so
// we return a string to potentially contain both.
///
std::u16string codepoint_to_utf16(char32_t cp)
{
// convert UTF-32 (standard unicode codepoint) to UTF-8 intermediate value
char utf8[4];
char* end_of_utf8;
{
char32_t const* from = &cp;
std::mbstate_t mbs;
std::codecvt_utf8<char32_t> ccv;
if(ccv.out(mbs, from, from + 1, from, utf8, utf8 + 4, end_of_utf8))
throw std::runtime_error("bad conversion");
}
// Now convert the UTF-8 intermediate value to UTF-16
char16_t utf16[2];
char16_t* end_of_utf16;
{
char const* from = nullptr;
std::mbstate_t mbs;
std::codecvt_utf8_utf16<char16_t> ccv;
if(ccv.in(mbs, utf8, end_of_utf8, from, utf16, utf16 + 2, end_of_utf16))
throw std::runtime_error("bad conversion");
}
return {utf16, end_of_utf16};
}
int main()
{
std::u16string s; // can hold UTF-16
// iterate through some Greek codepoint values
for(char32_t u = 0x03b1; u < 0x03c9; ++u)
{
// append the converted UTF-16 characters to our string
s += codepoint_to_utf16(u);
}
// do whatever you want with s here...
}
What you're trying to do is not possible. C++ parsing is split into multiple phases. Per [lex.phases], escape sequences (in phase 5) are escaped before adjacent string literals are concatenated (phase 6).
Related
Background:
I am making a hash that will allow you to lookup the description you see below by feeding it a QString containing its character.
I got a full list of the relevant data, looking something like this:
QHash<QString, QString> lookupCharacterDescription;
...
lookupCharacterDescription.insert("003F","QUESTION MARK");
lookupCharacterDescription.insert("0040","COMMERCIAL AT");
lookupCharacterDescription.insert("0041","LATIN CAPITAL LETTER A");
lookupCharacterDescription.insert("0042","LATIN CAPITAL LETTER B");
...
lookupCharacterDescription.insert("1F648","SEE-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F649","HEAR-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64A","SPEAK-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64B","HAPPY PERSON RAISING ONE HAND");
...
lookupCharacterDescription.insert("FFFD","REPLACEMENT CHARACTER");
lookupCharacterDescription.insert("FFFE","<not a character>");
lookupCharacterDescription.insert("FFFF","<not a character>");
lookupCharacterDescription.insert("FFFFE","<not a character>");
lookupCharacterDescription.insert("FFFFF","<not a character>");
Now obviously "1F64B" needs to be wrapped in something here. I have tried playing around with things like 0x1F64B as a QChar, but I am honestly groping in the dark here. I could make it work with the lower values like the Latin Letters, but it fails with the 5 character addresses.
Questions:
How do I classify 1F64B?
Is this considered UTF-32?
What can I wrap this value "1F64B" in to produce the QString("🙋")?
Will the wrappings also work for the lower values?
When you use QString(0x1F64B) it'll call QString::QString(QChar ch). Since QChar is a 16-bit type, it'll truncate the value to 0xF64B and you get an invalid character since that code point is currently unassigned. I'm pretty sure you'll get an out-of-range warning at that line. You can see the value F64B easily in the character if you zoom in or use a hex editor. Since there's no way for 0x1F64B to fit into a single 16-bit QChar and must be represented by a surrogate pair, you can't initialize the string that way.
OTOH QString("🙋") works since it's constructing the string from another string. You must construct the string with a string like that, or manually by assigning the UTF-8/16 code units.
Is this considered UTF-32?
No. UTF-32 is a Unicode encoding that uses 32 bits for a code unit. You only have QString and not a bare byte array, so you don't need to care about its underlying encoding (which is actually UTF-16)
What can I wrap this value "1F64B" in to produce the QString("🙋")?
You shouldn't deal with the numeric values as string. Store it as a numeric type instead
QHash<qint32, QString> lookupCharacterDescription;
lookupCharacterDescription.insert(0x1F64B, "HAPPY PERSON RAISING ONE HAND");
and then to make a string that contains the character at code point 0x1F64B use
uint cp = 0x1F64B;
QString mystr = QString::fromUcs4(&cp, 1);
Will the wrappings also work for the lower values?
Yes, since UCS4, A.K.A. UTF-32, can store any possible Unicode characters
Alternatively you can construct the character from UTF-16 or UTF-8. U+1F64B is encoded in UTF-16 as D83D DE4B, or as F0 9F 99 8B in UTF-8, therefore you can use any of the below
QChar utf16[2] = { 0xD38D, 0xDE4B };
str1 = QString(utf16, 2);
char* utf8[4] = { 0xF0, 0x9F, 0x99, 0x8B };
str2 = QString::fromUtf8(utf8, 4);
If you want to include the string in its literal form in source code then either of the following will work
str1 = QString::fromWCharArray(L"\xD83D\xDE4B");
str2 = QString::fromUtf8("\xF0\x9F\x99\x8B");
If you have C++11 support then simply use the prefix u8, u and U for UTF-8, UTF-16 and UTF-32 respectively like
u8"🙋"
u"🙋"
U"🙋"
u8"\U0001F64B"
u"\U0001F64B"
u"\uD83D\uDE4B"
U"\U0001F64B"
Mandatory article to understand text and encodings: There Ain't No Such Thing as Plain Text
I have a problem with a std::string comparation with codification I think. The problem is that I hate to compare a a string that is received and I dont know how kind of codification it has with a spanish string with unusal characters. I cant change s_area.m_s_area_text so I need to set s2 string with a identical value and i dont know how to do it in a generic way for other chases.
std::string s2= "Versión de sistema";
std::cout << s_area.m_s_area_text << std::endl;
for (const char* p = s2.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
for (const char* p = s_area.m_s_area_text.c_str(); *p; ++p)
{
printf("%02x", *p);
}
printf("\n");
And the result of the execution is:
Versi├│n de sistema
5665727369fffffff36e2064652073697374656d61
5665727369ffffffc3ffffffb36e2064652073697374656d61
Obviously, as the 2 strings has not the same bytes values, all the compare method fails: strncmp, std::string ==, std:sstring.comapre etc.
Any idea of how to do that witho touching s_area.m_s_area_text string?
In general it is impossible to guess the encoding of a string by inspecting its raw bytes. The exception to this rule is when a byte order mark (BOM) is present at the start of the byte stream. The BOM will tell you which unicode encoding the bytes are and the endianness.
As an aside, if at some point in the future you decide you need a canonical string encoding (as some have pointed out in the comments that it would be a good idea). There are strong arguments in favour of UTF-8 as the best choice for C++. See UTF-8 everywhere for further information on this.
First of all, two compare two string correctly you at least need to know their encoding. In your example s_area.m_s_area_text is happened to be encoded with UTF-8 while for s2 ISO/IEC 8859-1 (Latin-1) is used.
If you are sure that s_area.m_s_area_text will always be encoded in UTF-8, you can try to make s2 use the same encoding and then just compare them. One way of defining a UTF-8 encoded string is escaping every character that is not in basic character set with \u.
std::string s2 = u8"Versi\u00F3n de sistema";
...
if (s_area.m_s_area_text == s2)
...
It should also be possible to do it without escaping the characters by setting an appropriate encoding for the source file and specifying the encoding to the compiler.
As #nwp mentioned, you may also want to normalise the strings before comparing. Otherwise, two strings that look the same may have different Unicode representation and that will cause your comparison to yield a false negative result.
For example, "Versión de sistema" will not be equal to "Versión de sistema".
I have a working algorithm to convert a UTF-8 string to a UTF-32 string, however, I have to allocate all the space for my UTF-32 string ahead of time. Is there any way to know how many characters in UTF-32 that a UTF-8 string will take up.
For example, the UTF-8 string "¥0" is 3 chars, and once converted to UTF-32 is 2 unsigned ints. Is there any way to know the number of UTF-32 'chars' I will need before doing the conversion? Or am I going to have to re-write the algorithm?
There are two basic options:
You could make two passes through the UTF-8 string, the first one counting the number of UTF-32 characters you'll need to generate, and the second one actually writing them to a buffer.
Allocate the max number of 32-bit chars you could possibly need -- i.e., the length of the UTF-8 string. This is wasteful of memory, but means you can transform utf8->utf32 in one pass.
You could also use a hybrid -- e.g., if the string is shorter than some threshold then use the second approach, otherwise use the first.
For the first approach, the first pass would look something like this:
size_t len=0; // warning: untested code.
for(const char *p=src; *p; ++p) {
// characters that begin with binary 10xxxxxx... are continuations; all other
// characters should begin a new utf32 char (assuming valid utf8 input)
if ((*p & 0xc0) != 0x80) ++len;
}
I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?
(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)
You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.
With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());
Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.
I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf, e.g.:
xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );
char *utf8_s; // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s; // characters are written in ISO-8859-1
The implementation of xcoder_streambuf would use ICU's converters API. It would take the data coming in (in this case, from utf8_s), transcode it, and write it out using the iostream's original steambuf.
Is that a reasonable way to go? If not, what would be better?
Is that a reasonable way to go?
Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.
The behaviour of outputting through basic_streambuf<> is defined by the overflow(int_type c) virtual function.
The description of basic_filebuf<>::overflow(int_type c = traits::eof()) includes a_codecvt.out(state, b, p, end, xbuf, xbuf+XSIZE, xbuf_end); where a_codecvt is defined as:
const codecvt<charT,char,typename traits::state_type>& a_codecvt
= use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());
so you are expected to imbue a locale with the appropriate codecvt<charT,char,typename traits::state_type> converter.
The class codecvt<internT,externT,stateT> is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.
The standard library support for Unicode made some progress since 1997:
the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.
This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).
If not, what would be better?
I would introduce a different type for UTF8, like:
struct utf8 {
unsigned char d; // d for data
};
struct latin1 {
unsigned char c; // c for character
};
This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream/ostream.
Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.