I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...
Here's my try (complete C program):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
In order to run this program:
Save the program to a UTF-8 encoded file called ustridx.c
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
Make sure the terminal is set to a UTF-8 locale (locale)
Run it with ./ustridx
Output:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.
'あ' and 'é' are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.
GCC however emits a warning (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.
Since C11 you can use UTF-32 character literals such as U'あ' which results in a char32_t value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.
A standard-compliant portable solution is using Unicode escape sequences for the character literal, like U'\u3042' for あ, but this is hardly different from using an integer constant such as 0x3042.
From libunistring's documentation:
Compares S1 and S2, each of length N, lexicographically. Returns a
negative value if S1 compares smaller than S2, a positive value if
S1 compares larger than S2, or 0 if they compare equal.
The comparison in the if statement was wrong. That was the reason for the mismatch. Of course, this reveals other, unrelated, issues that also need to be fixed. But, that's the reason for the puzzling result of the comparison.
Related
I try to compile this C++17 code on VS2019:
int main() {
if(!testCodepointEncode(U'\u221A', '\xFB') ||
!testCodepointEncode(U'\u0040', '\x40') ||
!testCodepointEncode(U'\u03A3', '\xE4') ||
!testCodepointEncode(U'𠲖', '\xFE')) {
return 1;
}
// Test 1 byte
if(!testEncode("\u0040", "\x40")) {
return 2;
}
// Test 2 byte
if(!testEncode("\u03A3", "\xE4")) {
return 3;
}
// Test 3 byte
if(!testEncode("\u2502", "\xB3")) {
return 4;
}
// Test 4 byte
if(!testEncode("𠲖", "\xFE")) {
return 5;
}
if(!testArray("F ⌠ Z", "\x46\x20\xF4\x20\x5A")) {
return 6;
}
if(!testView("F ⌠ Z", "\x46\x20\xF4\x20\x5A")) {
return 7;
}
return 0;
}
It compiles and works fine with gcc and clang on Linux, but MSVC complains:
UNICODE_TEST.CPP(65,27): error C2015: too many characters in constant
UNICODE_TEST.CPP(75,18): warning C4566: character represented by universal-character-name '\u03A3' cannot be represented in the current code page (1252)
UNICODE_TEST.CPP(80,18): warning C4566: character represented by universal-character-name '\u2502' cannot be represented in the current code page (1252)
I tried setting the current codepage to UTF-8, but the errors persisted.
How is one supposed to compile this code on Windows?
Look carefully at what you are doing on this line:
if(!testEncode("\u03A3", "\xE4")) {
References the string literal:
"\u03a3"
You are trying to express a UTF-16 character inside an 8-bit (char*) string literal. That just won't work. That's kind of equivalent to doing this:
char sz[2] = {0};
sz[0] = (char)(0x03a3);
And expecting sz[0] to hold the original UTF-16 character. That's what the compiler is warning you about.
If you want to express a 16-bit unicode character inside a string literal, use a wide string. Like follows with the L prefix:
L"\u03a3"
The above is a string literal which holds a signal wide-char character: L"Σ"
And if we really want to be pendantic, we could say this to portably express a UTF-16 character string, use the u prefix:
u"\u03a3"
But on Windows wchar_t is 16-bit, so it doesn't really matter.
You'll probably need to fix your testEncode functions to expect a const wchar_t* instead of a const char* parameter. (I'm honestly not sure what your test* functions are doing, but some of your parameters look suspcious if the goal is to confirm UTF8 to UTF16 conversions)
If you want to express a UTF-8 string in code, you could say this:
"\xCE\xA3"
The above is the UTF-8 representation for a sigma Σ character as UTF-8 string
I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/
I am trying to compare a specific character in a QString, but getting odd results:
My QString named strModified contains: "[y]£trainstrip+[height]£trainstrip+8"
I convert the string to a standard string:
std:string stdstr = strModified.toStdString();
I can see in the debugger that 'stdstr' contins the correct contents, but when I attempt to extract a character:
char cCheck = stdstr.c_str()[3];
I get something completely different, I expected to see '£' but instead I get -62. I realise that '£' is outside of the ASCII character set and has a code of 156.
But what is it returning?
I've modified the original code to simplify, now:
const QChar cCheck = strModified.at(intClB + 1);
if ( cCheck == mccAttrMacroDelimited ) {
...
}
Where mccAttrMacroDelimited is defined as:
const QChar clsXMLnode::mccAttrMacroDelimiter = '£';
In the debugger when looking at both definitions of what should be the same value, I get:
cCheck: -93 '£'
mccAttrMacroDelimiter: -93 with what looks like a chinese character
The comparison fails...what is going on?
I've gone through my code changing all QChar references to unsigned char, now I get a warning:
large integer implicitly truncated to unsigned type [-Woverflow]
on:
const unsigned char clsXMLnode::mcucAttrMacroDelimiter = '£';
Again, why? According to the google search this may be a bogus message.
I am happy to say that this has fixed the problem, the solution, declare check character as unsigned char and use:
const char cCheck = strModified.at(intClB + 1).toLatin1();
I think because '£' is not is the ASCII table, you will get weird behavior from 'char'. The compiler in Xcode does not even let me compile
char c = '£'; error-> character too large for enclosing literal type
You could use unicode since '£' can be found on the Unicode character table
£ : u+00A3 | Dec: 163.
The answer to this question heavily inspired the code I wrote to extract the decimal value for '£'.
#include <iostream>
#include <codecvt>
#include <locale>
#include <string>
using namespace std;
//this takes the character at [index] and prints the unicode decimal value
u32string foo(std::string const & utf8str, int index)
{
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32str = conv.from_bytes(utf8str);
char32_t u = utf32str[index];
cout << u << endl;
return &u;
}
int main(int argc, const char * argv[]) {
string r = "[y]£trainstrip+[height]£trainstrip+8";
//compare the characters at indices 3 and 23 since they are the same
cout << (foo(r,3) == foo(r, 23)) << endl;
return 0;
}
You can use a for loop to get all of the characters in the string if you want. Hopefully this helps
for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.
Here is a snippet of a code that is using std::codecvt_utf8<> facet to convert from wchar_t to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?
#include <locale>
#include <codecvt>
#include <cstdlib>
int main ()
{
std::mbstate_t state = std::mbstate_t ();
std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);
wchar_t ch = L'\u5FC3';
wchar_t const * from_first = &ch;
wchar_t const * from_mid = &ch;
wchar_t const * from_end = from_first + 1;
char out_buf[1];
char * out_first = out_buf;
char * out_mid = out_buf;
char * out_end = out_buf + 1;
std::codecvt_base::result cvt_res
= cvt.out (state, from_first, from_end, from_mid,
out_first, out_end, out_mid);
// This is what I expect:
if (cvt_res == std::codecvt_base::partial
&& out_mid == out_end
&& state != 0)
;
else
abort ();
}
The expectation here is that the out() function output one byte of the UTF-8 conversion at a time but the middle of the if conditional above is false with Visual Studio 2012.
UPDATE
What fails is the out_mid == out_end and state != 0 conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state variable.
The standard description of partial return code of codecvt::do_out says exactly this:
in Table 83:
partial not all source characters converted
In 22.4.1.4.2[locale.codecvt.virtuals]/5:
Returns: An enumeration value, as summarized in Table 83. A return value of partial, if (from_next==from_end), indicates that either the destination sequence
has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.
In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8.
It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:
One: the old C's wide-to-multibyte conversion function std::wcsrtombs (whose locale-specific variants are usually called by the existing implementations of codecvt::do_out for system-supplied locales) is defined as follows:
Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.
And two, look at the existing implementations of codecvt_utf8: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out here calls ucs2_to_utf8 on Windows and ucs4_to_utf8 on other systems, and ucs2_to_utf8 does the following (comments mine):
else if (wc < 0x0800)
{
// not relevant
}
else // if (wc <= 0xFFFF)
{
if (to_end-to_nxt < 3)
return codecvt_base::partial; // <- look here
*to_nxt++ = static_cast<uint8_t>(0xE0 | (wc >> 12));
*to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
*to_nxt++ = static_cast<uint8_t>(0x80 | (wc & 0x003F));
}
nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.
Although there is no direct reference of it, I'd think that is most logical behavior of std::codecvt::out. Consider following scenario:
You use std::codecvt::out in the same manner as you did - not translating any characters (possibly without knowing) into your out_buf.
You now want to translate another string into your out_buf (again using std::codecvt::out) such that it appends the content which is already inside
To do so, you decide to use your buf_mid as you know it points directly after your string that you translated in the first step.
Now, if std::codecvt::out worked according to your expectations (buf_mid pointing to the character after first) then the first character of your out_buf would never be written which would not be exactly what you would want/expect in this case.
In essence, extern_type*& to_next (last parameter of std::codecvt::out) is here for you as a reference of where you left of - so you know where to continue - which is in your case indeed the same position as where you started (extern_type* to) parameter.
cppreferece.com on std::codecvt::out
cpulusplus.com on std::codecvt::out