MSVC UTF8 string encoding uses incorrect code points - c++

I'm trying to write the character "Ā" (https://www.fileformat.info/info/unicode/char/0100/index.htm) into a C++11 UTF8 string (using u8 prefix).
const char *const utf8 = u8"Ā";
const char *const utf8_2 = u8"\u0100";
const char *const chars = "Ā";
const int utf8_len = strlen(utf8);
const int utf8_2_len = strlen(utf8_2);
const int chars_len = strlen(chars);
Running this under MSVC (16.2.4) results in:
utf8_len == 5
utf8_2_len = 2;
chars_len = 2;
Where:
utf8 == "Ä€"
utf8_2 == "Ä€"
chars == "Ä€"
The source file is set to UTF8 (without BOM).
Trying the same with Clang and GCC works as expected:
https://godbolt.org/z/PNZFCa
Does anyone know why this behaviour is occurring? Why is the u8 prefixed Unicode character being encoded as 5 bytes (when it should be 2)?

The Microsoft compiler assumes the local ANSI encoding for files without BOM, which is probably Windows-1252 in your case. If you run cl /? from the command line, you'll see the following command line switches:
...
/source-charset:<iana-name>|.nnnn set source character set
/execution-charset:<iana-name>|.nnnn set execution character set
/utf-8 set source and execution character set to UTF-8
...
Use /source-charset:UTF-8 or /utf-8 if you don't want to save with BOM.
Test code saved in UTF-8 without BOM:
#include <stdio.h>
#include <string.h>
int main()
{
const char *const utf8 = u8"Ā";
printf("%zu\n",strlen(utf8));
}
Output:
C:\>cl /nologo test.cpp
test.cpp
C:\>test
5
C:\>cl /nologo /utf-8 test.cpp
test.cpp
C:\>test
2

Related

How to get single characters from unicode string and compare, print them?

I am processing unicode strings in C with libunistring. Can't use another library. My goal is to read a single character from the unicode string at its index position, print it, and compare it to a fixed value. This should be really simple, but well ...
Here's my try (complete C program):
/* This file must be UTF-8 encoded in order to work */
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>
int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
int result = u32_cmp(&charExpected, &charActual, 1);
if (result == 0) {
printf("%s is recognized as '%lc', good!\n", label, charExpected);
} else {
printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
}
return result;
}
int main() {
setlocale(LC_ALL, ""); /* switch from default "C" encoding to system encoding */
const char *enc = locale_charset();
printf("Current locale charset: %s (should be UTF-8)\n\n", enc);
const char *buf = "foo 楽あり bébé";
const uint32_t *mbcs = u32_strconv_from_locale(buf);
printf("%s\n", u32_strconv_to_locale(mbcs));
uint32_t c0 = mbcs[0];
uint32_t c5 = mbcs[5];
uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];
printf(" - char 0: %lc\n", c0);
printf(" - char 5: %lc\n", c5);
printf(" - last : %lc\n", cLast);
/* When this file is UTF-8-encoded, I'm passing a UTF-8 character
* as a uint32_t, which should be wrong! */
cmpchr("Char 0", 'f', c0);
cmpchr("Char 5", 'あ', c5);
cmpchr("Last char", 'é', cLast);
return 0;
}
In order to run this program:
Save the program to a UTF-8 encoded file called ustridx.c
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
Make sure the terminal is set to a UTF-8 locale (locale)
Run it with ./ustridx
Output:
Current locale charset: UTF-8 (should be UTF-8)
foo 楽あり bébé
- char 0: f
- char 5: あ
- last : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.
The desired behavior is that char 5 and last char are recognized correctly, and printed correctly in the last two lines of the output.
'あ' and 'é' are invalid character literals. Only characters from the basic source character set and escape sequences are allowed in character literals.
GCC however emits a warning (see godbolt) saying warning: multi-character character constant. This is a different case, and is about character constants such as 'abc', which are multicharacter literals. This is because these characters are encoded using multiple bytes with UTF-8. According to cppreference, the value of such a literal is implementation defined, so you can't rely on its value being the corresponding Unicode code point. GCC specifically doesn't do this as seen here.
Since C11 you can use UTF-32 character literals such as U'あ' which results in a char32_t value of the Unicode code point of the character. Although by my reading the standard doesn't allow using characters such as あ in literals, the examples on cppreference seem to suggest that it is common for compilers to allow this.
A standard-compliant portable solution is using Unicode escape sequences for the character literal, like U'\u3042' for あ, but this is hardly different from using an integer constant such as 0x3042.
From libunistring's documentation:
Compares S1 and S2, each of length N, lexicographically. Returns a
negative value if S1 compares smaller than S2, a positive value if
S1 compares larger than S2, or 0 if they compare equal.
The comparison in the if statement was wrong. That was the reason for the mismatch. Of course, this reveals other, unrelated, issues that also need to be fixed. But, that's the reason for the puzzling result of the comparison.

Compiling Unicode with Visual Studio 2019

I try to compile this C++17 code on VS2019:
int main() {
if(!testCodepointEncode(U'\u221A', '\xFB') ||
!testCodepointEncode(U'\u0040', '\x40') ||
!testCodepointEncode(U'\u03A3', '\xE4') ||
!testCodepointEncode(U'𠲖', '\xFE')) {
return 1;
}
// Test 1 byte
if(!testEncode("\u0040", "\x40")) {
return 2;
}
// Test 2 byte
if(!testEncode("\u03A3", "\xE4")) {
return 3;
}
// Test 3 byte
if(!testEncode("\u2502", "\xB3")) {
return 4;
}
// Test 4 byte
if(!testEncode("𠲖", "\xFE")) {
return 5;
}
if(!testArray("F ⌠ Z", "\x46\x20\xF4\x20\x5A")) {
return 6;
}
if(!testView("F ⌠ Z", "\x46\x20\xF4\x20\x5A")) {
return 7;
}
return 0;
}
It compiles and works fine with gcc and clang on Linux, but MSVC complains:
UNICODE_TEST.CPP(65,27): error C2015: too many characters in constant
UNICODE_TEST.CPP(75,18): warning C4566: character represented by universal-character-name '\u03A3' cannot be represented in the current code page (1252)
UNICODE_TEST.CPP(80,18): warning C4566: character represented by universal-character-name '\u2502' cannot be represented in the current code page (1252)
I tried setting the current codepage to UTF-8, but the errors persisted.
How is one supposed to compile this code on Windows?
Look carefully at what you are doing on this line:
if(!testEncode("\u03A3", "\xE4")) {
References the string literal:
"\u03a3"
You are trying to express a UTF-16 character inside an 8-bit (char*) string literal. That just won't work. That's kind of equivalent to doing this:
char sz[2] = {0};
sz[0] = (char)(0x03a3);
And expecting sz[0] to hold the original UTF-16 character. That's what the compiler is warning you about.
If you want to express a 16-bit unicode character inside a string literal, use a wide string. Like follows with the L prefix:
L"\u03a3"
The above is a string literal which holds a signal wide-char character: L"Σ"
And if we really want to be pendantic, we could say this to portably express a UTF-16 character string, use the u prefix:
u"\u03a3"
But on Windows wchar_t is 16-bit, so it doesn't really matter.
You'll probably need to fix your testEncode functions to expect a const wchar_t* instead of a const char* parameter. (I'm honestly not sure what your test* functions are doing, but some of your parameters look suspcious if the goal is to confirm UTF8 to UTF16 conversions)
If you want to express a UTF-8 string in code, you could say this:
"\xCE\xA3"
The above is the UTF-8 representation for a sigma Σ character as UTF-8 string

Putting Hebrew string in a variable using C++ on Windows

I have problem putting Hebrew string in a variable like this:
wchar_t* hebrewString = L"א";
The value in unicode of א is 0x05d0 in hex or 1488 in dec.
The problem is that my memory show different value that totally unconnected
to the real value of א.
If I write:
wchar_t hebrewChar = 0x05d0
it is obvious that the right value will be in hebrewChar, but I want to write regular string.
I thought maybe I did something wrong so I looked up in the generate ASM code and even there it was wrong value.
How can I write Hebrew string in a simple way?
Edit 1:
add source code(in comment above the code is the assembly)
wchar_t d = 0x05D0;
// DB 0f3H, 05H, 090H, 00H, 00H, 00H
wchar_t *test = L"א";
// mov eax, 1523 ; 000005f3H
wchar_t test1 = L'א';
// mov eax, -112 ; ffffff90H
char test2 = 'א';
By specifying L in front of string or Unicode character, compiler will convert it into an encoding matching the encoding file is saved. Therefore you have to change file encoding via FILE -> Advance Save Options and choose UTF 8 with signature - codepage 65001 for example.
Also bear in mind that Windows Console isn't capable of printing all Unicode characters (you could if you'd have different default language and encoding).
Here is also an example to see if your code is working by saving character into a text file:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
// UCS-2 little endian text file magic number
char magic_number[] = { 0xFF, 0xFE };
wchar_t unicode_char = L'א';
wchar_t unicode_val = 0x05d0;
if (unicode_char == unicode_val)
cout << "Works!" << endl;
ofstream f("out.txt", ios::out);
f.write(magic_number, 2);
f.write((char *)&unicode_char, 2);
f.close();
return 0;
}
Open the file and check if the value is printed correctly.
Otherwise for storing non ANSI characters in code, I'd strongly recommend using library like ICU for saving, loading... - in general all operations regarding strings.

how to print each character of strings that mix ascii character with unicode?

for example, I want to create some typewriter effects so need to print strings like that:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(0,i).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
but the output is
a
ab
ab?
ab?
ab》
ab》c
ab》cd
ab》cd?
ab》cd?
ab》cd《
ab》cd《e
and not:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
how to know the upcoming character is unicode?
similar question, print each character also has the problem:
#include <string>
int main(){
std::string st1="ab》cd《ef";
for(int i=0;i<st1.size();i++){
std::string st2=st1.substr(i,1).c_str();
printf("%s\n",st2.c_str());
}
return 0;
}
the output is:
a
b
?
?
?
c
d
?
?
?
e
f
not:
a
b
》
c
d
《
e
f
I think the problem is encoding. Likely your string is in UTF-8 encoding which has variable sized characters. This means you can not iterate one char at a time because some characters are more than one char wide.
The fact is, in unicode, you can only iterate reliably one fixed character at a time with UTF-32 encoding.
So what you can do is use a UTF library like ICU to convert vetween UTF-8 and UTF-32.
If you have C++11 then there are some tools to help you here, mostly std::u32string which is able to hold UTF-32 encoded strings:
#include <string>
#include <iostream>
#include <unicode/ucnv.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
// convert from UTF-32 to UTF-8
std::string to_utf8(std::u32string s)
{
UErrorCode status = U_ZERO_ERROR;
char target[1024];
int32_t len = ucnv_convert(
"UTF-8", "UTF-32"
, target, sizeof(target)
, (const char*)s.data(), s.size() * sizeof(char32_t)
, &status);
return std::string(target, len);
}
// convert from UTF-8 to UTF-32
std::u32string to_utf32(const std::string& utf8)
{
UErrorCode status = U_ZERO_ERROR;
char32_t target[256];
int32_t len = ucnv_convert(
"UTF-32", "UTF-8"
, (char*)target, sizeof(target)
, utf8.data(), utf8.size()
, &status);
return std::u32string(target, (len / sizeof(char32_t)));
}
int main()
{
// UTF-8 input (needs UTF-8 editor)
std::string utf8 = "ab》cd《ef"; // UTF-8
// convert to UTF-32
std::u32string utf32 = to_utf32(utf8);
// Now it is safe to use string indexing
// But i is for length so starting from 1
for(std::size_t i = 1; i < utf32.size(); ++i)
{
// convert back to to UTF-8 for output
// NOTE: i + 1 to include the BOM
std::cout << to_utf8(utf32.substr(0, i + 1)) << '\n';
}
}
Output:
a
ab
ab》
ab》c
ab》cd
ab》cd《
ab》cd《e
ab》cd《ef
NOTE:
The ICU library adds a BOM (Byte Order Mark) at the beginning of the strings it converts into Unicode. Therefore you need to deal with the fact that the first character of the UTF-32 string is the BOM. This is why the substring uses i + 1 for its length parameter to include the BOM.
Your C++ code is simply echoing octets to your terminal, and it is your terminal display that's converting octets encoded in its default character set to unicode characteers.
It looks like, based on your example, that your terminal display uses UTF-8. The rules for converting UTF-8-encoded characters to unicode are fairly well specified (Google is your friend), so all you have to do is to check the first character of a UTF-8 sequence to figure out how many octets make up the next unicode character.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!
You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.
Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.