Gcc: making UTF-8 the execution character set

Gcc: making UTF-8 the execution character set - c++

I wrote this test program in a Latin-1 encoded file...
#include <cstring>
#include <iostream>
using namespace std;
const char str[] = "ÅÄÖ";
int main() {
cout << sizeof(str) << ' ' << strlen(str) << ' '
<< sizeof("Åäö") << ' ' << strlen("åäö") << endl;
return 0;
}
...and compiled it with g++ -fexec-charset=UTF-8 -pedantic -Wall on OpenBSD 5.3. I was expecting to see the size of the strings being 6 chars + 1 NUL char, but I get the output 4 3 4 3?
I tried changing my system locale from ISO-8859-1 to UTF-8 with export LC_CTYPE=sv_SE.UTF-8, but that didn't help. (Accordning to the gcc manual, that only changes the input character set, but hey, it was worth a try.)
So, what am I doing wrong?

Related

How to display character string literals with hex properly with std::cout in C++?

How to display character string literals with hex properly with std::cout in C++?
I want to use octal and hex to print character string literals with std::cout in C++.
I want to print "bee".
#include <iostream>
int main() {
std::cout << "b\145e" << std::endl;//1
std::cout << "b\x65e" << std::endl;//2
return 0;
}
//1 works fine, but //2 doesn't with hex escape sequence out of range.
Now I want to print "be3".
#include <iostream>
int main() {
std::cout << "b\1453" << std::endl;//1
std::cout << "b\x653" << std::endl;//2
return 0;
}
Also, //1 works fine, but //2 doesn't with hex escape sequence out of range.
Now can I come to the conclusion that hex is not a good way to display character string characters?
I get the feeling I am wrong but don't know why.
Can someone explain whether hex can be used and how?

There's actually an example of this exact same situation on cppreference's documentation on string literals.
If a valid hex digit follows a hex escape in a string literal, it would fail to compile as an invalid escape sequence. String concatenation can be used as a workaround:
They provide the example below:
// const char* p = "\xfff"; // error: hex escape sequence out of range
const char* p = "\xff""f"; // OK : the literal is const char[3] holding {'\xff','f','\0'}
Applying what they explain to your problem, we can print the string literal be3 in two ways:
std::cout << "b\x65" "3" << std::endl;
std::cout << "b\x65" << "3" << std::endl;

The hex escape sequences becomes \x65e and \x653 so you need to help the compiler to stop after 65:
#include <iostream>
int main() {
std::cout << "b\x65""e" << std::endl;//2
std::cout << "b\x65""3" << std::endl;//2
}

std::regex fails to match char

I'm trying to get a regex to match a char containing a space ' '.
When compiled with g++ (MinGW 8.1.0 on Windows) it reliably fails to match.
When compiled with onlinegdb it reliably matches
Why would the behaviour differ between these two? What would be the best way to get my regex to match properly without using a std::string (which does match correctly)
My code:
#include <iostream>
#include <regex>
#include <string>
int main() {
char a = ' ';
std::string b = " ";
cout << std::regex_match(b, std::regex("\\s+")) << \n; // always writes 1
cout << std::regex_match(&a, std::regex("\\s+")) << \n; // writes 1 in onlinegdb, 0 with MinGW
}

snprintf equivalent for wchar_t to calculate formatted string size (mac)

Is there any function available to calculate the formatted string size for wchar_t similar to char (snprintf) ?
msvc has snwprintf but I couldnt find an equivalent in mac.
If not is there a way to calculate this with std libraries (without boost)

I tried this on OSX (Apple LLVM version 8.0.0, clang-800.0.42.1) and CentOS (g++ 4.8.5 20150623 Red Hat 4.8.5-11) using the swprintf() function recommended by Cubbi.
NB the man entry for swprintf() on Mac indicates that its return value is the "number of chars written", not the number of wchars.
I ran the following code, and got the output results shown in the comments on each platform:
#include <iostream>
#include <clocale>
#include <string>
#include <cstring>
int main()
{
std::setlocale(LC_ALL, "en_US.utf8");
char narrow_str[] = "z\u00df\u6c34\U0001f34c";
std::cout << narrow_str << std::endl; // zß水🍌
std::cout << strlen(narrow_str) << std::endl; // 10 (= 1 + 2 + 3 + 4)
std::wstring wide_str = L"z\u00df\u6c34\U0001f34c";
std::cout << wide_str.length() << std::endl; // 4
std::cout << wcslen(wide_str.c_str()) << std::endl; // 4
wchar_t warr[100];
int n1 = std::swprintf(warr, sizeof warr/sizeof *warr, L"%s", narrow_str);
int n2 = std::swprintf(warr, sizeof warr/sizeof *warr, L"Converted from UTF-8: '%s'", narrow_str);
std::cout << n1 << std::endl; // 10 with LLVM, 4 with gcc
std::cout << n2 << std::endl; // 34 with LLVM, 28 with gcc
}
From this it appears that with LLVM, (1) the original code at cppreference fails and returns -1 because the warr[] array was too small, and (2) the return value from swprintf() is measured in char.
This may, or may not, be helpful.

'_snwprintf' is a wide-character version of _snprintf; the pointer arguments to '_snwprintf' are wide-character strings. Detection of encoding errors in '_snwprintf' might differ from the detection in _snprintf. '_snwprintf', just like swprintf, writes output to a string instead of a destination of type FILE.
int _snwprintf(
wchar_t *buffer,
size_t count,
const wchar_t *format [,
argument] ...
);
Parameters:
buffer
Storage location for the output.
count
Maximum number of characters to store.
format
Format-control string.
argument
Optional arguments.

std::string::size() strange behaviour

I believe the output has to do with UTF, but I do not know how.
Would someone, please, explain?
#include <iostream>
#include <cstdint>
#include <iomanip>
#include <string>
int main()
{
std::cout << "sizeof(char) = " << sizeof(char) << std::endl;
std::cout << "sizeof(std::string::value_type) = " << sizeof(std::string::value_type) << std::endl;
std::string _s1 ("abcde");
std::cout << "s1 = " << _s1 << ", _s1.size() = " << _s1.size() << std::endl;
std::string _s2 ("abcdé");
std::cout << "s2 = " << _s2 << ", _s2.size() = " << _s2.size() << std::endl;
return 0;
}
The output is:
sizeof(char) = 1
sizeof(std::string::value_type) = 1
s1 = abcde, _s1.size() = 5
s2 = abcdé, _s2.size() = 6
g++ --version prints g++ (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609
QTCreator compiles like this:
g++ -c -m32 -pipe -g -std=c++0x -Wall -W -fPIC -I../strsize -I. -I../../Qt/5.5/gcc/mkspecs/linux-g++-32 -o main.o ../strsize/main.cpp
g++ -m32 -Wl,-rpath,/home/rodrigo/Qt/5.5/gcc -o strsize main.o
Thanks a lot!

é is encoded as 2 bytes, 0xC3 0xA9, in utf-8.

gcc default input character set is UTF-8. Your editor also probably saved the file as UTF-8, so in your input .cpp file the string abcdé will have 6 bytes (As Peter already answered, the LATIN SMALL LETTER E WITH ACUTE is encoded in UTF-8 with 2 bytes). std::string::length returns the length in bytes, ie. 6. QED
You should open your source .cpp file in a hex editor to confirm.

Even in C++11 std::string has nothing to do with UTF-8. In the description of size and length methods of std::string we can see:
For std::string, the elements are bytes (objects of type char), which are not the same as characters if a multibyte encoding such as UTF-8 is used.
Thus, you should use some third-party unicode-compatible library to handle unicode strings.
If you continue to use non-unicode string classes with unicode strings, you may face LOTS of other problems. For example, you'll get a bogus result when trying to compare same-looking combining character and precomposed character.

How to print wstring on Linux/OS X?

How can I print a string like this: €áa¢cée£ on the console/screen? I tried this:
#include <iostream>
#include <string>
using namespace std;
wstring wStr = L"€áa¢cée£";
int main (void)
{
wcout << wStr << " : " << wStr.length() << endl;
return 0;
}
which is not working. Even confusing, if I remove € from the string, the print out comes like this: ?a?c?e? : 7 but with € in the string, nothing gets printed after the € character.
If I write the same code in python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
wStr = u"€áa¢cée£"
print u"%s" % wStr
it prints out the string correctly on the very same console. What am I missing in c++ (well, I'm just a noob)? Cheers!!
Update 1: based on n.m.'s suggestion
#include <iostream>
#include <string>
using namespace std;
string wStr = "€áa¢cée£";
char *pStr = 0;
int main (void)
{
cout << wStr << " : " << wStr.length() << endl;
pStr = &wStr[0];
for (unsigned int i = 0; i < wStr.length(); i++) {
cout << "char "<< i+1 << " # " << *pStr << " => " << pStr << endl;
pStr++;
}
return 0;
}
First of all, it reports 14 as the length of the string: €áa¢cée£ : 14 Is it because it's counting 2 byte per character?
And all I get this:
char 1 # ? => €áa¢cée£
char 2 # ? => ??áa¢cée£
char 3 # ? => ?áa¢cée£
char 4 # ? => áa¢cée£
char 5 # ? => ?a¢cée£
char 6 # a => a¢cée£
char 7 # ? => ¢cée£
char 8 # ? => ?cée£
char 9 # c => cée£
char 10 # ? => ée£
char 11 # ? => ?e£
char 12 # e => e£
char 13 # ? => £
char 14 # ? => ?
as the last cout output. So, actual problem still remains, I believe. Cheers!!
Update 2: based on n.m.'s second suggestion
#include <iostream>
#include <string>
using namespace std;
wchar_t wStr[] = L"€áa¢cée£";
int iStr = sizeof(wStr) / sizeof(wStr[0]); // length of the string
wchar_t *pStr = 0;
int main (void)
{
setlocale (LC_ALL,"");
wcout << wStr << " : " << iStr << endl;
pStr = &wStr[0];
for (int i = 0; i < iStr; i++) {
wcout << *pStr << " => " << static_cast<void*>(pStr) << " => " << pStr << endl;
pStr++;
}
return 0;
}
And this is what I get as my result:
€áa¢cée£ : 9
€ => 0x1000010e8 => €áa¢cée£
á => 0x1000010ec => áa¢cée£
a => 0x1000010f0 => a¢cée£
¢ => 0x1000010f4 => ¢cée£
c => 0x1000010f8 => cée£
é => 0x1000010fc => ée£
e => 0x100001100 => e£
£ => 0x100001104 => £
=> 0x100001108 =>
Why there it's reported as 9 than 8? Or this is what I should expect? Cheers!!

Drop the L before the string literal. Use std::string, not std::wstring.
UPD: There's a better (correct) solution. keep wchar_t, wstring and the L, and call setlocale(LC_ALL,"") in the beginning of your program.
You should call setlocale(LC_ALL,"") in the beginning of your program anyway. This instructs your program to work with your environment's locale, instead of the default "C" locale. Your environment has a UTF-8 one so everything should work.
Without calling setlocale(LC_ALL,""), the program works with UTF-8 sequences without "realizing" that they are UTF-8. If a correct UTF-8 sequence is printed on the terminal, it will be interpreted as UTF-8 and everything will look fine. That's what happens if you use string and char: gcc uses UTF-8 as a default encoding for strings, and the ostream happily prints them without applying any conversion. It thinks it has a sequence of ASCII characters.
But when you use wchar_t, everything breaks: gcc uses UTF-32, the correct re-encoding is not applied (because the locale is "C") and the output is garbage.
When you call setlocale(LC_ALL,"") the program knows it should recode UTF-32 to UTF-8, and everything is fine and dandy again.
This all assumes that we only ever want to work with UTF-8. Using arbitrary locales and encodings is beyond the scope of this answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Gcc: making UTF-8 the execution character set - c++

Related

How to display character string literals with hex properly with std::cout in C++?

std::regex fails to match char

snprintf equivalent for wchar_t to calculate formatted string size (mac)

std::string::size() strange behaviour

How to print wstring on Linux/OS X?

Categories

Resources