c++ literal u8 and BOM (Byte Order Mask)

c++ literal u8 and BOM (Byte Order Mask) - c++

I decided to write a simple example:
#include <iostream>
int main()
{
std::cout << u8"это строка6" << std::endl;
return 0;
}
Executed in the console the following command:
chcp 65001
Programm output:
��то строка6
Why is the first character is not displayed correctly? I think that the codepage 65001 uses BOM, and read first symbol as BOM. Is this true?

Well the entire standard IO library is dodgy with that code page. Here's another test program (\xe2\x86\x92 is the arrow → in UTF-8):
#include <stdio.h>
int main(void)
{
char s[] = "\xe2\x86\x92 a \xe2\x86\x92 b\n";
int l = (int) sizeof(s) - 1;
int wr = fwrite(s, 1, l, stdout);
printf("%d/%d written\n", wr, l);
return 0;
}
And its output:
��� a → b
10/12 written
Note that the first character is again replaced by the ��� (it's 3 bytes in UTF-8), and the fwrite call returns the number of characters written on the console. This is a violation of the C standard (it should return the number of bytes), and it will break every program using fwrite or related functions correctly (for instance, try to print "☺☺☺☺☺☺☺☺☺☺☺☺" with Python 3.4).
So your only options to reliably output Unicode text are Windows-specific (unless these issues are fixed in the latest version of MSVC):
Use wide output functions, as described here: Output unicode strings in Windows console app
Use WriteConsoleW (the wide version). Make sure you test if the standard output or error handle is actually a console.

Related

I can't get the infinity symbol to print in Visual Studio 2019 using C++

I'm trying to get the infinity symbol (∞) to print but I keeping getting garbage. I've tried everything mentioned here but nothing is working.
What I'm trying to accomplish is this
modifies strength by 9 ∞
I've tried
printf ("%c", 236);
printf ("%c", 236u);
and I get
modifies strength by 9 ì
I've tried
printf("∞");
and I get
modifies strength by 9 ?
I tried this
if ( paf->duration == -1 ){
setlocale(LC_ALL, "en_US.UTF-8");
wprintf(L"%lc\n", 8734);
ch->printf("∞");
Just to see if I could get wprintf to print it but it completely ignores setlocale and wprintf and still gives me
modifies strength by 9 ?
I tried
if ( paf->duration == -1 ){
std::cout << "\u221E";
ch->printf("∞");
But got the this warning and error
Error C2664 'int _CrtDbgReportW(int,const wchar_t *,int,const wchar_t *,const wchar_t *,...)': cannot convert argument 5 from 'int' to 'const wchar_t *' testROS1a C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\ucrt\malloc.h 164
Warning C4566 character represented by universal-character-name '\u221E' cannot be represented in the current code page (1252) testROS1a C:\_Reign of Shadow\TEST\src\CPP\act_info.cpp 3724
which I can't make heads or tails of. I've exhausted the scope of my knowledge so does anyone know how to make this happen?

This does the trick on my machine, windows + code page (1252), not sure how universal this is though. I never really got to work a lot with unicode/localization stuff. And there always seems to be one more gotcha.
#include <iostream>
#include <io.h>
#include <fcntl.h>
const wchar_t infinity_symbol = 0x221E;
int main()
{
// enable windows console to unicode
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << infinity_symbol;
}

To use the Windows command prompt with wide strings, change the mode as follows, printf/cout won't work until the mode is switched back. Make sure to flush between mode changes:
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main()
{
// To use wprintf/wcout and output any BMP (<= U+FFFF) code point
int org = _setmode(_fileno(stdout), _O_U16TEXT);
wcout << L'\u221e' << endl;
wprintf(L"\u221e\n");
fflush(stdout);
_setmode(_fileno(stdout), org); // to switch back to cout/printf and default code page
cout << "hello, world!" << endl;
printf("hello, world!\n");
}
Output:
∞
∞
hello, world!
hello, world!
If you use UTF-8 source and your compiler accepts it, you can also change the code page of the terminal to 65001 (UTF-8) and it could work with printf as is:
test.c
#include <stdio.h>
int main() {
printf("∞\n");
}
Console output:
C:\demo>cl /W4 /utf-8 /nologo test.c
test.c
C:\demo>chcp
Active code page: 437
C:\demo>test ## NOTE: Wrong code page prints mojibake.
Γê₧ ## These are UTF-8 bytes interpreted incorrectly.
C:\demo>chcp 65001
Active code page: 65001
C:\demo>test
∞

The best option in Windows is to use UTF16 with _setmode as shown in other answers. You can also _setmode to switch back and forth between Unicode and ANSI.
warning C4566: character represented by universal-character-name
'\u221E' cannot be represented in the current code page (1252)
That's because your *.cpp file is probably saved in Unicode, the compiler sees ∞ as 2-byte UTF16 wchar_t ('\u221E') and can't convert it to char.
printf("\xEC") might print ∞, but only on English systems, and only if the console font feels like interpreting it that way. Even if it worked, it may stop working tomorrow if you change some small settings. This is the old ANSI encoding which many problems, that's why Unicode was brought in.
You can use this alternate solution in Visual Studio with C++20:
SetConsoleOutputCP(CP_UTF8);
printf((const char*)u8"∞");
And this solution in Visual Studio and C++17:
SetConsoleOutputCP(CP_UTF8);
printf(u8"∞");
And who knows how it's going to change later.
But UTF16 is at least natively supported. wprintf, wcout with _setmode are more consistent.

how to detect non-ascii characters in C++ Windows?

I'm simply trying detect non-ascii characters in my C++ program on Windows.
Using something like isascii() or :
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:
#include <iostream>
using namespace std;
int main()
{
int c;
c = getchar();
cout << isascii(c) << endl;
cout << c << endl;
printf("0x%x\n", c);
cout << (char)c;
return 0;
}
and input a 😁 (because i am so happy right now), the output is
1
63
0x3f
?
Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be
1
72
0x48
H
This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like
i = input()
print(ord(i))
gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?

Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.
I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "😁" isn't ASCII, and from this page: https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "😁" is 0x39 which is why you're seeing that printed out.

Okay, I have solved this. I was not aware of translation modes.
_setmode(_fileno(stdin), _O_WTEXT);
Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened.
https://en.cppreference.com/w/cpp/language/translation_phases

Is it possible to print UTF-8 string with Boost and STL in windows console?

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
{
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
SetConsoleOutputCP(CP_UTF8);
locale::global(generator().generate(""));
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
}
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?

It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout

The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?

You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

Matching russian vowels in C++

I wanted to write a function which returns true if a given character is a russian vowel. But the results I get are strange to me. This is what I've got so far:
#include <iostream>
using namespace std;
bool is_vowel_p(char working_char)
// returns true if the character is a russian vowel
{
string matcher = "аяё×эеуюыи";
if (find(matcher.begin(), matcher.end(), working_char) != matcher.end())
return true;
else
return false;
}
void main()
{
cout << is_vowel_p('е') << endl; // russian vowel
cout << is_vowel_p('Ж') << endl; // russian consonant
cout << is_vowel_p('D') << endl; // latin letter
}
The result is:
1
1
0
what is strange to me. I expected the following result:
1
0
0
It's seems that there is some kind of internal mechanism which I don't know yet. I'm at first interested in how to fix this function to work properly. And second, what is going on there, that I get this result.

string and char are only guaranteed to represent characters in the basic character set - which does not include the Cyrillic alphabet.
Using wstring and wchar_t, and adding L before the string and character literals to indicate that they use wide characters, should allow you to work with those letters.
Also, for portability you need to include <algorithm> for find, and give main a return type of int.

C++ source code is ASCII. You are entering unicode characters. The comparison is being done using 8 bit values. I bet one of the vowels fulfills the following:-
vowel & 255 == (code point for 'Ж') & 255
You need to use unicode functions to do this, not ASCII functions, i.e. use functions that require wchar_t values. Also, make sure your compiler can parse the non-ASCII vowel string. Using MS VC, the compiler requires:-
L"аяё×эеуюыи" or TEXT("аяё×эеуюыи")
the latter is a macro that adds the L when compiling with unicode support.
Convert the code to use wchar_t and it should work.

Very useful function in locale.h
setlocale(LC_ALL, "Russian");
Past this in the beginning of the program.
Example:
#include <stdio.h>
#include <locale.h>
void main()
{
setlocale(LC_ALL, "Russian");
printf("Здравствуй, мир!\n");//Hello, world!
}

Make sure your system default locale is Russian, and make sure your file is saved as codepage 1251 (Cyrillic/Windows). If it's saved as Unicode, this won't ever work.
The system default locale is the one used by non-Unicode-compliant programs. It's in Control Panel, under Regional settings.
Alternatively, rewritte to use wstring and wchar_t and L"" string/char literals.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ literal u8 and BOM (Byte Order Mask) - c++

Related

I can't get the infinity symbol to print in Visual Studio 2019 using C++

how to detect non-ascii characters in C++ Windows?

Is it possible to print UTF-8 string with Boost and STL in windows console?

How to read non-ASCII lines from file with std::ifstream on Linux?

Matching russian vowels in C++

Categories

Resources