Is it possible to print UTF-8 string with Boost and STL in windows console? - c++

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
{
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
SetConsoleOutputCP(CP_UTF8);
locale::global(generator().generate(""));
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
}
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?

It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout

The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

Related

Is there a proper way to receive input from console in UTF-8 encoding?

When getting input from std::cin in windows, the input is apparently always in the encoding windows-1252 (the default for the host machine in my case) despite all the configurations made, that apparently only affect to the output. Is there a proper way to capture input in windows in UTF-8 encoding?
For instance, let's check out this program:
#include <iostream>
int main(int argc, char* argv[])
{
std::cin.imbue(locale("es_ES.UTF-8"));
std::cout.imbue(locale("es_ES.UTF-8"));
std::cout << "ñeñeñe> ";
std::string in;
std::getline( std::cin, in );
std::cout << in;
}
I've compiled it using visual studio 2022 in a windows machine with spanish locale. The source code is in UTF-8. When executing the resulting program (windows powershell session, after executing chcp 65001 to set the default encoding to UTF-8), I see the following:
PS C:\> .\test_program.exe
ñeñeñe> ñeñeñe
e e e
The first "ñeñeñe" is correct: it display correctly the "ñ" caracter to the output console. So far, so good. The user input is echoed back to the console correctly: another good point. But! when it turns to send back the encoded string to the ouput, the "ñ" caracter is substituted by an empty space.
When debugging this program, I see that the variable "in" have captured the input in an encoding that it is not utf-8: for the "ñ" it use only one character, whereas in utf-8 that caracter must consume two. The conclusion is that the input is not affect for the chcp command. Is something I doing wrong?
UPDATE
Somebody have asked me to see what happens when changing to wcout/wcin:
std::wcout << u"ñeñeñe> ";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;
Behaviour:
PS C:\> .\test.exe
0,000,7FF,6D1,B76,E30ñeñeñe
e e e
Other try (setting the string as L"ñeñeñe"):
ñeñeñe> ñeñeñe
e e e
Leaving it as is:
std::wcout << "ñeñeñe> ";
Result is:
eee>
This is the closest to the solution I've found so far:
int main(int argc, char* argv[])
{
_setmode(_fileno(stdout), _O_WTEXT);
_setmode(_fileno(stdin), _O_WTEXT);
std::wcout << L"ñeñeñe";
std::wstring in;
std::getline(std::wcin, in);
std::wcout << in;
return 0;
}
The solution depicted here went in the right direction. Problem: both stdin and stdout should be in the same configuration, because the echo of the console rewrites the input. The problem is the writing of the string with \uXXXX codes.... I am guessing how to overcome that or using #define's to overcome and clarify the text literals

How to apply <cctype> functions on text files with different encoding in c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

c++ literal u8 and BOM (Byte Order Mask)

I decided to write a simple example:
#include <iostream>
int main()
{
std::cout << u8"это строка6" << std::endl;
return 0;
}
Executed in the console the following command:
chcp 65001
Programm output:
��то строка6
Why is the first character is not displayed correctly? I think that the codepage 65001 uses BOM, and read first symbol as BOM. Is this true?
Well the entire standard IO library is dodgy with that code page. Here's another test program (\xe2\x86\x92 is the arrow → in UTF-8):
#include <stdio.h>
int main(void)
{
char s[] = "\xe2\x86\x92 a \xe2\x86\x92 b\n";
int l = (int) sizeof(s) - 1;
int wr = fwrite(s, 1, l, stdout);
printf("%d/%d written\n", wr, l);
return 0;
}
And its output:
��� a → b
10/12 written
Note that the first character is again replaced by the ��� (it's 3 bytes in UTF-8), and the fwrite call returns the number of characters written on the console. This is a violation of the C standard (it should return the number of bytes), and it will break every program using fwrite or related functions correctly (for instance, try to print "☺☺☺☺☺☺☺☺☺☺☺☺" with Python 3.4).
So your only options to reliably output Unicode text are Windows-specific (unless these issues are fixed in the latest version of MSVC):
Use wide output functions, as described here: Output unicode strings in Windows console app
Use WriteConsoleW (the wide version). Make sure you test if the standard output or error handle is actually a console.

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?
You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

C++ - string.compare issues when output to text file is different to console output?

I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
ThisIsATestStringOutputtedToAFile
T^#h^#i^#s^#I^#s^#A^#T^#e^#s^#t^#S^#t^#r^#i^#n^#g^#O^#u^#t^#p^#u^#t^#
t^#e^#d^#T^#o^#A^#F^#i^#l^#e
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
EDIT
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
myInput.open(fileLocation.c_str());
myOutput.open("test.txt");
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
cin.get();
myInput.close();
myOutput.close();
TEST_ASSERT(compare1.compare(compare2) == 0);
How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
UPDATE:
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.
The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;
}
Output:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
NOTE
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.