wcin.imbue and UTF-8

wcin.imbue and UTF-8 - c++

On linux with g++, if I set a utf8 global locale, then wcin correctly transcodes UTF-8 to the internal wchar_t encoding.
However, if I use the classic locale and imbue an UTF8 locale into wcin, this doesn't happen. Input either fails altogether, or each individual byte gets converted to wchar_t independently.
With clang++ and libc++, neither setting the global locale nor imbuing the locale in wcin work.
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
if(true)
// this works with g++, but not with clang++/libc++
locale::global(locale("C.UTF-8"));
else
// this doesn't work with either implementation
wcin.imbue(locale("C.UTF-8"));
wstring s;
wcin >> s;
cout << s.length() << " " << (s == L"áéú");
return 0;
}
The input stream contains only áéú characters. (They are in UTF-8, not any single-byte encoding).
Live demo: one two (I can't reproduce the other behaviour with online compilers).
Is this standard-conforming? Shouldn't I be able to leave the global locale alone and use imbue instead?
Should either of the described behaviours be classified as an implementation bug?

First of all you should use wcout with wcin.
Now you have two possible solutions to that:
1) Deactivate synchronization of iostream and cstdio streams by using
ios_base::sync_with_stdio(false);
Note, that this should be the first call, otherwise the behavior depends on implementation.
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("C.UTF-8"));
wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}
2) Localize both locale and wcout:
int main() {
std::setlocale(LC_ALL, "C.UTF-8");
wcout.imbue(locale("C.UTF-8"));
wstring s;
wcin >> s;
wcout << s.length() << " " << (s == L"áéú");
return 0;
}
Tested both of them using ideone, works fine. I don't have clang++/libc++ with me, so wasn't able to test this behavior, sorry.

Related

C++, If...Else with strings and special characters

I was trying to create a simple program to help my students to train German irregular verbs, but I have had problems with special characters and If statement. Basically it does not recognise the ä ö ü and ß, and output is therefore the Else statement (¨Nicht Gut"). How could I fix it?
#include <iostream>
#include <conio.h>
#include <locale.h>
using namespace std;
int main () {
setlocale(LC_CTYPE, "German");
string Antwort1;
string Antwort2;
string Antwort3;
getline(cin, str);
cout << str;
cout << "Präsens von BEHALTEN (du)" << endl;
cin >> Antwort1;
if (Antwort1 == "behältst") {
cout << "Gut!" << endl;
}
else {
cout << "Nicht Gut" << endl;
}
cout << "Präsens von BEHALTEN (er/sie/es/man) " << endl;
cin >> Antwort1;
if (Antwort1 == "behält") {
cout << "Gut!" << endl;
}
else {
cout << "Nicht Gut" << endl;
}
return 0;
}
I tried with
if (Antwort1 == (LC_CTYPE, "German"),"behält")
but then it causes the contrary effect. Then every single string I write is valid ("Gut").

My answer applies to the Windows 10 console using the classic default Command Prompt (I haven't tried it with other systems like PowerShell, nor I have tried these experiments on Linux yet).
It seems to me that, as of today (23 February 2022), Windows 10's Command Prompt and the Microsoft C/C++ Runtime of VS2019 don't support Unicode UTF-8 well: See, for example, this blog post showing a CRT crash you get when trying to call:
_setmode(_fileno(stdout), _O_U8TEXT);
and printing UTF-8 text using std::cout.
In my experience, you can make Unicode work in Windows Command Prompt using Unicode UTF-16. You can still use UTF-8 in your C++ application, but you have to convert between UTF-8 and UTF-16 at the Windows boundaries.
I modified your code to use Unicode UTF-16, and the code seems to work correctly when compiled with Visual Studio 2019, and executed inside the Windows Command Prompt:
// Used for _setmode calls
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
// Console I/O with Unicode UTF-16 wcin, wcout and wstring
#include <iostream>
#include <string>
using std::wcin;
using std::wcout;
using std::wstring;
int main() {
// Enable Unicode UTF-16 console input/output
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);
wcout << L"Präsens von BEHALTEN (du) \n";
wstring Antwort1;
wcin >> Antwort1;
if (Antwort1 == L"behältst") {
wcout << L"Gut! \n";
} else {
wcout << L"Nicht Gut \n";
}
}
Note the use of L"..." to represent UTF-16 string literals, and the use of wchar_t-based std::wcout, std::wcin, and std::wstring instead of the char-based std::cout, std::cin and std::string.

Getting long long with suffix out of istringstream (C++)

How come istringstream can't seem to fully read numeric literals with suffixes?
#include <iostream>
#include <sstream>
using namespace std;
int main() {
long long x = 123ULL; // shows 123ULL is a valid long long literal
istringstream iss("123ULL");
iss >> x;
cout << "x is " << x << endl;
char extra;
iss >> extra;
cout << "remaining characters: ";
while(!iss.eof())
{
cout << extra;
iss >> extra;
}
cout << endl;
return 0;
}
The output of this code is
x is 123
remaining characters: ULL
Is this behavior controlled by the locale? Could anyone point me to clear documentation on what strings are accepted by istringstream::operator>>(long long)?

Yes, it's controlled by the locale (via the num_get facet), but no locale I ever heard of supports C++ language literals, and it would be the wrong place to customize this.
Streams are for general-purpose I/O, and C++ integer literal suffixes are very specialized.
The exact behavior of the default num_get facet is described in the C++11 standard in section 22.4.2.1. The description partially references the strto* family of functions from the C standard library. You can find a somewhat condensed version here:
http://en.cppreference.com/w/cpp/locale/num_get/get

Writing class object to file using streams

I have this code to serialize/deserialize class objects to file, and it seems to work.
However, I have two questions.
What if instead two wstring's (as I have now) I want to have one wstring and one string member
variable in my class? (I think in such case my code won't work?).
Finally, below, in main, when I initialize s2.product_name_= L"megatex"; if instead of megatex I write something in Russian say (e.g., s2.product_name_= L"логин"), the code doesn't work anymore as intended.
What can be wrong? Thanks.
Here is code:
// ConsoleApplication3.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <string>
#include <fstream> // std::ifstream
using namespace std;
// product
struct Product
{
double price_;
double product_index_;
wstring product_name_;
wstring other_data_;
friend std::wostream& operator<<(std::wostream& os, const Product& p)
{
return os << p.price_ << endl
<< p.product_index_ << endl
<< p.product_name_ << endl
<< p.other_data_ << endl;
}
friend wistream& operator>>(std::wistream& is, Product& p)
{
is >> p.price_ >> p.product_index_;
is.ignore(std::numeric_limits<streamsize>::max(), '\n');
getline(is,p.product_name_);
getline(is,p.other_data_);
return is;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
Product s1,s2;
s1.price_ = 100;
s1.product_index_ = 0;
s1.product_name_= L"flex";
s1.other_data_ = L"dat001";
s2.price_ = 300;
s2.product_index_ = 2;
s2.product_name_= L"megatex";
s2.other_data_ = L"dat003";
// write
wofstream binary_file("c:\\test.dat",ios::out|ios::binary|ios::app);
binary_file << s1 << s2;
binary_file.close();
// read
wifstream binary_file2("c:\\test.dat");
Product p;
while (binary_file2 >> p)
{
if(2 == p.product_index_){
cout<<p.price_<<endl;
cout<<p.product_index_<<endl;
wcout<<p.product_name_<<endl;
wcout<<p.other_data_<<endl;
}
}
if (!binary_file2.eof())
std::cerr << "error during parsing of input file\n";
else
std::cerr << "Ok \n";
return 0;
}

What if instead two wstring's (as I have now) I want to have one
wstring and one string member variable in my class? (I think in such
case my code won't work?).
There are an inserter defined for char * for any basic_ostream (ostream and wostream), so you can use the result of c_str() member function call for the string member. For example, if the string member is other_data_:
return os << p.price_ << endl
<< p.product_index_ << endl
<< p.product_name_ << endl
<< p.other_data_.c_str() << endl;
The extractor case is more complex, since you'll have to read as wstring and the convert to string. The most simple way to do this is just reading as wstring and then narrowing each character:
wstring temp;
getline(is, temp);
p.other_data_ = string(temp.begin(), temp.end());
I'm not using locales in this sample, just converting a sequence of bytes (8 bits) to a sequence of words (16 bits) for output and the opposite (truncating values) for input. That is OK if you are using ASCII chars, or using single-byte chars and you don't require an specific format (as Unicode) for output.
Otherwise, you will need handle with locales. locale gives cultural contextual information to interpret the string (remember that is just a sequence of bytes, not characters in the sense of letters or symbols; the map between the bytes and what symbol represents is defined by the locale). locale is not an very easy to use concept (human culture isn't too). As you suggest yourself, it would be better make first some investigation about how it works.
Anyway, the idea is:
Identify the charset used in string and the charset used in file (Unicode or utf-16).
Convert the strings from original charset to Unicode using locale for output.
Convert the wstrings read from file (in Unicode) to strings using locale.
Finally, below, in main, when I initialize s2.product_name_=
L"megatex"; if instead of megatex I write something in Russian say
(e.g., s2.product_name_= L"логин"), the code doesn't work anymore as
intended.
When you define an array of wchar_t using L"", you'are not really specifying the string is Unicode, just that the array is of chars, not wchar_t. I suppose the intended working is s2.product_name_ store the name in Unicode format, but the compiler will take every char in that string (as without L) and convert to wchar_t just padding with zeros the most significant byte. Unicode is not good supported in the C++ standard until C++11 (and is still not really too supported). It works just for ASCII characters because they have the same codification in Unicode (or UTF-8).
For using the Unicode characters in a static string, you can use escape characters: \uXXXX. Doing that for every not-English character is not very comfortable, I know. You can found a list of Unicode characters in multiple sites in the web. For example, in the Wikipedia: http://en.wikipedia.org/wiki/List_of_Unicode_characters.

wcin gets wrong character input?

The following code:
#include <iostream>
using std::wcin;
using std::wcout;
using std::locale;
int main()
{
locale::global(locale("Portuguese_Brazil"));
wcout << "wcin Test using \"ção\": "; // shows that wcout works properly
wchar_t wcinTest[] = L"";
wcin >> wcinTest;
wcout << wcinTest << " should be \"ção\".";
return 0;
}
Results in:
wcin Test using "ção": ção
╬Æo should be "ção".
The ╬ character is U+2021 or 8225, and the ç is U+00E7 or 231.
I changed mult-bytes option, set and not set UNICODE in project properties. Nothing worked.
I already set the console font into Consolas, a true type font capable of displaying the ç character correctly.
I'd like this as simple and reproducible possible to use as a standard practice for future UNICODE console applications.
Any ideas?

wcinTest is a wchar_t buffer of length 1;
You overflow it when you read into it. Use a std::wstring insead.

This finally worked:
#include <iostream>
#include <string>
#include <Windows.h>
using std::cin;
using std::cout;
using std::string;
int main()
{
SetConsoleOutputCP(1252);
SetConsoleCP(1252);
cout << "wcin Test using \"ção\": "; // shows that wcout works properly
string wcinTest;
cin >> wcinTest;
cout << wcinTest << " should be \"ção\".";
return 0;
}
I'm too newbie to understand why I need both SetConsoleOutputCP and SetConsoleCP. I though maybe just SetConsoleCP would fix everything, but no, I need both: SetConsoleOutputCP fixed cout; and SetConsoleCP fixed cin.
Thanks anyway #StoryTeller

Mistake in input Russian letter in Visual Studio 2008 throw console [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
I can't see the russian alpabet in Visual Studio 2008
I'm trying input symbol from console in Russian alphabet. This is code
#include <iostream>
#include <windows.h>
#include <locale.h>
using namespace std;
void main(){
char c;
setlocale(LC_ALL,"rus");
cout << "Я хочу видеть это по-русски!" << endl;
cin >> c;
cout << c;
}
I entered 'ф', but it prints 'д'. I tried to use
char buf[2];
char str[2];
str[0] = c;
str[1] = '\0';
OemToAnsi(buf, str);
But I have
+ str 0x0015fef4 "¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ф¦¦¦¦d §" char [2]
+ buf 0x0015ff00 "¦¦¦ф¦¦¦¦d §" char [2]
And then I have an error Run-Time Check Failure #2 - Stack around the variable 'str' was corrupted.

I assume the set-up you're using is to have the source saved using cp1251 (Cyrillic Windows) and to have the console using cp866 (Cyrillic DOS). (This will be the default set up on Russian versions of Windows.) The problem you're running into seems to be that setting the locale as you do causes output to be converted from cp1251 to cp866, but does not cause the inverse conversion for input. So when you read a character in, the program gets the cp866 representation. This cp866 representation, when output, is incorrectly treated as a cp1251 representation and converted to cp866, resulting in the ф to д transformation.
I think the conversions is just done by the CRT based on the C locale, but I don't know how to enable a similar conversion for input. There are different options for getting your program to work.
Manually convert input data from cp866 to cp1251 before echoing it.
Replace setlocale(LC_ALL,"rus") which changes how the CRT deals with output with calls to SetConsoleCP(1251); SetConsoleOutputCP(1251); which will instead changes the console's behavior (and the changes will persist for the lifetime of the console rather than the lifetime of your program).
Replace uses of cin and cout with Windows APIs using UTF-16. Microsoft's implementation of the standard library forces the use of legacy encodings and causes all sorts of similar problems on Windows. So just avoid it altogether.
Here's an example of the second option:
#include <iostream>
#include <clocale>
#include <Windows.h>
void main(){
char c;
SetConsoleCP(1251);
SetConsoleOutputCP(1251);
std::cout << "Я хочу видеть это по-русски!\n";
std::cin >> c;
std::cout << c;
}
Assuming the source is cp1251 encoded then the output will appear correctly and an input ф will not be transformed into a д.

The locale might be wrong. Try
setlocale(LC_ALL, "");
This sets the locale to "the default, which is the user-default ANSI code page obtained from the operating system".

const int N = 34;
const char DosABC[N] = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя";
const char WinABC[N] = " ЎўЈ¤Ґс¦§Ё©Є«¬®Їабвгдежзийклмноп";
std::string ToDosStr(std::string input)
{
std::string output = "";
bool Ok;
for (unsigned i = 0; i < input.length(); i++)
{
Ok = false;
for (int j = 0; j < N; j++)
if (input[i] == WinABC[j])
{
output += DosABC[j];
Ok = true;
}
if (!Ok)
output += input[i];
}
return output;
}
I did it, and it works, but everybody welcome to find easier answer

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

wcin.imbue and UTF-8 - c++

Related

C++, If...Else with strings and special characters

Getting long long with suffix out of istringstream (C++)

Writing class object to file using streams

wcin gets wrong character input?

Mistake in input Russian letter in Visual Studio 2008 throw console [duplicate]

Categories

Resources