Which characters are transformed by toupper() in the default C++ locale? - c++

With the default "C" locale only a-z get transformed by std::toupper() as is documented for example here. Which characters exactly get transformed by std::ctype<CharT>::toupper() with the default C++ locale?
I'm asking because std::toupper(L'ω', std::locale::classic()) returns L'Ω' on Windows and I'm wondering for which other characters the C++ locale also returns an upper case form. In the "C" locale the same character is not transformed: static_cast<wchar_t>(std::towupper(static_cast<std::wint_t>(L'ω'))) returns L'ω' as expected.
I used the following program to verify this:
#include <cwctype>
#include <fstream>
#include <locale>
int main()
{
std::wofstream fs("out.txt");
fs.imbue(std::locale("en_US.UTF8"));
fs << L"std::toupper(L'ω', std::locale::classic()): " << std::toupper(L'ω', std::locale::classic()) << std::endl;
fs << L"static_cast<wchar_t>(std::towupper(static_cast<std::wint_t>(L'ω'))): "
<< static_cast<wchar_t>(std::towupper(static_cast<std::wint_t>(L'ω'))) << std::endl;
return 0;
}
Content of out.txt when compiled with Visual Studio 2019 (save source file with UTF-8 encoding and add compiler switch /utf-8) and executed on Windows 10:
std::toupper(L'ω', std::locale::classic()): Ω
static_cast<wchar_t>(std::towupper(static_cast<std::wint_t>(L'ω'))): ω
Output with gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04):
std::toupper(L'ω', std::locale::classic()): ω
static_cast<wchar_t>(std::towupper(static_cast<std::wint_t>(L'ω'))): ω

Related

架 (U+67B6) is not graphical with en_US.UTF-8. Whats going on?

This is a follow up question to:
std::isgraph asserts, how to fix?
After setting locale to "en_US.UTF-8", std::isgraph no longer asserts.
However, the unicode character 架 (U+67B6) is reported as false in the same function. What is going on ?
It's a unicode built on Windows platform.
If you want to test characters that are too large to fit in an unsigned char, you can try using the wide-character versions, or a Unicode library as already suggested (Which is really the better option for portable code, as it removes any system or locale based differences in behavior).
This program:
#include <clocale>
#include <cwctype>
#include <iostream>
int main() {
wchar_t x = L'\u67B6';
char *loc = std::setlocale(LC_CTYPE, "");
std::wcout << "Using locale " << loc << ".\n";
std::wcout << "Character " << x << " is graphical: " << std::boolalpha
<< static_cast<bool>(std::iswgraph(x)) << '\n';
return 0;
}
when compiled and ran on my Ubuntu test system, outputs
Using locale en_US.utf8.
Character 架 is graphical: true
You said you're using Windows, but I don't have a Windows computer available for testing, so I can't confirm if this'll work there or not.
std::isgraph is not a Unicode-aware function.
It's an antiquity from C.
From the documentation:
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
It only takes int because .. it's an antiquity from C. Just like std::tolower.
You should be using something like ICU instead.

How to change Locale in MinGW

It seems like only the "C" locale is working with MinGW. I tried the example found here and no commas were added even though the system locale is set to Canada.
#include <iostream>
#include <locale>
int main()
{
std::wcout << "User-preferred locale setting is " << std::locale("").name().c_str() << '\n';
// on startup, the global locale is the "C" locale
std::wcout << 1000.01 << '\n';
// replace the C++ global locale as well as the C locale with the user-preferred locale
std::locale::global(std::locale(""));
// use the new global locale for future wide character output
std::wcout.imbue(std::locale());
// output the same number again
std::wcout << 1000.01 << '\n';
}
The output is
User-preferred locale setting is C
100000
100000
I tried std::locale("en-CA") and get locale::facet::_S_create_c_locale name not valid at run time. I compile from CMD using g++. I'm running 64bit Windows 10.
Also I tried compiling this program found in the accepted answer here and got the compiler error 'LOCALE_ALL' was not declared in this scope.
How can I set the locale in MinGW to the system default or something explicit?

wcout does not output as desired

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.
If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

How to universally define the number of characters per line, regardless of the encoding?

Interested question correct processing of multi languages
// C ++. How to universally define the number of
// characters per line, regardless of the encoding?
#include <iostream>
#include <string>
using namespace std;
int main()
{
// test line 13 characters length
// but the result get is 19 characters
string test_string = "string_строка";
cout << "String length " << test_string.size() << " characters.\n";
return 0;
}
I think that this is due to the different number of allocated memory for the characters of the Latin alphabet and the Cyrillic.
How to solve this is universal? Or simpy for Cyrillic.
My system Ubuntu 14.04 (Unity). Compiler GCC 4.9.1 20140922 (Red Hat 4.9.1-10), 64 bit.

How to get current locale of my environment?

Had tried following code in Linux, but always return 'C' under different LANG settings.
#include <iostream>
#include <locale.h>
#include <locale>
using namespace std;
int main()
{
cout<<"locale 1: "<<setlocale(LC_ALL, NULL)<<endl;
cout<<"locale 2: "<<setlocale(LC_CTYPE, NULL)<<endl;
locale l;
cout<<"locale 3: "<<l.name()<<endl;
}
$ ./a.out
locale 1: C
locale 2: C
locale 3: C
$
$ export LANG=zh_CN.UTF-8
$ ./a.out
locale 1: C
locale 2: C
locale 3: C
What should I do to get current locale setting in Linux(like Ubuntu)?
Another question is, is it the same way to get locale in Windows?
From man 3 setlocale (New maxim: "When in doubt, read the entire manpage."):
If locale is "", each part of the locale that should be modified is set according to the environment variables.
So, we can read the environment variables by calling setlocale at the beginning of the program, as follows:
#include <iostream>
#include <locale.h>
using namespace std;
int main()
{
setlocale(LC_ALL, "");
cout << "LC_ALL: " << setlocale(LC_ALL, NULL) << endl;
cout << "LC_CTYPE: " << setlocale(LC_CTYPE, NULL) << endl;
return 0;
}
My system does not support the zh_CN locale, as the following output reveals:
$ ./a.out
LC_ALL: en_US.utf8
LC_CTYPE: en_US.utf8
$ export LANG=zh_CN.UTF-8
$ ./a.out
LC_ALL: C
LC_CTYPE: C
Windows: I have no idea about Windows locales. I suggest starting with an MSDN search, and then opening a separate Stack Overflow question if you still have questions.
Just figured out how to get locale by C++, simply use an empty string "" to construct std::locale, which does the same thing as setlocale(LC_ALL, "").
locale l("");
cout<<"Locale by C++: "<<l.name()<<endl;
This link described differences in details between C locale and C++ locale.
For Windows use the following code:
LCID lcid = GetThreadLocale();
wchar_t name[LOCALE_NAME_MAX_LENGTH];
if (LCIDToLocaleName(lcid, name, LOCALE_NAME_MAX_LENGTH, 0) == 0)
error(GetLastError());
std::wcout << L"Locale name = " << name << std::endl;
This is going to print something like "en-US".
To purge sublanguage information use the following:
wchar_t parentLocateName[LOCALE_NAME_MAX_LENGTH];
if (GetLocaleInfoEx(name, LOCALE_SPARENT, parentLocateName, LOCALE_NAME_MAX_LENGTH) == 0)
error(GetLastError());
std::wcout << L"parentLocateName = " << parentLocateName << std::endl;
This will give you just "en".
A good alternative to consider instead of std::locale is boost::locale which is capable of returning more reliable information - see http://www.boost.org/doc/libs/1_52_0/libs/locale/doc/html/locale_information.html
boost::locale::info has the following member functions:
std::string name() -- the full name of the locale, for example en_US.UTF-8
std::string language() -- the ISO-639 language code of the current locale, for example "en".
std::string country() -- the ISO-3199 country code of the current locale, for example "US".
std::string variant() -- the variant of current locale, for example "euro".
std::string encoding() -- the encoding used for char based strings, for example "UTF-8"
bool utf8() -- a fast way to check whether the encoding is UTF-8.
The default constructor of std::locale creates a copy of the global C++ locale.
So to get the name of the current locale:
std::cout << std::locale().name() << '\n';