What exactly is the expected format of the locale in setlocale()?

What exactly is the expected format of the locale in setlocale()? - c++

If I use the fputws without setlocale, only ASCII letters get printed. It seems that setlocale is necessary, and according to this site, both setlocale(LC_CTYPE, "UTF-8") and setlocale(LC_CTYPE, "en_us.UTF-8") would work.
But when I tried it, setlocale(LC_CTYPE, "UTF-8") did not work, and it printed "Hello ". Only with setlocale(LC_CTYPE, "en_us.UTF-8"), I get "Hello ねこ 2". I read the comment on that page, and he also said "UTF-8" did not work (he used GCC, I used VC++). I read some manual pages, but they did not specifically say what the format should be. On CPP reference, the example is using only formats like "en_US.UTF-8", but on tutorials point, the example uses formats like "en_GB".
Is setlocale(LC_CTYPE, "UTF-8") a correct call?
setlocale(LC_CTYPE, "UTF-8");
//setlocale(LC_CTYPE, "en_us.UTF-8");
FILE* output = _wfopen(L"output.txt", L"w");
fputws(L"Hello ねこ 2", output);
fclose(output);

The answer is system specific. To quote the cppreference page you linked to:
system-specific locale identifier. Can be "" for the user-preferred locale or "C" for the minimal locale
I think "UTF-8" is an invalid locale name generally (it needs the language identifier at minimum, not 100% on that though). However generally you will want to simply use "" in order to default to the user preference (which, I would hope, is correct for displaying their input).

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests

If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).

On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.

containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

Printing out Korean in console C++

I am having trouble with printing out korean.
I have tried various methods with no avail.
I have tried
1.
cout << "한글" << endl;
2.
wcout << "한글" << endl;
3.
wprintf(L"한글\n");
4.
setlocale(LC_ALL, "korean");
wprintf("한글");
and more. But all of those prints "í•œê¸€".
I am using MinGW compiler, and my OS is windows 7.
P.S Strangely Java prints out Korean fine,
String kor = "한글";
System.out.println(kor);
works.

Set the console codepage to utf-8 before printing the text
::SetConsoleOutputCP(65001)

Since you are using Windows 7 you can use WriteConsoleW which is part of the windows API. #include <windows.h> and try the following code:
DWORD numCharsToWrite = str.length();
LPDWORD numCharsWritten = NULL;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), numCharsToWrite, numCharsWritten, NULL);
where str is the is a std::wstring
More on WriteConsoleW: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687401%28v=vs.85%29.aspx
After having tried other methods this worked for me.

Problem is that there are a lot of places where this could be broken.
Here is answer I've posted some time ago (covers Korean). Answear is for MSVC, but same applies to MinGW (compiler switches are different, locale name may be different).
Here are 5 traps which makes this hard:
Source code encoding. Source has to use encoding which supports all required characters. Nowadays UTF-8 is recommended. It is best to make sure your editor (IDE) is properly configure to enforce source encoding.
You have to inform compiler what is encoding of source file. For gcc it is: -finput-charset=utf-8 (it is default)
Encoding used by executable. You have to define what kind of encoding string literals should be encode in final executable. This encoding should also cover required characters. Here UTF-8 is also the best. Gcc option is -fexec-charset=utf-8
When you run application you have to inform standard library what kind of encoding your string literals are define in or what encoding in program logic is used. So somewhere in your code at beginning of execution you need something like this (here UTF-8 is enforced):
std::locale::global(std::locale{".utf-8"});
and finally you have to instruct stream what kind of encoding it should use. So for std::cout and std::cin you should set locale which is default for the system:
auto streamLocale = std::locale{""};
// this impacts date/time/floating point formats, so you may want tweak it just to use sepecyfic encoding and use C-loclae for formating
std::cout.imbue(streamLocale);
std::cin.imbue(streamLocale);
After this everything should work as desired without code which explicitly does conversions.
Since there are 5 places to make mistake, this is reason people have trouble with it and internet is full of "hack" solutions.
Note that if system is not configured for support all needed characters (for example wrong code page is set) then with thsi configuration characters which could not be converted will be replaced with question mark.

Why printf can display non-ASCII characters when "C" locale is used?

Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.
It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.
// This won't be necessary as it's the system default code page.
//system("chcp 936");
// NULL to show current locale, which is "C"
printf ("%s\n", setlocale(LC_ALL, NULL));
printf ("中\n");
printf ("%s\n", setlocale(LC_ALL, "English"));
printf ("中\n");
Output:
Active code page: 936
C
中
English_United States.1252
?D
The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.
Question:
Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?
Edit:
I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).

936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".
So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f only.
Let me also answer why wprintf(L"中") fails. There are big difference between console application and Windows-window application, they use different codepages
Follow is matches between console and windows:
DOS | Windows
------+----------
850 | 1252
936 | 54936
866 | 1251
So if you would like to see in console correct symbols use WideCharToMultiByte first - that provides expected conversion to allow console work in 936

The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.
According do the locale documentation on MSDN, the only effect that locale should have on printf is in determining the radix character for numeric values (i.e. the decimal point).
I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.
For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.

OK. For the default "C" locale, CRT assumes that characters passed to printf don't need any conversion. It has a reason because the ASCII characters almost always fall into the basic character set of the execution system(shared among different Windows code pages). When switched to "English", it assumes the input is encoded in code page 1252, and thus tries to perform a conversion from "English" to "Chinese", which is the locale used by the console. But CRT just cannot find the character 中 in code page 1252. That's why it outputs a question mark.
When redirected to a file, CRT knows it and won't do the conversion, because the console code page is no longer used. It just passes through the bytes as-is. How those bytes are interpreted is up to the program you use(e.g., care about BOM or not) when you open the file.
Refer to this MSDN forum link: Why printf can display non-ASCII characters when “C” locale is used?

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.

Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.

Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.

Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.

wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

stdout and stderr character encoding

i working on a c++ string library that have main 4 classes that deals with ASCII, UTF8, UTF16, UTF32 strings, every class has Print function that format an input string and print the result to stdout or stderr. my problem is i don't know what is the default character encoding for those streams.
for now my classes work in windows, later i'll add support for mac and linux so if you know anything about those stream encoding i'll appreciate it.
so my question is: what is the default encoding for stdout and stderr and can i change that encoding later and if so what would happens to data stored there?
thank you.

stdout and stderr use "C" locale. "C" locale is netural, and in most system translated into the current user's locale. You can force the program to use a specific locale using setlocale function:
// Set all categories and return "English_USA.1252"
setlocale( LC_ALL, "English" );
// Set only the LC_MONETARY category and return "French_France.1252"
setlocale( LC_MONETARY, "French" );
setlocale( LC_ALL, NULL );
The locale strings supported are system and compiler specific. Only "C" and "" are required to be supported.
http://www.cplusplus.com/reference/clibrary/clocale/

You might take a look at this SO answer (most upvoted answer).
This is not exactly your question, but it is surely related and gives a lot of useful information.
I'm not expert here, but I guess we can assume you should use std::cout whenever you use std::string and std::wcout whenever you use std::wstring.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js