how can C++ wcout utf-16 encoded char array? - c++

I am reading the well-known answer about string and wstring and come up some confusion.
source charset and execution charset are all set as utf-8, Windows x64, VC++ compiler, git bash console (can print unicode characters), system default codepage 936(GB2312).
My expertiment code:
#include <cstring>
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
wchar_t c[] = L"olé";
wchar_t d[] = L"abc";
wcout << c << endl;
wcout << d << endl;
return 0;
}
Can print "abc" but can't print "é".
I understand that wchar_t is used along with L prefix string literal. And under Windows wchar_t is encoded with UTF-16(It's hard coded right? No matter what source charset or execution charset I choose, L"abc" would always have the same UTF-16 code units).
The question is:How can it wcout a UTF-16 encoded string("abc"), while my source file is utf-8 and execution charset is utf-8. The program should not be able to recognize UTF-16 encoded stuff unless I set everything to utf-16.
And if it can print UTF-16 in some way, then why can't it print é?

You need a non-standard Windows system call to enable UTF-16 output.
#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <stdio.h>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT); // <=== Windows madness
std::wcout << L"olé\n";
}
Note you cannot use cout after doing this, only wcout.
Also note your source code file must have BOM, otherwise the compiler will not recognise it as Unicode.

The Windows Console does not support UTF-16 output. It only supports 8-bit output, and it has partial support for 8-bit MBCS, such as Big5 or UTF-8.
To display Unicode characters on the console you will need to do conversion to UTF-8 or another MBCS in your code, and also put the console into UTF-8 mode (which requires an undocumented system call).
See also this answer

Related

Unicode output not showing

I'm trying to learn Unicode programming in Windows.
I have this simple program:
#include <iostream>
#include <string>
int main()
{
std::wstring greekWord = L"Ελληνικά";
std::wcout << greekWord << std::endl;
return 0;
}
However, it outputs nothing. Any ideas how to make it output Greek?
I tried adding non-Greek letters, and that didn't work quite right either.
The first thing to try is to make the program not dependent on the encoding of the source file. So use Unicode escapes not literal Unicode letters
std::wstring greekWord = L"\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AC";
Having the incorrect encoding in the source file is only one thing of many things that could be preventing you from printing Greek. The other obvious issue is the ability of your terminal to print Greek letters. If it can't do that, or needs to be set up correctly so that it can then nothing you do in your program is going to work.
And probably you want to fix the source code encoding issue, so that you can use unescaped literals in your code. But that's dependent on the compiler/IDE you are using.
If you are outputting your cout to a normal console then the console doesn't usually support unicode text like greek, try setting it up for unicode text or find another way to output your data, like txt files or some gui,
There are two way to do this.
The old, non-standard Microsoft way is as follows:
#include <fcntl.h>
#include <io.h>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_WTEXT);
// your code here
}
You will fild this everywhere, but this is not necessarily a good way to solve this problem.
The more standards-compliant way is as follows:
#include <locale>
int main()
{
std::locale l(""); // or std::locale l("en_US.utf-8");
std::locale::global(l); // or std::wcout.imbue(l); std::wcin.imbue(l);
// your code here
}
This should work with other modern compilers and operating systems too.
TRY this it works with me :
#include
#include <io.h>
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout),_O_U16TEXT);
wcout<<L"Ελληνικά";
setlocale(LC_ALL,"");
return 0;
}

How to read a single char with std::wifstream? [duplicate]

You wouldn't imagine something as basic as opening a file using the C++ standard library for a Windows application was tricky ... but it appears to be. By Unicode here I mean UTF-8, but I can convert to UTF-16 or whatever, the point is getting an ofstream instance from a Unicode filename. Before I hack up my own solution, is there a preferred route here ? Especially a cross-platform one ?
The C++ standard library is not Unicode-aware. char and wchar_t are not required to be Unicode encodings.
On Windows, wchar_t is UTF-16, but there's no direct support for UTF-8 filenames in the standard library (the char datatype is not Unicode on Windows)
With MSVC (and thus the Microsoft STL), a constructor for filestreams is provided which takes a const wchar_t* filename, allowing you to create the stream as:
wchar_t const name[] = L"filename.txt";
std::fstream file(name);
However, this overload is not specified by the C++11 standard (it only guarantees the presence of the char based version). It is also not present on alternative STL implementations like GCC's libstdc++ for MinGW(-w64), as of version g++ 4.8.x.
Note that just like char on Windows is not UTF8, on other OS'es wchar_t may not be UTF16. So overall, this isn't likely to be portable. Opening a stream given a wchar_t filename isn't defined according to the standard, and specifying the filename in chars may be difficult because the encoding used by char varies between OS'es.
Since C++17, there is a cross-platform way to open an std::fstream with a Unicode filename using the std::filesystem::path overload. Example:
std::ofstream out(std::filesystem::path(u8"こんにちは"));
out << "hello";
The current versions of Visual C++ the std::basic_fstream have an open() method that take a wchar_t* according to http://msdn.microsoft.com/en-us/library/4dx08bh4.aspx.
Use std::wofstream, std::wifstream and std::wfstream. They accept unicode filename. File name has to be wstring, array of wchar_ts, or it has to have _T() macro, or prefix Lbefore the text.
Have a look at Boost.Nowide:
#include <boost/nowide/fstream.hpp>
#include <boost/nowide/cout.hpp>
using boost::nowide::ifstream;
using boost::nowide::cout;
// #include <fstream>
// #include <iostream>
// using std::ifstream;
// using std::cout;
#include <string>
int main() {
ifstream f("UTF-8 (e.g. ß).txt");
std::string line;
std::getline(f, line);
cout << "UTF-8 content: " << line;
}
Use
wfstream
instead of
fstream
and
wofstream
instead of
ofstream
and so on...
You can find this information in the iosfwd header file.
If you're using Qt mixed with std::ifstream:
return std::wstring(reinterpret_cast<const wchar_t*>(qString.utf16()));
Note that the std::basic_ifstream constructor normally doesn't accept a const w_char*, but on in the MS implementation of STL it does. With other implementations you would probably call qString.utf8(), and use the const char* ctor.

Is it possible to unify std::wstring behavior in VSVC and GCC?

Here a little code that reads a line from UFT-8 file:
#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <locale>
#include <fstream>
#include <codecvt>
int main()
{
_setmode(_fileno(stdout), _O_U8TEXT);
auto inputFileStream = std::wifstream("input.txt");
const auto utf8Locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
inputFileStream.imbue(utf8Locale);
std::wstring line;
std::getline(inputFileStream, line);
std::wcout << line << std::endl;
inputFileStream.close();
return 0;
}
When I build it with the Visual Studio Visual C++ compiler, I got the next result:
test τεστ тест
as expected.
By when I use MinGW with the GCC compiler, I got
琀攀猀琀 쐃딃쌃쐃 䈄㔄䄄䈄
As you understand, it's not the expected result.
Does any simple way exist to fix the output for GCC to the expected string?
OR
Does any simple way exist to use UTF-8 for both MSVC and GCC?
Answer (thanks for Igor Tandetnik and Remy Lebeau):
Seems, we must specify endian mode explicitly, because MSVC and GCC have different defaults. So
new std::codecvt_utf8<wchar_t, 0x10ffff, std::little_endian>()
should be used.
Fixed code:
#include <iostream>
#include <io.h>
#include <fcntl.h>
#include <locale>
#include <fstream>
#include <codecvt>
int main()
{
_setmode(_fileno(stdout), _O_U8TEXT);
auto inputFileStream = std::wifstream("input.txt");
const auto utf8Locale = std::locale(std::locale(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::little_endian>());
inputFileStream.imbue(utf8Locale);
std::wstring line;
std::getline(inputFileStream, line);
std::wcout << line << std::endl;
inputFileStream.close();
return 0;
}
For your second question, one option is to limit the use of utf16 and std::w-prefixed stuff to the cases when you need to exchange utf16-encoded strings with the operating system. This happens when you receive arguments in wmain, open file with _wfopen, call Windows API function, etc. Otherwise, you would store, get from the user and return to the user utf8 strings using char type (char*, std::string, etc). Conversion between utf8 and utf16 can be done with MultiByteToWideChar and WideCharToMultiByte, bypassing the retarded c++ encoding api. The place where this does not work well is console input/output. Overall, you can output utf8 to the console if the user sets chcp 65001 and a ttf font. At least in Windows 7, you will also have to make sure not to split a character between two write calls, otherwise it will not print correctly (this also implies you cannot use std::cout, because msvcrt will call putc for every byte separately, and you'll need to use puts, fprintf, etc instead); I heard that this was fixed in Windows 10, but cannot confirm. Reading utf8 from the console with file api does not work as far as I know; if you want that, you'd need to detect that stdin is attached to a console and use console api instead.

How to store unicode character in wstring on linux?

#include <iostream>
using namespace std;
int main() {
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
Whys this doesn't work? And how solve this?
It doesn't work because in the default C locale, there is no character which corresponds to U+00A2.
If you're using a standard ubuntu install, it is most likely that your user locale uses a larger character set than US-ASCII, quite possibly Unicode encoded with UTF-8. So you just need to switch to the locale specified in the environment, as follows:
#include <iostream>
/* locale is needed for std::setlocale */
#include <locale>
#include <string>
int main() {
/* The following switches to the locale specified
* by the LC_ALL environment variable.
*/
std::setlocale (LC_ALL, "");
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
If you use std::string instead of std::wstring and std::cout instead of std::wcout, then you don't need the setlocale because no translation is needed (provided the console expects UTF-8).

How to write UTF-8 file with fprintf in C++

I am programming (just occassionally) in C++ with VisualStudio and MFC. I write a file with fopen and fprintf. The file should be encoded in UTF8. Is there any possibility to do this? Whatever I try, the file is either double byte unicode or ISO-8859-2 (latin2) encoded.
Glanebridge
You shouldn't need to set your locale or set any special modes on the file if you just want to use fprintf. You simply have to use UTF-8 encoded strings.
#include <cstdio>
#include <codecvt>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
std::string utf8_string = convert.to_bytes(L"кошка 日本国");
if(FILE *f = fopen("tmp","w"))
fprintf(f,"%s\n",utf8_string.c_str());
}
Save the program as UTF-8 with signature or UTF-16 (i.e. don't use UTF-8 without signature, otherwise VS won't produce the right string literal). The file written by the program will contain the UTF-8 version of that string. Or you can do:
int main() {
if(FILE *f = fopen("tmp","w"))
fprintf(f,"%s\n","кошка 日本国");
}
In this case you must save the file as UTF-8 without signature, because you want the compiler to think the source encoding is the same as the execution encoding... This is a bit of a hack that relies on the compiler's, IMO, broken behavior.
You can do basically the same thing with any of the other APIs for writing narrow characters to a file, but note that none of these methods work for writing UTF-8 to the Windows console. Because the C runtime and/or the console is a bit broken you can only write UTF-8 directly to the console by doing SetConsoleOutputCP(65001) and then using one of the puts variety of function.
If you want to use wide characters instead of narrow characters then locale based methods and setting modes on file descriptors could come into play.
#include <cstdio>
#include <fcntl.h>
#include <io.h>
int main() {
if(FILE *f = fopen("tmp","w")) {
_setmode(_fileno(f), _O_U8TEXT);
fwprintf(f,L"%s\n",L"кошка 日本国");
}
}
#include <fstream>
#include <codecvt>
int main() {
if(auto f = std::wofstream("tmp")) {
f.imbue(std::locale(std::locale(),
new std::codecvt_utf8_utf16<wchar_t>)); // assumes wchar_t is UTF-16
f << L"кошка 日本国\n";
}
}
Yes, but you need Visual Studio 2005 or later. You can then call fopen with the parameters:
LPCTSTR strText = "абв";
FILE *f = fopen(pszFilePath, "w,ccs=UTF-8");
_ftprintf(f, _T("%s"), (LPCTSTR) strText);
Keep in mind this is Microsoft extension, it probably won't work with gcc or other compilers.
In theory, you should simply set a locale which uses UTF-8 as external encoding. My understanding -- I'm not a Windows programmer -- is that Windows has no such locale, so you have to resort to implementation specific means or non standard libraries (link from Dave's comment).