I using wcslen to determine the length of null-terminated wide string (wchar_t*), but I have some problems with this function in MSVC compiler
Code example:
#include <iostream>
#include <cstring>
#include <cwchar>
int main()
{
auto sc = "The good and bad";
auto wsc = L"Уставший лесник";
auto ws = std::wstring(wsc);
std::cout << "sc len:" << std::strlen(sc) << std::endl;
std::cout << "wsc len:" << std::wcslen(wsc) << std::endl;
std::cout << "ws len:" << ws.length() << std::endl;
}
MSVC (amd64 16.8.2 x64) output:
sc len:16
wsc len:29
ws len:29
Clang (10.0.0 (GNU CLI) for MSVC 16.8.30717.126) output:
sc len:16
wsc len:15
ws len:15
Is it a problem of MSVC compiler, some undefined behaivor or nuances of MSVC implementation?
You need to save your file as either UTF-16 or UTF-8 with BOM. MSVC doesn't seem to be able to handle a UTF-8 file without a BOM (which is understandable as the character encoding of such a file is a matter of interpretation).
Some editors (I am using Notepad2) call this 'UTF-8 with signature'.
Related
I open a file in read only mode, I read a line, and I cannot understand the meaning of the line contents as shown by the debugger:
#include <cctype>
#include <fstream>
#include <iostream>
#include <string>
int main()
{
std::string fidMapPath = "./file.txt";
std::ifstream ifsFidMap(fidMapPath);
if(ifsFidMap.good() == false) {
std::cerr << "Error" << std::endl;
exit(1);
}
std::string line;
while(ifsFidMap.eof() == false)
{
std::getline(ifsFidMap, line);
std::cout << "Line: " << line << std::endl;
}
}
This is the contents of the text file:
; Document title
123;456
123;123
456;456
...
When running, nothing is printed from the line variable; with the debugger, its contents is equal to "" (empty) before getline(), and \\000\\000\\000\\000... after, repeated up to a length of 2411 characters.
What is the meaning of this behavior?
These are my platform details:
Operating system: Windows 10, building remotely on Linux (Red Hat, kernel 2.6.32-431.5.1.el6.x86_64) through NetBeans 8.2.
Compiler: GCC 4.8.2 (C++11)
Debugger: GDB Red Hat Enterprise Linux 7.6.1-47.el6
P.S.: I tried, as suggested, to move the getline in the while argument:
while(std::getline(ifsFidMap, line))
but I still have the same issue.
With the version of GCC I used, 4.8.2, we must explicitly specify if what being compiled is C++11 code with the -std=c++11 compiler option. Otherwise it defaults to C++98.
This applies to GCC versions up to 5.x.
I have a simple problem in a Windows console program. I wrote the following single-file (UTF-8 encoded) program in C++:
// utf8 source file encoding
#include <iostream>
#include <iomanip>
int main() {
// compile and run on windows 10 and change code page to utf
// execute from cmd.exe
// WHY first letter 'П' - first time not printed? But after space printed?
std::system("chcp 65001>nul");
const char* utf8 = u8"Привет Мир"; // this will skip first letter with MinGW compiler
const char* utf8s = u8" Привет Мир"; // this will print nice!!! Just add space
std::cout << "cout\n";
std::cout << utf8 << std::endl;
std::cout << utf8s << std::endl;
std::cout << utf8 << std::endl;
std::cout << std::flush;
std::wcout << L"wcout\n";
std::wcout << L"Привет Мир" << std::endl;
std::wcout << L" Привет Мир" << std::endl;
std::wcout << L"Привет Мир" << std::endl;
std::wcout << std::flush;
std::printf("printf\n");
std::printf("%s\n", utf8);
std::printf("%s\n", utf8s);
std::printf("wprintf\n");
std::wprintf(L"Привет Мир\n");
std::wprintf(L" Привет Мир\n");
std::wprintf(L"Привет Мир\n");
std::fflush(stdout);
std::system("pause");
return 0;
}
Output in the Windows console when run from CMD:
Output in the terminal when run from bash:
The program's output is correct when run by git-bash.exe in Windows 10, but not when run by CMD in the console with Lucida Console font. It's not printing the first letter of Russian "Hello World" ("Привет Мир"), even with chcp 65001. I tried to compile this code with MinGW 7.1 and with MSVC 2017, but nothing changed. I know that rustc.exe (Rust Lang compiler) produces binary files that work in both the console and git-bash.exe with UTF-8 text printing. How can the same be done in a C++ program in Windows? Thanks in advance.
This is my sample code:
#pragma execution_character_set("utf-8")
#include <boost/locale.hpp>
#include <boost/algorithm/string/case_conv.hpp>
#include <iostream>
int main()
{
std::locale loc = boost::locale::generator().generate("");
std::locale::global(loc);
#ifdef MSVC
std::cout << boost::locale::conv::from_utf("grüßen vs ", "ISO8859-15");
std::cout << boost::locale::conv::from_utf(boost::locale::to_upper("grüßen"), "ISO8859-15") << std::endl;
std::cout << boost::locale::conv::from_utf(boost::locale::fold_case("grüßen"), "ISO8859-15") << std::endl;
std::cout << boost::locale::conv::from_utf(boost::locale::normalize("grüßen", boost::locale::norm_nfd), "ISO8859-15") << std::endl;
#else
std::cout << "grüßen vs ";
std::cout << boost::locale::to_upper("grüßen") << std::endl;
std::cout << boost::locale::fold_case("grüßen") << std::endl;
std::cout << boost::locale::normalize("grüßen", boost::locale::norm_nfd) << std::endl;
#endif
return 0;
}
Output on Windows 7 is:
grüßen vs GRÜßEN
grüßen
grußen
Output on Linux (openSuSE 12.3) is:
grüßen vs GRÜSSEN
grüssen
grüßen
On Linux the german letter 'ß' is converted to 'SS' as predicted, while this character remains unchanged on Windows.
Question: why is this so? How can I correct the conversion?
Some notes: Windows console codepage is set to 1252. In both cases locales are set to de_DE. I tried to replace the default locale setting in the listing above by "de_DE.UTF-8" - without any effect.
On Windows this code is compiled with Visual Studio 2013, on Linux with GCC 4.7, c++11 enabled.
Any suggestions are appreciated - thanks in advance for your support!
Windows doesn't do this conversion because "it would be too confusing" for developers if the string length changed all of a sudden. And boost presumably just delegates all the Unicode conversions to the underlying Windows APIs
Source
I guess the robust way to handle it would be to use a third-party Unicode library such as ICU.
I wrote the following program using VS2008:
#include <fstream>
int main()
{
std::wofstream fout("myfile");
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
}
When I tried to compile it the IDE asked me whether I wanted to save my source file in unicode, I said "yes, please".
Then I run the program, and myfile appeared in my project's folder. I opened it with notepad, the file was empty. I recalled that notepad supported only ASCII data. I opened it with WordPad, it was still empty. Finally the little genius inside me urged me to look at the file size and not surprisingly it was 0 bytes. So I rebuilt and reran the program, to no effect. Finally I decided to ask very intelligent people on StackOverflow as to what I am missing and here I am :)
Edited:
After the abovementioned intelligent people left some comments, I decided to follow their advice and rewrote the program like this:
#include <fstream>
#include <iostream>
int main()
{
std::wofstream fout("myfile");
if(!fout.is_open())
{
std::cout << "Before: Not open...\n";
}
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
if(!fout.good())
{
std::cout << "After: Not good...\n";
}
}
Built it. Ran it. And... the console clearly read, to my surprise: "After: Not good...".
So I edited my post to provide the new information and started waiting for answers which would explain why this is and what I could do. :)
MSVC offers the codecvt_utf8 locale facet for this problem.
#include <codecvt>
// ...
std::wofstream fout(fileName);
std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
fout.imbue(loc);
In Visual studio the output stream is always written in ANSI encoding, and it does not support UTF-8 output.
What is basically need to do is to create a locale class, install into it UTF-8 facet and then imbue it to the fstream.
What happens that code points are not being converted to UTF encoding. So basically this would not work under MSVC as it does not support UTF-8.
This would work under Linux with UTF-8 locale
#include <fstream>
int main()
{
std::locale::global(std::locale(""));
std::wofstream fout("myfile");
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
}
~
And under windows this would work:
#include <fstream>
int main()
{
std::locale::global(std::locale("Russian_Russia"));
std::wofstream fout("myfile");
fout << L"Россия" << std::endl;
}
As only ANSI encodings are supported by MSVC.
Codecvt facet can be found in some Boost libraries. For example: http://www.boost.org/doc/libs/1_38_0/libs/serialization/doc/codecvt.html
I found the following code working properly. I am using VS2019.
#include <iostream>
#include <fstream>
#include <codecvt>
int main()
{
std::wstring str = L"abàdëef€hhhhhhhµa";
std::wofstream fout(L"C:\\app.log.txt", ios_base::app); //change this to ios_base::in or ios_base::out as per relevance
std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
fout.imbue(loc);
fout << str;
fout.close();
}
I wrote the following program using VS2008:
#include <fstream>
int main()
{
std::wofstream fout("myfile");
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
}
When I tried to compile it the IDE asked me whether I wanted to save my source file in unicode, I said "yes, please".
Then I run the program, and myfile appeared in my project's folder. I opened it with notepad, the file was empty. I recalled that notepad supported only ASCII data. I opened it with WordPad, it was still empty. Finally the little genius inside me urged me to look at the file size and not surprisingly it was 0 bytes. So I rebuilt and reran the program, to no effect. Finally I decided to ask very intelligent people on StackOverflow as to what I am missing and here I am :)
Edited:
After the abovementioned intelligent people left some comments, I decided to follow their advice and rewrote the program like this:
#include <fstream>
#include <iostream>
int main()
{
std::wofstream fout("myfile");
if(!fout.is_open())
{
std::cout << "Before: Not open...\n";
}
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
if(!fout.good())
{
std::cout << "After: Not good...\n";
}
}
Built it. Ran it. And... the console clearly read, to my surprise: "After: Not good...".
So I edited my post to provide the new information and started waiting for answers which would explain why this is and what I could do. :)
MSVC offers the codecvt_utf8 locale facet for this problem.
#include <codecvt>
// ...
std::wofstream fout(fileName);
std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
fout.imbue(loc);
In Visual studio the output stream is always written in ANSI encoding, and it does not support UTF-8 output.
What is basically need to do is to create a locale class, install into it UTF-8 facet and then imbue it to the fstream.
What happens that code points are not being converted to UTF encoding. So basically this would not work under MSVC as it does not support UTF-8.
This would work under Linux with UTF-8 locale
#include <fstream>
int main()
{
std::locale::global(std::locale(""));
std::wofstream fout("myfile");
fout << L"Հայաստան Россия Österreich Ελλάδα भारत" << std::endl;
}
~
And under windows this would work:
#include <fstream>
int main()
{
std::locale::global(std::locale("Russian_Russia"));
std::wofstream fout("myfile");
fout << L"Россия" << std::endl;
}
As only ANSI encodings are supported by MSVC.
Codecvt facet can be found in some Boost libraries. For example: http://www.boost.org/doc/libs/1_38_0/libs/serialization/doc/codecvt.html
I found the following code working properly. I am using VS2019.
#include <iostream>
#include <fstream>
#include <codecvt>
int main()
{
std::wstring str = L"abàdëef€hhhhhhhµa";
std::wofstream fout(L"C:\\app.log.txt", ios_base::app); //change this to ios_base::in or ios_base::out as per relevance
std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
fout.imbue(loc);
fout << str;
fout.close();
}