Converting char[] with unicode encoding to filesystem::path - c++

To simplify the problem I'm trying to solve, let's say I'm trying to build an CLI that check if a path exists using std::filesystem::exists(path). This path came from the user input.
Here's two constraint:
I'm developing on Windows.
I cannot use wmain because I don't have access to the main function (imagine the CLI is a third-party software and I'm writing a plugin to it).
The argc and argv will be passed to my function with the exact signature below (see the snippet code).
Here's an example code that I write:
#include <iostream>
#include <filesystem>
void my_func(int argc, char *argv[]) {
// This is the only place I can do my works...
if (argc > 1)
{
bool is_exist = std::filesystem::exists(argv[1]);
std::cout << "The path (" << argv[1] << ") existence is: " << is_exist << std::endl;
}
else
{
std::cout << "No path defined" << std::endl;
}
}
int main(int argc, char *argv[]) {
my_func(argc, argv);
return 1;
}
The user can use the software with the following command:
./a.exe path/to/my/folder/狗猫
Currently, rubbish is being printed on the terminal. But from my research, this is not C++ problem, but rather a cmd.exe problem.
And if it is not already obvious, the above code snippet does not work even though there is a folder called 狗猫.
My guess is I have to manually convert char[] to filesystem::path somehow. Any help is greatly appreciated.

There is a way to get the wchar_t *argv[] without wmain (see GetCommandLineW), if you're willing to change your API slightly. Then you can construct std::filesystem::path directly from that.
With char *argv[] you're bound to the user's current console codepage and you can only hope that it supports the characters you're interested in (characters outside the codepage will get irrecoverably corrupted). For example for shift-JIS, use chcp 932 before starting the program.
Then follow these steps:
Get the current console codepage with GetConsoleCP,
Convert the char string to UTF-16 with MultiByteToWideChar,
Use the wchar_t overload to construct std::filesystem::path.
Code example:
const char* mbStr = argv[1];
unsigned mbCP = GetConsoleCP();
int wLen = MultiByteToWideChar(mbCP, 0, mbStr, -1, nullptr, 0);
std::wstring wStr(wLen, 0);
MultiByteToWideChar(mbCP, 0, mbStr, -1, wStr.data(), wLen);
std::filesystem::path myPath(wStr);
if (std::filesystem::exists(myPath)) {
// . . .
}

Related

C++ _findfirst and TCHAR

I've been given the following code:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst("C:/*", &dirEntry);
int res = (int)dirHandle;
while(res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
}
_findclose(dirHandle);
cin.get();
return (0);
}
what this does is printing the name of everything that the given directory (C:) contains. Now I have to make this print out the name of everything in the subdirectories (if there are any) as well. I've got this so far:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst(argv[1], &dirEntry);
vector<string> dirArray;
int res = (int)dirHandle;
unsigned int attribT;
while (res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
attribT = (dirEntry.attrib >> 4) & 1; //put the fifth bit into a temporary variable
//the fifth bit of attrib says if the current object that the _finddata instance contains is a folder.
if (attribT) { //if it is indeed a folder, continue (has been tested and confirmed already)
dirArray.push_back(dirEntry.name);
cout << "Pass" << endl;
//res = _findfirst(dirEntry.name, &dirEntry); //needs to get a variable which is the dirEntry.name combined with the directory specified in argv[1].
}
}
_findclose(dirHandle);
std::cin.get();
return (0);
}
Now I'm not asking for the whole solution (I want to be able to do it on my own) but there is just this one thing I can't get my head around which is the TCHAR* argv. I know argv[1] contains what I put in my project properties under "command arguments", and right now this contains the directory I want to test my application in (C:/users/name/New folder/*), which contains some folders with subfolders and some random files.
The argv[1] currently gives the following error:
Error: argument of type "_TCHAR*" is incompatible with parameter of type "const char *"
Now I've googled for the TCHAR and I understand it is either a wchar_t* or a char* depending on using Unicode character set or multi-byte character set (I'm currently using Unicode). I also understand converting is a massive pain. So what I'm asking is: how can I best go around this with the _TCHAR and _findfirst parameter?
I'm planning to concat the dirEntry.name to the argv[1] as well as concatenating a "*" on the end, and using this in another _findfirst. Any comments on my code are appreciated as well since I'm still learning C++.
See here: _findfirst is for multibyte strings while _wfindfirst is for wide characters. If you use TCHAR in your code then use _tfindfirst (macro) which will resolve to _findfirst on non UNICODE, and _wfindfirst on UNICODE builds.
Also instead of _finddata_t use _tfinddata_t which will also resolve to correct structure depending on UNICODE config.
Another thing is that you should use also correct literal, _T("C:/*") will be L"C:/*" on UNICODE build, and "C:/*" otherwise. If you know you are building with UNICODE defined, then use std::vector<std::wstring>.
btw. Visual Studio by default will create project with UNICODE, you may use only wide versions of functions like _wfindfirst as there is no good reason to build non UNICODE projects.
TCHAR and I understand it is either a wchar_t* or a char* depending on using UTF-8 character set or multi-byte character set (I'm currently using UTF-8).
this is wrong, in UNICODE windows apis uses UTF-16. sizeof(wchar_t)==2.
Use this simple typedef:
typedef std::basic_string<TCHAR> TCharString;
Then use TCharString wherever you were using std::string, such as here:
vector<TCharString> dirArray;
See here for information on std::basic_string.

Is it possible to print UTF-8 string with Boost and STL in windows console?

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
{
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
SetConsoleOutputCP(CP_UTF8);
locale::global(generator().generate(""));
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
}
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?
It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout
The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

Wrong filename when using chinese characters

I'm trying to create a file on Windows using a Chinese character. The entire path is inside the variable "std::string originalPath", however, I have a charset problem that I simply cannot understand to overcome.
I have written the following code:
#include <iostream>
#include <boost/locale.hpp>
#include <boost/filesystem/fstream.hpp>
#include <windows.h>
int main( int argc, char *argv[] )
{
// Start the rand
srand( time( NULL ) );
// Create and install global locale
std::locale::global( boost::locale::generator().generate( "" ) );
// Make boost.filesystem use it
boost::filesystem::path::imbue( std::locale() );
// Check if set to utf-8
if( std::use_facet<boost::locale::info>( std::locale() ).encoding() != "utf-8" ){
std::cerr << "Wrong encoding" << std::endl;
return -1;
}
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
LPWSTR lp=(LPWSTR )newPath.c_str();
CreateFileW(lp,GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ |
FILE_SHARE_WRITE, NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL );
return 0;
}
Running it, however, I get inside the folder "C:\test\s" a file of name "¦ᄌタ.png", instead of "一.png", which I want. The only way I found to overcome this is to exchange the lines
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
to simply
std::wstring newPath = L"C:/test/s/一.png";
In this case the file "一.png" appears perfectly inside the folder "C:\test\s". Nonetheless, I cannot do that because the software get its path from a std::string variable. I think the conversion from std::string to std::wstring is being performed the wrong way, however, as it can be seen, I'm having deep problem trying to understand this logic. I read and researched Google exhaustively, read many qualitative texts, but all my attempts seem to be useless. I tried the MultiByteToWideChar function and also boost::filesystem, but both for no help, I simply cannot get the right filename when written to the folder.
I'm still learning, so I'm very sorry if I'm making a dumb mistake. My IDE is Eclipse and it is set up to UTF-8.
You need to actually convert the UTF-8 string to UTF-16. For that you have to look up how to use boost::locale::conv or (on Windows only) the MultiByteToWideChar function.
std::wstring newPath( originalPath.begin(), originalPath.end() ); won't work, it will simply copy all the bytes one by one and cast them to a wchar_t.
Thank you for your help, roeland. Finally I managed to find a solution and I simply used this following library: "http://utfcpp.sourceforge.net/". I used the function "utf8::utf8to16" to convert my original UTF-8 string to UTF-16, this way allowing Windows to display the Chinese characters correctly.

`std::wcout << L"\u25a0" << std::endl;` outputs nothing, and anything <<'d to wcout thereafter also outputs nothing [duplicate]

Consider the following code snippet, compiled as a Console Application on MS Visual Studio 2010/2012 and executed on Win7:
#include "stdafx.h"
#include <iostream>
#include <string>
const std::wstring test = L"hello\xf021test!";
int _tmain(int argc, _TCHAR* argv[])
{
std::wcout << test << std::endl;
std::wcout << L"This doesn't print either" << std::endl;
return 0;
}
The first wcout statement outputs "hello" (instead of something like "hello?test!")
The second wcout statement outputs nothing.
It's as if 0xf021 (and other?) Unicode characters cause wcout to fail.
This particular Unicode character, 0xf021 (encoded as UTF-16), is part of the "Private Use Area" in the Basic Multilingual Plane. I've noticed that Windows Console applications do not have extensive support for Unicode characters, but typically each character is at least represented by a default character (e.g. "?"), even if there is no support for rendering a particular glyph.
What is causing the wcout stream to choke? Is there a way to reset it after it enters this state?
wcout, or to be precise, a wfilebuf instance it uses internally, converts wide characters to narrow characters, then writes those to the file (in your case, to stdout). The conversion is performed by the codecvt facet in the stream's locale; by default, that just does wctomb_s, converting to the system default ANSI codepage, aka CP_ACP.
Apparently, character '\xf021' is not representable in the default codepage configured on your system. So the conversion fails, and failbit is set in the stream. Once failbit is set, all subsequent calls fail immediately.
I do not know of any way to get wcout to successfully print arbitrary Unicode characters to console. wprintf works though, with a little tweak:
#include <fcntl.h>
#include <io.h>
#include <string>
const std::wstring test = L"hello\xf021test!";
int _tmain(int argc, _TCHAR* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(test.c_str());
return 0;
}
Setting the mode for stdout to _O_U16TEXT will allow you to write Unicode characters to the wcout stream as well as wprintf. (See Conventional wisdom is retarded, aka What the ##%&* is _O_U16TEXT?) This is the right way to make this work.
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"hello\xf021test!" << std::endl;
std::wcout << L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd" << std::endl;
std::wcout << L"Now this prints!" << std::endl;
It shouldn't be necessary anymore but you can reset a stream that has entered an error state by calling clear:
if (std::wcout.fail())
{
std::wcout.clear();
}

Unicode Windows console application (WxDev-C++/minGW 4.6.1)

I'm trying to make simple multilingual Windows console app just for educational purposes. I'm using c++ lahguage with WxDev-C++/minGW 4.6.1 and I know this kind of question was asked like million times. I'v searched possibly entire internet and seen probably all forums, but nothing really helps.
Here's the sample working code:
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
/* English version of Hello world */
wchar_t EN_helloWorld[] = L"Hello world!";
wcout << EN_helloWorld << endl;
cout << "\nPress the enter key to continue...";
cin.get();
return 0;
}
It works perfectly until I try put in some really wide character like "Ahoj světe!". The roblem is in "ě" which is '011B' in hexadecimal unicode. Compiler gives me this error: "Illegal byte sequence."
Not working code:
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
/* Czech version of Hello world */
wchar_t CS_helloWorld[] = L"Ahoj světe!"; /* error: Illegal byte sequence */
wcout << CS_helloWorld << endl;
cout << "\nPress the enter key to continue...";
cin.get();
return 0;
}
I heard about things like #define UNICODE/_UNICODE, -municode or downloading wrappers for older minGW. I tried them but it doesn't work. May be I don't know how to use them properly. Anyway I need some help. In Visual studio it's simple task.
Big thanks for any response.
Apparently, using the standard output streams for UTF-16 does not work in MinGW.
I found that I could either use Windows API, or use UTF-8. See this other answer for code samples.
Here is an answer, not sure this will work for minGW.
Also there are some details specific to minGW here