C++ _findfirst and TCHAR - c++

I've been given the following code:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst("C:/*", &dirEntry);
int res = (int)dirHandle;
while(res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
}
_findclose(dirHandle);
cin.get();
return (0);
}
what this does is printing the name of everything that the given directory (C:) contains. Now I have to make this print out the name of everything in the subdirectories (if there are any) as well. I've got this so far:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst(argv[1], &dirEntry);
vector<string> dirArray;
int res = (int)dirHandle;
unsigned int attribT;
while (res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
attribT = (dirEntry.attrib >> 4) & 1; //put the fifth bit into a temporary variable
//the fifth bit of attrib says if the current object that the _finddata instance contains is a folder.
if (attribT) { //if it is indeed a folder, continue (has been tested and confirmed already)
dirArray.push_back(dirEntry.name);
cout << "Pass" << endl;
//res = _findfirst(dirEntry.name, &dirEntry); //needs to get a variable which is the dirEntry.name combined with the directory specified in argv[1].
}
}
_findclose(dirHandle);
std::cin.get();
return (0);
}
Now I'm not asking for the whole solution (I want to be able to do it on my own) but there is just this one thing I can't get my head around which is the TCHAR* argv. I know argv[1] contains what I put in my project properties under "command arguments", and right now this contains the directory I want to test my application in (C:/users/name/New folder/*), which contains some folders with subfolders and some random files.
The argv[1] currently gives the following error:
Error: argument of type "_TCHAR*" is incompatible with parameter of type "const char *"
Now I've googled for the TCHAR and I understand it is either a wchar_t* or a char* depending on using Unicode character set or multi-byte character set (I'm currently using Unicode). I also understand converting is a massive pain. So what I'm asking is: how can I best go around this with the _TCHAR and _findfirst parameter?
I'm planning to concat the dirEntry.name to the argv[1] as well as concatenating a "*" on the end, and using this in another _findfirst. Any comments on my code are appreciated as well since I'm still learning C++.

See here: _findfirst is for multibyte strings while _wfindfirst is for wide characters. If you use TCHAR in your code then use _tfindfirst (macro) which will resolve to _findfirst on non UNICODE, and _wfindfirst on UNICODE builds.
Also instead of _finddata_t use _tfinddata_t which will also resolve to correct structure depending on UNICODE config.
Another thing is that you should use also correct literal, _T("C:/*") will be L"C:/*" on UNICODE build, and "C:/*" otherwise. If you know you are building with UNICODE defined, then use std::vector<std::wstring>.
btw. Visual Studio by default will create project with UNICODE, you may use only wide versions of functions like _wfindfirst as there is no good reason to build non UNICODE projects.
TCHAR and I understand it is either a wchar_t* or a char* depending on using UTF-8 character set or multi-byte character set (I'm currently using UTF-8).
this is wrong, in UNICODE windows apis uses UTF-16. sizeof(wchar_t)==2.

Use this simple typedef:
typedef std::basic_string<TCHAR> TCharString;
Then use TCharString wherever you were using std::string, such as here:
vector<TCharString> dirArray;
See here for information on std::basic_string.

Related

Converting char[] with unicode encoding to filesystem::path

To simplify the problem I'm trying to solve, let's say I'm trying to build an CLI that check if a path exists using std::filesystem::exists(path). This path came from the user input.
Here's two constraint:
I'm developing on Windows.
I cannot use wmain because I don't have access to the main function (imagine the CLI is a third-party software and I'm writing a plugin to it).
The argc and argv will be passed to my function with the exact signature below (see the snippet code).
Here's an example code that I write:
#include <iostream>
#include <filesystem>
void my_func(int argc, char *argv[]) {
// This is the only place I can do my works...
if (argc > 1)
{
bool is_exist = std::filesystem::exists(argv[1]);
std::cout << "The path (" << argv[1] << ") existence is: " << is_exist << std::endl;
}
else
{
std::cout << "No path defined" << std::endl;
}
}
int main(int argc, char *argv[]) {
my_func(argc, argv);
return 1;
}
The user can use the software with the following command:
./a.exe path/to/my/folder/狗猫
Currently, rubbish is being printed on the terminal. But from my research, this is not C++ problem, but rather a cmd.exe problem.
And if it is not already obvious, the above code snippet does not work even though there is a folder called 狗猫.
My guess is I have to manually convert char[] to filesystem::path somehow. Any help is greatly appreciated.
There is a way to get the wchar_t *argv[] without wmain (see GetCommandLineW), if you're willing to change your API slightly. Then you can construct std::filesystem::path directly from that.
With char *argv[] you're bound to the user's current console codepage and you can only hope that it supports the characters you're interested in (characters outside the codepage will get irrecoverably corrupted). For example for shift-JIS, use chcp 932 before starting the program.
Then follow these steps:
Get the current console codepage with GetConsoleCP,
Convert the char string to UTF-16 with MultiByteToWideChar,
Use the wchar_t overload to construct std::filesystem::path.
Code example:
const char* mbStr = argv[1];
unsigned mbCP = GetConsoleCP();
int wLen = MultiByteToWideChar(mbCP, 0, mbStr, -1, nullptr, 0);
std::wstring wStr(wLen, 0);
MultiByteToWideChar(mbCP, 0, mbStr, -1, wStr.data(), wLen);
std::filesystem::path myPath(wStr);
if (std::filesystem::exists(myPath)) {
// . . .
}

Is it possible to print UTF-8 string with Boost and STL in windows console?

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
{
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
SetConsoleOutputCP(CP_UTF8);
locale::global(generator().generate(""));
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
}
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?
It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout
The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

Wrong filename when using chinese characters

I'm trying to create a file on Windows using a Chinese character. The entire path is inside the variable "std::string originalPath", however, I have a charset problem that I simply cannot understand to overcome.
I have written the following code:
#include <iostream>
#include <boost/locale.hpp>
#include <boost/filesystem/fstream.hpp>
#include <windows.h>
int main( int argc, char *argv[] )
{
// Start the rand
srand( time( NULL ) );
// Create and install global locale
std::locale::global( boost::locale::generator().generate( "" ) );
// Make boost.filesystem use it
boost::filesystem::path::imbue( std::locale() );
// Check if set to utf-8
if( std::use_facet<boost::locale::info>( std::locale() ).encoding() != "utf-8" ){
std::cerr << "Wrong encoding" << std::endl;
return -1;
}
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
LPWSTR lp=(LPWSTR )newPath.c_str();
CreateFileW(lp,GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ |
FILE_SHARE_WRITE, NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL );
return 0;
}
Running it, however, I get inside the folder "C:\test\s" a file of name "¦ᄌタ.png", instead of "一.png", which I want. The only way I found to overcome this is to exchange the lines
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
to simply
std::wstring newPath = L"C:/test/s/一.png";
In this case the file "一.png" appears perfectly inside the folder "C:\test\s". Nonetheless, I cannot do that because the software get its path from a std::string variable. I think the conversion from std::string to std::wstring is being performed the wrong way, however, as it can be seen, I'm having deep problem trying to understand this logic. I read and researched Google exhaustively, read many qualitative texts, but all my attempts seem to be useless. I tried the MultiByteToWideChar function and also boost::filesystem, but both for no help, I simply cannot get the right filename when written to the folder.
I'm still learning, so I'm very sorry if I'm making a dumb mistake. My IDE is Eclipse and it is set up to UTF-8.
You need to actually convert the UTF-8 string to UTF-16. For that you have to look up how to use boost::locale::conv or (on Windows only) the MultiByteToWideChar function.
std::wstring newPath( originalPath.begin(), originalPath.end() ); won't work, it will simply copy all the bytes one by one and cast them to a wchar_t.
Thank you for your help, roeland. Finally I managed to find a solution and I simply used this following library: "http://utfcpp.sourceforge.net/". I used the function "utf8::utf8to16" to convert my original UTF-8 string to UTF-16, this way allowing Windows to display the Chinese characters correctly.

How to check if a file already exists before creating it? (C++, Unicode, cross platform)

I have the following code which works on Windows:
bool fileExists(const wstring& src)
{
#ifdef PLATFORM_WINDOWS
return (_waccess(src.c_str(), 0) == 0);
#else
// ???? how to make C access() function to accept the wstring on Unix/Linux/MacOS ?
#endif
}
How do I make the code work on *nix platforms the same way as it does on Windows, considering that scr is a Unicode string and might contain file path with Unicode characters?
I have seen various StackOverflow answers which partly answer my question but I have problems to put it all together. My system relies on wide strings, especially on Windows where file names might contain non-ASCII characters. I know that generally it's better to write to the file and check for errors, but my case is the opposite - I need to skip the file if it already exists. I just want to check if the file exists, no matter if I can read/write it or not.
On many filesystems other than FAT and NTFS, filenames aren't exactly well defined as strings. They're technically byte sequences. What those byte sequences mean is a matter of interpretation. A common interpretation is UTF-8-like. Not exact UTF-8, because Unicode specifies string equality regardless of encoding. Most systems use byte equality instead. (Again, FAT and NTFS are exceptions, using case-insensitive comparisons)
A good portable solution I use is to use the following:
ifstream my_file(myFilenameHere);
if (my_file.good())
{
// file exists and do what you need to do when it exists
}
else
{
// the file doesn't exist do what you need to do to create it etc.
}
For example a small file existence checker function could be (this one works in windows, linux and unix):
inline bool doesMyFileExist (const std::string& myFilename)
{
#if defined(__unix__) || defined(__posix__) || defined(__linux__ )
// all UNIXes, POSIX (including OS X I think (cant remember been a while)) and
// all the various flavours of Linus Torvalds digital offspring:)
struct stat buffer;
return (stat (myFilename.c_str(), &buffer) == 0);
#elif defined(__APPLE__)|| defined(_WIN32)
// this includes IOS AND OSX and Windows (x64 and x86)
// note the underscore in the windows define, without it can cause problems
if (FILE *file = fopen(myFilename.c_str(), "r"))
{
fclose(file);
return true;
}
else
{
return false;
}
#else // a catch-all fallback, this is the slowest method, but works on them all:)
ifstream myFile(myFilename.c_str());
if (myFile.good())
{
myFile.close();
return true;
}
else
{
myFile.close();
return false;
}
#endif
}
The function above uses the fastest possible method to check the file for each OS variant, and has a fallback in case you are on an os other than the ones explicitly listed (original Amiga OS for example). This has been used in GCC4.8.x and VS 2010/2012.
The good method will check that everything is as it should be, and this way you actually have the file open.
The only caveat is pay close attention to how the file name is represented in the OS (as mentioned in another answer).
So far this has worked cross platform for me just fine:)
I spent some hours experimenting on my Ubuntu machine. It took many trials and errors but finally I got it working. I'm not sure if it will work on MacOS or even on other *nixes.
As many suspected, direct casting to char* did not work - then I got only the first slash of my test path /home/progmars/абвгдāēī . The trick was to use wcstombs() combined with setlocale() Although I could not get the text to display in console after this conversion, still access() function got it right.
Here is the code which worked for me:
bool fileExists(const wstring& src)
{
#ifdef PLATFORM_WINDOWS
return (_waccess(src.c_str(), 0) == 0);
#else
// hopefully this will work on most *nixes...
size_t outSize = src.size() * sizeof(wchar_t) + 1;// max possible bytes plus \0 char
char* conv = new char[outSize];
memset(conv, 0, outSize);
// MacOS claims to have wcstombs_l which has locale argument,
// but I could not find something similar on Ubuntu
// thus I had to use setlocale();
char* oldLocale = setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "en_US.UTF-8"); // let's hope, most machines will have "en_US.UTF-8" available
// "Works on my machine", that is, Ubuntu 12.04
size_t wcsSize = wcstombs(conv, src.c_str(), outSize);
// we might get an error code (size_t-1) in wcsSize, ignoring for now
// now be good, restore the locale
setlocale(LC_ALL, oldLocale);
return (access(conv, 0) == 0);
#endif
}
And here is some experimental code which led me to the solution:
// this is crucial to output correct unicode characters in console and for wcstombs to work!
// empty string also works instead of en_US.UTF-8
// setlocale(LC_ALL, "en_US.UTF-8");
wstring unicoded = wstring(L"/home/progmars/абвгдāēī");
int outSize = unicoded.size() * sizeof(wchar_t) + 1;// max possible bytes plus \0 char
char* conv = new char[outSize];
memset(conv, 0, outSize);
size_t szt = wcstombs(conv, unicoded.c_str(), outSize); // this needs setlocale - only then it returns 31. else it returns some big number - most likely, an error message
wcout << "wcstombs result " << szt << endl;
int resDirect = access("/home/progmars/абвгдāēī", 0); // works fine always
int resCast = access((char*)unicoded.c_str(), 0);
int resConv = access(conv, 0);
wcout << "Raw " << unicoded.c_str() << endl; // output /home/progmars/абвгдāēī but only if setlocale has been called; else output is /home/progmars/????????
wcout << "Casted " << (char*)unicoded.c_str() << endl; // output /
wcout << "Converted " << conv << endl; // output /home/progmars/ - for some reason, Unicode chars are cut away in the console, but still they are there because access() picks them up correctly
wcout << "resDirect " << resDirect << endl; // gives correct result depending on the file existence
wcout << "resCast " << resCast << endl; // wrong result - always 0 because it looks for / and it's the filesystem root which always exists
wcout << "resConv " << resConv << endl;
// gives correct result but only if setlocale() is present
Of course, I could avoid all that hassle with ifdefs to define my own version of string which would be wstring on Windows and string on *nix because *nix seems to be more liberal about UTF8 symbols and doesn't mind using them in plain strings. Still, I wanted to keep my function declarations consistent for all platforms and also I wanted to learn how Unicode filenames work in Linux.

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.