Wrong filename when using chinese characters - c++

I'm trying to create a file on Windows using a Chinese character. The entire path is inside the variable "std::string originalPath", however, I have a charset problem that I simply cannot understand to overcome.
I have written the following code:
#include <iostream>
#include <boost/locale.hpp>
#include <boost/filesystem/fstream.hpp>
#include <windows.h>
int main( int argc, char *argv[] )
{
// Start the rand
srand( time( NULL ) );
// Create and install global locale
std::locale::global( boost::locale::generator().generate( "" ) );
// Make boost.filesystem use it
boost::filesystem::path::imbue( std::locale() );
// Check if set to utf-8
if( std::use_facet<boost::locale::info>( std::locale() ).encoding() != "utf-8" ){
std::cerr << "Wrong encoding" << std::endl;
return -1;
}
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
LPWSTR lp=(LPWSTR )newPath.c_str();
CreateFileW(lp,GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ |
FILE_SHARE_WRITE, NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL );
return 0;
}
Running it, however, I get inside the folder "C:\test\s" a file of name "¦ᄌタ.png", instead of "一.png", which I want. The only way I found to overcome this is to exchange the lines
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
to simply
std::wstring newPath = L"C:/test/s/一.png";
In this case the file "一.png" appears perfectly inside the folder "C:\test\s". Nonetheless, I cannot do that because the software get its path from a std::string variable. I think the conversion from std::string to std::wstring is being performed the wrong way, however, as it can be seen, I'm having deep problem trying to understand this logic. I read and researched Google exhaustively, read many qualitative texts, but all my attempts seem to be useless. I tried the MultiByteToWideChar function and also boost::filesystem, but both for no help, I simply cannot get the right filename when written to the folder.
I'm still learning, so I'm very sorry if I'm making a dumb mistake. My IDE is Eclipse and it is set up to UTF-8.

You need to actually convert the UTF-8 string to UTF-16. For that you have to look up how to use boost::locale::conv or (on Windows only) the MultiByteToWideChar function.
std::wstring newPath( originalPath.begin(), originalPath.end() ); won't work, it will simply copy all the bytes one by one and cast them to a wchar_t.

Thank you for your help, roeland. Finally I managed to find a solution and I simply used this following library: "http://utfcpp.sourceforge.net/". I used the function "utf8::utf8to16" to convert my original UTF-8 string to UTF-16, this way allowing Windows to display the Chinese characters correctly.

Related

How to properly convert USC-2 little endian into UTF-8?

I have a file and the line endings are in the windows style \r\n; it is encoded in USC-2 little endian.
Say this is my file fruit.txt (USC-2 little endian):
So I open it in a std::wifstream and try to parse the contents:
// open the file
std::wifstream file("fruit.txt");
if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));
// create container for the lines
std::forward_list<std::string> lines;
// Add each line to the container
std::wstring line;
while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));
If I try to print to cout...
// Printing to cout
for( auto it = lines.cbegin(); it != lines.cend(); ++it )
std::cout << *it << std::endl;
...This is what it outputs:
Cherry
Banana
ÿþApple
Worse yet, if I open it in Notepad++, this is what it looks like
I can sort-of rectify this by forcibly converting the encoding back to USC-2 which results in this:
My wstring_to_string function is defined as this:
std::string wstring_to_string( const std::wstring& wstr ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
return convert.to_bytes(wstr);
}
What in the world is going on here? How can I get a normal UTF-8 string? I have tried this method too: How to read utf-16 file into utf-8 std::string line by line, but imbuing the std::wifstream first results in no outputs altogether. Can someone please help direct me in the best way to go about converting USC-2 LE data to readable UTF-8 data?
Edit I think there may be a bug with mingw64/mingw-w64-x86_64-gcc 6.3.0-2 which is provided by MSYS2. I have tried everyone's suggestions and imbuing the locale into the streams is just rendering no output at all. I do know there are only two native locales provided, "C" and "POSIX". I was going to try Visual Studio but don't have sufficient internet speed for the 4GB download. I have used ICU like #Andrei R. suggested and it is working great.
I would have loved to use standard libraries but I am ok with this. Please take a look at my code if you need this solution: https://pastebin.com/qudy7yva
The code itself is fine as-is.
The real problem is that your input file is NOT valid UTF-16LE to begin with (your use of std::codecvt_utf8_utf16 requires UTF-16, not UCS-2). This is clearly shown in your Notepad++ screenshots.
Offhand, the file data looks like a UTF-16LE file with a BOM (ÿþ is the UTF-16LE BOM when viewed as 8bit ANSI) was appended as-is to the end of a UCS-2BE (or UTF-16BE) file that did not have a BOM.
You need to fix the input file so the entire file is valid UTF-16LE from beginning to end (with or without a BOM in front, not in the middle).
Then the code you already have will work.
converting to/from unicode is in general not so trivial. Have a look at ICU libraries, I believe, this is by far most complete encoding conversion library for c/c++.
There are also platform-dependent ways like WideCharToMultibyte (Win) or iconv (Linux). Or, with Qt, you can use QString::fromUtf16. Probably you will have to reverse endianness yourself.
For your case, the main issue is that you made the wifstream read the file in a wrong way. If you print the size of wstr in wstring_to_string, you will find that it's not what you expect.
https://stackoverflow.com/a/19698449/4005852
Set proper locale will fix this issue.
std::string wstring_to_string( const std::wstring& wstr ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
return convert.to_bytes(wstr);
}
int main()
{
// open the file
std::wifstream file("fruit.txt", std::ios::binary);
file.imbue(std::locale(file.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));
// create container for the lines
std::forward_list<std::string> lines;
// Add each line to the container
std::wstring line;
file.get(); // remove BOM
while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));
// Printing to cout
for( auto it = lines.cbegin(); it != lines.cend(); ++it )
std::cout << *it << std::endl;
return 0;
}

C++ _findfirst and TCHAR

I've been given the following code:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst("C:/*", &dirEntry);
int res = (int)dirHandle;
while(res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
}
_findclose(dirHandle);
cin.get();
return (0);
}
what this does is printing the name of everything that the given directory (C:) contains. Now I have to make this print out the name of everything in the subdirectories (if there are any) as well. I've got this so far:
int _tmain(int argc, _TCHAR* argv[]) {
_finddata_t dirEntry;
intptr_t dirHandle;
dirHandle = _findfirst(argv[1], &dirEntry);
vector<string> dirArray;
int res = (int)dirHandle;
unsigned int attribT;
while (res != -1) {
cout << dirEntry.name << endl;
res = _findnext(dirHandle, &dirEntry);
attribT = (dirEntry.attrib >> 4) & 1; //put the fifth bit into a temporary variable
//the fifth bit of attrib says if the current object that the _finddata instance contains is a folder.
if (attribT) { //if it is indeed a folder, continue (has been tested and confirmed already)
dirArray.push_back(dirEntry.name);
cout << "Pass" << endl;
//res = _findfirst(dirEntry.name, &dirEntry); //needs to get a variable which is the dirEntry.name combined with the directory specified in argv[1].
}
}
_findclose(dirHandle);
std::cin.get();
return (0);
}
Now I'm not asking for the whole solution (I want to be able to do it on my own) but there is just this one thing I can't get my head around which is the TCHAR* argv. I know argv[1] contains what I put in my project properties under "command arguments", and right now this contains the directory I want to test my application in (C:/users/name/New folder/*), which contains some folders with subfolders and some random files.
The argv[1] currently gives the following error:
Error: argument of type "_TCHAR*" is incompatible with parameter of type "const char *"
Now I've googled for the TCHAR and I understand it is either a wchar_t* or a char* depending on using Unicode character set or multi-byte character set (I'm currently using Unicode). I also understand converting is a massive pain. So what I'm asking is: how can I best go around this with the _TCHAR and _findfirst parameter?
I'm planning to concat the dirEntry.name to the argv[1] as well as concatenating a "*" on the end, and using this in another _findfirst. Any comments on my code are appreciated as well since I'm still learning C++.
See here: _findfirst is for multibyte strings while _wfindfirst is for wide characters. If you use TCHAR in your code then use _tfindfirst (macro) which will resolve to _findfirst on non UNICODE, and _wfindfirst on UNICODE builds.
Also instead of _finddata_t use _tfinddata_t which will also resolve to correct structure depending on UNICODE config.
Another thing is that you should use also correct literal, _T("C:/*") will be L"C:/*" on UNICODE build, and "C:/*" otherwise. If you know you are building with UNICODE defined, then use std::vector<std::wstring>.
btw. Visual Studio by default will create project with UNICODE, you may use only wide versions of functions like _wfindfirst as there is no good reason to build non UNICODE projects.
TCHAR and I understand it is either a wchar_t* or a char* depending on using UTF-8 character set or multi-byte character set (I'm currently using UTF-8).
this is wrong, in UNICODE windows apis uses UTF-16. sizeof(wchar_t)==2.
Use this simple typedef:
typedef std::basic_string<TCHAR> TCharString;
Then use TCharString wherever you were using std::string, such as here:
vector<TCharString> dirArray;
See here for information on std::basic_string.

Stumped with Unicode, Boost, C++, codecvts

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.
But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.
My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.
The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.
#include <string>
#include <boost/locale.hpp>
#include <locale>
int main(void)
{
std::string data("Testing, 㤹");
std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
std::locale toLoc = boost::locale::generator().generate("en_US.UTF-32");
typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);
std::locale convLoc = std::locale(fromLoc, toCvt);
std::cout.imbue(convLoc);
std::cout << data << std::endl;
// Output is unconverted -- what?
return 0;
}
I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?
Okay, after a long few months I've figured it out, and I'd like to help people in the future.
First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).
#include <boost/locale.hpp>
namespace loc = boost::locale;
int main(void)
{
loc::generator gen;
std::locale blah = gen.generate("en_US.utf-32");
std::string UTF8String = "Tésting!";
// from_utf will also work with wide strings as it uses the character size
// to detect the encoding.
std::string converted = loc::conv::from_utf(UTF8String, blah);
// Outputs a UTF-32 string.
std::cout << converted << std::endl;
return 0;
}
As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.
I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.
As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.
std::cout.imbue(convLoc);
std::cout << data << std::endl;
This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.
To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.
The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.
However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.
A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.
Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h" // CmdLineArgs
#include "throwx.h" // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )
#include <boost/filesystem/fstream.hpp> // boost::filesystem::ifstream
#include <iostream> // std::cout, std::cerr, std::endl
#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;
inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }
int main()
{
try
{
CmdLineArgs const args;
wstring const programPath = args.at( 0 );
hopefully( args.nArgs() == 2 )
|| throwX( "Usage: " + ansi( programPath ) + " FILENAME" );
wstring const filePath = args.at( 1 );
bfs::ifstream stream( filePath ); // Nice Boost ifstream subclass.
hopefully( !stream.fail() )
|| throwX( "Failed to open file '" + ansi( filePath ) + "'" );
string line;
while( getline( stream, line ) )
{
cout << line << endl;
}
hopefully( stream.eof() )
|| throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );
return EXIT_SUCCESS;
}
catch( exception const& x )
{
cerr << "!" << x.what() << endl;
}
return EXIT_FAILURE;
}

how to get system or user temp folder in unix and windows?

I am writing a C++ problem. It need to work on both Windows and Unix OS.
How to get user or system tmp folder on different OS?
Update: Thanks #RoiDanton, the most up to date answer is std::filesystem::temp_directory_path (C++17)
Try boost::filesystem's temp_directory_path() which internally uses:
ISO/IEC 9945 (POSIX): The path supplied by the first environment variable found in the list TMPDIR, TMP, TEMP, TEMPDIR. If none of these are found, "/tmp", or, if macro __ANDROID__ is defined, "/data/local/tmp"
Windows: The path reported by the Windows GetTempPath API function.
Interestingly, Window's GetTempPath uses similar logic to the POSIX version: the first environment variable in the list TMP, TEMP, USERPROFILE. If none of these are found, it returns the Windows directory.
The fact that these methods primarily rely on environment variables seems a bit yuck. But thats how it seems to be determined. Seeing as how mundane it really is, you could easily roll your own using cstdlib's getenv function, especially if you want specific order prioritization/requirements or dont want to use another library.
Use the $TMPDIR environment variable, according to POSIX.
char const *folder = getenv("TMPDIR");
if (folder == 0)
folder = "/tmp";
if you use QT(Core) you can try QString QDir::tempPath() , or use it's implementation in your code (QT is open, so, check how they do).
The doc say : On Unix/Linux systems this is usually /tmp; on Windows this is usually the path in the TEMP or TMP environment variable.
According to the docs, the max path is MAX_PATH (260). If the path happens to be 260, the code in the sample above (als plougy) will fail because 261 will be returned. Probably the buffer size should be MAX_PATH + 1.
TCHAR szPath[MAX_PATH + 1];
DWORD result = GetTempPath(MAX_PATH + 1, szPath);
if (result != ERROR_SUCCESS) {
// check GetLastError()
}
Handy function :
std::string getEnvVar( std::string const & key )
{
char * val = getenv( key.c_str() );
return val == NULL ? std::string("") : std::string(val);
}
I guess TEMP or something could be passed as an argument? Depending on the OS of course. getenv is part of stdlib so this should also be portable.
If you get an access to main() function code, may be better is to put necessary folder names through the main()'s **argv and use an OS-dependend batch launcher.
For example, for UNIX
bash a_launcher.sh
where a_launcher.sh is like
./a.out /tmp
On Windows: Use GetTempPath() to retrieve the path of the directory designated for temporary files.
wstring TempPath;
wchar_t wcharPath[MAX_PATH];
if (GetTempPathW(MAX_PATH, wcharPath))
TempPath = wcharPath;
None of these examples are really concrete and provide a working example (besides std::filesystem::temp_directory_path) rather they're referring you to microsoft's documentation, here's a working example using "GetTempPath()" (tested on windows 10):
//temp.cpp
#include <iostream>
#include <windows.h>
int main()
{
TCHAR path_buf[MAX_PATH];
DWORD ret_val = GetTempPath(MAX_PATH, path_buf);
if ( ret_val > MAX_PATH || (ret_val == 0) )
{
std::cout << "GetTempPath failed";
} else {
std::cout << path_buf;
}
}
outputs:
C:\>temp.exe
C:\Users\username\AppData\Local\Temp\

find window text and save txt to file named that wont work

my code wont work and idk why. the point of my code is to find the top window and save a text file with the name the same as the text on the top menu bar (task bar i think?). then save some data to that text file. but everytime i try to use it the write fails if i set the name of the text file before hand so it wont change it will write the data to the file. but if i don't set it before hand it will make the text doc but not write anything to it. or sometimes it will just write numbers for the name (i think it's the handle number) then it will write the data. :\ it's odd can anyone help?
#include <iostream>
#include <windows.h>
#include <fstream>
#include <string>
#include <sstream>
#include <time.h>
using namespace std;
string header_str = ("NULL");
#define DTTMFMT "%Y-%m-%d %H:%M:%S "
#define DTTMSZ 21
char buff[DTTMSZ];
fstream filestr;
string ff = ("C:\\System logs\\txst.txt");
TCHAR buf[255];
int main()
{
GetWindowText(GetForegroundWindow(), buf, 255);
stringstream header(stringstream::in | stringstream::out);
header.flush();
header << ("C:\\System logs\\");
header << buf;
header << (".txt");
header_str = header.str();
ff = header_str;
cout << header_str << "\n";
filestr.open (ff.c_str(), fstream::in | fstream::out | fstream::app | ios_base::binary | ios_base::out);
filestr << "dfg";
filestr.close();
Sleep(10000);
return 0;
}
You are not sanitizing the name of your text file. There are quite a few illegal file names. Primarily, characters such as ":", "/" and "\" are not allowed in a filename.
Your buffer containing the window title is of type TCHAR. In Visual C++ TCHAR expands to type wchar_t when the _UNICODE preprocessor macro is defined (I believe this is the default setting). You then use the non-wide version of stringstream, which when will convert a wchar_t* to a string representation of the pointer value rather than the string contents of the buffer.
You have 2 options:
Change your settings to not use the "Unicode Character Set". (Look under General Project Settings).
Use wstringstream and wfstream instead of stringstream and fstream. These will work with wchar_t.
You are probably compiling your application in unicode mode. So buf is actually wchar_t[255], and when you execute header<<buf, you actually output the pointer value of buf instead of its contents since header is a non-unicode string stream and doesn't know how to write unicode buffers.
You can either compile your app as non-unicode (In the general project settings, set character set to multi-byte), or use the unicode alternatives to stringstream and fstream (wstringstream, wfstream).
That's the main cause of the problem. In addition, theatrus is right about needing to remove illegal characters from the filename, and you should also check if GetForegroundWindow returned NULL or if GetWindowText returned an empty string.