On Windows, stat and GetFileAttributes fail for paths containing strange characters

On Windows, stat and GetFileAttributes fail for paths containing strange characters - c++

The code below demonstrates how stat and GetFileAttributes fail when the path contains some strange (but valid) ASCII characters.
As a workaround, I would use the 8.3 DOS file name. But this does not work when the drive has 8.3 names disabled.
(8.3 names are disabled with the fsutil command: fsutil behavior set disable8dot3 1).
Is it possible to get stat and/or GetFileAttributes to work in this case?
If not, is there another way of determining whether or not a path is a directory or file?
#include "stdafx.h"
#include <sys/stat.h>
#include <string>
#include <Windows.h>
#include <atlpath.h>
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
// The final characters in the path below are 0xc3 (Ã) and 0x3f (?).
// Create a test directory with the name Ã and set TEST_DIR below to your test directory.
const char* TEST_DIR = "D:\\tmp\\VisualStudio\\TestProject\\ConsoleApplication1\\test_data\\Ã";
int main()
{
std::string testDir = TEST_DIR;
// test stat and _wstat
struct stat st;
const auto statSucceeded = stat(testDir.c_str(), &st) == 0;
if (!statSucceeded)
{
printf("stat failed\n");
}
std::wstring testDirW = s2ws(testDir);
struct _stat64i32 stW;
const auto statSucceededW = _wstat(testDirW.data(), &stW) == 0;
if (!statSucceededW)
{
printf("_wstat failed\n");
}
// test PathIsDirectory
const auto isDir = PathIsDirectory(testDirW.c_str()) != 0;
if (!isDir)
{
printf("PathIsDirectory failed\n");
}
// test GetFileAttributes
const auto fileAttributes = ::GetFileAttributes(testDirW.c_str());
const auto getFileAttributesWSucceeded = fileAttributes != INVALID_FILE_ATTRIBUTES;
if (!getFileAttributesWSucceeded)
{
printf("GetFileAttributes failed\n");
}
return 0;
}

The problem you have encountered comes from using the MultiByteToWideChar function. Using CP_ACP can default to a code page that does not support some characters. If you change the default system code page to UTF8, your code will work. Since you cannot tell your clients what code page to use, you can use a third party library such as International Components for Unicode to convert from the host code page to UTF16.
I ran your code using console code page 65001 and VS2015 and your code worked as written. I also added positive printfs to verify that it did work.

Don't start with a narrow string literal and try to convert it, start with a wide string literal - one that represents the actual filename. You can use hexadecimal escape sequences to avoid any dependency on the encoding of the source code.
If the actual code doesn't use string literals, the best resolution depends on the situation; for example, if the file name is being read from a file, you need to make sure that you know what encoding the file is in and perform the conversion accordingly.
If the actual code reads the filename from the command line arguments, you can use wmain() instead of main() to get the arguments as wide strings.

Related

Renaming a file with an en dash in the name in C++

In the project I'm working on, I work with files and I check if they exists before proceeding. Renaming or even working with files featuring that 'en dash' in the file path seems impossible.
std::string _old = "D:\\Folder\\This – by ABC.txt";
std::rename(_old.c_str(), "New.txt");
here the _old variable is interpreted as D:\Folder\This û by ABC.txt
I tried
setlocale(LC_ALL, "");
//and
setlocale(LC_ALL, "C");
//or
setlocale(LC_ALL, "en_US.UTF-8");
but none of them worked.. What should be done?

It depends on the operation system. In Linux file names are simple byte arrays: forget about encoding and just rename the file.
But seems you are using Windows and file name is actually a null-terminated string containing 16-bit characters. In this case the best way is to use wstring instead of messing with encodings.
Don't try to write platform-independent code to solve platform-specific problems. Windows uses Unicode for file names so you have to write platform-specific code instead of using standard function rename.
Just write L"D:\\Folder\\This \u2013 by ABC.txt" and call _wrename.

The Windows ANSI Western encoding has the Unicode n-dash, U+2013, “–”, as code point 150 (decimal). When you output that to a console with active code page 437, the original IBM PC character set, or compatible, then it's interpreted as an “û”. So you have the right codepage 1252 character in your string literal, either because
you're using Visual C++, which defaults to the Windows ANSI codepage for encoding narrow string literals, or
you're using an old version of g++ that doesn't do the standard-mandated conversions and checking but just passes narrow character bytes directly through its machinery, and your source code is encoded as Windows ANSI Western (or compatible), or
something I didn't think of.
For either of the first two possibilities
the rename call will work.
I tested that it does indeed work with Visual C++. I do not have an old version of g++ around, but I tested that it works with version 5.1. That is, I tested that the file is really renamed to New.txt.
// Source encoding: UTF-8
// Execution character set: Windows ANSI Western a.k.a. codepage 1252.
#include <stdio.h> // rename
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <string> // std::string
using namespace std;
auto main()
-> int
{
string const a = ".\\This – by ABC.txt"; // Literal encoded as CP 1252.
return rename( a.c_str(), "New.txt" ) == 0? EXIT_SUCCESS : EXIT_FAILURE;
}
Example:
[C:\my\forums\so\265]
> dir /b *.txt
File Not Found
[C:\my\forums\so\265]
> g++ r.cpp -fexec-charset=cp1252
[C:\my\forums\so\265]
> type nul >"This – by ABC.txt"
[C:\my\forums\so\265]
> run a
Exit code 0
[C:\my\forums\so\265]
> dir /b *.txt
New.txt
[C:\my\forums\so\265]
> _
… where run is just a batch file that reports the exit code.
If your Windows ANSI codepage is not codepage 1252, then you need to use your particular Windows ANSI codepage.
You can check the Windows ANSI codepage via the GetACP API function, or e.g. via this command:
[C:\my\forums\so\265]
> wmic os get codeset /value | find "="
CodeSet=1252
[C:\my\forums\so\265]
> _
The code will work if that codepage supports the n-dash character.
This model of coding is based on having one version of the executable for each relevant main locale (including character encoding).
An alternative is to do everything in Unicode. This can be done portably via Boost file system, which will be adopted into the standard library in C++17. Or you can use the Windows API, or de facto standard extensions to the standard library in Windows, i.e. _rename.
Example of using the experimental file system module with Visual C++ 2015:
// Source encoding: UTF-8
// Execution character set: irrelevant (everything's done in Unicode).
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <filesystem> // In C++17 and later, or Visual C++ 2015 and later.
using namespace std::tr2::sys;
auto main()
-> int
{
path const old_path = L".\\This – by ABC.txt"; // Literal encoded as wide string.
path const new_path = L"New.txt";
try
{
rename( old_path, new_path );
return EXIT_SUCCESS;
}
catch( ... )
{}
return EXIT_FAILURE;
}
To do this properly for portable code you can use Boost, or you can create a wrapper header that uses whatever implementation is available.

It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN. This test example creates file with name you have problem with, then renames it
// #define _UNICODE // might be defined in project
#include <string>
#include <tchar.h>
#include <windows.h>
using namespace std;
// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
if( wstr.empty() ) return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo( size_needed, 0 );
WideCharToMultiByte (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
if( str.empty() ) return std::wstring();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
int _tmain(int argc, _TCHAR* argv[] ) {
std::string pFileName = "C:\\This \xe2\x80\x93 by ABC.txt";
std::wstring pwsFileName = utf8_decode(pFileName);
// can use CreateFile id instead
HANDLE hf = CreateFileW( pwsFileName.c_str() ,
GENERIC_READ | GENERIC_WRITE,
0,
0,
CREATE_NEW,
FILE_ATTRIBUTE_NORMAL,
0);
CloseHandle(hf);
MoveFileW(utf8_decode("C:\\This \xe2\x80\x93 by ABC.txt").c_str(), utf8_decode("C:\\This \xe2\x80\x93 by ABC 2.txt").c_str());
}
There is still problem with those helpers so that you can have a null terminated string.
std::string utf8_encode(const std::wstring &wstr)
{
std::string strTo;
char *szTo = new char[wstr.length() + 1];
szTo[wstr.size()] = '\0';
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
strTo = szTo;
delete[] szTo;
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
std::wstring wstrTo;
wchar_t *wszTo = new wchar_t[str.length() + 1];
wszTo[str.size()] = L'\0';
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wszTo, (int)str.length());
wstrTo = wszTo;
delete[] wszTo;
return wstrTo;
}
a problem with size of character for conversion.. call to WideCharToMultiByte with 0 as the size of target buffer allows to get size of character required for conversion. It will then return the number of bytes needed for the target buffer size. All this juggling with code explains why the frameworks like Qt got so convoluted code to support Unicode-based file system. Actually, best cost-effective way to get rid of all possible bugs for you is to use such framework.
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name... but for file API, you still need do proper conversion

Is it possible to concatenate string and wstring? [duplicate]

string s = "おはよう";
wstring ws = FUNCTION(s, ws);
How would i assign the contents of s to ws?
Searched google and used some techniques but they can't assign the exact content. The content is distorted.

Assuming that the input string in your example (おはよう) is a UTF-8 encoded (which it isn't, by the looks of it, but let's assume it is for the sake of this explanation :-)) representation of a Unicode string of your interest, then your problem can be fully solved with the standard library (C++11 and newer) alone.
The TL;DR version:
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string narrow = converter.to_bytes(wide_utf16_source_string);
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);
Longer online compilable and runnable example:
(They all show the same example. There are just many for redundancy...)
http://ideone.com/KA1oty
http://ide.geeksforgeeks.org/5pRLSh
http://rextester.com/DIJZK52174
Note (old):
As pointed out in the comments and explained in https://stackoverflow.com/a/17106065/6345 there are cases when using the standard library to convert between UTF-8 and UTF-16 might give unexpected differences in the results on different platforms. For a better conversion, consider std::codecvt_utf8 as described on http://en.cppreference.com/w/cpp/locale/codecvt_utf8
Note (new):
Since the codecvt header is deprecated in C++17, some worry about the solution presented in this answer were raised. However, the C++ standards committee added an important statement in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0618r0.html saying
this library component should be retired to Annex D, along side , until a suitable replacement is standardized.
So in the foreseeable future, the codecvt solution in this answer is safe and portable.

int StringToWString(std::wstring &ws, const std::string &s)
{
std::wstring wsTmp(s.begin(), s.end());
ws = wsTmp;
return 0;
}

Your question is underspecified. Strictly, that example is a syntax error. However, std::mbstowcs is probably what you're looking for.
It is a C-library function and operates on buffers, but here's an easy-to-use idiom, courtesy of Mooing Duck:
std::wstring ws(s.size(), L' '); // Overestimate number of code points.
ws.resize(std::mbstowcs(&ws[0], s.c_str(), s.size())); // Shrink to fit.

If you are using Windows/Visual Studio and need to convert a string to wstring you could use:
#include <AtlBase.h>
#include <atlconv.h>
...
string s = "some string";
CA2W ca2w(s.c_str());
wstring w = ca2w;
printf("%s = %ls", s.c_str(), w.c_str());
Same procedure for converting a wstring to string (sometimes you will need to specify a codepage):
#include <AtlBase.h>
#include <atlconv.h>
...
wstring w = L"some wstring";
CW2A cw2a(w.c_str());
string s = cw2a;
printf("%s = %ls", s.c_str(), w.c_str());
You could specify a codepage and even UTF8 (that's pretty nice when working with JNI/Java). A standard way of converting a std::wstring to utf8 std::string is showed in this answer.
//
// using ATL
CA2W ca2w(str, CP_UTF8);
//
// or the standard way taken from the answer above
#include <codecvt>
#include <string>
// convert UTF-8 string to wstring
std::wstring utf8_to_wstring (const std::string& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.from_bytes(str);
}
// convert wstring to UTF-8 string
std::string wstring_to_utf8 (const std::wstring& str) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(str);
}
If you want to know more about codepages there is an interesting article on Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
These CA2W (Convert Ansi to Wide=unicode) macros are part of ATL and MFC String Conversion Macros, samples included.
Sometimes you will need to disable the security warning #4995', I don't know of other workaround (to me it happen when I compiled for WindowsXp in VS2012).
#pragma warning(push)
#pragma warning(disable: 4995)
#include <AtlBase.h>
#include <atlconv.h>
#pragma warning(pop)
Edit:
Well, according to this article the article by Joel appears to be: "while entertaining, it is pretty light on actual technical details". Article: What Every Programmer Absolutely, Positively Needs To Know About Encoding And Character Sets To Work With Text.

Windows API only, pre C++11 implementation, in case someone needs it:
#include <stdexcept>
#include <vector>
#include <windows.h>
using std::runtime_error;
using std::string;
using std::vector;
using std::wstring;
wstring utf8toUtf16(const string & str)
{
if (str.empty())
return wstring();
size_t charsNeeded = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), NULL, 0);
if (charsNeeded == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
vector<wchar_t> buffer(charsNeeded);
int charsConverted = ::MultiByteToWideChar(CP_UTF8, 0,
str.data(), (int)str.size(), &buffer[0], buffer.size());
if (charsConverted == 0)
throw runtime_error("Failed converting UTF-8 string to UTF-16");
return wstring(&buffer[0], charsConverted);
}

Here's a way to combining string, wstring and mixed string constants to wstring. Use the wstringstream class.
This does NOT work for multi-byte character encodings. This is just a dumb way of throwing away type safety and expanding 7 bit characters from std::string into the lower 7 bits of each character of std:wstring. This is only useful if you have a 7-bit ASCII strings and you need to call an API that requires wide strings.
#include <sstream>
std::string narrow = "narrow";
std::wstring wide = L"wide";
std::wstringstream cls;
cls << " abc " << narrow.c_str() << L" def " << wide.c_str();
std::wstring total= cls.str();

From char* to wstring:
char* str = "hello worlddd";
wstring wstr (str, str+strlen(str));
From string to wstring:
string str = "hello worlddd";
wstring wstr (str.begin(), str.end());
Note this only works well if the string being converted contains only ASCII characters.

This variant of it is my favourite in real life. It converts the input, if it is valid UTF-8, to the respective wstring. If the input is corrupted, the wstring is constructed out of the single bytes. This is extremely helpful if you cannot really be sure about the quality of your input data.
std::wstring convert(const std::string& input)
{
try
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input);
}
catch(std::range_error& e)
{
size_t length = input.length();
std::wstring result;
result.reserve(length);
for(size_t i = 0; i < length; i++)
{
result.push_back(input[i] & 0xFF);
}
return result;
}
}

using Boost.Locale:
ws = boost::locale::conv::utf_to_utf<wchar_t>(s);

You can use boost path or std path; which is a lot more easier.
boost path is easier for cross-platform application
#include <boost/filesystem/path.hpp>
namespace fs = boost::filesystem;
//s to w
std::string s = "xxx";
auto w = fs::path(s).wstring();
//w to s
std::wstring w = L"xxx";
auto s = fs::path(w).string();
if you like to use std:
#include <filesystem>
namespace fs = std::filesystem;
//The same
c++ older version
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
//The same
The code within still implement a converter which you dont have to unravel the detail.

For me the most uncomplicated option without big overhead is:
Include:
#include <atlbase.h>
#include <atlconv.h>
Convert:
char* whatever = "test1234";
std::wstring lwhatever = std::wstring(CA2W(std::string(whatever).c_str()));
If needed:
lwhatever.c_str();

String to wstring
std::wstring Str2Wstr(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
wstring to String
std::string Wstr2Str(const std::wstring& wstr)
{
typedef std::codecvt_utf8<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
return converterX.to_bytes(wstr);
}

If you have QT and if you are lazy to implement a function and stuff you can use
std::string str;
QString(str).toStdWString()

Here is my super basic solution that might not work for everyone. But would work for a lot of people.
It requires usage of the Guideline Support Library.
Which is a pretty official C++ library that was designed by many C++ committee authors:
https://github.com/isocpp/CppCoreGuidelines
https://github.com/Microsoft/GSL
std::string to_string(std::wstring const & wStr)
{
std::string temp = {};
for (wchar_t const & wCh : wStr)
{
// If the string can't be converted gsl::narrow will throw
temp.push_back(gsl::narrow<char>(wCh));
}
return temp;
}
All my function does is allow the conversion if possible. Otherwise throw an exception.
Via the usage of gsl::narrow (https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#es49-if-you-must-use-a-cast-use-a-named-cast)

method s2ws works well. Hope helps.
std::wstring s2ws(const std::string& s) {
std::string curLocale = setlocale(LC_ALL, "");
const char* _Source = s.c_str();
size_t _Dsize = mbstowcs(NULL, _Source, 0) + 1;
wchar_t *_Dest = new wchar_t[_Dsize];
wmemset(_Dest, 0, _Dsize);
mbstowcs(_Dest,_Source,_Dsize);
std::wstring result = _Dest;
delete []_Dest;
setlocale(LC_ALL, curLocale.c_str());
return result;
}

Based upon my own testing (On windows 8, vs2010) mbstowcs can actually damage original string, it works only with ANSI code page. If MultiByteToWideChar/WideCharToMultiByte can also cause string corruption - but they tends to replace characters which they don't know with '?' question marks, but mbstowcs tends to stop when it encounters unknown character and cut string at that very point. (I have tested Vietnamese characters on finnish windows).
So prefer Multi*-windows api function over analogue ansi C functions.
Also what I've noticed shortest way to encode string from one codepage to another is not use MultiByteToWideChar/WideCharToMultiByte api function calls but their analogue ATL macros: W2A / A2W.
So analogue function as mentioned above would sounds like:
wstring utf8toUtf16(const string & str)
{
USES_CONVERSION;
_acp = CP_UTF8;
return A2W( str.c_str() );
}
_acp is declared in USES_CONVERSION macro.
Or also function which I often miss when performing old data conversion to new one:
string ansi2utf8( const string& s )
{
USES_CONVERSION;
_acp = CP_ACP;
wchar_t* pw = A2W( s.c_str() );
_acp = CP_UTF8;
return W2A( pw );
}
But please notice that those macro's use heavily stack - don't use for loops or recursive loops for same function - after using W2A or A2W macro - better to return ASAP, so stack will be freed from temporary conversion.

std::string -> wchar_t[] with safe mbstowcs_s function:
auto ws = std::make_unique<wchar_t[]>(s.size() + 1);
mbstowcs_s(nullptr, ws.get(), s.size() + 1, s.c_str(), s.size());
This is from my sample code

use this code to convert your string to wstring
std::wstring string2wString(const std::string& s){
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
int main(){
std::wstring str="your string";
std::wstring wStr=string2wString(str);
return 0;
}

string s = "おはよう"; is an error.
You should use wstring directly:
wstring ws = L"おはよう";

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!

Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

Store non-English string in std::string

I have a simple string in std::wstring
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
I want to store this string in a std::string.
I have tried the below code but the result is not the same as input string
std::wstring tempStr = _T("F:\\Projects\\Current_자동_\\Cam.xml");
//setup converter
typedef std::codecvt_utf8_utf16 <wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
//use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( tempStr );
The Korean string present in the input string is converted to "ìžë™".
Is there any way I can get the same string in std::string?
Expected result:
converted_str should contain F:\Projects\Current_자동_\Cam.xml
Below is an screenshot of debugging showing 3 values in 3 scenarios (conversion in 3 ways). But none of them gives the desired value.

Your conversion code is fine.
In fact, in UTF-8 (the string you store in std::string), the characters 자동 corresponds to:
자 (UTF-16 0xC790) ---> UTF-8: EC 9E 90
동 (UTF-16 0xB3D9) ---> UTF-8: EB 8F 99
If you run the following program, which just prints the converted UTF-8 bytes, you get this output:
ec 9e 90 eb 8f 99
#include <iomanip> // For std::hex
#include <iostream> // For console output
#include <string> // For STL strings
#include <codecvt> // For Unicode conversions
void print_char_hex(const char ch)
{
auto * p = reinterpret_cast<const unsigned char*>(&ch);
int i = *p;
std::cout << std::hex << i << ' ';
}
int main()
{
std::wstring utf16_str = L"\xC790\xB3D9";
// setup converter
typedef std::codecvt_utf8_utf16<wchar_t> convert_type;
std::wstring_convert<convert_type, wchar_t> converter;
// use converter (.to_bytes: wstr->str, .from_bytes: str->wstr)
std::string converted_str = converter.to_bytes( utf16_str );
// Output the converted bytes (UTF-8)
for (size_t i = 0; i < converted_str.length(); ++i)
{
print_char_hex(converted_str[i]);
}
std::cout << std::endl;
}

I think the best solution would be to use the wide-char APIs to open the file, e.g. CreateFileW(...);, because then you can use the wide-char file name directly.
If this is not possible, maybe the string should not be converted to UTF8, but to the system default ANSI code page.
I think this might work:
char out[200];
wchar_t * in = L"F:\\Projects\\Current_자동_\\Cam.xml";
WideCharToMultiByte(CP_ACP, 0, in, 100, out, 100, 0, 0);
or maybe another Korean code page:
WideCharToMultiByte(949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(1361, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(10003, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20833, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(20949, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50225, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(50933, 0, in, 100, out, 100, 0, 0);
WideCharToMultiByte(51949, 0, in, 100, out, 100, 0, 0);
The code page ids can be found here:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
good luck :-)

This works.. You can tell because the conversion back to UTF16 is valid.. If you write the UTF8 string to a file, it will also display properly. This way, you now have two ways of validating that it works.
// UTF16ToUTF8.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <windows.h>
#include <iostream>
#include <codecvt>
std::wstring ToUTF16(const std::string &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes(data);
}
std::string ToUTF8(const std::wstring &data)
{
return std::wstring_convert<std::codecvt_utf8<wchar_t>>().to_bytes(data);
}
int _tmain(int argc, _TCHAR* argv[])
{
std::wstring u16 = L"_자동_";
std::string u8 = ToUTF8(u16);
MessageBoxW(NULL, ToUTF16(u8).c_str(), L"", 0);
std::cin.get();
return 0;
}

You can store UTF-8 in std:string as regular char sequence. Here's library with some useful things, such as length() and everything about indexing, you may want to have http://utfcpp.sourceforge.net/.
For windows console you need to set codepage to 65001 and will become UTF-8.
Sadly or not, std::wstring and whole wchar_t thing doesn't specify any specific encoding.
By the way, you're using Managed C++, why wouldn't you use .NET Framework's System::String^? There's no problems with encodings at all. http://msdn.microsoft.com/ru-ru/library/system.string(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp

The problem is not in your string conversion code. This is a typical source file encoding problem. Visual studio does not use Unicode as default so you should convert your source file's encoding to UTF-8 yourself. To make this conversion you can open your file with notepad++ and click Encoding->Convert to UTF-8
Note1: In VS2010 and vs2012 if you write non-ascii characters to a source file visual studio now warns you and offers to make this conversion.
Note2: From your use of macro _T() I predict this is targeted only to Windows. If you try to build UTF-8 encoded source files that contains BOM with gcc you may get different errors. In any case the best approach would be to read your UTF-8 encoded text data from a file during run-time.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!

You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.

Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

On Windows, stat and GetFileAttributes fail for paths containing strange characters - c++

Related

Renaming a file with an en dash in the name in C++

Is it possible to concatenate string and wstring? [duplicate]

C++: socket encoding (working with TeamSpeak)

Store non-English string in std::string

Call popen() on a command with Chinese characters on Mac

Categories

Resources