Open utf8 encoded filename in c++ Windows - c++

Consider the following code:
#include <iostream>
#include <boost\locale.hpp>
#include <Windows.h>
#include <fstream>
std::string ToUtf8(std::wstring str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}
int main()
{
std::wstring wfilename = L"D://Private//Test//एउटा फोल्दर//भित्रको फाईल.txt";
std::string utf8path = ToUtf8(wfilename );
std::ifstream iFileStream(utf8path , std::ifstream::in | std::ifstream::binary);
if(iFileStream.is_open())
{
std::cout << "Opened the File\n";
//Do the work here.
}
else
{
std::cout << "Cannot Opened the file\n";
}
return 0;
}
If I am running the file, I cannot open the file thus entering into the else block. Even using boost::locale::conv::from_utf(utf8path ,"utf_8") instead of utf8path doesn't work. The code works if I consider using wifstream and using wfilename as its parameter, but I don' want to use wifstream. Is there any way to open the file with its name utf8 encoded? I am using Visual Studio 2010.

On Windows, you MUST use 8bit ANSI (and it must match the user's locale) or UTF-16 for filenames, there is no other option available. You can keep using string and UTF-8 in your main code, but you will have to convert UTF-8 filenames to UTF-16 when you are opening files. Less efficient, but that is what you need to do.
Fortunately, VC++'s implementation of std::ifstream and std::ofstream have non-standard overloads of their constructors and open() methods to accept wchar_t* strings for UTF-16 filenames.
explicit basic_ifstream(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::in,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::in,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode
);
explicit basic_ofstream(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::out,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::out,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode
);
You will have to use an #ifdef to detect Windows compilation (unfortunately, different C++ compilers identify that differently) and temporarily convert your UTF-8 string to UTF-16 when opening a file.
#ifdef _MSC_VER
std::wstring ToUtf16(std::string str)
{
std::wstring ret;
int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
if (len > 0)
{
ret.resize(len);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len);
}
return ret;
}
#endif
int main()
{
std::string utf8path = ...;
std::ifstream iFileStream(
#ifdef _MSC_VER
ToUtf16(utf8path).c_str()
#else
utf8path.c_str()
#endif
, std::ifstream::in | std::ifstream::binary);
...
return 0;
}
Note that this is only guaranteed to work in VC++. Other C++ compilers for Windows are not guaranteed to provide similar extensions.
UPDATE: as of Windows 10 Insider Preview Build 17035, Microsoft now supports UTF-8 as a system-wide encoding that users can set their locale to. And as of Windows 10 Version 1903 (build 18362), applications can now opt in via their app manifest to use UTF-8 as a process-wide codepage, even if the user locale is not set to UTF-8. These features allow ANSI-based APIs (like CreateFileA(), which std::ifstream/std::ofstream use internally) to work with UTF-8 strings. So, in theory, with this feature turned on, you might be able to pass a UTF-8 encoded string to std::ifstream/std::ofstream and it would "just work". I can't confirm that, as it very much depends on the implementation. It would be safer to stick with passing in UTF-16 filenames, since that is Windows' native encoding, which the ANSI APIs will simply convert to internally.

You can use std::filesystem::u8path in C++14/17:
std::filesystem::path pa = std::filesystem::u8path((const char*)yourStdStringPath.c_str());
std::ofstream ofs(pa);
It's deprecated in C++20 since you can use the u8 prefix.

Related

Strange unicode error when converting Chinese wide strings to regular strings in C++

Some of my Chinese software users noticed a strange C++ exception being thrown when my C++ code for Windows tried to list all running processes:
在多字节的目标代码页中,没有此 Unicode 字符可以映射到的字符。
Translated to English this roughly means:
There are no characters to which this Unicode character can be mapped
in the multi-byte target code page.
The code which prints this is:
try
{
list_running_processes();
}
catch (std::runtime_error &exception)
{
LOG_S(ERROR) << exception.what();
return EXIT_FAILURE;
}
The most likely culprit source code is:
std::vector<running_process_t> list_running_processes()
{
std::vector<running_process_t> running_processes;
const auto snapshot_handle = unique_handle(CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0));
if (snapshot_handle.get() == INVALID_HANDLE_VALUE)
{
throw std::runtime_error("CreateToolhelp32Snapshot() failed");
}
PROCESSENTRY32 process_entry{};
process_entry.dwSize = sizeof process_entry;
if (Process32First(snapshot_handle.get(), &process_entry))
{
do
{
const auto process_id = process_entry.th32ProcessID;
const auto executable_file_path = get_file_path(process_id);
// *** HERE ***
const auto process_name = wide_string_to_string(process_entry.szExeFile);
running_processes.emplace_back(executable_file_path, process_name, process_id);
} while (Process32Next(snapshot_handle.get(), &process_entry));
}
return running_processes;
}
Or alternatively:
std::string get_file_path(const DWORD process_id)
{
std::string file_path;
const auto snapshot_handle = unique_handle(CreateToolhelp32Snapshot(TH32CS_SNAPMODULE, process_id));
MODULEENTRY32W module_entry32{};
module_entry32.dwSize = sizeof(MODULEENTRY32W);
if (Module32FirstW(snapshot_handle.get(), &module_entry32))
{
do
{
if (module_entry32.th32ProcessID == process_id)
{
return wide_string_to_string(module_entry32.szExePath); // *** HERE ***
}
} while (Module32NextW(snapshot_handle.get(), &module_entry32));
}
return file_path;
}
This is the code for performing a conversion from a std::wstring to a regular std::string:
std::string wide_string_to_string(const std::wstring& wide_string)
{
if (wide_string.empty())
{
return std::string();
}
const auto size_needed = WideCharToMultiByte(CP_UTF8, 0, &wide_string.at(0),
static_cast<int>(wide_string.size()), nullptr, 0, nullptr, nullptr);
std::string str_to(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wide_string.at(0), static_cast<int>(wide_string.size()), &str_to.at(0),
size_needed, nullptr, nullptr);
return str_to;
}
Is there any reason this can fail on Chinese language file paths or Chinese language Windows etc.? The code works fine on regular western Windows machines. Let me know if I'm missing any crucial pieces of information here since I cannot debug or test this on my own right now without access to one of the affected machines.
I managed to test on a Chinese machine and it turns out that converting a file path from wide string to a regular string will produce a bad file path output if the file path contains e.g. Chinese (non-ASCII) symbols.
I could fix this bug by replacing calls to wide_string_to_string() with std::filesystem::path(wide_string_file_path).string() since the std::filesystem API will handle the conversion correctly for file paths unlike wide_string_to_string().

Bitmap file fails to open

I run the following command and it always returns null as the Bitmap file fails to open for some reason. Please help!
const XCHAR* szFilePathW = L"C:\\Users\\Simrat\\Desktop";
std::ofstream bmpF;
char szFilePathA[MSO_MAX_PATH]; // std::ofstream.open() takes char* in Android C++ compiler, whereas it takes both char* and wchar* in VC++
WideCharToMultiByte(CP_UTF8, 0, szFilePathW, -1, szFilePathA, MSO_MAX_PATH, NULL, NULL);
bmpF.open(szFilePathA, std::ofstream::binary | std::ofstream::out);
if (!bmpF.is_open())
return null;
The functionality of the WideCharToMultiByte() function could be found here:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx
As commented, you appear to be missing a filename. And the code should simply be:
const WCHAR* szFilePath = L"C:\\Users\\Simrat\\Desktop\\name_of_bitmap_file_I_want_to_create.bmp";
std::ofstream bmpF;
bmpF.open (szFilePath, std::ofstream::binary | std::ofstream::out);
No need to convert to multibyte, see MSDN (note that open has a const wchar_t * overload for the filename parameter - on Windows).

WideCharToMultiByte std analog UTF8

This code is workng properly for me:
std::wstring wmsg_text = L"キエオイウカクケコサシスセソタチツテア";
char buffer[100] = { 0 };
WideCharToMultiByte(CP_UTF8, 0, wmsg_text.data(), wmsg_text.size(), buffer, sizeof(buffer)-1, NULL, NULL);
I wonder the cross platform analog of this code. I look to std::wcstombs with std::codecvt_utf8, but can't guess how to use this by right way.
You want to use std::wcsrtombs, something like:
std::wstring wmsg_text = L"キエオイウカクケコサシスセソタチツテア";
const wchar_t* wstr = wmsg_text.data();
std::mbstate_t state = std::mbstate_t();
int len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
std::vector<char> mbstr(len);
std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
char* buffer = mbstr.data();
This code is working properly, too:
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string u8str = conv.to_bytes(msg);

How to convert BSTR to std::string in Visual Studio C++ 2010?

I am working on a COM dll. I wish to convert a BSTR to a std::string to pass to a method that takes a const reference parameter.
It seems that using _com_util::ConvertBSTRToString() to get the char* equivalent of the BSTR is an appropriate way to do so. However, the API documentation is sparse, and the implementation is potentially buggy:
http://msdn.microsoft.com/en-us/library/ewezf1f6(v=vs.100).aspx
http://www.codeproject.com/Articles/1969/BUG-in-_com_util-ConvertStringToBSTR-and-_com_util
Example:
#include <comutil.h>
#include <string>
void Example(const std::string& Str) {}
int main()
{
BSTR BStr = SysAllocString("Test");
char* CharStr = _com_util::ConvertBSTRToString(BStr);
if(CharStr != NULL)
{
std::string StdStr(CharStr);
Example(StdStr);
delete[] CharStr;
}
SysFreeString(BStr);
}
What are the pros and cons of alternatives to using ConvertBSTRToString(), preferrably based on standard methods and classes?
You can do this yourself. I prefer to convert into the target std::string if possible. If not, use a temp-value override.
// convert a BSTR to a std::string.
std::string& BstrToStdString(const BSTR bstr, std::string& dst, int cp = CP_UTF8)
{
if (!bstr)
{
// define NULL functionality. I just clear the target.
dst.clear();
return dst;
}
// request content length in single-chars through a terminating
// nullchar in the BSTR. note: BSTR's support imbedded nullchars,
// so this will only convert through the first nullchar.
int res = WideCharToMultiByte(cp, 0, bstr, -1, NULL, 0, NULL, NULL);
if (res > 0)
{
dst.resize(res);
WideCharToMultiByte(cp, 0, bstr, -1, &dst[0], res, NULL, NULL);
}
else
{ // no content. clear target
dst.clear();
}
return dst;
}
// conversion with temp.
std::string BstrToStdString(BSTR bstr, int cp = CP_UTF8)
{
std::string str;
BstrToStdString(bstr, str, cp);
return str;
}
Invoke as:
BSTR bstr = SysAllocString(L"Test Data String")
std::string str;
// convert directly into str-allocated buffer.
BstrToStdString(bstr, str);
// or by-temp-val conversion
std::string str2 = BstrToStdString(bstr);
// release BSTR when finished
SysFreeString(bstr);
Something like that, anyway.
Easy way
BSTR => CStringW => CW2A => std::string.

Convert UTF-16 to UTF-8

I am current using VC++ 2008 MFC. Due to PostgreSQL doesn't support UTF-16 (Encoding used by Windows for Unicode), I need to convert string from UTF-16 to UTF-8, before store it.
Here is my code snippet.
// demo.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "demo.h"
#include "Utils.h"
#include <iostream>
#ifdef _DEBUG
#define new DEBUG_NEW
#endif
// The one and only application object
CWinApp theApp;
using namespace std;
int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
{
int nRetCode = 0;
// initialize MFC and print and error on failure
if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0))
{
// TODO: change error code to suit your needs
_tprintf(_T("Fatal Error: MFC initialization failed\n"));
nRetCode = 1;
}
else
{
// TODO: code your application's behavior here.
}
CString utf16 = _T("Hello");
std::cout << utf16.GetLength() << std::endl;
CStringA utf8 = UTF8Util::ConvertUTF16ToUTF8(utf16);
std::cout << utf8.GetLength() << std::endl;
getchar();
return nRetCode;
}
and the conversion functions.
namespace UTF8Util
{
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF8ToUTF16
// DESC: Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).
//----------------------------------------------------------------------------
CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )
{
//
// Special case of NULL or empty input string
//
if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') )
{
// Return empty string
return L"";
}
//
// Consider CHAR's count corresponding to total input string length,
// including end-of-string (\0) character
//
const size_t cchUTF8Max = INT_MAX - 1;
size_t cchUTF8;
HRESULT hr = ::StringCchLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );
if ( FAILED( hr ) )
{
AtlThrow( hr );
}
// Consider also terminating \0
++cchUTF8;
// Convert to 'int' for use with MultiByteToWideChar API
int cbUTF8 = static_cast<int>( cchUTF8 );
//
// Get size of destination UTF-16 buffer, in WCHAR's
//
int cchUTF16 = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
MB_ERR_INVALID_CHARS, // error on invalid chars
pszTextUTF8, // source UTF-8 string
cbUTF8, // total length of source UTF-8 string,
// in CHAR's (= bytes), including end-of-string \0
NULL, // unused - no conversion done in this step
0 // request size of destination buffer, in WCHAR's
);
ATLASSERT( cchUTF16 != 0 );
if ( cchUTF16 == 0 )
{
AtlThrowLastWin32();
}
//
// Allocate destination buffer to store UTF-16 string
//
CStringW strUTF16;
WCHAR * pszUTF16 = strUTF16.GetBuffer( cchUTF16 );
//
// Do the conversion from UTF-8 to UTF-16
//
int result = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
MB_ERR_INVALID_CHARS, // error on invalid chars
pszTextUTF8, // source UTF-8 string
cbUTF8, // total length of source UTF-8 string,
// in CHAR's (= bytes), including end-of-string \0
pszUTF16, // destination buffer
cchUTF16 // size of destination buffer, in WCHAR's
);
ATLASSERT( result != 0 );
if ( result == 0 )
{
AtlThrowLastWin32();
}
// Release internal CString buffer
strUTF16.ReleaseBuffer();
// Return resulting UTF16 string
return strUTF16;
}
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in const WCHAR * pszTextUTF16 )
{
//
// Special case of NULL or empty input string
//
if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L'\0') )
{
// Return empty string
return "";
}
//
// Consider WCHAR's count corresponding to total input string length,
// including end-of-string (L'\0') character.
//
const size_t cchUTF16Max = INT_MAX - 1;
size_t cchUTF16;
HRESULT hr = ::StringCchLengthW( pszTextUTF16, cchUTF16Max, &cchUTF16 );
if ( FAILED( hr ) )
{
AtlThrow( hr );
}
// Consider also terminating \0
++cchUTF16;
//
// WC_ERR_INVALID_CHARS flag is set to fail if invalid input character
// is encountered.
// This flag is supported on Windows Vista and later.
// Don't use it on Windows XP and previous.
//
#if (WINVER >= 0x0600)
DWORD dwConversionFlags = WC_ERR_INVALID_CHARS;
#else
DWORD dwConversionFlags = 0;
#endif
//
// Get size of destination UTF-8 buffer, in CHAR's (= bytes)
//
int cbUTF8 = ::WideCharToMultiByte(
CP_UTF8, // convert to UTF-8
dwConversionFlags, // specify conversion behavior
pszTextUTF16, // source UTF-16 string
static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,
// including end-of-string \0
NULL, // unused - no conversion required in this step
0, // request buffer size
NULL, NULL // unused
);
ATLASSERT( cbUTF8 != 0 );
if ( cbUTF8 == 0 )
{
AtlThrowLastWin32();
}
//
// Allocate destination buffer for UTF-8 string
//
CStringA strUTF8;
int cchUTF8 = cbUTF8; // sizeof(CHAR) = 1 byte
CHAR * pszUTF8 = strUTF8.GetBuffer( cchUTF8 );
//
// Do the conversion from UTF-16 to UTF-8
//
int result = ::WideCharToMultiByte(
CP_UTF8, // convert to UTF-8
dwConversionFlags, // specify conversion behavior
pszTextUTF16, // source UTF-16 string
static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,
// including end-of-string \0
pszUTF8, // destination buffer
cbUTF8, // destination buffer size, in bytes
NULL, NULL // unused
);
ATLASSERT( result != 0 );
if ( result == 0 )
{
AtlThrowLastWin32();
}
// Release internal CString buffer
strUTF8.ReleaseBuffer();
// Return resulting UTF-8 string
return strUTF8;
}
} // namespace UTF8Util
However, during runtime, I get the exception at
ATLASSERT( cbUTF8 != 0 );
while trying to get size of destination UTF-8 buffer
What thing I had missed out?
If I am testing using a Chinese characters, How can I verify the resultant UTF-8 string is correct?
You can also use the ATL String Conversion Macros - to convert from UTF-16 to UTF-8 use CW2A and pass CP_UTF8 as the code page, e.g.:
CW2A utf8(buffer, CP_UTF8);
const char* data = utf8.m_psz;
The problem is you specified the WC_ERR_INVALID_CHARS flag:
Windows Vista and later: Fail if an invalid input character is encountered. If this flag is not set, the function silently drops illegal code points. A call to GetLastError returns ERROR_NO_UNICODE_TRANSLATION. Note that this flag only applies when CodePage is specified as CP_UTF8 or 54936 (for Windows Vista and later). It cannot be used with other code page values.
Your conversion function seems quite long. How does this one work for you?
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in LPCWSTR pszTextUTF16 ) {
if (pszTextUTF16 == NULL) return "";
int utf16len = wcslen(pszTextUTF16);
int utf8len = WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len,
NULL, 0, NULL, NULL );
CArray<CHAR> buffer;
buffer.SetSize(utf8len+1);
buffer.SetAt(utf8len, '\0');
WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len,
buffer.GetData(), utf8len, 0, 0 );
return buffer.GetData();
}
I see you use a function called StringCchLengthW to get the required length of the output buffer. Most of the places I look recommend using the WideCharToMultiByte function itself to tell you how many CHARs it wants.
Edit:
As Rob pointed out, you can use CW2A with the CP_UTF8 code page:
CStringA str = CW2A(wStr, CP_UTF8);
While I'm editing, I can answer your second question:
How can I verify the resultant UTF-8 string is correct?
Write it to a text file, then open it in Mozilla Firefox or an equivillant program. In the View menu, you can go to Character Encoding and switch manually to UTF-8 (assuming Firefox didn't guess it correctly to begin with). Compare it with a UTF-16 document with the same text and see if there are any differences.