WideCharToMultiByte std analog UTF8 - c++

This code is workng properly for me:
std::wstring wmsg_text = L"キエオイウカクケコサシスセソタチツテア";
char buffer[100] = { 0 };
WideCharToMultiByte(CP_UTF8, 0, wmsg_text.data(), wmsg_text.size(), buffer, sizeof(buffer)-1, NULL, NULL);
I wonder the cross platform analog of this code. I look to std::wcstombs with std::codecvt_utf8, but can't guess how to use this by right way.

You want to use std::wcsrtombs, something like:
std::wstring wmsg_text = L"キエオイウカクケコサシスセソタチツテア";
const wchar_t* wstr = wmsg_text.data();
std::mbstate_t state = std::mbstate_t();
int len = 1 + std::wcsrtombs(nullptr, &wstr, 0, &state);
std::vector<char> mbstr(len);
std::wcsrtombs(&mbstr[0], &wstr, mbstr.size(), &state);
char* buffer = mbstr.data();

This code is working properly, too:
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string u8str = conv.to_bytes(msg);

Related

convert emoji string to icu::UnicodeString

I have a method reads a json file and returns a const char* that can be any text, including emojis. I don't have access to the source of this method.
For example, I created a json file with the england flag, 🏴󠁧󠁢󠁥󠁮󠁧󠁿 ({message: "\uD83C\uDFF4\uDB40\uDC67\uDB40\uDC62\uDB40\uDC65\uDB40\uDC6E\uDB40\uDC67\uDB40\uDC7F"}).
When I call that method, it returns something like 🏴󠁧󠁢󠁥󠁮󠁧󠁿, but in order to use it properly, I need to convert it to icu::UnicodeString because I use another method (closed source again) that expects it.
The only way I found to make it work was something like:
icu::UnicodeString unicode;
unicode.setTo((UChar*)convertMessage().data());
std::string messageAsString;
unicode.toUTF8String(messageAsString);
after doing that, messageAsString is usable and everything works.
convertMessage() is a method that uses std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>::from_bytes(str).
My question is, is there a way to create a icu::UnicodeString without using that extra convertMessage() call?
This is sample usage of ucnv_toUChars function. I took these function from postgresql source code and used it for my project.
UConverter *icu_converter;
static int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, int32_t nbytes)
{
UErrorCode status;
int32_t len_uchar;
status = U_ZERO_ERROR;
len_uchar = ucnv_toUChars(icu_converter, NULL, 0,buff, nbytes, &status);
if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
return -1;
*buff_uchar = (UChar *) malloc((len_uchar + 1) * sizeof(**buff_uchar));
status = U_ZERO_ERROR;
len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,buff, nbytes, &status);
if (U_FAILURE(status))
assert(0); //(errmsg("ucnv_toUChars failed: %s", u_errorName(status))));
return len_uchar;
}
static int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
{
UErrorCode status;
int32_t len_result;
status = U_ZERO_ERROR;
len_result = ucnv_fromUChars(icu_converter, NULL, 0,
buff_uchar, len_uchar, &status);
if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
assert(0); // (errmsg("ucnv_fromUChars failed: %s", u_errorName(status))));
*result = (char *) malloc(len_result + 1);
status = U_ZERO_ERROR;
len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
buff_uchar, len_uchar, &status);
if (U_FAILURE(status))
assert(0); // (errmsg("ucnv_fromUChars failed: %s", u_errorName(status))));
return len_result;
}
void main() {
const char *utf8String = "Hello";
int len = 5;
UErrorCode status = U_ZERO_ERROR;
icu_converter = ucnv_open("utf8", &status);
assert(status <= U_ZERO_ERROR);
UChar *buff_uchar;
int32_t len_uchar = icu_to_uchar(&buff_uchar, ut8String, len);
// use buff_uchar
free(buff_uchar);
}

Cannot add TCHAR* to const char

Hello I'm trying to make my application to run on startup, and to do this to work on my clients PC firstly I needed to get their PC username, but when I'm trying to make this working I'm getting this error :
E2140 expression must have integral or unscoped enum type
Here's the code:
HKEY hKey;
const char* czStartName = "MY application";
TCHAR pcusername[UNLEN + 1];
DWORD pcusername_len = UNLEN + 1;
GetUserName((TCHAR*)pcusername, &pcusername_len);
const char* czExePath = "\"C:\\Users\\" + pcusername + "\\Desktop\\Myapplication.exe\" /background";
How Can I convert TCHAR* to Const Char?
As others have said in the comments, you cannot concatenate strings in C using the addition operator. You can do something like in this example:
#include <string.h>
char buf[4096];
snprintf(buf, sizeof(buf), "\"C:\\Users\\%s\\Desktop\\Myapplication.exe\" /background", username);
const char* czExePath = buf;

Open utf8 encoded filename in c++ Windows

Consider the following code:
#include <iostream>
#include <boost\locale.hpp>
#include <Windows.h>
#include <fstream>
std::string ToUtf8(std::wstring str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}
int main()
{
std::wstring wfilename = L"D://Private//Test//एउटा फोल्दर//भित्रको फाईल.txt";
std::string utf8path = ToUtf8(wfilename );
std::ifstream iFileStream(utf8path , std::ifstream::in | std::ifstream::binary);
if(iFileStream.is_open())
{
std::cout << "Opened the File\n";
//Do the work here.
}
else
{
std::cout << "Cannot Opened the file\n";
}
return 0;
}
If I am running the file, I cannot open the file thus entering into the else block. Even using boost::locale::conv::from_utf(utf8path ,"utf_8") instead of utf8path doesn't work. The code works if I consider using wifstream and using wfilename as its parameter, but I don' want to use wifstream. Is there any way to open the file with its name utf8 encoded? I am using Visual Studio 2010.
On Windows, you MUST use 8bit ANSI (and it must match the user's locale) or UTF-16 for filenames, there is no other option available. You can keep using string and UTF-8 in your main code, but you will have to convert UTF-8 filenames to UTF-16 when you are opening files. Less efficient, but that is what you need to do.
Fortunately, VC++'s implementation of std::ifstream and std::ofstream have non-standard overloads of their constructors and open() methods to accept wchar_t* strings for UTF-16 filenames.
explicit basic_ifstream(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::in,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::in,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode
);
explicit basic_ofstream(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::out,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode = ios_base::out,
int _Prot = (int)ios_base::_Openprot
);
void open(
const wchar_t *_Filename,
ios_base::openmode _Mode
);
You will have to use an #ifdef to detect Windows compilation (unfortunately, different C++ compilers identify that differently) and temporarily convert your UTF-8 string to UTF-16 when opening a file.
#ifdef _MSC_VER
std::wstring ToUtf16(std::string str)
{
std::wstring ret;
int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
if (len > 0)
{
ret.resize(len);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len);
}
return ret;
}
#endif
int main()
{
std::string utf8path = ...;
std::ifstream iFileStream(
#ifdef _MSC_VER
ToUtf16(utf8path).c_str()
#else
utf8path.c_str()
#endif
, std::ifstream::in | std::ifstream::binary);
...
return 0;
}
Note that this is only guaranteed to work in VC++. Other C++ compilers for Windows are not guaranteed to provide similar extensions.
UPDATE: as of Windows 10 Insider Preview Build 17035, Microsoft now supports UTF-8 as a system-wide encoding that users can set their locale to. And as of Windows 10 Version 1903 (build 18362), applications can now opt in via their app manifest to use UTF-8 as a process-wide codepage, even if the user locale is not set to UTF-8. These features allow ANSI-based APIs (like CreateFileA(), which std::ifstream/std::ofstream use internally) to work with UTF-8 strings. So, in theory, with this feature turned on, you might be able to pass a UTF-8 encoded string to std::ifstream/std::ofstream and it would "just work". I can't confirm that, as it very much depends on the implementation. It would be safer to stick with passing in UTF-16 filenames, since that is Windows' native encoding, which the ANSI APIs will simply convert to internally.
You can use std::filesystem::u8path in C++14/17:
std::filesystem::path pa = std::filesystem::u8path((const char*)yourStdStringPath.c_str());
std::ofstream ofs(pa);
It's deprecated in C++20 since you can use the u8 prefix.

How to convert BSTR to std::string in Visual Studio C++ 2010?

I am working on a COM dll. I wish to convert a BSTR to a std::string to pass to a method that takes a const reference parameter.
It seems that using _com_util::ConvertBSTRToString() to get the char* equivalent of the BSTR is an appropriate way to do so. However, the API documentation is sparse, and the implementation is potentially buggy:
http://msdn.microsoft.com/en-us/library/ewezf1f6(v=vs.100).aspx
http://www.codeproject.com/Articles/1969/BUG-in-_com_util-ConvertStringToBSTR-and-_com_util
Example:
#include <comutil.h>
#include <string>
void Example(const std::string& Str) {}
int main()
{
BSTR BStr = SysAllocString("Test");
char* CharStr = _com_util::ConvertBSTRToString(BStr);
if(CharStr != NULL)
{
std::string StdStr(CharStr);
Example(StdStr);
delete[] CharStr;
}
SysFreeString(BStr);
}
What are the pros and cons of alternatives to using ConvertBSTRToString(), preferrably based on standard methods and classes?
You can do this yourself. I prefer to convert into the target std::string if possible. If not, use a temp-value override.
// convert a BSTR to a std::string.
std::string& BstrToStdString(const BSTR bstr, std::string& dst, int cp = CP_UTF8)
{
if (!bstr)
{
// define NULL functionality. I just clear the target.
dst.clear();
return dst;
}
// request content length in single-chars through a terminating
// nullchar in the BSTR. note: BSTR's support imbedded nullchars,
// so this will only convert through the first nullchar.
int res = WideCharToMultiByte(cp, 0, bstr, -1, NULL, 0, NULL, NULL);
if (res > 0)
{
dst.resize(res);
WideCharToMultiByte(cp, 0, bstr, -1, &dst[0], res, NULL, NULL);
}
else
{ // no content. clear target
dst.clear();
}
return dst;
}
// conversion with temp.
std::string BstrToStdString(BSTR bstr, int cp = CP_UTF8)
{
std::string str;
BstrToStdString(bstr, str, cp);
return str;
}
Invoke as:
BSTR bstr = SysAllocString(L"Test Data String")
std::string str;
// convert directly into str-allocated buffer.
BstrToStdString(bstr, str);
// or by-temp-val conversion
std::string str2 = BstrToStdString(bstr);
// release BSTR when finished
SysFreeString(bstr);
Something like that, anyway.
Easy way
BSTR => CStringW => CW2A => std::string.

Convert UTF-16 to UTF-8

I am current using VC++ 2008 MFC. Due to PostgreSQL doesn't support UTF-16 (Encoding used by Windows for Unicode), I need to convert string from UTF-16 to UTF-8, before store it.
Here is my code snippet.
// demo.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "demo.h"
#include "Utils.h"
#include <iostream>
#ifdef _DEBUG
#define new DEBUG_NEW
#endif
// The one and only application object
CWinApp theApp;
using namespace std;
int _tmain(int argc, TCHAR* argv[], TCHAR* envp[])
{
int nRetCode = 0;
// initialize MFC and print and error on failure
if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0))
{
// TODO: change error code to suit your needs
_tprintf(_T("Fatal Error: MFC initialization failed\n"));
nRetCode = 1;
}
else
{
// TODO: code your application's behavior here.
}
CString utf16 = _T("Hello");
std::cout << utf16.GetLength() << std::endl;
CStringA utf8 = UTF8Util::ConvertUTF16ToUTF8(utf16);
std::cout << utf8.GetLength() << std::endl;
getchar();
return nRetCode;
}
and the conversion functions.
namespace UTF8Util
{
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF8ToUTF16
// DESC: Converts Unicode UTF-8 text to Unicode UTF-16 (Windows default).
//----------------------------------------------------------------------------
CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )
{
//
// Special case of NULL or empty input string
//
if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') )
{
// Return empty string
return L"";
}
//
// Consider CHAR's count corresponding to total input string length,
// including end-of-string (\0) character
//
const size_t cchUTF8Max = INT_MAX - 1;
size_t cchUTF8;
HRESULT hr = ::StringCchLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );
if ( FAILED( hr ) )
{
AtlThrow( hr );
}
// Consider also terminating \0
++cchUTF8;
// Convert to 'int' for use with MultiByteToWideChar API
int cbUTF8 = static_cast<int>( cchUTF8 );
//
// Get size of destination UTF-16 buffer, in WCHAR's
//
int cchUTF16 = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
MB_ERR_INVALID_CHARS, // error on invalid chars
pszTextUTF8, // source UTF-8 string
cbUTF8, // total length of source UTF-8 string,
// in CHAR's (= bytes), including end-of-string \0
NULL, // unused - no conversion done in this step
0 // request size of destination buffer, in WCHAR's
);
ATLASSERT( cchUTF16 != 0 );
if ( cchUTF16 == 0 )
{
AtlThrowLastWin32();
}
//
// Allocate destination buffer to store UTF-16 string
//
CStringW strUTF16;
WCHAR * pszUTF16 = strUTF16.GetBuffer( cchUTF16 );
//
// Do the conversion from UTF-8 to UTF-16
//
int result = ::MultiByteToWideChar(
CP_UTF8, // convert from UTF-8
MB_ERR_INVALID_CHARS, // error on invalid chars
pszTextUTF8, // source UTF-8 string
cbUTF8, // total length of source UTF-8 string,
// in CHAR's (= bytes), including end-of-string \0
pszUTF16, // destination buffer
cchUTF16 // size of destination buffer, in WCHAR's
);
ATLASSERT( result != 0 );
if ( result == 0 )
{
AtlThrowLastWin32();
}
// Release internal CString buffer
strUTF16.ReleaseBuffer();
// Return resulting UTF16 string
return strUTF16;
}
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in const WCHAR * pszTextUTF16 )
{
//
// Special case of NULL or empty input string
//
if ( (pszTextUTF16 == NULL) || (*pszTextUTF16 == L'\0') )
{
// Return empty string
return "";
}
//
// Consider WCHAR's count corresponding to total input string length,
// including end-of-string (L'\0') character.
//
const size_t cchUTF16Max = INT_MAX - 1;
size_t cchUTF16;
HRESULT hr = ::StringCchLengthW( pszTextUTF16, cchUTF16Max, &cchUTF16 );
if ( FAILED( hr ) )
{
AtlThrow( hr );
}
// Consider also terminating \0
++cchUTF16;
//
// WC_ERR_INVALID_CHARS flag is set to fail if invalid input character
// is encountered.
// This flag is supported on Windows Vista and later.
// Don't use it on Windows XP and previous.
//
#if (WINVER >= 0x0600)
DWORD dwConversionFlags = WC_ERR_INVALID_CHARS;
#else
DWORD dwConversionFlags = 0;
#endif
//
// Get size of destination UTF-8 buffer, in CHAR's (= bytes)
//
int cbUTF8 = ::WideCharToMultiByte(
CP_UTF8, // convert to UTF-8
dwConversionFlags, // specify conversion behavior
pszTextUTF16, // source UTF-16 string
static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,
// including end-of-string \0
NULL, // unused - no conversion required in this step
0, // request buffer size
NULL, NULL // unused
);
ATLASSERT( cbUTF8 != 0 );
if ( cbUTF8 == 0 )
{
AtlThrowLastWin32();
}
//
// Allocate destination buffer for UTF-8 string
//
CStringA strUTF8;
int cchUTF8 = cbUTF8; // sizeof(CHAR) = 1 byte
CHAR * pszUTF8 = strUTF8.GetBuffer( cchUTF8 );
//
// Do the conversion from UTF-16 to UTF-8
//
int result = ::WideCharToMultiByte(
CP_UTF8, // convert to UTF-8
dwConversionFlags, // specify conversion behavior
pszTextUTF16, // source UTF-16 string
static_cast<int>( cchUTF16 ), // total source string length, in WCHAR's,
// including end-of-string \0
pszUTF8, // destination buffer
cbUTF8, // destination buffer size, in bytes
NULL, NULL // unused
);
ATLASSERT( result != 0 );
if ( result == 0 )
{
AtlThrowLastWin32();
}
// Release internal CString buffer
strUTF8.ReleaseBuffer();
// Return resulting UTF-8 string
return strUTF8;
}
} // namespace UTF8Util
However, during runtime, I get the exception at
ATLASSERT( cbUTF8 != 0 );
while trying to get size of destination UTF-8 buffer
What thing I had missed out?
If I am testing using a Chinese characters, How can I verify the resultant UTF-8 string is correct?
You can also use the ATL String Conversion Macros - to convert from UTF-16 to UTF-8 use CW2A and pass CP_UTF8 as the code page, e.g.:
CW2A utf8(buffer, CP_UTF8);
const char* data = utf8.m_psz;
The problem is you specified the WC_ERR_INVALID_CHARS flag:
Windows Vista and later: Fail if an invalid input character is encountered. If this flag is not set, the function silently drops illegal code points. A call to GetLastError returns ERROR_NO_UNICODE_TRANSLATION. Note that this flag only applies when CodePage is specified as CP_UTF8 or 54936 (for Windows Vista and later). It cannot be used with other code page values.
Your conversion function seems quite long. How does this one work for you?
//----------------------------------------------------------------------------
// FUNCTION: ConvertUTF16ToUTF8
// DESC: Converts Unicode UTF-16 (Windows default) text to Unicode UTF-8.
//----------------------------------------------------------------------------
CStringA ConvertUTF16ToUTF8( __in LPCWSTR pszTextUTF16 ) {
if (pszTextUTF16 == NULL) return "";
int utf16len = wcslen(pszTextUTF16);
int utf8len = WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len,
NULL, 0, NULL, NULL );
CArray<CHAR> buffer;
buffer.SetSize(utf8len+1);
buffer.SetAt(utf8len, '\0');
WideCharToMultiByte(CP_UTF8, 0, pszTextUTF16, utf16len,
buffer.GetData(), utf8len, 0, 0 );
return buffer.GetData();
}
I see you use a function called StringCchLengthW to get the required length of the output buffer. Most of the places I look recommend using the WideCharToMultiByte function itself to tell you how many CHARs it wants.
Edit:
As Rob pointed out, you can use CW2A with the CP_UTF8 code page:
CStringA str = CW2A(wStr, CP_UTF8);
While I'm editing, I can answer your second question:
How can I verify the resultant UTF-8 string is correct?
Write it to a text file, then open it in Mozilla Firefox or an equivillant program. In the View menu, you can go to Character Encoding and switch manually to UTF-8 (assuming Firefox didn't guess it correctly to begin with). Compare it with a UTF-16 document with the same text and see if there are any differences.