utfcpp and Win32 wide API

utfcpp and Win32 wide API - c++

Is it good/safe/possible to use the tiny utfcpp library for converting everything I get back from the wide Windows API (FindFirstFileW and such) to a valid UTF8 representation using utf16to8?
I would like to use UTF8 internally, but am having trouble getting the correct output (via wcout after another conversion or plain cout). Normal ASCII characters work of course, but ñä gets messed up.
Or is there an easier alternative?
Thanks!
UPDATE: Thanks to Hans (below), I now have an easy UTF8<->UTF16 conversion through the Windows API. Two way conversion works, but the UTF8 from UTF16 string has some extra characters that might cause me some trouble later on...). I'll share it here out of pure friendliness :) ):
// UTF16 -> UTF8 conversion
std::string toUTF8( const std::wstring &input )
{
// get length
int length = WideCharToMultiByte( CP_UTF8, NULL,
input.c_str(), input.size(),
NULL, 0,
NULL, NULL );
if( !(length > 0) )
return std::string();
else
{
std::string result;
result.resize( length );
if( WideCharToMultiByte( CP_UTF8, NULL,
input.c_str(), input.size(),
&result[0], result.size(),
NULL, NULL ) > 0 )
return result;
else
throw std::runtime_error( "Failure to execute toUTF8: conversion failed." );
}
}
// UTF8 -> UTF16 conversion
std::wstring toUTF16( const std::string &input )
{
// get length
int length = MultiByteToWideChar( CP_UTF8, NULL,
input.c_str(), input.size(),
NULL, 0 );
if( !(length > 0) )
return std::wstring();
else
{
std::wstring result;
result.resize( length );
if( MultiByteToWideChar(CP_UTF8, NULL,
input.c_str(), input.size(),
&result[0], result.size()) > 0 )
return result;
else
throw std::runtime_error( "Failure to execute toUTF16: conversion failed." );
}
}

The Win32 API already has a function to do this, WideCharToMultiByte() with CodePage = CP_UTF8. Saves you from having to rely on another library.
You cannot normally use the result with wcout. Its output goes to the console, it uses an 8-bit OEM encoding for legacy reasons. You can change the code page with SetConsoleCP(), 65001 is the code page for UTF-8 (CP_UTF8).
Your next stumbling block would be the font that's used for the console. You'll have to change it but finding a font that's fixed-pitch and has a full set of glyphs to cover Unicode is going to be difficult. You'll see you have a font problem when you get square rectangles in the output. Question marks are encoding problems.

Why do you want to use UTF8 internally? Are you working with so much text that using UTF16 would create unreasonable memory demands? Even if that was the case, you're probably better off using wide chars anyway, and dealing with memory issues in some other way (using a disk cache, better algorithms or data structures).
Your code will be much cleaner and easier to deal with using wide chars native to the Win32 API internally, and only doing UTF8 conversions when reading or writing out data that requires it (eg. XML files or REST APIs).
Your problem may also occur at the point where you print your output to the console, see: Output unicode strings in Windows console app
Finally I haven't used the utfcpp library, but UTF8 conversions are fairly trivial to perform using Win32's WideCharToMultiByte and MultiByteToWideChar with CP_UTF8 as the code page. Personally I would do a one time conversion and work with the text in UTF16 until it was time to output or transfer it in UTF8 if needed.

Related

Renaming a file with an en dash in the name in C++

In the project I'm working on, I work with files and I check if they exists before proceeding. Renaming or even working with files featuring that 'en dash' in the file path seems impossible.
std::string _old = "D:\\Folder\\This – by ABC.txt";
std::rename(_old.c_str(), "New.txt");
here the _old variable is interpreted as D:\Folder\This û by ABC.txt
I tried
setlocale(LC_ALL, "");
//and
setlocale(LC_ALL, "C");
//or
setlocale(LC_ALL, "en_US.UTF-8");
but none of them worked.. What should be done?

It depends on the operation system. In Linux file names are simple byte arrays: forget about encoding and just rename the file.
But seems you are using Windows and file name is actually a null-terminated string containing 16-bit characters. In this case the best way is to use wstring instead of messing with encodings.
Don't try to write platform-independent code to solve platform-specific problems. Windows uses Unicode for file names so you have to write platform-specific code instead of using standard function rename.
Just write L"D:\\Folder\\This \u2013 by ABC.txt" and call _wrename.

The Windows ANSI Western encoding has the Unicode n-dash, U+2013, “–”, as code point 150 (decimal). When you output that to a console with active code page 437, the original IBM PC character set, or compatible, then it's interpreted as an “û”. So you have the right codepage 1252 character in your string literal, either because
you're using Visual C++, which defaults to the Windows ANSI codepage for encoding narrow string literals, or
you're using an old version of g++ that doesn't do the standard-mandated conversions and checking but just passes narrow character bytes directly through its machinery, and your source code is encoded as Windows ANSI Western (or compatible), or
something I didn't think of.
For either of the first two possibilities
the rename call will work.
I tested that it does indeed work with Visual C++. I do not have an old version of g++ around, but I tested that it works with version 5.1. That is, I tested that the file is really renamed to New.txt.
// Source encoding: UTF-8
// Execution character set: Windows ANSI Western a.k.a. codepage 1252.
#include <stdio.h> // rename
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <string> // std::string
using namespace std;
auto main()
-> int
{
string const a = ".\\This – by ABC.txt"; // Literal encoded as CP 1252.
return rename( a.c_str(), "New.txt" ) == 0? EXIT_SUCCESS : EXIT_FAILURE;
}
Example:
[C:\my\forums\so\265]
> dir /b *.txt
File Not Found
[C:\my\forums\so\265]
> g++ r.cpp -fexec-charset=cp1252
[C:\my\forums\so\265]
> type nul >"This – by ABC.txt"
[C:\my\forums\so\265]
> run a
Exit code 0
[C:\my\forums\so\265]
> dir /b *.txt
New.txt
[C:\my\forums\so\265]
> _
… where run is just a batch file that reports the exit code.
If your Windows ANSI codepage is not codepage 1252, then you need to use your particular Windows ANSI codepage.
You can check the Windows ANSI codepage via the GetACP API function, or e.g. via this command:
[C:\my\forums\so\265]
> wmic os get codeset /value | find "="
CodeSet=1252
[C:\my\forums\so\265]
> _
The code will work if that codepage supports the n-dash character.
This model of coding is based on having one version of the executable for each relevant main locale (including character encoding).
An alternative is to do everything in Unicode. This can be done portably via Boost file system, which will be adopted into the standard library in C++17. Or you can use the Windows API, or de facto standard extensions to the standard library in Windows, i.e. _rename.
Example of using the experimental file system module with Visual C++ 2015:
// Source encoding: UTF-8
// Execution character set: irrelevant (everything's done in Unicode).
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <filesystem> // In C++17 and later, or Visual C++ 2015 and later.
using namespace std::tr2::sys;
auto main()
-> int
{
path const old_path = L".\\This – by ABC.txt"; // Literal encoded as wide string.
path const new_path = L"New.txt";
try
{
rename( old_path, new_path );
return EXIT_SUCCESS;
}
catch( ... )
{}
return EXIT_FAILURE;
}
To do this properly for portable code you can use Boost, or you can create a wrapper header that uses whatever implementation is available.

It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN. This test example creates file with name you have problem with, then renames it
// #define _UNICODE // might be defined in project
#include <string>
#include <tchar.h>
#include <windows.h>
using namespace std;
// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
if( wstr.empty() ) return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo( size_needed, 0 );
WideCharToMultiByte (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
if( str.empty() ) return std::wstring();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
int _tmain(int argc, _TCHAR* argv[] ) {
std::string pFileName = "C:\\This \xe2\x80\x93 by ABC.txt";
std::wstring pwsFileName = utf8_decode(pFileName);
// can use CreateFile id instead
HANDLE hf = CreateFileW( pwsFileName.c_str() ,
GENERIC_READ | GENERIC_WRITE,
0,
0,
CREATE_NEW,
FILE_ATTRIBUTE_NORMAL,
0);
CloseHandle(hf);
MoveFileW(utf8_decode("C:\\This \xe2\x80\x93 by ABC.txt").c_str(), utf8_decode("C:\\This \xe2\x80\x93 by ABC 2.txt").c_str());
}
There is still problem with those helpers so that you can have a null terminated string.
std::string utf8_encode(const std::wstring &wstr)
{
std::string strTo;
char *szTo = new char[wstr.length() + 1];
szTo[wstr.size()] = '\0';
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
strTo = szTo;
delete[] szTo;
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
std::wstring wstrTo;
wchar_t *wszTo = new wchar_t[str.length() + 1];
wszTo[str.size()] = L'\0';
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wszTo, (int)str.length());
wstrTo = wszTo;
delete[] wszTo;
return wstrTo;
}
a problem with size of character for conversion.. call to WideCharToMultiByte with 0 as the size of target buffer allows to get size of character required for conversion. It will then return the number of bytes needed for the target buffer size. All this juggling with code explains why the frameworks like Qt got so convoluted code to support Unicode-based file system. Actually, best cost-effective way to get rid of all possible bugs for you is to use such framework.
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name... but for file API, you still need do proper conversion

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!

Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

Storing unicode UTF-8 string in std::string

In response to discussion in
Cross-platform strings (and Unicode) in C++
How to deal with Unicode strings in C/C++ in a cross-platform friendly way?
I'm trying to assign a UTF-8 string to a std::string variable in Visual Studio 2010 environment
std::string msg = "महसुस";
However, when I view the string view debugger, I only see "?????"
I have the file saved as Unicode (UTF-8 with Signature)
and i'm using character set "use unicode character set"
"महसुस" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5
My question is:
How do I use std::string to just store the utf-8 without needing to manipulate it?

If you were using C++11 then this would be easy:
std::string msg = u8"महसुस";
But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):
std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"
Otherwise, you might consider doing a conversion at runtime instead:
std::string toUtf8(const std::wstring &str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}
std::string msg = toUtf8(L"महसुस");

You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

If you have C++11, you can write u8"महसुस". Otherwise, you'll have to write the actual byte sequence, using \xxx for each byte in the UTF-8 sequence.
Typically, you're better off reading such text from a configuration file.

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:
In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix
http://support.microsoft.com/kb/980263

Why obtained MachineGuid looks not alike a GUID but like Korean?

I created a simple function:
std::wstring GetRegKey(const std::string& location, const std::string& name){
const int valueLength = 10240;
auto platformFlag = KEY_WOW64_64KEY;
HKEY key;
TCHAR value[valueLength];
DWORD bufLen = valueLength*sizeof(TCHAR);
long ret;
ret = RegOpenKeyExA(HKEY_LOCAL_MACHINE, location.c_str(), 0, KEY_READ | platformFlag, &key);
if( ret != ERROR_SUCCESS ){
return std::wstring();
}
ret = RegQueryValueExA(key, name.c_str(), NULL, NULL, (LPBYTE) value, &bufLen);
RegCloseKey(key);
if ( (ret != ERROR_SUCCESS) || (bufLen > valueLength*sizeof(TCHAR)) ){
return std::wstring();
}
std::wstring stringValue(value, (size_t)bufLen - 1);
size_t i = stringValue.length();
while( i > 0 && stringValue[i-1] == '\0' ){
--i;
}
return stringValue;
}
And I call it like auto result = GetRegKey("SOFTWARE\\Microsoft\\Cryptography", "MachineGuid");
yet string looks like
㤴ㄷ㤵戰㌭㉣ⴱ㔴㍥㤭慣ⴹ㍥摢㘵〴㉡ㄵ\0009ca9-e3bd5640a251
not like RegEdit
4971590b-3c21-45e3-9ca9-e3bd5640a251
So I wonder what shall be done to get a correct representation of MachineGuid in C++?

RegQueryValueExA is an ANSI wrapper around the Unicode version since Windows NT. When building on a Unicode version of Windows, it not only converts the the lpValueName to a LPCWSTR, but it will also convert the lpData retrieved from the registry to an LPWSTR before returning.
MSDN has the following to say:
If the data has the REG_SZ, REG_MULTI_SZ or REG_EXPAND_SZ type, and
the ANSI version of this function is used (either by explicitly
calling RegQueryValueExA or by not defining UNICODE before including
the Windows.h file), this function converts the stored Unicode string
to an ANSI string before copying it to the buffer pointed to by
lpData.
Your problem is that you are populating the lpData, which holds TCHARs (WCHAR on Unicode versions of Windows) with an ANSI string.
The garbled string that you see is a result of 2 ANSI chars being used to populate a single wchar_t. That explains the Asian characters. The portion that looks like the end of the GUID is because the print function blew past the terminating null since it was only one byte and began printing what is probably a portion of the buffer that was used by RegQueryValueExA before converting to ANSI.
To solve the problem, either stick entirely to Unicode, or to ANSI (if you are brave enough to continue using ANSI in the year 2014), or be very careful about your conversions. I would change GetRegKey to accept wstrings and use RegQueryValueExW instead, but that is a matter of preference and what sort of code you plan on using this in.
(Also, I would recommend you have someone review this code since there are a number of oddities in the error checking, and a hard coded buffer size.)

How to set file encoding format to UTF8 in C++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)
What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.
line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*
// ...
if( inputFile.is_open() )
{
while( inputFile.good() )
{
getline(inputFile,line);
//1
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
wchar_t *pwcharText;
pwcharText = new wchar_t[ dwCount];
//2
MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );
//3
dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
char *pText;
pText = new char[ dwCount ];
//4
WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );
fprintf(pOutputFile,pText);
fprintf(pOutputFile,"\n");
delete[] pwcharText;
delete[] pText;
}
}
// ...
Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?
I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.
std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;
It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );
if (dwCount == 0) continue;
std::vector<WCHAR> utf16Text(dwCount);
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );
dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );
if (dwCount == 0) continue;
std::vector<CHAR> utf8Text(dwCount);
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );
fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);
fprintf(pOutputFile, "\n");

The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.
So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说：微笑！😊." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.
Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.
#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>
int main(int argc, char** argv)
{
std::fstream afile;
std::string line1A = u8"Confucius says: Smile. 孔子说：微笑！ 😊";
std::wstring line1W;
afile.open("Test.txt", std::ios::out | std::ios::app);
if (!afile.is_open())
return 0;
afile << "\n" << line1A;
afile.close();
afile.open("Test.txt", std::ios::in);
std::getline(afile, line1A);
line1W = CA2W(line1A.c_str(), CP_UTF8);
MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
afile.close();
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

utfcpp and Win32 wide API - c++

Related

Renaming a file with an en dash in the name in C++

C++: socket encoding (working with TeamSpeak)

Storing unicode UTF-8 string in std::string

Why obtained MachineGuid looks not alike a GUID but like Korean?

How to set file encoding format to UTF8 in C++

Categories

Resources