How to set file encoding format to UTF8 in C++

How to set file encoding format to UTF8 in C++ - c++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)
What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.
line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*
// ...
if( inputFile.is_open() )
{
while( inputFile.good() )
{
getline(inputFile,line);
//1
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
wchar_t *pwcharText;
pwcharText = new wchar_t[ dwCount];
//2
MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );
//3
dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
char *pText;
pText = new char[ dwCount ];
//4
WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );
fprintf(pOutputFile,pText);
fprintf(pOutputFile,"\n");
delete[] pwcharText;
delete[] pText;
}
}
// ...
Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?
I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.
std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;
It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );
if (dwCount == 0) continue;
std::vector<WCHAR> utf16Text(dwCount);
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );
dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );
if (dwCount == 0) continue;
std::vector<CHAR> utf8Text(dwCount);
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );
fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);
fprintf(pOutputFile, "\n");

The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.
So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说：微笑！😊." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.
Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.
#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>
int main(int argc, char** argv)
{
std::fstream afile;
std::string line1A = u8"Confucius says: Smile. 孔子说：微笑！ 😊";
std::wstring line1W;
afile.open("Test.txt", std::ios::out | std::ios::app);
if (!afile.is_open())
return 0;
afile << "\n" << line1A;
afile.close();
afile.open("Test.txt", std::ios::in);
std::getline(afile, line1A);
line1W = CA2W(line1A.c_str(), CP_UTF8);
MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
afile.close();
return 0;
}

Related

Converting shift-jis encoded file to to utf-8 in c++

I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);
wstring str(pUTF16);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;
File.close();

There are multiple problems.
The first problem is that when you are writing the output file, you need to set it to binary for the same reason you need to do so when reading the input.
fstream File("filepath", std::ios::out | std::ios::binary);
The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar with a -1 length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size and the contents of pUTF16 are already wrong. Add it manually after reading the file:
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
The last problem is that you are using CP_ACP. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932 (you can look that up on wikipedia for example). So use 932 instead of CP_ACP:
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
Additionally, there is no reason to create wstring str(pUTF16). Just use pUTF16 directly in the WideCharToMultiByte calls.
Also, I'm not sure how kosher char *ptr = &result[0] is. I personally would not create a string specifically as a buffer for this.
Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out | std::ios::binary);
string result;
result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
char *ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
File << ptr;
File.close();
Also, you have a memory leak -- lpszBuf and pUTF16 are not cleaned up.

You should try use std::locale to perform this conversion:
namespace fs = std::filesystem;
void convert(const fs::path inName, const fs::path outName)
{
std::wifstream in{inName};
in.imbue(std::locale{".932"}); // or "ja_JP.SJIS"
if (in) {
std::wofstream out{outName};
out.imbue(std::locale{".utf-8"});
std::wstring line;
while (getline(in, line)) {
out << line << L'\n';
}
}
}
Note locale names are platform specific - I think I used proper one for Windows.
Update: I've tested this on my Window 10 machine with MSVC 19.29.30145 and works perfectly. I used wiki page to get some valid Japanese text and used Notepad++ to save this text in proper encoding (Shift-JIS).
I also used Beyond Compare to verify results:
Note I used similar method here for Korean and it worked nicely.

wstring str(pUTF16); - pUTF16 there does not end with zero char. It should be wstring str(pUTF16, utf16size);

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!

You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);

Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte

Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

Renaming a file with an en dash in the name in C++

In the project I'm working on, I work with files and I check if they exists before proceeding. Renaming or even working with files featuring that 'en dash' in the file path seems impossible.
std::string _old = "D:\\Folder\\This – by ABC.txt";
std::rename(_old.c_str(), "New.txt");
here the _old variable is interpreted as D:\Folder\This û by ABC.txt
I tried
setlocale(LC_ALL, "");
//and
setlocale(LC_ALL, "C");
//or
setlocale(LC_ALL, "en_US.UTF-8");
but none of them worked.. What should be done?

It depends on the operation system. In Linux file names are simple byte arrays: forget about encoding and just rename the file.
But seems you are using Windows and file name is actually a null-terminated string containing 16-bit characters. In this case the best way is to use wstring instead of messing with encodings.
Don't try to write platform-independent code to solve platform-specific problems. Windows uses Unicode for file names so you have to write platform-specific code instead of using standard function rename.
Just write L"D:\\Folder\\This \u2013 by ABC.txt" and call _wrename.

The Windows ANSI Western encoding has the Unicode n-dash, U+2013, “–”, as code point 150 (decimal). When you output that to a console with active code page 437, the original IBM PC character set, or compatible, then it's interpreted as an “û”. So you have the right codepage 1252 character in your string literal, either because
you're using Visual C++, which defaults to the Windows ANSI codepage for encoding narrow string literals, or
you're using an old version of g++ that doesn't do the standard-mandated conversions and checking but just passes narrow character bytes directly through its machinery, and your source code is encoded as Windows ANSI Western (or compatible), or
something I didn't think of.
For either of the first two possibilities
the rename call will work.
I tested that it does indeed work with Visual C++. I do not have an old version of g++ around, but I tested that it works with version 5.1. That is, I tested that the file is really renamed to New.txt.
// Source encoding: UTF-8
// Execution character set: Windows ANSI Western a.k.a. codepage 1252.
#include <stdio.h> // rename
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <string> // std::string
using namespace std;
auto main()
-> int
{
string const a = ".\\This – by ABC.txt"; // Literal encoded as CP 1252.
return rename( a.c_str(), "New.txt" ) == 0? EXIT_SUCCESS : EXIT_FAILURE;
}
Example:
[C:\my\forums\so\265]
> dir /b *.txt
File Not Found
[C:\my\forums\so\265]
> g++ r.cpp -fexec-charset=cp1252
[C:\my\forums\so\265]
> type nul >"This – by ABC.txt"
[C:\my\forums\so\265]
> run a
Exit code 0
[C:\my\forums\so\265]
> dir /b *.txt
New.txt
[C:\my\forums\so\265]
> _
… where run is just a batch file that reports the exit code.
If your Windows ANSI codepage is not codepage 1252, then you need to use your particular Windows ANSI codepage.
You can check the Windows ANSI codepage via the GetACP API function, or e.g. via this command:
[C:\my\forums\so\265]
> wmic os get codeset /value | find "="
CodeSet=1252
[C:\my\forums\so\265]
> _
The code will work if that codepage supports the n-dash character.
This model of coding is based on having one version of the executable for each relevant main locale (including character encoding).
An alternative is to do everything in Unicode. This can be done portably via Boost file system, which will be adopted into the standard library in C++17. Or you can use the Windows API, or de facto standard extensions to the standard library in Windows, i.e. _rename.
Example of using the experimental file system module with Visual C++ 2015:
// Source encoding: UTF-8
// Execution character set: irrelevant (everything's done in Unicode).
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
#include <filesystem> // In C++17 and later, or Visual C++ 2015 and later.
using namespace std::tr2::sys;
auto main()
-> int
{
path const old_path = L".\\This – by ABC.txt"; // Literal encoded as wide string.
path const new_path = L"New.txt";
try
{
rename( old_path, new_path );
return EXIT_SUCCESS;
}
catch( ... )
{}
return EXIT_FAILURE;
}
To do this properly for portable code you can use Boost, or you can create a wrapper header that uses whatever implementation is available.

It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN. This test example creates file with name you have problem with, then renames it
// #define _UNICODE // might be defined in project
#include <string>
#include <tchar.h>
#include <windows.h>
using namespace std;
// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
if( wstr.empty() ) return std::string();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
std::string strTo( size_needed, 0 );
WideCharToMultiByte (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
if( str.empty() ) return std::wstring();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
int _tmain(int argc, _TCHAR* argv[] ) {
std::string pFileName = "C:\\This \xe2\x80\x93 by ABC.txt";
std::wstring pwsFileName = utf8_decode(pFileName);
// can use CreateFile id instead
HANDLE hf = CreateFileW( pwsFileName.c_str() ,
GENERIC_READ | GENERIC_WRITE,
0,
0,
CREATE_NEW,
FILE_ATTRIBUTE_NORMAL,
0);
CloseHandle(hf);
MoveFileW(utf8_decode("C:\\This \xe2\x80\x93 by ABC.txt").c_str(), utf8_decode("C:\\This \xe2\x80\x93 by ABC 2.txt").c_str());
}
There is still problem with those helpers so that you can have a null terminated string.
std::string utf8_encode(const std::wstring &wstr)
{
std::string strTo;
char *szTo = new char[wstr.length() + 1];
szTo[wstr.size()] = '\0';
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
strTo = szTo;
delete[] szTo;
return strTo;
}
// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
std::wstring wstrTo;
wchar_t *wszTo = new wchar_t[str.length() + 1];
wszTo[str.size()] = L'\0';
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wszTo, (int)str.length());
wstrTo = wszTo;
delete[] wszTo;
return wstrTo;
}
a problem with size of character for conversion.. call to WideCharToMultiByte with 0 as the size of target buffer allows to get size of character required for conversion. It will then return the number of bytes needed for the target buffer size. All this juggling with code explains why the frameworks like Qt got so convoluted code to support Unicode-based file system. Actually, best cost-effective way to get rid of all possible bugs for you is to use such framework.
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name... but for file API, you still need do proper conversion

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!

Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

Wrong reading file in UNICODE (fread) on C++

I'm trying to load into string the content of file saved on the dics. The file is .CS code, created in VisualStudio so I suppose it's saved in UTF-8 coding. I'm doing this:
FILE *fConnect = _wfopen(connectFilePath, _T("r,ccs=UTF-8"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPTSTR lpContent = (LPTSTR)malloc(sizeof(TCHAR) * lSize + 1);
fread(lpContent, sizeof(TCHAR), lSize, fConnect);
But result is so strange - the first part (half of the string is content of .CS file), then strange symbols like 췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍 appear.
So I think I read the content in a wrong way. But how to do that properly?
Thank you so much and I'm looking to hear!

ftell(), fseek(), and fread() all operate on bytes, not on characters. In a Unicode environment, TCHAR is at least 2 bytes, so you are allocating and reading twice as much memory as you should be.
I have never seen fopen() or _wfopen() support a "ccs" attribute. You should use "rb" as the reading mode, read the raw bytes into memory, and then decode them once you have them all available, ie:
FILE *fConnect = _wfopen(connectFilePath, _T("rb"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPBYTE lpContent = (LPBYTE) malloc(lSize);
fread(lpContent, 1, lSize, fConnect);
fclose(lpContent);
.. decode lpContent as needed ...
free(lpContent);

Does the string contain all the contents of the cs file and then additional funny characters? Probably it's just not correctly null-terminated since fread will not automatically do that. You need to set the character following the string content to zero:
lpContent[lSize] = 0;

.. decode lpContent as needed ...
s2ws function convert string to wstring
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
add null terminator in the end of buffer:
lpContent[lSize-1] = 0;
initialize wstring from buffer content
std::wstring replyStr = (s2ws((char*)lpContent));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to set file encoding format to UTF8 in C++ - c++

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

Related

Converting shift-jis encoded file to to utf-8 in c++

Converting to UTF-8 from ToUnicodeEx()

Renaming a file with an en dash in the name in C++

C++: socket encoding (working with TeamSpeak)

Wrong reading file in UNICODE (fread) on C++

Categories

Resources