VS2019 compiler misinterprets UTF8 without BOM file as ANSI - c++

I used to compile my C++ wxWidgets-3.1.1 application (Win10x64) with VS2015 Express. I wanted to upgrade my IDE to VS2019 community, which seemed to work quite well.
My project files are partly from older projects, so their encoding differs (Windows-1252, UTF-8 without BOM, ANSI).
With VS2015 I was able to compile and give out messages (hardcoded in my .cpp files), which displayed unicode characters correctly.
The same app compiled with VS2019 community shows for example the german word "übergabe" as "übergabe" which is uninterpreted UTF8.
Saving the .cpp file, which contains the unicode, explicitly as UTF8 WITH BOM solves this issue. But I don't want run through all files in all projects. Can I change the expected input from a "without BOM" file to UTF-8 to get the same behaviour that VS2015 had?
[EDIT]
It seems there is no such option. As I said before, converting all .cpp/.h files to UTF-8-BOM is a solution.
Thus, so far the only suitable way is to loop through the directory, rewrite the files in UTF-8 while prepending the BOM.
Using C++ wxWidgets, this is (part of) my attempt to automate the process:
//Read in the file, convert its content to UTF8 if necessary
wxFileInputStream fis(fileFullPath);
wxFile file(fileFullPath);
size_t dataSize = file.Length();
void* data = malloc(dataSize);
if (!fis.ReadAll(data, dataSize))
{
wxString sErr;
sErr << "Couldn't read file: " << fileFullPath;
wxLogError(sErr);
}
else
{
wxString sData((char*)data, dataSize);
wxString sUTF8Data;
if (wxEmptyString == wxString::FromUTF8(sData))
{
sUTF8Data = sData.ToUTF8();
}
else
{
sUTF8Data = sData;
}
wxFFileOutputStream out(fileFullPath);
wxBOM bomType = wxConvAuto::DetectBOM(sUTF8Data, sUTF8Data.size());
if (wxBOM_UTF8 != bomType)
{
if (wxBOM_None == bomType)
{
unsigned char utf8bom[] = { 0xEF,0xBB,0xBF };
out.Write((char*)utf8bom, sizeof(utf8bom));
}
else
{
wxLogError("File already contains a different BOM: " + fileFullPath);
}
}
}
Note that this can not convert all encodings, basically afaik it can only convert ANSI files or add the BOM to UTF-8 files without BOM. For all other encodings, I open the project in VS2019, select the file and go (freely translated into english, names might differ):
-> File -> XXX.cpp save as... -> Use the little arrow in the "Save" button -> Save with encoding... -> Replace? Yes! -> "Unicode (UTF-8 with signature) - Codepage 65001"
(Don't take "UTF-8 without signature" which is also Codepage 65001, though!)

The option /utf-8 specifies both the source character set and the execution character set as UTF-8.
Check the Microsoft docs
The C++ team blog that explains the charset problem

Related

How to use UTF-8 encoding in VC++ MFC application

I'm trying to add "static text component" (simple label) with UTF-8 encoding text into my windows application. If I use the designer tool in visual studio 2017 and put the text through properties - everything looks just fine - after opening the .rc file, the text is different (bad encoding).
I read that I need to change the encoding of the file to utf-8 with bom, but I have nothing like it there... If I change the file encoding to CP1252, the program cannot compile - so I'm using Unicode (UTF-8 with signature) - Codepage 65001 now.
SetWindowTextA("ěščřžýáíé");
GetDlgItem(IDC_SERIAL_NUMBER_TITLE)->SetWindowTextA("ěščřžýáíé");
this code will do this:
so it does 2 different things in the title and label.
and this code works only for the title & messageboxes
CStringA utf8 = CW2A(L"ěščřžýáíé", CP_UTF8);
CStringW utf16 = CA2W(utf8, CP_UTF8);
MessageBoxW(0, utf16, 0, 0);
Why is it so complicated? Is it not possible to normally use utf8 text?
Can anyone help me solve this problem? Thanks!
You have to save your .rc file encoding to Unicode - Codepage 1200
In fact, it was very simple.
// from:
SetWindowTextA("ěščřžýáíé");
// to
SetWindowTextW(m_hWnd, "ščřžýáí");
// also
// std::string -> std::stringw
// CString -> CStringW
// etc.
and that's it :D
and also this, it was very helpful and good to understand what was going on there!

SDL2 loading files with special characters

I got a problem, that is: in a Windows application using SDL2 & SDL2_Image, it opens image files, for later saving them with modifications on the image data.
When it opens an image without special characters (like áéíóúñ, say, "buenos aires.jpg") it works as intended. But, if there is any special character as mentioned (say, "córdoba.jpg"), SDL_Image generates an error saying "Couldn't open file". Whatever, if i use the std::ifstream flux with the exact file name that i got from the CSV file (redundant, as "córdoba.jpg" or "misiónes.jpg"), the ifstream works well... Is it an error using the special characters? UNICODE, UTF, have something to do?
A little information about the environment: Windows 10 (spanish, latin american), SDL2 & SDL2_Image (up to date versions), GCC compiler using Mingw64 7.1.0
About the software I'm trying to make: it uses a CSV form, with the names of various states of Argentina, already tried changing encoding on the .CSV. It loads images based on the names found on the CSV, changes them, and saves.
I know maybe I am missing something basic, but already depleted my resources.
IMG_Load() forwards its file argument directly to SDL_RWFromFile():
// http://hg.libsdl.org/SDL_image/file/8fee51506499/IMG.c#l125
SDL_Surface *IMG_Load(const char *file)
{
SDL_RWops *src = SDL_RWFromFile(file, "rb");
const char *ext = SDL_strrchr(file, '.');
if(ext) {
ext++;
}
if(!src) {
/* The error message has been set in SDL_RWFromFile */
return NULL;
}
return IMG_LoadTyped_RW(src, 1, ext);
}
And SDL_RWFromFile()'s file argument should be a UTF-8 string:
SDL_RWops* SDL_RWFromFile(const char* file,
const char* mode)
Function Parameters:
file: a UTF-8 string representing the filename to open
mode: an ASCII string representing the mode to be used for opening the file; see Remarks for details
So pass UTF-8 paths into IMG_Load().
C++11 has UTF-8 string literal support built-in via the u8 prefix:
IMG_Load( u8"córdoba.jpg" );

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

Define unicode string as byte array

Let's say we have file main.cpp in windows-1251 encoding with such content:
int main()
{
wchar_t* ws = L"котэ"; //cat in russian
return 0;
}
Everything is fine if we compile this in VisualStudio, BUT we gonna compile it with GCC which default encoding for source code is UTF-8. Of course we can convert file encoding or set option "-finput-charset=windows-1251" for compiler, but what if not? There is some way to do this by replacing raw text into hex UTF32 bytes:
int main()
{
wchar_t* ws = (wchar_t*)"\x3A\x04\x00\x00\x3E\x04\x00\x00\x42\x04\x00\x00\x4D\x04\x00\x00\x00\x00\x00\x00"; //cat in russian
return 0;
}
But it's kind of agly: 4 letters becomes a 20 bytes ((
How else it can be done?
What you need is to use a file encoding that is understood by both GCC and VS. It seems to me that saving the file in UTF-8 encoding is the way forward.
Also see: How can I make Visual Studio save all files as UTF-8 without signature on Project or Solution level?

c++ unicode writing is not working

I am trying to write some Russian unicode text in file by wfstream. Following piece of code has been used for it.
wfstream myfile;
locale AvailLocale("Russian");
myfile.imbue(AvailLocale);
myfile.open(L"d:\\example.txt",ios::out);
if (myfile.is_open())
{
myfile << L"доброе утро" <<endl;
}
myfile.flush();
myfile.close();
Something unrecognizable is written to the file by executing this code, I am using VS 2008.
I don't know much about imbuing locales in C++, but perhaps it isn't working because you're trying to write Cyrillic characters using the Greek locale?
If you use std::locale("Russian") the file will be encoded as "Cyrillic (Windows)" (and not some Unicode format) when you use VS2008. If you for example open it with Internet Explorer and change the encoding to Cyrillic (Windows) the characters become visable.
It is more common to store files in some Unicode format.
When I use:
const std::locale AvailLocale
= std::locale(std::locale("Russian"), new std::codecvt_utf8<wchar_t>());
to store it as UTF-8 and open the file for example in notepad as an UTF-8 file. It see доброе утро
Similarly you can use codecvt_utf16 or some other coding scheme to encode the unicode characters.
codecvt_utf8 is specific for C++11.
Alternatively you can use boost:
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Or implement something similar yourself (from looking at http://www.boost.org/doc/libs/1_40_0/boost/detail/utf8_codecvt_facet.hpp it doesn't seem that complicated).
Or this library: http://utfcpp.sourceforge.net/