Working with UTF-8 std::string objects in C++ - c++

I'm using Visual Studio and C++ on Windows to work with small caps text like ʜᴇʟʟᴏ ꜱᴛᴀᴄᴋᴏᴠᴇʀꜰʟᴏᴡ using e.g. this website. Whenever I read this text from a file or put this text directly into my source code using std::string, the text visualizer in Visual Studio shows it in the wrong encoding, presumably the visualizer uses Windows (ANSI). How can I force Visual Studio to let me work with UTF-8 strings properly?
std::string message_or_file_path = "...";
auto message = message_or_file_path;
// If the file path is valid, read from that file
if (GetFileAttributes(message_or_file_path.c_str()) != INVALID_FILE_ATTRIBUTES
&& GetLastError() != ERROR_FILE_NOT_FOUND)
{
std::ifstream file_stream(message_or_file_path);
std::string text_file_contents((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());
message = text_file_contents; // Displayed in wrong encoding
message = "ʜᴇʟʟᴏ ꜱᴛᴀᴄᴋᴏᴠᴇʀꜰʟᴏᴡ"; // Displayed in wrong encoding
std::wstring wide_message = L"ʜᴇʟʟᴏ ꜱᴛᴀᴄᴋᴏᴠᴇʀꜰʟᴏᴡ"; // Displayed in correct encoding
}
I tried the additional command line option /utf-8 for compiling and setting the locale:
std::locale::global(std::locale(""));
std::cout.imbue(std::locale());
Neither of those fixed the encoding issue.

From What’s Wrong with My UTF-8 Strings in Visual Studio?, there are a couple of ways to see the contents of a std::string with UTF-8 encoding.
Let's say you have a variable with the following initialization:
std::string s2 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9f\x8d\x8c";
Use a Watch window.
Add the variable to Watch.
In the Watch window, add ,s8 to the variable name to display its contents as UTF-8.
Here's what I see in Visual Studio 2015.
Use the Command Window.
In the Command Window, use ? &s2[0],s8 to display the text as UTF-8.
Here's what I see in Visual Studio 2015.

A working solution was simply rewriting all std::strings as std::wstrings and adjusting the code logic properly to work with std::wstrings, as indicated in the question as well. Now everything works as expected.

Related

VS2019 compiler misinterprets UTF8 without BOM file as ANSI

I used to compile my C++ wxWidgets-3.1.1 application (Win10x64) with VS2015 Express. I wanted to upgrade my IDE to VS2019 community, which seemed to work quite well.
My project files are partly from older projects, so their encoding differs (Windows-1252, UTF-8 without BOM, ANSI).
With VS2015 I was able to compile and give out messages (hardcoded in my .cpp files), which displayed unicode characters correctly.
The same app compiled with VS2019 community shows for example the german word "übergabe" as "übergabe" which is uninterpreted UTF8.
Saving the .cpp file, which contains the unicode, explicitly as UTF8 WITH BOM solves this issue. But I don't want run through all files in all projects. Can I change the expected input from a "without BOM" file to UTF-8 to get the same behaviour that VS2015 had?
[EDIT]
It seems there is no such option. As I said before, converting all .cpp/.h files to UTF-8-BOM is a solution.
Thus, so far the only suitable way is to loop through the directory, rewrite the files in UTF-8 while prepending the BOM.
Using C++ wxWidgets, this is (part of) my attempt to automate the process:
//Read in the file, convert its content to UTF8 if necessary
wxFileInputStream fis(fileFullPath);
wxFile file(fileFullPath);
size_t dataSize = file.Length();
void* data = malloc(dataSize);
if (!fis.ReadAll(data, dataSize))
{
wxString sErr;
sErr << "Couldn't read file: " << fileFullPath;
wxLogError(sErr);
}
else
{
wxString sData((char*)data, dataSize);
wxString sUTF8Data;
if (wxEmptyString == wxString::FromUTF8(sData))
{
sUTF8Data = sData.ToUTF8();
}
else
{
sUTF8Data = sData;
}
wxFFileOutputStream out(fileFullPath);
wxBOM bomType = wxConvAuto::DetectBOM(sUTF8Data, sUTF8Data.size());
if (wxBOM_UTF8 != bomType)
{
if (wxBOM_None == bomType)
{
unsigned char utf8bom[] = { 0xEF,0xBB,0xBF };
out.Write((char*)utf8bom, sizeof(utf8bom));
}
else
{
wxLogError("File already contains a different BOM: " + fileFullPath);
}
}
}
Note that this can not convert all encodings, basically afaik it can only convert ANSI files or add the BOM to UTF-8 files without BOM. For all other encodings, I open the project in VS2019, select the file and go (freely translated into english, names might differ):
-> File -> XXX.cpp save as... -> Use the little arrow in the "Save" button -> Save with encoding... -> Replace? Yes! -> "Unicode (UTF-8 with signature) - Codepage 65001"
(Don't take "UTF-8 without signature" which is also Codepage 65001, though!)
The option /utf-8 specifies both the source character set and the execution character set as UTF-8.
Check the Microsoft docs
The C++ team blog that explains the charset problem

Convert utf8 wstring to string on windows in C++

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());
but unfortunatelly the result of the following text is this:
"c:\git\myproject\bin\árvíztűrőtükörfúrógép" ->
"c:\git\myproject\bin\árvíztűrőtükörfúrógép"
What do I do wrong?
#include <string>
#include <locale>
#include <codecvt>
int main()
{
// wide character data
std::wstring wstr = L"árvíztűrőtükörfúrógép";
// wide to UTF-8
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
std::string str = conv1.to_bytes(wstr);
}
I was checking the value of the variable in visual studio debug mode.
The code is fine.
You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.
I was checking the value of the variable in visual studio debug mode.
Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.
So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.
Nothing is wrong here.
If you were to output the string like std::cout << str, and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.
You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8; or, in the Command window, type ? &str[0],s8. These techniques are explored by Giovanni Dicanio in his article "What's Wrong with My UTF-8 Strings in Visual Studio?".
It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

How to use UTF-8 encoding in VC++ MFC application

I'm trying to add "static text component" (simple label) with UTF-8 encoding text into my windows application. If I use the designer tool in visual studio 2017 and put the text through properties - everything looks just fine - after opening the .rc file, the text is different (bad encoding).
I read that I need to change the encoding of the file to utf-8 with bom, but I have nothing like it there... If I change the file encoding to CP1252, the program cannot compile - so I'm using Unicode (UTF-8 with signature) - Codepage 65001 now.
SetWindowTextA("ěščřžýáíé");
GetDlgItem(IDC_SERIAL_NUMBER_TITLE)->SetWindowTextA("ěščřžýáíé");
this code will do this:
so it does 2 different things in the title and label.
and this code works only for the title & messageboxes
CStringA utf8 = CW2A(L"ěščřžýáíé", CP_UTF8);
CStringW utf16 = CA2W(utf8, CP_UTF8);
MessageBoxW(0, utf16, 0, 0);
Why is it so complicated? Is it not possible to normally use utf8 text?
Can anyone help me solve this problem? Thanks!
You have to save your .rc file encoding to Unicode - Codepage 1200
In fact, it was very simple.
// from:
SetWindowTextA("ěščřžýáíé");
// to
SetWindowTextW(m_hWnd, "ščřžýáí");
// also
// std::string -> std::stringw
// CString -> CStringW
// etc.
and that's it :D
and also this, it was very helpful and good to understand what was going on there!

Using std:: string in hdf5 creates unreadable output

I'm currently using hdf5 1.8.15 on Windows 7 64bit.
The sourcecode of my software is saved in files using utf8 encoding.
As soon as I call any hdf5 function supporting std:: string, the ouput gets cryptic
But if I use const char* instead of std::string, everything works fine. This applies also to the filename.
Here is a short sample:
std::string filename_ = "test.h5";
H5::H5File file( filename_.c_str(), H5F_ACC_TRUNC); // works
H5::H5File file( filename_, H5F_ACC_TRUNC); // filename is not readable or
// hdf5 throws an exception
I guess that this problem is caused by different encodings used in my source files and hdf5. But I'm not sure about this and found no solution allowing the use of std::strings. I would appreciate any idea which helps me with this issue.
I also had the same problem, and fixed it by changing all my std::string or h5std_string to literally:
5File file("myFile.h5", H5F_ACC_TRUNC);
Or use string.c_str() to change the string to const char.
I had exactly the same problem. The solution was, that I was in Debug-Mode in Visual Studio, whereas the libraries I linked against were build in Release-Mode. When I switched in Visual Studio to Release-Mode, the above error disappeared.

MSVC++ 2008 text file parsing difference between win7 <-> xp

Work environment
Windows 7 Ultimate X64 SP1
Microsoft Visual Studio 2008 SP1 (C++)
Text file
My program is parsing pure text-file and running. the next code is that text-file.
/*main.txt*/
_function main()
{
a = 3; //a is 3
b = "three";
c = 4.3; //automatic typecast
_outd(b+"3\n");
_outd(b+a+c);
d = a/c;
_outd("\n"+d+"\n"); ///test commantation
/*//
comment
*/
}
it is a some kind of simple c-styple script that editable at a text editor like notepad.(the contents in this text is not important)
Code
Text open code.
std::wstring wsFileName; //from
char* sz_FileName = new char[wsFileName.size()+1];
ZeroMemory(sz_FileName,sizeof(char)*(wsFileName.size()+1));
//////////////////////////////////////////////////////////////////////////
//for using 'fopen', change wsFileName to ascii
WideCharToMultiByte(CP_UTF8,0,wsFileName.c_str(),-1,sz_FileName,wsFileName.size(),NULL,NULL);
//open ascii text file
FILE* fp=fopen(sz_FileName,"rt");
std::string ch;
while(!feof(fp))
{
ch.push_back(fgetc(fp)); //get textfile in ascii
}
Problem
Watching std::string ch in debugger that:
That is exactly same i expected result in Windows 7. And the released-execution-file is work fine too. But in Windows XP SP3, it invokes a runtime error.
I dug around a whole night, and finally I found it. I installed the visual studio 2008 sp1 on windows xp sp3 and debugged it.
Watching std::string ch:
I guessed the "newline mark" is the reason why that program crashed in windows xp. and i guessed this problem based 'text encoding between windows 7 and windows xp'. so i make the new script text using windows xp's notepad. and try to parse it. But the windows xp has the same problem. i know that : The problem occurs not only the text file orignated windows 7 but also a new generated text in windows xp. i don't know why the problem happens.
Question
a pure text file (generated by notepad, editplus, ultraedit, and others...) that generated any windows os(xp & vista & 7), how my program can read that text same result?
i want each user can edit script to control and customize my program in their own OS, using their own simple text editor.
Why don't test the char before adding to the string if it contains a unwanted character? E.g. something like:
char temp;
while(...)
{
temp = fgetc(fp);
if (temp != "\r")
ch.push_back(temp);
}
I haven't tested the code, but it should work, if the "wrong" char is indeed "\r".