Define unicode string as byte array - c++

Let's say we have file main.cpp in windows-1251 encoding with such content:
int main()
{
wchar_t* ws = L"котэ"; //cat in russian
return 0;
}
Everything is fine if we compile this in VisualStudio, BUT we gonna compile it with GCC which default encoding for source code is UTF-8. Of course we can convert file encoding or set option "-finput-charset=windows-1251" for compiler, but what if not? There is some way to do this by replacing raw text into hex UTF32 bytes:
int main()
{
wchar_t* ws = (wchar_t*)"\x3A\x04\x00\x00\x3E\x04\x00\x00\x42\x04\x00\x00\x4D\x04\x00\x00\x00\x00\x00\x00"; //cat in russian
return 0;
}
But it's kind of agly: 4 letters becomes a 20 bytes ((
How else it can be done?

What you need is to use a file encoding that is understood by both GCC and VS. It seems to me that saving the file in UTF-8 encoding is the way forward.
Also see: How can I make Visual Studio save all files as UTF-8 without signature on Project or Solution level?

Related

How to use Unicode characters in x.dump() in nolohmann::json?

I'm trying to use Unicode characters (like משתמש לא רשום/קיים and other languages).
But when I try to do json_object.dump(), it throws an exception:
{_Data={_What=0x00c80b08 "[json.exception.type_error.316] invalid UTF-8 byte at index 1: 0xF9" _DoFree=...} }
When
#include <nlohmann/json.hpp>
#include <iostream>
using JSON = nlohmann::json;
int main()
{
std::string str = "משתמש לא רשום/קיים";
JSON json_obj;
json_obj["message"] = str;
std::cout << json_obj.dump();
}
For some reason it works in the "minimal reproducible example", but not in my project. Maybe the problem is somwhere else, but I don't know where...
In one sentence: How to support other languages' characters in nolohmann::json?
Nlohmann json works with UTF8 std::string(s) no matter your platform encoding.
It means that you should either encode your code files as UTF8 or convert str content to UTF8 at runtime.
This is the same for every string you pass to nlohmann or you get from nlohmann.
On unix, the platform encoding is generally UTF8.
If your program runs on windows however, the debugger and system API(s) probably expect that you store ANSI encoded text in std::string(s), which means you'll have to do manual conversions.

VS2019 compiler misinterprets UTF8 without BOM file as ANSI

I used to compile my C++ wxWidgets-3.1.1 application (Win10x64) with VS2015 Express. I wanted to upgrade my IDE to VS2019 community, which seemed to work quite well.
My project files are partly from older projects, so their encoding differs (Windows-1252, UTF-8 without BOM, ANSI).
With VS2015 I was able to compile and give out messages (hardcoded in my .cpp files), which displayed unicode characters correctly.
The same app compiled with VS2019 community shows for example the german word "übergabe" as "übergabe" which is uninterpreted UTF8.
Saving the .cpp file, which contains the unicode, explicitly as UTF8 WITH BOM solves this issue. But I don't want run through all files in all projects. Can I change the expected input from a "without BOM" file to UTF-8 to get the same behaviour that VS2015 had?
[EDIT]
It seems there is no such option. As I said before, converting all .cpp/.h files to UTF-8-BOM is a solution.
Thus, so far the only suitable way is to loop through the directory, rewrite the files in UTF-8 while prepending the BOM.
Using C++ wxWidgets, this is (part of) my attempt to automate the process:
//Read in the file, convert its content to UTF8 if necessary
wxFileInputStream fis(fileFullPath);
wxFile file(fileFullPath);
size_t dataSize = file.Length();
void* data = malloc(dataSize);
if (!fis.ReadAll(data, dataSize))
{
wxString sErr;
sErr << "Couldn't read file: " << fileFullPath;
wxLogError(sErr);
}
else
{
wxString sData((char*)data, dataSize);
wxString sUTF8Data;
if (wxEmptyString == wxString::FromUTF8(sData))
{
sUTF8Data = sData.ToUTF8();
}
else
{
sUTF8Data = sData;
}
wxFFileOutputStream out(fileFullPath);
wxBOM bomType = wxConvAuto::DetectBOM(sUTF8Data, sUTF8Data.size());
if (wxBOM_UTF8 != bomType)
{
if (wxBOM_None == bomType)
{
unsigned char utf8bom[] = { 0xEF,0xBB,0xBF };
out.Write((char*)utf8bom, sizeof(utf8bom));
}
else
{
wxLogError("File already contains a different BOM: " + fileFullPath);
}
}
}
Note that this can not convert all encodings, basically afaik it can only convert ANSI files or add the BOM to UTF-8 files without BOM. For all other encodings, I open the project in VS2019, select the file and go (freely translated into english, names might differ):
-> File -> XXX.cpp save as... -> Use the little arrow in the "Save" button -> Save with encoding... -> Replace? Yes! -> "Unicode (UTF-8 with signature) - Codepage 65001"
(Don't take "UTF-8 without signature" which is also Codepage 65001, though!)
The option /utf-8 specifies both the source character set and the execution character set as UTF-8.
Check the Microsoft docs
The C++ team blog that explains the charset problem

Convert utf8 wstring to string on windows in C++

I am representing folder paths with boost::filesystem::path which is a wstring on windows OS and I would like to convert it to std::string with the following method:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv1;
shared_dir = conv1.to_bytes(temp.wstring());
but unfortunatelly the result of the following text is this:
"c:\git\myproject\bin\árvíztűrőtükörfúrógép" ->
"c:\git\myproject\bin\árvíztűrőtükörfúrógép"
What do I do wrong?
#include <string>
#include <locale>
#include <codecvt>
int main()
{
// wide character data
std::wstring wstr = L"árvíztűrőtükörfúrógép";
// wide to UTF-8
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
std::string str = conv1.to_bytes(wstr);
}
I was checking the value of the variable in visual studio debug mode.
The code is fine.
You're taking a wstring that stores UTF-16 encoded data, and creating a string that stores UTF-8 encoded data.
I was checking the value of the variable in visual studio debug mode.
Visual Studio's debugger has no idea that your string stores UTF-8. A string just contains bytes. Only you (and people reading your documentation!) know that you put UTF-8 data inside it. You could have put something else inside it.
So, in the absence of anything more sensible to do, the debugger just renders the string as ASCII*. What you're seeing is the ASCII* representation of the bytes in your string.
Nothing is wrong here.
If you were to output the string like std::cout << str, and if you were running the program in a command line window set to UTF-8, you'd get your expected result. Furthermore, if you inspect the individual bytes in your string, you'll see that they are encoded correctly and hold your desired values.
You can push the IDE to decode the string as UTF-8, though, on an as-needed basis: in the Watch window type str,s8; or, in the Command window, type ? &str[0],s8. These techniques are explored by Giovanni Dicanio in his article "What's Wrong with My UTF-8 Strings in Visual Studio?".
It's not even really ASCII; it'll be some 8-bit encoding decided by your system, most likely the code page Windows-1252 given the platform. ASCII only defines the lower 7 bits. Historically, the various 8-bit code pages have been colloquially (if incorrectly) called "extended ASCII" in various settings. But the point is that the multi-byte nature of the data is not at all considered by the component rendering the string to your screen, let alone specifically its UTF-8-ness.

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.