C++ doesn't save UTF-8 encode file - c++

I'm working in a software that need to save a utf-8 file with special characters (like 'çäüëé').
I received the content to save (regular string with special characters encoded) from a webservice (with gsoap). When I try to save, using ofstream, the file saves a square and other strange characters instead of special characters.
When I try to convert the regular string to a wide string, it's lost the special characters (it is replaced by different ones). And, using wofstream, the file not saves the file when there are special characters.
I try to use also utf8-cpp, but the file it wasn't wrote correct too.

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.
You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.
TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.

Reading input from file with Chinese Characters that got mangled

I'm getting stuck trying to convert an input string in char* to Chinese character encoding. An application accepts a Chinese string input ex: "啊说到" and when it is written into a file it turns into this "°¡Ëµµ½". I'm able to take this input and feed it to _mbstowcs_s_l() but the solution needs to be locale independent, so I'm forced to use either mbstowcs() or WideCharToMultiByte() but it looks like both would work for me if the input did already went through MBCS to UTF-8, which in our case isnt.
The project is using Multibyte Character Set, and I'm struggling to understand what is going on. One other thing is the input is coming from a different application and stores it into file.
The application that accepted the Chinese input is an MFC set to Multibyte Char Set and the os was set to regional Chinese Simplified, UI accepts the input and is placed on a CString, that is coped to a char*. This is that part where I don't know whats going on in the encoding, this application stores it into a file, then we read it using the other application, the string is read unto char*, thats when the characters seems to take the "°¡Ëµµ½".
Question is, how can I turn this encoded char"°¡Ëµµ½" back to its Chinese encoding "啊说到", with out setting the locale in _mbstowcs_s_l()? The problem is, we could be reading strings from other regional settings and the application wouldn't just know what character map to use unless we tell it to.

Read text-file in C++ with fopen without linefeed conversion

I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.
The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.

How to read the … character and french accents from a text file

I am given a text file that contains a couple character per line. I have to read it, line by line, and apply a lexical analyzer on each character. Then, I write my analysis in another file.
With the following code, I have no problem reading french accents, but I realized that the character '…' (this is one character not 3 dots) is turned into a '&'.
Note: My lexical analyzer must use strings, that's why I converted back the wstring to a string.
wfstream SourceFile;
ofstream ResultFile (ResultFileName);
locale utf8_locale(std::locale(), new codecvt_utf8<wchar_t>);
SourceFile.imbue(utf8_locale);
SourceFile.open(SourceFileName);
while(getline(SourceFile, wLineBuffer))
{
string LineBuffer( wLineBuffer.begin(), wLineBuffer.end() );
...
Edit: Raymond Chen figured that the character is lost because of my conversion from wstring to string.
So the new question is now : How do I convert from a wstring to a string without transforming the characters ?
Edit: file sample
"stringééé"
"ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"
Identificateur1
Identificateur2
// Commentaire22
/**/
/*
Autre commentaire
…
*/
You need a proper Unicode support library. Forget using the broken Standard functions. They were not designed to support Unicode, don't support Unicode, and cannot be extended to support it properly. Look into using ICU or Boost.Locale or something like that.

URL encoding for multibyte character string in c++

I am trying to achieve URL encoding for some of my strings via c++. Strings can contaim multibyte characters like ™, ®, ©, etc.
Input text: Something ™
Output should be: Something%20%E2%84%A2
I can achieve URL encode or decode in JS with encodeURIComponent and decodeURIComponent,
but I have some native code in c++ and hence need to encode some text via c++.
Any help here would be great relief for me.
It's not to hard to do manually, if you can't find a library. First encode the string as UTF-8 (there are other posts on SO about using the standard library to do that if the string is in another encoding) and then replace every character with a value above 127, and every one that's restricted in URLs, with the percent encoding of that character (A percent sign followed by the two hexadecimal digits representing the character's value).