How do I decode UTF-8? - c++

I have a UTF-8-encoded string.
This string is first saved to a file and then sent via Apache to a process written in C++, which receives it using Curl.
How can I decode the string in the C++ process?

There is a very good article on CodeProject that shows how to read utf8 .Alternatively http://utfcpp.sourceforge.net/ has also manipulations to do it ( C++ & Boost: encode/decode UTF-8 ).

Related

How to convert Windows-1251(ISO-88-59-5) string to UTF-8 string on Linux?

I have a common string, which is encoded like ISO-88-59-5 and I want to transform this string to UTF-8 format, by the way, I have the code example on C# which is working well. I need to do the same on C++
result = mainString.Substring(nameStart + 3, symbols);
Encoding enc = Encoding.GetEncoding("ISO-8859-5");
byte[] bytes = enc.GetBytes(result);
result = Encoding.UTF8.GetString(bytes);
result is a string with text
The procedure to do this on Linux is as follows:
Use iconv_open() as described in its manual page to create a handle for a conversion from windows-1251 to UTF-8. I just double-checked and "windows-1251" is supported by the iconv library.
Use iconv() as described in its manual page.
Use iconv_close() as described in its manual page.

How to turn UTF8 std::string into a NSString?

Hello i have a project using both objective-c and c++ , I never set any encoding and on the right panel of the file page it says “no specific encoding set”, but I’ve read that NSString is natively utf-16 so how would I translate a c++ string(utf-8) to NSString(utf-16)?
You can use the std::string::data() method to get access to the raw bytes of the std::string. Once you have that, you can use the init(bytes:length:encoding:) constructor for NSString to convert the raw bytes into a NSString. Specify that the encoding is UTF-8.

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:
en: Couldn't your family afford a
costume for you
ru: Не ваша семья
позволить себе костюм для вас
How do I open file:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
The first trouble, there is some symbol before text 'en' (I saw this in debugger):
"en: least"
And another trouble is cyrillic symbols:
" ru: наименьший"
What's wrong?
there is some symbol before text 'en'
That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.
Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.
ru: наименьший
That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.
If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.
i suppose that your os is windows. exists several ways simple:
Use wchar_t, wstring, wifstream, etc.
Use icu library
Use other super puper library (them really many)
Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.
Note: for file printing you need to know what encoding use your file
Use libiconv to convert the text to a usable encoding after reading.
Use icu to convert the text.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.