How can I convert a std::string to UTF-8? - c++

I need to put a stringstream as a value of a JSON (using rapidjson library), but std::stringstream::str is not working because it is not returning UTF-8 characters. How can I do that?
Example:
d["key"].SetString(tmp_stream.str());

rapidjson::Value::SetString accepts a pointer and a length. So you have to call it this way:
std::string stream_data = tmp_stream.str();
d["key"].SetString(tmp_stream.data(), tmp_string.size());
As others have mentioned in the comments, std::string is a container of char values with no encoding specified. It can contain UTF-8 encoded bytes or any other encoding.
I tested putting invalid UTF-8 data in an std::string and calling SetString. RapidJSON accepted the data and simply replaced the invalid characters with "?". If that's what you're seeing, then you need to:
Determine what encoding your string has
Re-encode the string as UTF-8
If your string is ASCII, then SetString will work fine as ASCII and UTF-8 are compatible.
If your string is UTF-16 or UTF-32 encoded, there are several lightweight portable libraries to do this like utfcpp. C++11 had an API for this, but it was poorly supported and now deprecated as of C++17.
If your string encoded with a more archaic encoding like Windows-1252, then you might need to use either an OS API like MultiByteToWideChar on Windows, or use a heavyweight Unicode library like LibICU to convert the data to a more standard encoding.

Related

How to convert a std::string containing bytes encoded in windows-1252 to a string containing utf8 encoded data?

Using modern C++ and the std library, what is the easiest and cleanest way to convert a std::string containing windows-1252 encoded characters to utf-8?
My use case is I'm parsing a CSV files which is windows-1252 encoded, and then push some of its data to node-js using Node-Api (node-addon-api), which requires UTF-8 encoded strings.
Using just the standard library, the closest solution would probably be to use std::wstring_convert with a custom Windows-1252 facet to convert the std::string to a std::wstring, and then use std::wstring_convert with a standard UTF-8 facet to convert the std::wstring to a std::string.
However, std::wstring_convert is deprecated since C++17, with no replacement in sight. So you are better off using a 3rd-party Unicode library to handle the conversion, such as iconv, ICU, etc. Or platform-specific APIs, like MultiByteToWideChar() and WideCharToMultiByte() on Windows, etc.
Or, you could simply implement the conversion yourself, since Windows-1252 is a very simple encoding, it has only 251 characters defined. A trivial lookup table to convert each 8bit character to its UTF-8 equivalent would suffice.

Convert from std::wstring to std::string

I'm converting wstring to string with std::codecvt_utf8 as described in this question, but when I tried Greek or Chinese alphabet symbols are corrupted, I can see it in the debug Locals window, for example 日本 became "日本"
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; //also tried codecvt_utf8_utf16
std::string str = myconv.to_bytes(wstr);
What am I doing wrong?
std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.
Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.
I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.
But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.
As verification, python gives the following:
>>> '日本'.encode('utf8').decode('cp1252')
'日本'
Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.
Therefore the conversion seems to have worked as intended.
As mentioned by #MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

convert std::string to jstring encoded using windows-1256

I am using a library (libcurl) that requests a certain webpage with some Arabic content. when I obtain the string response it has Arabic characters and the whole response is encoded in WINDOWS-1256.
the problem is arabic chars dont show up properly.
is there a way to convert an std::string to a jstring encoded in WINDOWS-1256?
by the way I tried env->NewStringUTF(str.c_str()); and the application crashed.
Java strings use UTF-16. JNI has no concept of charset encodings other than UTF-8 and UTF-16 (unless you use JNI calls to access Java's Charset class directly, but Java only implements a small subset of charsets, and Windows-1256 is not one of them unless the underlying Java JVM specifically implements it).
JNI's NewStringUTF() function requires UTF-8 input (and not just standard UTF-8 but Java's special modified UTF-8) and returns a UTF-16 encoded JString.
So you would have to first convert the original Arabic data from Windows-1256 to (modified) UTF-8 before then calling NewStringUTF(). A better option would be to convert the data to UTF-16 directly and then use JNI's NewString() function. But either way, you can use libiconv, ICU4JNI, or any other Unicode library of your choosing for the actual conversion itself one way or the other.

std::string conversion to char32_t (unicode characters)

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.
char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

streams with default utf8 handling

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.
I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.
Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.
std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.