convert std::string to jstring encoded using windows-1256 - c++

I am using a library (libcurl) that requests a certain webpage with some Arabic content. when I obtain the string response it has Arabic characters and the whole response is encoded in WINDOWS-1256.
the problem is arabic chars dont show up properly.
is there a way to convert an std::string to a jstring encoded in WINDOWS-1256?
by the way I tried env->NewStringUTF(str.c_str()); and the application crashed.

Java strings use UTF-16. JNI has no concept of charset encodings other than UTF-8 and UTF-16 (unless you use JNI calls to access Java's Charset class directly, but Java only implements a small subset of charsets, and Windows-1256 is not one of them unless the underlying Java JVM specifically implements it).
JNI's NewStringUTF() function requires UTF-8 input (and not just standard UTF-8 but Java's special modified UTF-8) and returns a UTF-16 encoded JString.
So you would have to first convert the original Arabic data from Windows-1256 to (modified) UTF-8 before then calling NewStringUTF(). A better option would be to convert the data to UTF-16 directly and then use JNI's NewString() function. But either way, you can use libiconv, ICU4JNI, or any other Unicode library of your choosing for the actual conversion itself one way or the other.

Related

How can I convert a std::string to UTF-8?

I need to put a stringstream as a value of a JSON (using rapidjson library), but std::stringstream::str is not working because it is not returning UTF-8 characters. How can I do that?
Example:
d["key"].SetString(tmp_stream.str());
rapidjson::Value::SetString accepts a pointer and a length. So you have to call it this way:
std::string stream_data = tmp_stream.str();
d["key"].SetString(tmp_stream.data(), tmp_string.size());
As others have mentioned in the comments, std::string is a container of char values with no encoding specified. It can contain UTF-8 encoded bytes or any other encoding.
I tested putting invalid UTF-8 data in an std::string and calling SetString. RapidJSON accepted the data and simply replaced the invalid characters with "?". If that's what you're seeing, then you need to:
Determine what encoding your string has
Re-encode the string as UTF-8
If your string is ASCII, then SetString will work fine as ASCII and UTF-8 are compatible.
If your string is UTF-16 or UTF-32 encoded, there are several lightweight portable libraries to do this like utfcpp. C++11 had an API for this, but it was poorly supported and now deprecated as of C++17.
If your string encoded with a more archaic encoding like Windows-1252, then you might need to use either an OS API like MultiByteToWideChar on Windows, or use a heavyweight Unicode library like LibICU to convert the data to a more standard encoding.

Handling the utf8 encoded char* array

A file contains non-latin content and is encoded in UTF8.
Currently the existing code uses "fopen" to open the file, parses it and calls my validate function with the non-latin content and passes data as char*.
void validate(const char* str)
{
....
}
I have to do some validation on passed char array.
The application uses Sun C++ 5.11 and which I think doesn't supports unicode. (I googled for unicode support on Sun C++ 5.11, I didn't get any proper pointers about the unicode support. So I wrote a simple program to check if Sun C++ supports unicode and the program didn't compile).
How do I do the validation on the input char*? Is it possible using wchar_t?
The application uses <compiler> and which I think doesn't supports unicode
This isn't a problem. You only need compiler support for unicode to embed unicode string literals in the code, or for fixed width character types to represent UTF-16 or UTF-32. Your unicode is UTF-8 and comes from user input, so no unicode compiler support should be needed.
How do I do the validation on the input char*?
The C++ standard library has very few tools for processing unicode. The provided tools primarily consist of conversion between different unicode formats, and even those tools were not available prior to C++11.
Input and output is mostly just copying of bytes, so no significant processing is required to do that. For other processing (which you presumably need for "validation") you will need to implement the tools yourself, or use third party tools. You will need to refer to the ~1000 pages of the unicode standard if you choose to implement yourself: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
Is it possible using wchar_t?
wchar_t is the native wide character type used for the native wide character encoding of the system. UTF-8 does not use wide code-units.

What is difference between CW2A(LPCWSTR)str) and CW2A(LPCWSTR)str, CP_UTF8)?

I am trying to convert few CStringW strings to CStringA strings. One of the strings (lets call it otherLangString) is in other language (Chinese, Arabic etc.). All the other strings had no issue getting converted when used like this :
CW2A((LPCWSTR)some_String);
But when used for the otherLangString, I was getting "?????"
So to fix that, I did this and it worked
CW2A(some_String, CP_UTF8);
Now in the code some all conversions looked like the 1st sample except one which looked like the 2nd sample.
For consistency I mixed above two and did this for all.
CW2A((LPCWSTR)some_String, CP_UTF8);
My questions is, What is the difference between following ?
- CW2A((LPCWSTR)some_String, CP_UTF8) and CW2A(some_String, CP_UTF8);
- CW2A((LPCWSTR)some_String) and CW2A(some_String, CP_UTF8);
CW2A is a typedef for CW2AEX<>, and it's c'tor is documented. The c'tor taking 2 arguments allows you to explicitly specify the code page to use for the conversion:
nCodePage:
The code page used to perform the conversion. See the code page parameter discussion for the Windows SDK function MultiByteToWideChar for more details.
If you don't specify a code page, the current thread's ANSI code page is used for the conversion (you rarely want that). This is explained under ATL and MFC String Conversion Macros:
By default, the ATL conversion classes and macros will use the current thread's ANSI code page for the conversion. If you want to override that behavior for a specific conversion using macros based on the classes CA2WEX or CW2AEX, specify the code page as the second parameter to the constructor for the class.
In your case,
CW2A((LPCWSTR)some_String);
converts from UTF-16 to a narrow character string, using the thread's current ANSI code page. The result is only meaningful when interpreted using the same ANSI code page. To make matters worse, ANSI code page encoded strings cannot represent all Unicode characters.
The other piece of code
CW2A(some_String, CP_UTF8);
converts from UTF-16 to UTF-8. This is generally favorable, since the conversion is lossless and explicit. Both encodings can represent the same set of characters. The encoded string can be decoded by any reader capable of interpreting UTF-8.
Note: In general, you cannot directly use a UTF-8 encoded character string stored in a CStringA in Windows. It is safe to send the contents over a network, or write them to disk. But if you want to pass it to the Windows API (e.g. for display) you have to convert to UTF-16 first. The ANSI versions of the Windows API do not support UTF-8.

Choosing encoding for icu::UnicodeString

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?
The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.
If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.
I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison
UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

streams with default utf8 handling

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.
I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.
Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.
std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.