Choosing encoding for icu::UnicodeString - c++

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?
The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.
If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.

I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison
UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.
In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?
Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

Related

Can I read åäö from wxWidget wxTextCtrl?

In C++. I have a wxTexfield and want the user to input a swedish translation of a word.
Everything works until the user types å, ä or ö (utf8).
Converting wxString to utf8 is not the problem - the problem is i can not even get the text out of the field. For the rest of text i use (where ans is a ponter to the Textfield). Any Idea? For the other strings i just use and it works perfekt.
std::string ch = std::string((ans->GetValue()));
You can't convert an arbitrary Unicode string to std::string without specifying the encoding. By default, the encoding is that of the current locale which, especially under Windows, is not necessarily UTF-8 which is what you almost certainly want to use precisely because the characters not representable in this encoding will be simply lost during conversion.
So the correct thing to do is to explicitly use ans->GetValue().ToUTF8() and then your std::string will contain UTF-8-encoded representation of your characters. Of course, you need to realize that the string won't be of length 1, even for a single character, so perhaps you need to use std::wstring instead.
P.S. In wxWidgets 3.1.5+ you also have utf8_string() directly returning std::string, so you can also use this one if you have a new enough version.

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

convert std::string to jstring encoded using windows-1256

I am using a library (libcurl) that requests a certain webpage with some Arabic content. when I obtain the string response it has Arabic characters and the whole response is encoded in WINDOWS-1256.
the problem is arabic chars dont show up properly.
is there a way to convert an std::string to a jstring encoded in WINDOWS-1256?
by the way I tried env->NewStringUTF(str.c_str()); and the application crashed.
Java strings use UTF-16. JNI has no concept of charset encodings other than UTF-8 and UTF-16 (unless you use JNI calls to access Java's Charset class directly, but Java only implements a small subset of charsets, and Windows-1256 is not one of them unless the underlying Java JVM specifically implements it).
JNI's NewStringUTF() function requires UTF-8 input (and not just standard UTF-8 but Java's special modified UTF-8) and returns a UTF-16 encoded JString.
So you would have to first convert the original Arabic data from Windows-1256 to (modified) UTF-8 before then calling NewStringUTF(). A better option would be to convert the data to UTF-16 directly and then use JNI's NewString() function. But either way, you can use libiconv, ICU4JNI, or any other Unicode library of your choosing for the actual conversion itself one way or the other.

How can I convert a wchar_t* to char* without losing data?

I'm using a Japanese string as a wchar_t, and I need to convert it to a char*. Is there any method or function to convert wchar_t* to char* without losing data?
It is not enough to say "I have a string as wchar_t". You must also know what encoding the characters of the string are in. This is probably UTF-16, but you need to know definitely.
It is also not enough to say "I want to convert to char". Again, you must make a decision on what encoding the characters will be represented in. JIS? Shift-JIS? EUC? UTF-8? Another encoding?
If you know the answers to the two questions above, you can do the conversion without any problem using WideCharToMultiByte.
What you have to do first is to choose the string encoding such as UTF-8 or UTF-16. And then, encode your wchar_t[] strings in the encoding you choose via libiconv or other similar string encoding library.
You need to call WideCharToMultiByte and pass in the code page encoding identifier for the Japanese multibyte encoding you want. See the MDSN for that function. On Windows, the local multibyte set is CP932, the MS variation on ShiftJIS. However, you might conceivably want UTF-8 to send to someone who wants it.

How to write a std::string to a UTF-8 text file

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?
The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.
And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.
If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).
You may want to start your file with a byte order mark so that other programs will know it is UTF-8.
There is nice tiny library to work with utf8 from c++: utfcpp
libiconv is a great library for all our encoding and decoding needs.
If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.
What is the easiest and simple way to do so?
The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string.
As the internet still lacks of one, I went to implement the functionality on my own:
tinyutf8 (EDIT: now Github).
This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).
Hope this helps!
If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.
std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());
std::string str_std( byteArray.constData(), byteArray.length());
My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.
#include <utf/utf.h>
int main()
{
using namespace utf;
u32string u32_text = U"ɦΈ˪˪ʘ";
// do stuff with string
// convert to utf8 string
utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());
u8string u8_text(pos, end);
// write out utf8 to file.
// ...
}
Use Glib::ustring from glibmm.
It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).
As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.
So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols
(languages) in single.
It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.
Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!
I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.