I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.
A short summary
First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).
[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]
C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.
However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.
Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.
Observations
Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.
* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.
Questions
If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...
Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?
The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?
Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.
EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.
I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.
Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.
Is the above analysis correct, or are there any other
Unicode-supporting facilities I'm missing?
You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".
The standards committee has done a fantastic job in the last couple of
years moving C++ forward at a rapid pace. They're all smart people and
I assume they're well aware of the above shortcomings. Is there a
particular well-known reason that Unicode support remains so poor in
C++?
The Committee simply doesn't seem to give a shit about Unicode.
Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.
Going forward, does anybody know of any proposals to rectify the
situation? A quick search on isocpp.org didn't seem to reveal
anything.
There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.
Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.
Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.
Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?
Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.
Related
Does the string returned from the GetStringUTFChars() end with a null terminated character? Or do I need to determine the length using GetStringUTFLength and null terminate it myself?
Yes, GetStringUTFChars returns a null-terminated string. However, I don't think you should take my word for it, instead you should find an authoritative online source that answers this question.
Let's start with the actual Java Native Interface Specification itself, where it says:
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Oh, surprisingly it doesn't say whether it's null-terminated or not. Boy, that seems like a huge oversight, and fortunately somebody was kind enough to log this bug on Sun's Java bug database back in 2008. The notes on the bug point you to a similar but different documentation bug (which was closed without action), which suggests that the readers buy a book, "The Java Native Interface: Programmer's Guide and Specification" as there's a suggestion that this become the new specification for JNI.
But we're looking for an authoritative online source, and this is neither authoritative (it's not yet the specification) nor online.
Fortunately, the reviews for said book on a certain popular online book retailer suggest that the book is freely available online from Sun, and that would at least satisfy the online portion. Sun's JNI web page has a link that looks tantalizingly close, but that link sadly doesn't go where it says it goes.
So I'm afraid I cannot point you to an authoritative online source for this, and you'll have to buy the book (it's actually a good book), where it will explain to you that:
UTF-8 strings are always terminated with the '\0' character, whereas Unicode strings are not. To find out how many bytes are needed to represent a jstring in the UTF-8 format, JNI programmers can either call the ANSI C function strlen on the result of GetStringUTFChars, or call the JNI function GetStringUTFLength on the jstring reference directly.
(Note that in the above sentence, "Unicode" means "UTF-16", or more accurately "the internal two-byte string representation used by Java, though finding proof of that is left as an exercise for the reader.)
All current answers to the question seem to be outdated (Edward Thomson's answer last update dates back to 2015), or referring to Android JNI documentation which can be authoritative only in the Android world. The matter has been clarified in recent (2017) official Oracle JNI documentation clean-up and updates, more specifically in this issue.
Now the JNI specification clearly states:
String Operations
This specification makes no assumptions on how a JVM
represent Java strings internally. Strings returned from these
operations:
GetStringChars()
GetStringUTFChars()
GetStringRegion()
GetStringUTFRegion()
GetStringCritical()
are therefore not required to
be NULL terminated. Programmers are expected to determine buffer
capacity requirements via GetStringLength() or GetStringUTFLength().
In the general case this means one should never assume JNI returned strings are null terminated, not even UTF-8 strings. In a pragmatic world one can test a specific behavior in a list of supported JVM(s). In my experience, rereferring to JVMs I actually tested:
Oracle JVMs do null terminate both UTF-16 (with \u0000) and UTF-8 strings (with '\0');
Android JVMs do terminate UTF-8 strings but not UTF-16 ones.
https://developer.android.com/training/articles/perf-jni says:
The Java programming language uses UTF-16. For convenience, JNI provides methods that work with Modified UTF-8 as well. The modified encoding is useful for C code because it encodes \u0000 as 0xc0 0x80 instead of 0x00. The nice thing about this is that you can count on having C-style zero-terminated strings, suitable for use with standard libc string functions. The down side is that you cannot pass arbitrary UTF-8 data to JNI and expect it to work correctly.
If possible, it's usually faster to operate with UTF-16 strings. Android currently does not require a copy in GetStringChars, whereas GetStringUTFChars requires an allocation and a conversion to UTF-8. Note that UTF-16 strings are not zero-terminated, and \u0000 is allowed, so you need to hang on to the string length as well as the jchar pointer.
Yes, strings returned by GetStringUTFChars() are null-terminated. I use it in my application, so proved it experimentally, let's say. While Oracle's documentation sucks, alternative sources are more informative: Java Native Interface (JNI) Tutorial
What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?
Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.
I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...
In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...
The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.
You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.
There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.
Anyone know of a more permissive license (MIT / public domain) version of this:
http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html
('drop-in' replacement for std::string thats UTF-8 aware)
Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)
I really don't want to be carrying ICU around with me.
std::string is fine for UTF-8 storage.
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.
Take a look on Boost.Locale library (it uses ICU under the hood):
Reference http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/
It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.
If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.
If you need an ability to iterate over Code points (that BTW are not characters)
take a look on http://utfcpp.sourceforge.net/
Answer to comment:
1) Find file formats for files included by me
std::string::find is perfectly fine for this.
2) Line break detection
This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)
And of course Boost.Locale does this and correctly.
And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.
3) Character (or now, code point) counting Looking at utfcpp thx
Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.
So if you are aware of these issues then utfcpp will be fine, otherwise you will not
find anything simpler.
I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.
It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!
You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.
The more I work with C++ locale facets, more I understand --- they are broken.
std::time_get -- is not symmetric with std::time_put (as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.
I discovered recently that simple number formatting may produce illegal UTF-8 under certain locales (like ru_RU.UTF-8).
std::ctype is very simplistic assuming that to upper/to lower can be done on per-character base (case conversion may change number of characters and it is context dependent).
std::collate -- does not support collation strength (case sensitive or insensitive).
There is not way to specify timezone different from global timezone in time formatting.
And much more...
Does anybody knows whether any changes are expected in standard facets in C++0x?
Is there any way to bring an importance of such changes?
Thanks.
EDIT: Clarifications in case the link is not accessible:
std::numpunct defines thousands separator as char. So when separator in U+2002 -- different kind of space it can't be reproduced as single char in UTF-8 but as multiple byte sequence.
In C API struct lconv defines thousands separator as string and does not suffers from this problem. So, when you try to format numbers with separators outside of ASCII with UTF-8 locale, invalid UTF-8 is produced.
To reproduce this bug write 1234 to std:ostream with imbued ru_RU.UTF-8 locale
EDIT2: I must admit that POSIX C localization API works much smoother:
There is inverse of strftime -- strptime (strftime does same as std::time_put::put)
No problems with number formatting because of the point I mentioned above.
However it is still for from being perfecet.
EDIT3: According to the latest notes about C++0x I can see that std::time_get::get -- similar to strptime and opposite of std::time_put::put.
I agree with you, C++ is lacking proper i18n support.
Does anybody knows whether any changes are expected in standard facets in C++0x?
It is too late in the game, so probably not.
Is there any way to bring an importance of such changes?
I am very pessimistic about this.
When asked directly, Stroustrup claimed that he does not see any problems with the current status. And another one of the big C++ guys (book author and all) did not even realize that wchar_t can be one byte, if you read the standard.
And some threads in boost (which seems to drive the direction in the future) show so little understanding on how this works that is outright scary.
C++0x barely added some Unicode character data types, late in the game and after a lot of struggle. I am not holding my breath for more too soon.
I guess the only chance to see something better is if someone really good/respected in the i18n and C++ worlds gets directly involved with the next version of the standard. No clue who that might be though :-(
std::numpunct is a template. All specializations try to return the decimal seperator character. Obviously, in any locale where that is a wide character, you should use std::numpunct<wchar_t>, as the <char specialization can't do that.
That said, C++0x is pretty much done. However, if good improvements continue, the C++ committee is likely to start C++1x. The ISO C++ committee on is very likely to accept your help, if offered through your national ISO member organization. I see that Pavel Minaev suggested a Defect Report. That's technically possible, but the problems you describe are in general design limitations. In that case, the most reliable course of action is to design a Boost library for this, have it pass the Boost review, submit it for inclusion in the standard, and participate in the ISO C++ meetings to deal with any issues cropping up there.