Are there any updates of localization support in C++0x? - c++

The more I work with C++ locale facets, more I understand --- they are broken.
std::time_get -- is not symmetric with std::time_put (as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.
I discovered recently that simple number formatting may produce illegal UTF-8 under certain locales (like ru_RU.UTF-8).
std::ctype is very simplistic assuming that to upper/to lower can be done on per-character base (case conversion may change number of characters and it is context dependent).
std::collate -- does not support collation strength (case sensitive or insensitive).
There is not way to specify timezone different from global timezone in time formatting.
And much more...
Does anybody knows whether any changes are expected in standard facets in C++0x?
Is there any way to bring an importance of such changes?
Thanks.
EDIT: Clarifications in case the link is not accessible:
std::numpunct defines thousands separator as char. So when separator in U+2002 -- different kind of space it can't be reproduced as single char in UTF-8 but as multiple byte sequence.
In C API struct lconv defines thousands separator as string and does not suffers from this problem. So, when you try to format numbers with separators outside of ASCII with UTF-8 locale, invalid UTF-8 is produced.
To reproduce this bug write 1234 to std:ostream with imbued ru_RU.UTF-8 locale
EDIT2: I must admit that POSIX C localization API works much smoother:
There is inverse of strftime -- strptime (strftime does same as std::time_put::put)
No problems with number formatting because of the point I mentioned above.
However it is still for from being perfecet.
EDIT3: According to the latest notes about C++0x I can see that std::time_get::get -- similar to strptime and opposite of std::time_put::put.

I agree with you, C++ is lacking proper i18n support.
Does anybody knows whether any changes are expected in standard facets in C++0x?
It is too late in the game, so probably not.
Is there any way to bring an importance of such changes?
I am very pessimistic about this.
When asked directly, Stroustrup claimed that he does not see any problems with the current status. And another one of the big C++ guys (book author and all) did not even realize that wchar_t can be one byte, if you read the standard.
And some threads in boost (which seems to drive the direction in the future) show so little understanding on how this works that is outright scary.
C++0x barely added some Unicode character data types, late in the game and after a lot of struggle. I am not holding my breath for more too soon.
I guess the only chance to see something better is if someone really good/respected in the i18n and C++ worlds gets directly involved with the next version of the standard. No clue who that might be though :-(

std::numpunct is a template. All specializations try to return the decimal seperator character. Obviously, in any locale where that is a wide character, you should use std::numpunct<wchar_t>, as the <char specialization can't do that.
That said, C++0x is pretty much done. However, if good improvements continue, the C++ committee is likely to start C++1x. The ISO C++ committee on is very likely to accept your help, if offered through your national ISO member organization. I see that Pavel Minaev suggested a Defect Report. That's technically possible, but the problems you describe are in general design limitations. In that case, the most reliable course of action is to design a Boost library for this, have it pass the Boost review, submit it for inclusion in the standard, and participate in the ISO C++ meetings to deal with any issues cropping up there.

Related

Detecting whether US English or British English spelling is appropriate with C++

Using C++ is there a simple and reliable way to detect whether US English or British English spelling is a better match to the user's locale? I have a simple desktop C++ program and I want to perform this first minimal internationalization step. My user interface uses the word color/colour a lot and I am hoping to present the correct spelling to the user in 99%+ cases without a lot of rocket science or an explicit user option.
Edit: The accepted answer below is good. Just to provide some additional information and a complete usable solution, here is perhaps the simplest possible way to determine whether to use US or British English in your C++ application. This is a small but useful step towards proper internationalization. It's useful for US developers who wish to be culturally sensitive to English speakers outside the US, and for other English speaking developers who want their products to look a little more at home in the US (I am in the latter camp).
It's not a perfect solution, but clearly the standardisation process has failed to provide a perfect and simple solution. I expect that at the very least using this code will get color versus colour wrong for a smaller proportion of your users than simply arbitrarily choosing one or the other, assuming your product is widely used.
#include <string>
#include <clocale>
setlocale(LC_MONETARY,"");
struct lconv * lc;
lc = localeconv();
std::string sym(lc->int_curr_symbol);
bool us_english = (sym=="USD" || sym=="PHP");
I am basically assuming US English is the preference in the USA and British English elsewhere, although I've made an exception for one country where I know that not to be the case (the Philippines).
On my Windows system this program immediately reflects a change in the default language as controlled by Windows Settings or the Control Panel. Doing it this way avoids parsing any strings that may vary from OS to OS and I expect it to be portable.
I am not going to use my own solution, since in my particular case I am using wxWidgets, and as it happens wxWidgets provides a portable wrapper tailor made for this issue (I should have thought of checking before).
#include "wx/intl.h"
int lang = wxLocale::GetSystemLanguage();
bool us_english = (lang == wxLANGUAGE_ENGLISH_US);
The limitations of standards
As NeomerArcana correctly pointed out, the standard C++ way to do this is:
Create a locale object that uses the system's environment default locale std::locale("")
Query the name of this environment std::locale("").name()
Unfortunately, the standard doesn't enforce how this name should look like, and MSVC implementation just returns the name that you gave when you've constructed the object, i.e. an empty string.
Workarounds
Fortunately however, there's a C work-around that produces a better result with MSVC: std::setlocale(LC_ALL, NULL);. This function returns a pointer to a null terminated C-string, which contain the full name of the environment's default locale.
However, the format of this string might depend on the OS and version. So you'll have to parse it in a flexible manner: it can be a proper locale name (e.g. "en-US" for windows vs. "en_US.UTF-8" on POSIX) or a Windows specific format (e.g. "English_United States.1252", based on a language table).
Alternatively, if you target only Windows, you could consider simplifying your life by using GetSystemDefaultLangID(void). This function returns a binary language identifier, that you can then check against language LANG_ENGLISH and regional variations SUBLANG_ENGLISH_UK
There's a whole Localizations/Localisations library in C++ http://en.cppreference.com/w/cpp/locale
You should be able to use it by checking the string returned by std::locale("").name()

Unicode in C++11

I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.
A short summary
First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).
[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]
C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.
However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.
Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.
Observations
Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.
* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.
Questions
If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...
Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?
The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?
Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.
EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.
I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.
Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.
Is the above analysis correct, or are there any other
Unicode-supporting facilities I'm missing?
You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".
The standards committee has done a fantastic job in the last couple of
years moving C++ forward at a rapid pace. They're all smart people and
I assume they're well aware of the above shortcomings. Is there a
particular well-known reason that Unicode support remains so poor in
C++?
The Committee simply doesn't seem to give a shit about Unicode.
Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.
Going forward, does anybody know of any proposals to rectify the
situation? A quick search on isocpp.org didn't seem to reveal
anything.
There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.
Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.
Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.
Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?
Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.

Explanation about deprecated enum types

I have tried to find some posts or articles but I cant seem to find a good explanation on
What are deprecated enum types,
What deprecated means,
How they are declared, or discovered,
How should they be handled, (or not)
What problems they can cause?
Redirecting me to a helpful article would also be great
Thank you very much!
What are deprecated enum types?
I've never heard that exact wording, but this is essentially an enum (type) marked as deprecated.
What does deprecated mean?
Deprecated means some value, function, or maybe even module is marked as now obsolete or replaced. It still exists for compatibility with older code, but you shouldn't use it in new code anymore unless you really have to. Keep in mind it might be removed in future versions.
How they are declared, or discovered?
I'm not aware of any true standard/cross-platform way to do this, unfortunately. The question linked in the comments has some examples regarding this. If your compiler supports some special markup (#pragma instruction or some kind of attribute), it should issue you a warning or similar, if marked properly.
How should they be handled, (or not)?
As mentioned above, try to avoid stuff marked as deprecated. There's typically some replacement or at least hints on what/how to do it in the future. For example, if you're trying to use some standard library function in MSVC that is marked as deprecated, the compiler will typically tell you which function to use instead.
What problems they can cause?
For now, they most likely won't cause any problem, but you might not be able to utilize all features provided by some library. For example, the classic sprintf() in MSVC never checked the buffer length writing to. If you try using it, you'll get a warning, asking you to use sprintf_s() instead, which will do that security check. You don't have to do so yet (it's marked as deprecated but not removed), but you're essentially missing out. Not to forget that your code might break (and require major rewrites later on), if the deprecated stuff is finally removed.
What are deprecated enum types?
The terminology is ambiguous, but implies some specific enum types have been marked as deprecated using either:
a compiler-specific notation, such that there will be a warning or error if they're used, and/or
documentation without any technical enforcement (whether e.g. a sweeping corporate "coding standard" requiring that say only C++11 enum classes be used, or an API-specific note that specific enum types are deprecated)
What deprecated means?
That the functionality may be removed in a later version of the system, usually because it is considered inferior (whether to some existing alternative, or in terms of maintainability, performance, robustness etc.) or unimportant.
How they are declared, or discovered?
If the deprecation is being enforced by the compiler, then it will be have to be included in the same translation unit as the enum type: it may be in the same header, or in a general "deprecation.h" header etc.. Here's a few examples for a common compiler:
GCC: enum X [ { ... } ] __attribute__ ((deprecated));
for individual enumerations: enum X { E1 __attribute__((deprecated)) [ = value ] [ , ... ] };
How should they be handled, (or not)
When you can, you should investigate why they're deprecated and what the alternatives are, and change the code using them to avoid them.
What problems they can cause?
The immediate problem they cause is that your compiler may generate warnings or errors.
They may well be deprecated because it's not the best idea to use them even with the current software - related functionality may be inefficient, buggy, etc.. For example, given enum Image_Format { GIF, PNG, JPEG, SVG };, GIF may be deprecated in a system because PNG has proven better for the system's users - e.g. perhaps because it supports better colour depth, preserving the colours more accurately, or SVG might be deprecated because some clients have been found to be using web browsers that won't display them, JPEG might be deprecated because it's known the images in the system aren't natural photographic images and the format gives visually poor results despite larger compressed files, slower processing speed and higher memory usage - lots of possible motivations for making things deprecated.
A bigger but not immediate issue is that they could disappear with the next revision of the software "subsystem" providing them, so if you don't migrate old code off them and avoid creating new code using them, your software will have to be fixed before it can work with the update to that subsystem.
Deprecated means -
You can use this feature in the current latest stable release but this feature will be removed in some future release, but which not mentioned.
They are marked by the library or SDK who created them. Some languages use attirubutes to mark deprecated. Like C# uses - [Obsolete] attributes, I am not sure about C++, do don't know what they use to mark deprecated.
They can be removed in any future release. So if you use them, your code or program might not work in future updates, as the feature might have been removed in that future update.

Are STL headers written entirely by hand?

I am looking at various STL headers provided with compilers and I cant imagine the developers actually writing all this code by hand.
All the macros and the weird names of varaibles and classes - they would have to remember all of them! Seems error prone to me.
Are parts of the headers result of some text preprocessing or generation?
I've maintained Visual Studio's implementation of the C++ Standard Library for 7 years (VC's STL was written by and licensed from P.J. Plauger of Dinkumware back in the mid-90s, and I work with PJP to pick up new features and maintenance bugfixes), and I can tell you that I do all of my editing "by hand" in a plain text editor. None of the STL's headers or sources are automatically generated (although Dinkumware's master sources, which I have never seen, go through automated filtering in order to produce customized drops for Microsoft), and the stuff that's checked into source control is shipped directly to users without any further modification (now, that is; previously we ran them through a filtering step that caused lots of headaches). I am notorious for not using IDEs/autocomplete, although I do use Source Insight to browse the codebase (especially the underlying CRT whose guts I am less familiar with), and I extensively rely on grep. (And of course I use diff tools; my favorite is an internal tool named "odd".) I do engage in very very careful cut-and-paste editing, but for the opposite reason as novices; I do this when I understand the structure of code completely, and I wish to exactly replicate parts of it without accidentally leaving things out. (For example, different containers need very similar machinery to deal with allocators; it should probably be centralized, but in the meantime when I need to fix basic_string I'll verify that vector is correct and then copy its machinery.) I've generated code perhaps twice - once when stamping out the C++14 transparent operator functors that I designed (plus<>, multiplies<>, greater<>, etc. are highly repetitive), and again when implementing/proposing variable templates for type traits (recently voted into the Library Fundamentals Technical Specification, probably destined for C++17). IIRC, I wrote an actual program for the operator functors, while I used sed for the variable templates. The plain text editor that I use (Metapad) has search-and-replace capabilities that are quite useful although weaker than outright regexes; I need stronger tools if I want to replicate chunks of text (e.g. is_same_v = is_same< T >::value).
How do STL maintainers remember all this stuff? It's a full time job. And of course, we're constantly consulting the Standard/Working Paper for the required interfaces and behavior of code. (I recently discovered that I can, with great difficulty, enumerate all 50 US states from memory, but I would surely be unable to enumerate all STL algorithms from memory. However, I have memorized the longest name, as a useless bit of trivia. :->)
The looks of it are designed to be weird in some sense. The standard library and the code in there needs to avoid conflicts with names used in user programs, including macros and there are almost no restrictions as to what can be in a user program.
They are most probably hand written, and as others have mentioned, if you spend some time looking at them you will figure out what the coding conventions are, how variables are named and so on. One of the few restrictions include that user code cannot use identifiers starting with _ followed by a capital letter or __ (two consecutive underscores), so you will find many names in the standard headers that look like _M_xxx or __yyy and it might surprise at first, but after some time you just ignore the prefix...

Java native code string ending

Does the string returned from the GetStringUTFChars() end with a null terminated character? Or do I need to determine the length using GetStringUTFLength and null terminate it myself?
Yes, GetStringUTFChars returns a null-terminated string. However, I don't think you should take my word for it, instead you should find an authoritative online source that answers this question.
Let's start with the actual Java Native Interface Specification itself, where it says:
Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().
Oh, surprisingly it doesn't say whether it's null-terminated or not. Boy, that seems like a huge oversight, and fortunately somebody was kind enough to log this bug on Sun's Java bug database back in 2008. The notes on the bug point you to a similar but different documentation bug (which was closed without action), which suggests that the readers buy a book, "The Java Native Interface: Programmer's Guide and Specification" as there's a suggestion that this become the new specification for JNI.
But we're looking for an authoritative online source, and this is neither authoritative (it's not yet the specification) nor online.
Fortunately, the reviews for said book on a certain popular online book retailer suggest that the book is freely available online from Sun, and that would at least satisfy the online portion. Sun's JNI web page has a link that looks tantalizingly close, but that link sadly doesn't go where it says it goes.
So I'm afraid I cannot point you to an authoritative online source for this, and you'll have to buy the book (it's actually a good book), where it will explain to you that:
UTF-8 strings are always terminated with the '\0' character, whereas Unicode strings are not. To find out how many bytes are needed to represent a jstring in the UTF-8 format, JNI programmers can either call the ANSI C function strlen on the result of GetStringUTFChars, or call the JNI function GetStringUTFLength on the jstring reference directly.
(Note that in the above sentence, "Unicode" means "UTF-16", or more accurately "the internal two-byte string representation used by Java, though finding proof of that is left as an exercise for the reader.)
All current answers to the question seem to be outdated (Edward Thomson's answer last update dates back to 2015), or referring to Android JNI documentation which can be authoritative only in the Android world. The matter has been clarified in recent (2017) official Oracle JNI documentation clean-up and updates, more specifically in this issue.
Now the JNI specification clearly states:
String Operations
This specification makes no assumptions on how a JVM
represent Java strings internally. Strings returned from these
operations:
GetStringChars()
GetStringUTFChars()
GetStringRegion()
GetStringUTFRegion()
GetStringCritical()
are therefore not required to
be NULL terminated. Programmers are expected to determine buffer
capacity requirements via GetStringLength() or GetStringUTFLength().
In the general case this means one should never assume JNI returned strings are null terminated, not even UTF-8 strings. In a pragmatic world one can test a specific behavior in a list of supported JVM(s). In my experience, rereferring to JVMs I actually tested:
Oracle JVMs do null terminate both UTF-16 (with \u0000) and UTF-8 strings (with '\0');
Android JVMs do terminate UTF-8 strings but not UTF-16 ones.
https://developer.android.com/training/articles/perf-jni says:
The Java programming language uses UTF-16. For convenience, JNI provides methods that work with Modified UTF-8 as well. The modified encoding is useful for C code because it encodes \u0000 as 0xc0 0x80 instead of 0x00. The nice thing about this is that you can count on having C-style zero-terminated strings, suitable for use with standard libc string functions. The down side is that you cannot pass arbitrary UTF-8 data to JNI and expect it to work correctly.
If possible, it's usually faster to operate with UTF-16 strings. Android currently does not require a copy in GetStringChars, whereas GetStringUTFChars requires an allocation and a conversion to UTF-8. Note that UTF-16 strings are not zero-terminated, and \u0000 is allowed, so you need to hang on to the string length as well as the jchar pointer.
Yes, strings returned by GetStringUTFChars() are null-terminated. I use it in my application, so proved it experimentally, let's say. While Oracle's documentation sucks, alternative sources are more informative: Java Native Interface (JNI) Tutorial