Related
I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.
A short summary
First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).
[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]
C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.
However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.
Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.
Observations
Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.
* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.
Questions
If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...
Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?
The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?
Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.
EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.
I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.
Is the above analysis correct
Let's see.
you can't validate an array of bytes as containing valid UTF-8
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.
you can't find out the length
Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.
you can't iterate over a std::string in any way other than byte-by-byte
Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).
doesn't really support UTF-16
Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.
Demo that illustrates these points.
If I have missed some other "you can't", please point it out and I will address it.
Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.
Is the above analysis correct, or are there any other
Unicode-supporting facilities I'm missing?
You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".
The standards committee has done a fantastic job in the last couple of
years moving C++ forward at a rapid pace. They're all smart people and
I assume they're well aware of the above shortcomings. Is there a
particular well-known reason that Unicode support remains so poor in
C++?
The Committee simply doesn't seem to give a shit about Unicode.
Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.
Going forward, does anybody know of any proposals to rectify the
situation? A quick search on isocpp.org didn't seem to reveal
anything.
There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.
Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.
Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.
Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?
Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.
char someArray[n];
std::cin >> someArray; // potential buffer overrun
I've seen code like the above numerous times on the C++ forums I frequent. Is there a good reason for this not to be treated as a compile time error? or at the very least, a warning?
An underlying premise with C (and C++) is that the coder should know what they're doing. Otherwise they'd be coding in BASIC :-)
It's not permitted to be an error since it's allowed per the standard, just like gets and scanf("%s") are allowed in C, despite the fact they're a problem waiting to happen.
The code you've posted is bad and has no place in serious software, but it's fine for "toy" programs or testing things. You just need to be aware of its problems (and it sounds very much like you are aware of them).
If C++ had been all been invented in one fell swoop, it probably wouldn't exist at all -- if you wanted to read a string, you'd have to read it into a std::string, and that would be the end of it.
Unfortunately, C++ was used for quite a while before std::string was standardized (or invented at all). Both operator>> and istream::getline (not to be mistaken for std::getline) were invented during that time. When they were invented, there was little (or no) real alternative, so they worked with arrays of char.
Today, of course, there are alternatives, and it's best to just avoid these unless you get stuck writing code with some ancient compiler that doesn't support the superior alternatives.
For example printf instead of cout, scanf instead of cin, using #define macros, etc?
I wouldn't say bad as it will depend on the personal choice. My policy is when there is a type-safe alternatives is available in C++, use them as it will reduce the errors in the code.
It depends on which features. Using define macros in C++ is strongly frowned upon, and for a good reason. You can almost always replace a use of a define macro with something more maintainable and safe in C++ (templates, inline functions, etc.)
Streams, on the other hand, are rightly judged by some people to be very slow and I've seen a lot of valid and high-quality C++ code using C's FILE* with its host of functions instead.
And another thing: with all due respect to the plethora of stream formatting possibilities, for stuff like simple debug printouts, IMHO you just can't beat the succinctness of printf and its format string.
You should definitely use printf in place of cout. The latter does let you make most or all of the formatting controls printf allows, but it does so in a stateful way. I.e. the current formatting mode is stored as part of the (global) object. This means bad code can leave cout in a state where subsequent output gets misformatted unless you reset all the formatting every time you use it. It also wreaks havoc with threaded usage.
I would say the only ones that are truly harmful to mix are the pairings between malloc/free and new/delete.
Otherwise it's really a style thing...and while the C is compatible with the C++, why would you want to mix the two languages when C++ has everything you need without falling back?
There are better solutions for most cases, but not all.
For example, people quite often use memcpy. I would almost never do that (except in really low-level code). I always use std::copy, even on pointers.
The same counts for the input/output routines. But it’s true that sometimes, C-style printf is substantially easier to use than cout (especially in logging). If Boost.Format isn’t an option then sure, use C.
#define is a different beast entirely. It’s not really a C-only feature, and there are many legitimate uses for it in C++. (But many more that aren’t.)
Of course you’d never use it to define constants (that’s what const is for), nor to declare inline functions (use inline and templates!).
On the other hand, it is often useful to generate debugging assertions and generally as a code generation tool. For example, I’m unit-testing class templates and without extensive use of macros, this would be a real pain in the *ss. Using macros here isn’t nice but it saves literally thousands of lines of code.
For allocations, I would avoid using malloc/free altogether and just stick to new/delete.
Not really, printf() is quite faster than cout, and the c++ iostream library is quite large. It depends on the user preference or the program itself (is it needed? etc). Also, scanf() is not suitable to use anymore, I prefer fgets().
What can be used or not only depends on the compiler that will be used. Since you are programming in c++, in my opinion, to maximize compatibility it is better to use what c++ provides instead of c functions unless you do not have any other choices.
Coming from a slightly different angle, I'd say it's bad to use scanf in C, never mind C++. User input is just far to variable to be parsed reliably with scanf.
I'd just post a comment to another reply, but since I can't... C's printf() is better than C++'s iostream because of internationalization. Want to translate a string and put the embedded number in a different place? Can't do it with an ostream. printf()'s format specification is a whole little language unto itself, interpreted at runtime.
I'm trying to make a Fortran 77 wrapper for C++ code. I have not found information about it.
The idea is to use from functions from a lib that is written in C++ in a Fortran 77 progran.
Does anyone know how to do it?
Thanks!
Lawrence Livermore National Laboratory developed a tool called Babel for integrating software written in multiple languages into a single, cohesive application. If your needs are simple you can probably just put C wrapper on your C++ code and call that from Fortran. However, if your needs are more advanced, it might be worth giving Babel a look.
Calling Fortran from C is easy, C from Fortran potentially tricky, C++ from Fortran may potentially become ... challenging.
I have some notes elsewhere. Those are quite old, but nothing changes very rapidly in this sort of area, so there may still be some useful pointers there.
Unfortunately, there's no really standard way of doing this, and different compilers may do it slightly different ways. Having said that, it's only when passing strings that you're likely to run into major headaches. The resource above points to a library called CNF which aims to help here, mostly by providing C macros to sugar the bookkeeping.
The short version, however is this:
Floats and integers are generally easy -- an integer is an integer, more or less.
Strings are hard (because Fortrans quite often store these as structures, and very rarely as C-style null-terminated arrays).
C is call-by-value, Fortran call-by-reference, which means that Fortran functions are always pointer-to-value, from C's point of view.
You have to care about how your compiler generates symbols: compilers often turn C/Fortran symbol foo into _foo or foo_ or some other variant (see the compiler docs).
C tends not to have much of a runtime, C++ and Fortran do, and so you have to remember to link that in somehow, at link time.
That's the majority of what you need to know. The rest is annoying detail, and making friends with your compiler and linker docs. You'll end up knowing more about linkers than you probably wanted to.
For instance why does most members in STL implementation have _M_ or _ or __ prefix?
Why there is so much boilerplate code ?
What features C++ is lacking that would allow make vector (for instance) implementation clear and more concise?
Implementations use names starting with an underscore followed by an uppercase letter or two underscores to avoid conflicts with user-defined macros. Such names are reserved in C++.
For example, one could define a macro called Type and then #include <vector>. If vector implementations used Type as a template parameter name, it would break.
However, one is not allowed to define macros called _Type (or __type, type__ etc.). Therefore, vector can safely use such names.
Lots of STL implementations also include checking for debug builds, such as verifying that two iterators are from the same container when comparing them, and watching for iterators going out of bounds. This involves fairly complex code to track the container and validity of every iterator created, but is invaluable for finding bugs. This code is also all interwoven with the standard release code with #ifdefs - even in the STL algorithms. So it's never going to be as clear as their most basic operation. Sites like this one show the most basic functionality of STL algorithms, stating their functionality is "equivalent to" the code they show. You won't see that in your header files though.
In addition to the good reasons robson and AshleysBrain have already given, one reason that C++ standard library implementations have such terse names and compact code is that virtually every C++ program (compilation unit, really) includes a large number of the standard library headers, and they are thus repeatedly recompiled (remember that they're largely inlined and template-based, whereas the C standard library headers only contain a handful of function declarations). A standard library written to "industry standard" style guidelines would take longer to compile and thus lead to the perception that a particular compiler was "slow". By minimizing whitespace and using short identifier names, the lexer and parser have less work to do, and the whole compilation process completes a little bit faster.
Another reason worth mentioning is that many standard library implementations (e.g. Dinkumware, Rogue Wave (old), etc.) can be used with several different compilers with widely different standards compliance and quirks. There's frequently a lot of macro hackery aimed at satisfying each supported platform.