Does Standard ML support Unicode?
I believe it does not but cannot find any authoritative documentation for SML stating such.
A yes or no is all that is needed, but you must know for a fact. No guessing or I believe answers. An authoritative link would be better.
Not really. All there is in the standard for the time being is the ability to use \uXXXX escapes in character and string literals, and that it does at least allow Unicode as the underlying character encoding for char or the optional WideChar.char. But the standard basis library does not prescribe any support for additional Unicode-aware functionality.
Particular implementations may have additional support, and you may perhaps find some third-party unicode libraries, but that's about it (unfortunately, I have no pointers at hand).
It depends a lot what you mean by "Unicode", which is a collection of many standards for many things. I've not seen any language or system that supports Unicode fully, and I don't even know what that would mean in all details.
You can certainly work with UTF-8 in SML: that encoding was invented to make it easy for ASCII applications to support Unicode. This might result it better and more efficient representation of Unicode than e.g. UTF-16 seen in Java, which does "support Unicode" officially, but then there are many practical problems with it (like surrogate characters).
With UTF-8 in SML strings, one question is how to work with string literals. Systems like Poly/ML allow to redefine the ML toplevel pretty printer for type string, and it is also feasible to wrap up the compiler to process string literals in a Unicode friendly way. Both of this is done in Isabelle/ML, which is based on Poly/ML. So if you take that big theorem proving environment as ML development platform, you have some kind of Unicode support built in (via so-called "Isabelle symbols").
Related
How to arrange correct processing of Unicode strings using pure C++?
What I mean is, when you put your unicode string into std::string and count its length, sometimes you get like 10 characters for 5-chars-long string.
How do they do it in serious open-source programs? How do they do it in a cross-platform manner? How do you tie it to file i/o and stdin/stdout streams?
Thanks.
There's Boost.Locale, which is written in C++, wraps the ICU library, and provides a nice, non-alien interface to it.
For Unicode work, my first choice would be Boost.Locale, followed by ICU directly (if there is something that Boost.Locale doesn't wrap yet).
std::[w]string, contrary to popular belief, has no Unicode support whatsoever. They both operate only on [w]char[_t] units, in an encoding agnostic way.
If you only need basic Unicode support in the form of length and conversions and encoding verification, there is utfcpp, which provides a beautiful C++ interface for these operations.
Application frameworks like Qt and wxWdigets do provide their own string classes, which offer better Unicode support, but often tying you to use the whole framework throughout your code.
Aside from that, there is ICU, which is the standard Unicode implementation around today.
A work in progress by one of the C++ masters on this website is ogonek. you can surely contact the author through the Lounge<C++> StackOverflow chat room to ask for details on his progress.
This is how: http://www.utf8everywhere.org
Have you checked http://site.icu-project.org already?
ICU is currently the Unicode library. If you want cross-platform Unicode support, ICU is basically the only place to get it.
If only its interface wasn't more unfriendly than the wrong end of an automatic shotgun.
I've used wxWidgets to do this. It makes for easy conversion from std::string to their string type wxString. It's not ideal, but it works well, is simple and portable.
I hope there is any library out there that provides such functionality so that I do not need to dig too much on charset specification.
C++, and hopefully Chinese, and hopefully Windows.
Yes, ICU is a mature library providing Unicode and Globalization support. Among other things it provides easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.
I have not tried to program with it it myself, but in the Unix world the Gnu Library libiconv is very widely used. It is also available for Windows. It is probably a bit more slim then the ICU.
I'm currently developing a cross-platform C++ library which I intend to be Unicode aware. I currently have compile-time support for either std::string or std::wstring via typedefs and macros. The disadvantage with this approach is that it forces you to use macros like L("string") and to make heavy use of templates based on character type.
What are the arguments for and against to support std::wstring only?
Would using std::wstring exclusively hinder the GNU/Linux user base, where UTF-8 encoding is preferred?
A lot of people would want to use unicode with UTF-8 (std::string) and not UCS-2 (std::wstring). UTF-8 is the standard encoding on a lot of linux distributions and databases - so not supporting it would be a huge disadvantage. On Linux every call to a function in your library with a string as argument would require the user to convert a (native) UTF-8 string to std::wstring.
On gcc/linux each character of a std::wstring will have 4 bytes while it will have 2 bytes on Windows. This can lead to strange effects when reading or writing files (and copying them from/to different platforms). I would rather recomend UTF-8/std::string for a cross platform project.
What are the arguments for and against to support std::wstring only?
The argument in favor of using wide characters is that it can do everything narrow characters can and more.
The argument against it that I know are:
wide characters need more space (which is hardly relevant, the Chinese do not, in principle, have more headaches over memory than Americans have)
using wide characters gives headaches to some westerners who are used for all their characters to fit into 7bit (and are unwilling to learn to pay a bit of attention to not to intermingle uses of the character type for actual characters vs. other uses)
As for being flexible: I have maintained a library (several kLoC) that could deal with both narrow and wide characters. Most of it was through the character type being a template parameter, I don't remember any macros (other than UNICODE, that is). Not all of it was flexible, though, there was some code in there which ultimately required either char or wchar_t string. (No point in making internal key strings wide using wide characters.)
Users could decide whether they wanted only narrow character support (in which case "string" was fine) or only wide character support (which required them to use L"string") or whether they wanted to support both, too (which required something like T("string")).
For:
Joel Spolsky wrote The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. If you scroll to the bottom, you'll find that his crew uses wide character strings exclusively. If it's good enough for them, it's good enough for you. ;-)
Against:
You might have to interface with code that isn't i18n-aware. But like any good library writer, you'll just hide that mess behind an easy-to-use interface, right? Right?
I would say that using std::string or std::wstring is irrelevant.
None offer proper Unicode support anyway.
If you need internationalization, then you need proper Unicode support and should start investigating about libraries such as ICU.
After that, it's a matter of which encoding use, and this depends on the platform you're on: wrap the OS-dependent facilities behind an abstraction layer and convert in the implementation layer when applicable.
Don't worry about the encoding internally used by the Unicode library you use (or build ? hum), it's a matter of performance and should not impact the use of the library itself.
Disadvantage:
Since wstring is truly UCS-2 and not UTF-16. I will kick you in the shins one day. And it will kick hard.
Suppose we have an arbitrary string, s.
s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don't have time travellers using Linear A, however.
For the sake of discussion, let's presume we want to do string operations such as:
reverse
length
capitalize
lowercase
index into
and, just because this is for the sake of discussion, let's presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.
There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let's say I'm sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).
Which encoding is most general?
Which encoding is supported by wchar_t?
Which encoding is supported by the STL?
Are these encodings all(or not at all) null-terminated?
--
The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there's a 'stack' of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.
Which encoding is most general
Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.
Which encoding is supported by wchar_t
None. That's implementation defined. On most Windows platforms it's UTF-16, on most Unix platforms its UTF-32.
Which encoding is supported by the STL
None really. The STL can store any type of character you want. Just use the std::basic_string<t> template with a type large enough to hold your code point. Most operations (e.g. std::reverse) do not know about any sort of unicode encoding though.
Are these encodings all(or not at all) null-terminated?
No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing -- not an encoding thing.
Choosing how to do this has a lot to do with your platform. If you're on Windows, use UTF-16 and wchar_t strings, because that's what the Windows API uses to support unicode. I'm not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.
Have a look at the open source library ICU, especially at the Docs & Papers section. It's an extensive library dealing with all sorts of unicode oddities.
In response to your final bullet, UTF-8 is guaranteed not to have NULL bytes in its encoding of any character (except NULL itself, of course). As a result, many functions that work with NULL-terminated strings also work with UTF-8 encoded strings.
Define "real app" :)
Seriously, the decision really depends a lot on the kind of software you are developing. If your target platform is Win32 API (with or without wrappers such as MFC, WTL, etc) you would probably want to use wstring types with the text encoded as UTF-16. That's simply because all Win32 API internally uses that encoding anyway.
On another hand, if your output is something like XML/HTML and/or needs to be delivered over the internet, UTF-8 is pretty much the standard - it is usually transmitted well via protocols that make assumptions about characters having 8 bits.
As for UTF-32, I can't think of a single reason to use it, unless you need 1:1 mapping between code units and code points (that still does not mean 1:1 mapping between code units and characters!).
For more information, be sure to look at Unicode.org. This FAQ may be a good starting point.
What is the best practice of Unicode processing in C++?
Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide
Here is a checklist for Windows programming:
All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string
Look at
Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
My code is under the New BSD license and can be found here:
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
Willow Schlanger's example code seems like a good one (see his answer for more details).
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
Here's a list of the embedded libraries that seem decent.
Embedded libraries
http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)
Use IBM's International Components for Unicode
Have a look at the recommendations of UTF-8 Everywhere