It has been mentioned in several sources that C++0x will include better language-level support for Unicode(including types and literals).
If the language is going to add these new features, it's only natural to assume that the standard library will as well.
However, I am currently unable to find any references to the new standard library. I expected to find out the answer for these answers:
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
Finally, are any of these functions are available in any popular compilers such as GCC or Visual Studio?
I have tried to look for information, but I can't seem to find anything. I am actually starting to think that maybe these things aren't even decided yet(I am aware that C++0x is a work in progress).
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
No. The new library does provide std::codecvt facets which do the conversion for you when dealing with iostream, however. ISO/IEC TR 19769:2004, the C Unicode Technical Report, is included almost verbatim in the new standard.
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Yes, you'd just imbue cout with the correct codecvt facet. Note however that the console is not required to display those characters correctly
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
AFAIK that functionality exists with the existing C++03 standard. std::toupper and std::towupper of course function just as in previous versions of the standard. There aren't any new functions which specifically operate on unicode for this.
If you need these kinds of things, you're still going to have to rely on an external library -- the <iostream> is the primary piece that was retrofitted.
What, specifically, is added for unicode in the new standard?
Unicode literals, via u8"", u"", and U""
std::char_traits classes for UTF-8, UTF-16, and UTF-32
mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb from ISO/IEC TR 19769:2004
std::codecvt facets for the locale library
The std::wstring_convert class template (which uses the codecvt mechanism for code set conversions)
The std::wbuffer_convert, which does the same as wstring_convert except for raw arrays, not strings.
Related
C++11 introduced the c16rtomb()/c32rtomb() conversion functions, along with the inverse (mbrtoc16()/mbrtoc32()). c16rtomb() clearly states in the reference documentation here:
The multibyte encoding used by this function is specified by the currently active C locale
The documentation for c32rtomb() states the same. Both the C and C++ versions agree that these are locale-dependent conversions (as they should be, according to the naming convention of the functions themselves).
However, MSVC seems to have taken a different approach and made them locale-independent (not using the current C locale) according to this document. These conversion functions are specified under the heading Locale-independent multibyte routines.
C++20 adds to the confusion by including the c8rtomb()/mbrtoc8() functions, which if locale-independent would basically do nothing, converting UTF-8 input to UTF-8 output.
Two questions arise from this:
Do any other compilers actually follow the standard and implement locale-dependent Unicode multibyte conversion routines? I couldn't find any concrete information after extensive searching.
Is this a bug in MSVC's implementation?
char8_t in C++20 fixes some problems of char, so I was considering using char8_t instead of char for utf8 text (e.g. text from command line). But then I noticed that strlen was not specified in the standard to be used with char8_t, actually none of the functions in the cstring library are. Can I expect this to happen in a next standard update? Or is char8_t never intended to replace char in the way I had in mind?
I'm the author of the P0482 and P1423 char8_t proposals.
The intent of those proposals was to introduce the char8_t type with the same level of support present for char16_t and char32_t and then to follow up with additional functionality later. These proposals were adopted late in the C++20 development cycle (at the San Diego and Cologne meetings respectively), so there wasn't opportunity to deliver additional features for C++20.
One of the directives for SG16 as described in P1238 is to standardize new encoding aware text container and view types. Work is progressing in this area and we hope to deliver it for C++23. It is hoped that these new containers and views will supplant much raw string handling in C++.
With regard to strlen specifically, strlen is a C API. N2231 is a proposal to add char8_t support to C (again, at the same level as the existing support for char16_t and char32_t). That proposal has not yet been accepted by WG14. Assuming it is eventually accepted, then it would make sense to follow up with additional char8_t-based C string management functions (perhaps enhancing support for char16_t and char32_t as well).
At present, I'm working on completing an implementation of N2231 in gcc and glibc. Once that is complete, I intend to submit a revision of N2231 to WG14.
You can help! SG16 is an open group. Please feel free to subscribe to our mailing list, join us on Slack, share your ideas, needs, and wants, and write proposals for new functionality (we can help with how to do that).
These new char types are intended to use C++ string template std::basic_string, namely to define std::u8string. So the best in your case is use C++ strings.
As for the future support of char8_t in cstring library, I suppose this question is more suitable to the future C standard. I'm afraid, it will not be an easy and will be unlikely update, since C does not have overloaded functions, and this update will require new functions like c8slen in addition to strlen and wcslen.
char8_t is intended for UTF-8-encoded strings. As such, APIs that consume them will be assumed by users to be Unicode aware on some level. Quite a lot of the contents of the <cstring> header would be inappropriate for char8_t, as their behavior is very much not in line with Unicode (would strcmp do proper Unicode collation?).
If you want access to functions that work similarly to the <cstring> functions, then you'll find std::char_trait<char8_t> to contain some useful ones, in particular length (exactly like strlen) and compare (explicitly lexicographical). Most of the rest of <cstring> can be handled adequately through C++ algorithms.
0 can still act as null-terminator in utf8-strings, so technically nothing prevents you (except a lack of appropriate function) from using strlen to count the amount of bytes(!) in utf8 sequence. If you want to find the number of chars you would need a separate function.
I'm trying to use new unicode characters in C++0x.
So I wrote sample code:
#include <fstream>
#include <string>
int main()
{
std::u32string str = U"Hello World";
std::basic_ofstream<char32_t> fout("output.txt");
fout<<str;
return 0;
}
But after executing this program I'm getting empty output.txt file. So why it's not printing Hello World?
Also is there something like a cout and cin already defined for these types, or stdin and stdout doesn't support Unicode?
Edit: I'm using g++ and Linux.
EDIT:АТТЕNTION. I have discovered, that standard committee dismissed Unicode streams from C++0x. So previously accepted answer is not correct anymore. For more information see my answer!
Unicode string literals support began in GCC 4.5. Maybe that's the problem.
[edit]
After some digging I've found that streams for this new unicode literals are described in N2035 and it was included in a draft of the standard. According to this document you need u32ofstream to output you string but this class is absent in GCC 4.5 C++0x library.
As a workaround you can use ordinary fstream:
std::ofstream fout2("output2.txt", std::ios::out | std::ios::binary);
fout2.write((const char *)str.c_str(), str.size() * 4);
This way I've output your string in UTF-32LE on my Intel machine (which is little-endian).
[edit]
I was a little bit wrong about the status of u32ofstream: according to the latest draft on the The C++ Standards Committee's web site you have to use std::basic_ofstream<char32_t> as you did. This class would use codecvt<char32_t,char,typename traits::state_type> class (see end of §27.9.1.1) which has to be implemented in the standard library (search codecvt<char32_t in the document), but it's not available in GCC 4.5.
In new C++ standard there will not be Unicode streams.
As #ssmir mentioned, standard committee was going to add stream support for Unicode in C++0x. However in the feature editions committee decided to remove stream support for Unicode. For more information see this link.
It seams like the only way to output Unicode string is to convert it to ASCII string with codecvt .
When creating, the stream tries to obtain a 'codecvt' from the global locale, but fails to get one because the only standard codecvt's are for char and wchar_t.
As a result, _M_codecvt member of the stream object is NULL.
Later, during the attempt to output, your code throws an exception (not visible to user) in facet checking function in basic_ios.h, because the facet is initialized from _M_codecvt.
Add a facet to the local associated with the stream to do the conversion from char32_t to the correct output.
Imbue the stream with a locale containing a codecvt of the right type.
The standard is pretty much silent on what constitutes a valid locale name; only that passing an invalid locale name results in std::runtime_error. What locale names are usable on common windows compilers such as MSVC, MinGW, and ICC?
Ok, there is a difference between C and C++ locales.
Let's start:
MSVC C++ std::locale and C setlocale
Accepts locale names as "Language[_Country][.Codepage]" for example "English_United States.1251" Otherwise would throws. Note: codepage can't be 65001/UTF-8 and should be consistent with ANSI codepage for this locale (or just omitted)
MSVC C++ std::locale and C setlocale in Vista and 7 should accept locales
[Language][-Script][-Country] like "en-US" using ISO-631 language codes and
ISO 3166 regions and script names.
I tested it with Visual Studio on Windows 7 - it does not work.
MinGW C++ std::locale accepts "C" and "POSIX" it does not support other locales,
actually gcc supports locales only over GNU C library - basically only under Linux.
setlocale is native Windows API call so should support all I mentioned above.
It may support wider range of locales when used with alternative C++ libraries
like Apache stdcxx or STL Port.
ICC - I hadn't tested it but it depends on the standard C++ library it uses. For
example under Linux it used GCC's libstdc++ so it supports all the locales gcc
supports. I don't know what standard C++ library it uses under Windows.
If you want to have "compiler and platform" independent locales support (and actually
much better support) take a look on Boost.Locale
Artyom
I believe the information you need is here :
locale "lang[_country_region[.code_page]]"
| ".code_page"
| ""
| NULL
This page provides links to :
Language Strings
Country/Region String
Code Pages
Although my answers covers setlocale instead of std::locale, this MSDN page seems to imply that the format is indeed the same :
An object of class locale also stores
a locale name as an object of class
string. Using an invalid locale name
to construct a locale facet or a
locale object throws an object of
class runtime_error. The stored
locale name is "*" if the locale
object cannot be certain that a
C-style locale corresponds exactly to
that represented by the object.
Otherwise, you can establish a matching locale within the Standard C
Library, for the locale object loc, by
calling setlocale(LC_ALL,
loc.name.c_str).
Also see this page and this thread which tend to show that std::locale internally uses setlocale.
Here's one locale name that's usable pretty much anywhere: "". That is, the empty string. The is in contrast to the "C" locale that you are probably getting by default. The empty string as an argument to std::setlocale() means something like "Use the preferred locale set by the user or environment." If you use this, the downside is that your program won't have the same output everywhere; the upside is that your users might think it works just the way they want.
my question is simple:
Should I use array of char eg:
char *buf, buf2[MAX_STRING_LENGTH]
etc or should I use std::string in a library that will be used by other programmers where they can use it on any SO and compiler of their choice?
Considering performance and portability...
from my point of view, std strings are easier and performance is equal or the difference is way too little to not use std:string, about portability I don't know. I guess as it is standard, there shouldn't be any compiler that compiles C++ without it, at least any important compiler.
EDIT:
The library will be compiled on 3 major OS and, theorically, distributed as a lib
Your thoughts?
ty,
Joe
Depends on how this library will be used in conjunction with client code. If it will be linked in dynamically and you have a set of APIs exposed for the client -- you are better off using null terminated byte strings (i.e. char *) and their wide-character counterparts. If you are talking about using them within your code, you certainly are free to use std::string. If it is going to be included in source form -- std::string works fine.
But if your library is shipped as DLL your users will have to use the same implementation of std::string. It won't be possible for them to use STLPort (or any other implementation) if your library was built using Microsoft STL.
As long as you are targetting pure C++ for your library, using std::string is fine and even desirable. However, doing that ties you to a particular implementation of C++ (the one used to build your library), and it can't be linked with other C++ implementations or other languages.
Often, it is highly desirable to give a library a C interface rather than a C++ one. That way its usable by any other language that provides a C foreign function interface (which is most of them). For a C interface, you need to use char *
I would recommend just using std::string. Besides if you want compatibility with libraries requiring C-style strings (for example, which uses a C compatible API), you can always just use the c_str() method of std::string.
In general, you will be better off using std::string, certainly for calls internal to your library.
For your API, it's dependent on what its purpose is. For internal use within your organization, an API that uses std::string will probably be fine. For external use you may wish to provide a C API, one which uses char*