c16rtomb()/c32rtomb() locale-independent conversion?

c16rtomb()/c32rtomb() locale-independent conversion? - c++

C++11 introduced the c16rtomb()/c32rtomb() conversion functions, along with the inverse (mbrtoc16()/mbrtoc32()). c16rtomb() clearly states in the reference documentation here:
The multibyte encoding used by this function is specified by the currently active C locale
The documentation for c32rtomb() states the same. Both the C and C++ versions agree that these are locale-dependent conversions (as they should be, according to the naming convention of the functions themselves).
However, MSVC seems to have taken a different approach and made them locale-independent (not using the current C locale) according to this document. These conversion functions are specified under the heading Locale-independent multibyte routines.
C++20 adds to the confusion by including the c8rtomb()/mbrtoc8() functions, which if locale-independent would basically do nothing, converting UTF-8 input to UTF-8 output.
Two questions arise from this:
Do any other compilers actually follow the standard and implement locale-dependent Unicode multibyte conversion routines? I couldn't find any concrete information after extensive searching.
Is this a bug in MSVC's implementation?

Related

Status of tgmath.h in terms of portability across C, C++ compilers

I'm implementing a DSL with a compiler that translates it into C. Our code is self-contained: we provide users with just one main entry point (i.e., void step()), all inputs are given in global variables with simple types (bool, signed/unsigned ints, double, float, structs, and arrays; but no pointers, and no functions as inputs).
Our main target is C99, but some people are using the arduino compiler for the code we produce, so maximum portability across standards and compilers is preferred.
I see that C99's math.h defines multiple variants of functions based on the type of the arguments (e.g., sin, sinf, sinl).
To avoid losing precision, our translator can, based on the types of the arguments, pick one of those variants. However, all functions we support are also defined by tgmath.h, so a simpler way would be to use the standard name for each (e.g., sin) and let tgmath.h pick the appropriate variant based on the types of the arguments.
I've read online that opinions on tgmath.h, and support, are mixed. Some people say it is ugly, and that not all compilers support it.
Is relying on tgmath.h a bad idea? Is this going to give us headaches down the line?

A C99 (page 165 and 335 in the standard) compliant compiler will include tgmath.h so relying on tgmath.h is ok. It's a standard header since C99.

Will we have a size_t strlen(const char8_t*) in a future C++ version

char8_t in C++20 fixes some problems of char, so I was considering using char8_t instead of char for utf8 text (e.g. text from command line). But then I noticed that strlen was not specified in the standard to be used with char8_t, actually none of the functions in the cstring library are. Can I expect this to happen in a next standard update? Or is char8_t never intended to replace char in the way I had in mind?

I'm the author of the P0482 and P1423 char8_t proposals.
The intent of those proposals was to introduce the char8_t type with the same level of support present for char16_t and char32_t and then to follow up with additional functionality later. These proposals were adopted late in the C++20 development cycle (at the San Diego and Cologne meetings respectively), so there wasn't opportunity to deliver additional features for C++20.
One of the directives for SG16 as described in P1238 is to standardize new encoding aware text container and view types. Work is progressing in this area and we hope to deliver it for C++23. It is hoped that these new containers and views will supplant much raw string handling in C++.
With regard to strlen specifically, strlen is a C API. N2231 is a proposal to add char8_t support to C (again, at the same level as the existing support for char16_t and char32_t). That proposal has not yet been accepted by WG14. Assuming it is eventually accepted, then it would make sense to follow up with additional char8_t-based C string management functions (perhaps enhancing support for char16_t and char32_t as well).
At present, I'm working on completing an implementation of N2231 in gcc and glibc. Once that is complete, I intend to submit a revision of N2231 to WG14.
You can help! SG16 is an open group. Please feel free to subscribe to our mailing list, join us on Slack, share your ideas, needs, and wants, and write proposals for new functionality (we can help with how to do that).

These new char types are intended to use C++ string template std::basic_string, namely to define std::u8string. So the best in your case is use C++ strings.
As for the future support of char8_t in cstring library, I suppose this question is more suitable to the future C standard. I'm afraid, it will not be an easy and will be unlikely update, since C does not have overloaded functions, and this update will require new functions like c8slen in addition to strlen and wcslen.

char8_t is intended for UTF-8-encoded strings. As such, APIs that consume them will be assumed by users to be Unicode aware on some level. Quite a lot of the contents of the <cstring> header would be inappropriate for char8_t, as their behavior is very much not in line with Unicode (would strcmp do proper Unicode collation?).
If you want access to functions that work similarly to the <cstring> functions, then you'll find std::char_trait<char8_t> to contain some useful ones, in particular length (exactly like strlen) and compare (explicitly lexicographical). Most of the rest of <cstring> can be handled adequately through C++ algorithms.

0 can still act as null-terminator in utf8-strings, so technically nothing prevents you (except a lack of appropriate function) from using strlen to count the amount of bytes(!) in utf8 sequence. If you want to find the number of chars you would need a separate function.

"wcs" and "_w" and "_mbs" prefix in Visual Studio

I am a little confused with respect to the difference in the functions which are defined with/without the wcs/_w/_mbs prefix.
For Example:
fopen(),_wfopen()
On msdn it is given that:
The fopen function opens the file that is specified by filename.
_wfopen is a wide-character version of fopen; the arguments to _wfopen are wide-character strings. Otherwise, _wfopen and fopen behave
identically.
I just had a doubt whether there is any platform dependence to windows associated with the addition of the "_w" prefix.
strcpy(),wcscpy(),_mbscpy()
On msdn it is given that:
wcscpy and _mbscpy are, respectively, wide-character and multibyte-character versions of strcpy.
Again there is a doubt if the addition of "wcs" or "_mbs" is platform dependent.
EDIT:
Is WideCharToMultiByte function also platform dependent?
WideCharToMultiByte is not a C Runtime function, it's a Windows
API,hence it is platform dependent
Similarly is wcstombs_s function also platform dependent?
It was nonstandard but was standardized in C11 Annex K.

The wcs* functions like wcscpy are part of the C Standard Library. The _wfopen function and other _w* functions are extensions, as are the multibyte string functions like _mbscpy.
For the most part, Visual C++ C Runtime (CRT) functions that have a leading underscore are extensions; functions that do not have a leading underscore are part of the C Standard Library.
There are two main exceptions, where extensions may not have leading underscores:
There are several extension functions, declared with an underscore prefix, that have prefixless aliases for backwards source compatibility. These aliases are deprecated, and if you try to use them you'll get a suppressable deprecation warning (C4996).
There are _s-suffixed secure alternative functions to some C Standard Library functions, e.g. scanf_s. These are declared by default, but their declarations may be suppressed by defining the macro __STDC_WANT_SECURE_LIB__ to have the value 0.
(These functions were actually added to C11 in the optional Annex K,
but note that there are a few differences between what is specified
in the C Standard and what is implemented by Visual C++. The
differences are due to a historical
accident.)

wcscpy is standard. _mbcscpy is specific to MS VC.
That's why there is an underscore at the start: names with leading underscore are reserved for implementation-specific things.

-Werror=format: how can the compiler know

I wrote this intentionally wrong code
printf("%d %d", 1);
compiling with g++ and -Werror=format.
The compiler gives this very impressive warning:
error: format '%d' expects a matching 'int' argument [-Werror=format]
As far as I can see, there's no way the compiler can tell that the code is wrong, because the format string isn't parsed until runtime.
My question: does the compiler have a special feature that kicks in for printf and similar libc functions, or is this a feature I could use for my own functions? String literals?

As far as I can see, there's no way the compiler can tell that the code is wrong, because the format string isn't parsed until runtime.
As long as the format string is a string literal, it can be parsed at compile time. If it isn't (which is usually a bad idea anyway), then you can get a warning about that from -Wformat-security.
does the compiler have a special feature that kicks in for printf and similar libc functions?
Yes.
or is this a feature I could use for my own functions?
Yes, as long as you're using the same style of format string as printf (or various other standard functions like scanf or strftime).
void my_printf(Something, char const * format, SomethingElse, ...)
__attribute__ ((format (printf,2,4)));
to indicate that the second argument is a printf-style format string, and the values to format begin with the fourth. See http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html.

Well, printf definitely parses the format string at runtime in order to do its job. But nowhere is it written that the compiler may not choose to parse it itself if it wants to.
The documentation for -Wformat says that this is exactly what happens:
-Wformat
-Wformat=n
Check calls to printf and scanf, etc., to make sure that the arguments supplied have > types appropriate to the format string
specified, and that the conversions specified in the format string
make sense. This includes standard functions, and others specified by
format attributes (see Function Attributes), in the printf, scanf,
strftime and strfmon (an X/Open extension, not in the C standard)
families (or other target-specific families). Which functions are
checked without format attributes having been specified depends on the
standard version selected, and such checks of functions without the
attribute specified are disabled by -ffreestanding or -fno-builtin.
The formats are checked against the format features supported by GNU libc version 2.2.
These include all ISO C90 and C99 features, as
well as features from the Single Unix Specification and some BSD and
GNU extensions. Other library implementations may not support all
these features; GCC does not support warning about features that go
beyond a particular library's limitations. However, if -Wpedantic is
used with -Wformat, warnings are given about format features not in
the selected standard version (but not for strfmon formats, since
those are not in any version of the C standard). See Options
Controlling C Dialect.
Update: Turns out you can use it on your own functions. Mike has the details.

What new Unicode functions are there in C++0x?

It has been mentioned in several sources that C++0x will include better language-level support for Unicode(including types and literals).
If the language is going to add these new features, it's only natural to assume that the standard library will as well.
However, I am currently unable to find any references to the new standard library. I expected to find out the answer for these answers:
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
Finally, are any of these functions are available in any popular compilers such as GCC or Visual Studio?
I have tried to look for information, but I can't seem to find anything. I am actually starting to think that maybe these things aren't even decided yet(I am aware that C++0x is a work in progress).

Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
No. The new library does provide std::codecvt facets which do the conversion for you when dealing with iostream, however. ISO/IEC TR 19769:2004, the C Unicode Technical Report, is included almost verbatim in the new standard.
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Yes, you'd just imbue cout with the correct codecvt facet. Note however that the console is not required to display those characters correctly
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
AFAIK that functionality exists with the existing C++03 standard. std::toupper and std::towupper of course function just as in previous versions of the standard. There aren't any new functions which specifically operate on unicode for this.
If you need these kinds of things, you're still going to have to rely on an external library -- the <iostream> is the primary piece that was retrofitted.
What, specifically, is added for unicode in the new standard?
Unicode literals, via u8"", u"", and U""
std::char_traits classes for UTF-8, UTF-16, and UTF-32
mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb from ISO/IEC TR 19769:2004
std::codecvt facets for the locale library
The std::wstring_convert class template (which uses the codecvt mechanism for code set conversions)
The std::wbuffer_convert, which does the same as wstring_convert except for raw arrays, not strings.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js