Question
Is constructing std::locale with an empty string to get the user-preferred native locale a part of the standard? If yes, could you point out a source which explicitly states that?
Problem description
Example from documentation of std::locale has this line:
std::wcout << "User-preferred locale setting is " << std::locale("").name().c_str()
Which hints that creating a locale with an empty string will return a user-preferred native locale. After quick googling, this article also mentions:
The empty string tells setlocale to use the locale specified by the
user in the environment.
However, when looking at the documentation for std::locale constructors, there is no mentioning of a special case, when an empty string is provided.
Here's the quote:
3-4) Constructs a copy of the system locale with specified std_name
(such as "C", or "POSIX", or "en_US.UTF-8", or "English_US.1251"), if
such locale is supported by the operating system. The locale
constructed in this manner has a name.
The draft standard says in [locale.cons]:
explicit locale(const char* std_name);
Effects:
Constructs a locale using standard C locale names, e.g., "POSIX". The resulting locale implements semantics defined to be associated with that name.
Throws:
runtime_error if the argument is not valid, or is null.
Remarks:
The set of valid string argument values is "C" , "" , and any implementation-defined values.
This says "" is a valid constructor argument, and arguments are standard C locale names.
Then in [c.locale] it explicitly refers to the standard C header <locale.h>.
Quoting from the C standard (C99), 7.11.1.1/3:
A value of "C" for locale specifies the minimal environment for C translation; a value of "" for locale specifies the locale-specific native environment. Other implementation-defined strings may be passed as the second argument to setlocale.
I think this means the answer to your question is "yes": A name of "" refers to the native locale.
Related
C++11 introduced the c16rtomb()/c32rtomb() conversion functions, along with the inverse (mbrtoc16()/mbrtoc32()). c16rtomb() clearly states in the reference documentation here:
The multibyte encoding used by this function is specified by the currently active C locale
The documentation for c32rtomb() states the same. Both the C and C++ versions agree that these are locale-dependent conversions (as they should be, according to the naming convention of the functions themselves).
However, MSVC seems to have taken a different approach and made them locale-independent (not using the current C locale) according to this document. These conversion functions are specified under the heading Locale-independent multibyte routines.
C++20 adds to the confusion by including the c8rtomb()/mbrtoc8() functions, which if locale-independent would basically do nothing, converting UTF-8 input to UTF-8 output.
Two questions arise from this:
Do any other compilers actually follow the standard and implement locale-dependent Unicode multibyte conversion routines? I couldn't find any concrete information after extensive searching.
Is this a bug in MSVC's implementation?
#include <locale>
#include <iostream>
int main()
{
std::locale::global(std::locale("en_US.utf8"));
std::wcout << L"Həł£ō שøяļđ\n";
return 0;
}
This works as expected with libstdc++ (both gcc and clang), but only prints the first character (which happens to be ASCII) with libc++. I'm using libcxx-0.0_p20140322 on Gentoo Linux.
Is this a known bug in libc++, or just me not knowing how to cook it?
Update 1. I have tried
std::locale::global(std::locale("en_US.utf8"));
std::locale::global(std::locale(""));
std::setlocale(LC_ALL, "en_US.utf8");
std::setlocale(LC_ALL, "");
which all do the same thing.
Update 2. The wide string literal is here for simplicity. The same thing happens when the string is obtained in any other way (converted from UTF-8, read from binary file, ...)
You have to explicitly imbue the output stream with a locale, like so:
std::wcout.imbue(std::locale());
This makes things work as expected. In fact, it is required by the standard:
27.5.3.3 ios_base functions
locale getloc() const;
4 If no locale has been imbued, a copy of the global C++ locale, locale(), in effect at the time of construction.
So when wcout is constructed, it gets a copy of the initial locale imbued in it. The initial locale is "C". My incorrect assumption was that streams which have no locale explicitly imbued use the current global locale always (and not just at the time of construction). This assumption is totally unreasonable if one thinks about it a little.
June 2021 edit: So in theory just imbue should work, however in practice it doesn't in libstdc++. One needs to set the global locale for this to work, which is probably a libstdc++ bug. Imbuing works with other wide-character streams, but not with std::wcout.
In my program I have a std::string that contains text encoded using the "execution character set" (which is not guaranteed to be UTF-8 or even US-ASCII), and I want to convert that to a std::string that contains the same text, but encoded using UTF-8. How can I do that?
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object? What function or constructor must I use?
I assume the standard library provides some means for doing this (somewhere, somehow), because the compiler itself must know about UTF-8 (to support UTF-8 string literals) and the execution character set.
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object?
You can get a std::codecvt object only as a base class instance (by inheriting from it) because the destructor is protected. That said no, std::codecvt<char, char, std::mbstate_t> is not a facet that you need since it represents the identity conversion (i.e. no conversion at all).
At the moment, the C++ standard library has no functionality for conversion between the native (aka excution) character encoding (aka character set) and UTF-8. As such, you can implement the conversion yourself using the Unicode standard: https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf
To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that?
There is no standard library function for that either. On POSIX system for example, you can use nl_langinfo(CODESET).
This is hacky but it worked for me in MS VS2019
#pragma execution_character_set( "utf-8" )
I'm seeing inconsistent behavior in a call to std::isblank between Visual C++ on Windows and gcc on Ubuntu and I'm wondering which one is correct.
On both compilers -- when the default locale is the "C" locale -- the following call returns false
std::isblank('\n');
This is what I expect. And it squares with what I see on cppreference.com
In the default C locale, only space (0x20) and horizontal tab (0x09)
are classified as blank characters.
However with C++, we also have the version that takes a std::locale argument
std::isblank('\n', std::locale::classic());
Here I am supplying std::locale::classic. Shouldn't that be the equivalent to the previous call? Because when I call this second version on Windows, it returns true. It considers a newline to be a blank character. Linux still says false.
Is my understanding (about std::locale::classic) correct? And if so, is the Visual C++ version wrong?
Yes, MSVS is wrong. [locale.statics] states:
static const locale& classic();
The "C" locale.
Returns: A locale that implements the classic "C" locale semantics, equivalent to the value locale("C").
Remarks: This locale, its facets, and their member functions, do not change with time.
Thus the following:
std::isblank('\n', std::locale::classic());
Is the same as:
std::isblank('\n');
Where locale("C") has been called.
The standard is pretty much silent on what constitutes a valid locale name; only that passing an invalid locale name results in std::runtime_error. What locale names are usable on common windows compilers such as MSVC, MinGW, and ICC?
Ok, there is a difference between C and C++ locales.
Let's start:
MSVC C++ std::locale and C setlocale
Accepts locale names as "Language[_Country][.Codepage]" for example "English_United States.1251" Otherwise would throws. Note: codepage can't be 65001/UTF-8 and should be consistent with ANSI codepage for this locale (or just omitted)
MSVC C++ std::locale and C setlocale in Vista and 7 should accept locales
[Language][-Script][-Country] like "en-US" using ISO-631 language codes and
ISO 3166 regions and script names.
I tested it with Visual Studio on Windows 7 - it does not work.
MinGW C++ std::locale accepts "C" and "POSIX" it does not support other locales,
actually gcc supports locales only over GNU C library - basically only under Linux.
setlocale is native Windows API call so should support all I mentioned above.
It may support wider range of locales when used with alternative C++ libraries
like Apache stdcxx or STL Port.
ICC - I hadn't tested it but it depends on the standard C++ library it uses. For
example under Linux it used GCC's libstdc++ so it supports all the locales gcc
supports. I don't know what standard C++ library it uses under Windows.
If you want to have "compiler and platform" independent locales support (and actually
much better support) take a look on Boost.Locale
Artyom
I believe the information you need is here :
locale "lang[_country_region[.code_page]]"
| ".code_page"
| ""
| NULL
This page provides links to :
Language Strings
Country/Region String
Code Pages
Although my answers covers setlocale instead of std::locale, this MSDN page seems to imply that the format is indeed the same :
An object of class locale also stores
a locale name as an object of class
string. Using an invalid locale name
to construct a locale facet or a
locale object throws an object of
class runtime_error. The stored
locale name is "*" if the locale
object cannot be certain that a
C-style locale corresponds exactly to
that represented by the object.
Otherwise, you can establish a matching locale within the Standard C
Library, for the locale object loc, by
calling setlocale(LC_ALL,
loc.name.c_str).
Also see this page and this thread which tend to show that std::locale internally uses setlocale.
Here's one locale name that's usable pretty much anywhere: "". That is, the empty string. The is in contrast to the "C" locale that you are probably getting by default. The empty string as an argument to std::setlocale() means something like "Use the preferred locale set by the user or environment." If you use this, the downside is that your program won't have the same output everywhere; the upside is that your users might think it works just the way they want.