how to detect and handle unsupported locales in algorithms? - c++

I have a function with the following signature:
template <typename Container>
void write_cards_as_xml(const Container& cards, std::ostream& os);
Internally it calls:
boost::property_tree::ptree root;
...
boost::property_tree::write_xml(os, root);
The write_xml function does not know anything about encodings. By default it assumes UTF-8 But does not do any conversions. It's up to the locale of os. I'm not sure how to handle unsupported non-UTF-8 locales. Can I detect if it is not UTF-8? Should I throw if not? Should I replace the locale temporarily to my prefered encoding? I' m using boost locale.

The Standard library has no platform independent way to detect if a locale is UTF-8. There's only a name method which returns a platform dependent name. Even if it is a POSIX name there's no guarantee that the encoding is part of the locale's name.
Boost.Locale offers an additional facet called boost::locale::info holding detailed information about the current locale.
https://www.boost.org/doc/libs/1_70_0/libs/locale/doc/html/locale_information.html
You can obtain the info like this:
std::use_facet<boost::locale::info>(some_locale).utf8()
If there is no info facet std::use_face throws std::bad_cast. In this case it's not a Boost locale and you're out of luck. Throwing is a reasonable behavior in this case. You could catch the bad_cast and throw a more informative exception instead. If there's an info facet you can inspect the return value of utf8(). If it returns false the current locale is not compatible and you should throw, too. Otherwise your algorithm can run without problems.

Related

C++17 to_string() converts floats with comma

This is happening inside a big project so I cannot really post a minimal reproducible example but I'll try asking anyway. I'm building a list of benchmarks application integrated with a framework we're working on and on one of them the convertion we require to make (float -> string) with to_string appears to reproduce a comma separated result.
| Monitored values:
| [ my_time_monitor.average = 61720,000000 ]
This is the function responsible:
std::string operating_point_parser::operator()(const int32_t num_threads, const float exec_time_ms) const {
return "{\"compute\":[{\"knobs\":{\"num_threads\":" + std::to_string(num_threads) + "},\"metrics\":{\"exec_time_ms\":[" + std::to_string(exec_time_ms) + ",0]}}]}";
}
Since as I said the same exact function is being called by other applications which don't show this unexpected behavior, my guess is that some internal compilation flags are messing around.
set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++17 -DENABLE_THREADS -DHAVE_RANDOM -DHAVE_UNISTD_H -DHAVE_SYS_FILE_H -DHAVE_SYS_MMAN_H -DHAVE_CONFIG_H -DVIPSDATASET_PATH=\"\\\"${CMAKE_CURRENT_SOURCE_DIR}/dataset/orion_18000x18000.v\\\"\"" )
set( CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DHAVE_RANDOM -DHAVE_UNISTD_H -DIM_PREFIX=\"\\\"${CMAKE_CURRENT_SOURCE_DIR}/dataset\\\"\" -DIM_EXEEXT=\"\\\"\\\"\" -DIM_LIBDIR=\"\\\"${CMAKE_INSTALL_PREFIX}/lib\\\"\" -DGETTEXT_PACKAGE=\"\\\"vips7\"\\\" -DHAVE_SYS_FILE_H -DHAVE_SYS_MMAN_H -DHAVE_CONFIG_H")
If you want to take a look at the full application code here's the link. The operating_point_parse::operator() is called inside margot::compute::push_custom_monitor_values().
As stated in the Notes of https://en.cppreference.com/w/cpp/string/basic_string/to_string, to_string relies on the current locale for formatting purposes:
std::to_string relies on the current locale for formatting purposes,
and therefore concurrent calls to std::to_string from multiple threads
may result in partial serialization of calls. C++17 provides
std::to_chars as a higher-performance locale-independent alternative.
So, if you want to have dots instead of commans, you have to adjust the current locale.
Or instead of changing the global locale with std::locale::global(...) you could use a stringstream and imbue() the locale on that stream only, for ex.
stringstream ss;
ss.imbue( locale you want )
ss << ... write what you need
ss.str(); // get formatted string
std::to_string uses the currently active locale for formatting.
You can set the active locale using a C locale name using:
const char* locale = "C";
std::locale::global(std::locale(locale));
Meaning of locale name is specified in the C standard (quote from C11 draft):
7.11.1.1 The setlocale function
A value of "C" for locale specifies the minimal environment for C translation; a value of "" for locale specifies the locale-specific native environment. Other implementation-defined strings may be passed as the second argument to setlocale.
That value is going to be formatted inside a JSON string
In this case, and more generally when ever you wish to format using style that shouldn't depend on the global locale, should avoid std::to_string.
What would you recommend since you advised on avoiding it?
Anything that doesn't use locale, or lets you specify the locale to use instead of using the global locale. For example:
std::format("{}", 0.42); // doesn't use locale
std::format(std::locale("C"), "{}", 0.42); // use specific locale
Another example is a stringstream with imbued locale as suggested in the other answer.

Change narrow string encoding or missing std::filesystem::path::imbue

I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?
My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}
For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.

Locale invariant guarantee of boost::lexical_cast<>

I'm using boost::lexical_cast<std::string>(double) for converting doubles to string, generating JSON serialized byte stream, that is (on remote side) parsed by .NET.
I'm able to force the .NET to use InvariantCulture for parsing, thereby returning predictable result on every possible language.
However, I was not able to find this guarantee in boost::lexical_cast documentation. I tried it a little bit, and it works the same way for different locales set. But, I cannot be sure only from few tests, am I missing something in the documentation, or this cannot be guaranted at all, and I have to use something else?
EDIT:
I've found an issue.
std::locale::global(std::locale("Czech"));
std::cout << boost::lexical_cast<std::string>(0.15784465) << std::endl;
returns 0,15784465, and that is undesired. Can I force the boost::lexical_cast<> not to be aware of locales?
Can I force the boost::lexical_cast<> not to be aware of locales?
No, I don't think that is possible. The best you can do is call
std::locale::global(std::locale::classic());
to set the global locale to the "C" locale as boost::lexical_cast relies on the global locale. However, the problem is if somewhere else in the code the global locale is set to something else before calling boost::lexical_cast, then you still have the same problem.
Therefore, a robust solution would be imbue a stringstream like so, and you can be always sure that this works:
std::ostringstream oss;
oss.imbue(std::locale::classic());
oss.precision(std::numeric_limits<double>::digits10);
oss << 0.15784465;
A better solution to this problem is to use a boost::locale instead of a std::locale as the globale locale. From the documentation:
Setting the global locale has bad side effects... it affects even printf and libraries like boost::lexical_cast giving incorrect or unexpected formatting. In fact many third-party libraries are broken in such a situation.
Unlike the standard localization library, Boost.Locale never changes the basic number formatting, even when it uses std based localization backends, so by default, numbers are always formatted using C-style locale. Localized number formatting requires specific flags.
Boost locale requires you to specify explicitly when you want numeric formatting to be locale aware, which is more consistent with recent library decisions like std::money_put.

What the purpose of imbue in C++?

I'm working with some code today, and I saw:
extern std::locale g_classicLocale;
class StringStream : public virtual std::ostringstream
{
public:
StringStream() { imbue(g_classicLocale); }
virtual ~StringStream() {};
};
Then I came in face of imbue. What is the purpose of the imbue function in C++? What does it do? Are there any potential problems in using imbue (non-thread safe, memory allocation)?
imbue is inherited by std::ostringstream from std::ios_base and it sets the locale of the stream to the specified locale.
This affects the way the stream prints (and reads) certain things; for instance, setting a French locale will cause the decimal point . to be replaced by ,.
C++ streams perform their conversions to and from (numeric) types according to a locale, which is an object that summarizes all the localization information needed (decimal separator, date format, ...).
The default for streams is to use the current global locale, but you can set to a stream a custom locale using the imbue function, which is what your code does here - I suppose it's setting the default C locale to produce current locale-independent text (this is useful e.g. for serialization purposes).

Boost.Locale - Unicode string in C++

Can I make all std::string in my application support Unicode with Boost.Locale? After reading the documentation I can say yes. But I don't understand how it works. The main question is can I still use boost string algorithms library or Boost.Lexical_Cast libraries? If yes, why I need boost::locale::to_upper and similar format methods, if I have these methods in boost string algorithm library.
Yes, you can make all strings in your application Unicode encoded with Boost.Locale.
To make it work you imbue the locale into the string, or set the default global locale to your new unicode-based locale (generated by Boost.Locale).
See here for how to do that: http://www.boost.org/libs/locale/doc/html/locale_gen.html
and http://www.boost.org/libs/locale/doc/html/faq.html
The string manipulation APIs in Boost.Locale are different to the ones provided in the Boost string algorithm library.
See here for why the Boost.Locale functions are better: http://www.boost.org/libs/locale/doc/html/conversions.html
You can still use boost::lexical_cast, provided you set the global locale correctly (as, if I recall correctly, you can't explicitly pass a locale object to Boost.LexicalCast).
Keep in mind however that this will 'break' some cases, for example, when converting an integer to a string, instead of using the C locale (as was probably your previous default), it will use a different one, which may insert separators etc. When doing conversions that are NOT displayed to the user, you may wish to use std::stringstream et al directly to avoid these unwanted formatting changes.
I highly suggest you read the Boost.Locale documentation in full, as it should address most of your concerns (especially the FAQ, generation backend information, etc.).