C++ : Which locale is considered by sprintf? - c++

I am using two functions sprintf and snprintf for dealing with conversions of "double" to string,
In one of the case, the application which is running has a different locale than the Windows' locale. So, in such a scenario locale which is considered by sprintf is always of the application. Whereas, snprintf sometimes starts using Windows locale. As a consequence of this, decimal characters returned by both the methods are different and it causes a problem.
To provide further details,
I have a library in my project which builds a string from "double", this library uses snprintf to convert a double to string. Then I need to send this information to server which would understand "." (dot) only as a decimal symbol. Hence, I need to replace the local decimal character with a "." (dot). To find out the local decimal character (in order to replace it), I am using one of the libraries provided in my project which uses sprintf. Then I replace this character with a dot to get the final output.
Also, please note, sprintf is always considering locale of native application while snprintf sometimes considers locale of Windows.
As the problem is inconsistent, sorry for not providing a clear example.
So, what are the circumstances under which snprintf might behave differently?
Why am I getting such different behavior from these two methods?
How can I avoid it?
P.S. - I have to use these 2 methods, so please suggest a solution which would not require me to use any different methods.
Thanks.

The locale used by both sprintf and snprintf is not the Windows locale, but your application locale. As this locale is global to your application, any line of code in your program can change it.
In your case, the (not thread safe) solution may be to temporarily replace the locale for the snprintf call:
auto old = std::locale::global(std::locale::classic());
snprintf(...);
std::locale::global(old);
BTW, the "Windows locale" can be accessed via just std::locale("") , you don't need to know its exact name.

Related

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

Why doesn't fstream support an em-dash in the file name?

I ported some code from C to C++ and have just found a problem with paths that contain em-dash, e.g. "C:\temp\test—1.dgn". A call to fstream::open() will fail, even though the path displays correctly in the Visual Studio 2005 debugger.
The weird thing is that the old code that used the C library fopen() function works fine. I thought I'd try my luck with the wfstream class instead, and then found that converting my C string using mbstowcs() loses the em-dash altogether, meaning it also fails.
I suppose this is a locale issue, but why isn't em-dash supported in the default locale? And why can't fstream handle an em-dash? I would have thought any byte character supported by the Windows filesystem would be supported by the file stream classes.
Given these limitations, what is the correct way to handle opening a file stream that may contain valid Windows file names that doesn't just fall over on certain characters?
Character em-dash is coded as U+2014 in UTF-16 (0x14 0x20 in little endian), 0xE2 0x80 0x94 in UTF-8, and with other codes or not code at all depending on the charset and code page used. Windows-1252 code page (very common in western European languages) has dash character 0x97 that we could consider equivalent.
Windows internally manages UTF-16 paths, so every time a function is called with its bad-called ANSI interface (functions ending with A) the path is converted using the current code page configured for the user to UTF-16.
On the other hand, RTL of C and C++ could be implemented accessing to the "ANSI" or "Unicode" (functions ending in W) interface. In the first case, the code page used to represent the string must be the same of the code page used for the system. In the second case, either we directly use utf-16 strings from the beginning, or the functions used to convert to utf-16 must be configured to use the same code page of the source string for the mapping.
Yes, it is a complex problem. And there are several wrong (or with problems) proposal to solve it:
Use wfstream instead fstream: wfstream do nothing with paths different to fstream. Nothing. It just means "manage the stream of bytes like wchar_t". (And it does that in a different way as one can expect, so making this class unuseless in the most of cases, but that is another history). To use the Unicode interface in Visual Studio implementation, it exists the overloaded constructor and open() function that accept const wchar_t*. Those function and constructor are overloaded for fstream and for wfstream. Use fstream with the right open().
mbstowcs(): The problem here is the locale (which contains the code page used in the string) to use. If you match the locale because the default locale matches the system one, cool. If not, you can try with mbstowcs_l(). But these functions are unsafe C functions, so you have to be careful with the buffer size. Anyway, this approach could makes sense only if the path to convert is got in runtime. If it is an static string known at compile time, better is to use it directly in your code.
L"C:\\temp\\test—1.dgn": The L prefix in the string doesn't means "converts this string to utf-16" (source code use to be in 8-bit characters), at least no in Visual Studio implementation. L prefix means "add a 0x00 byte after each character between the quotes". So —, equivalent to byte 0x97 in a narrow (ordinary) string, it become 0x97 0x00 when in a wide (prefixed with L) string, but not 0x14 0x20. Instead it is better to use its universal character name: L"C:\\temp\\test\\u20141.dgn"
One popular approach is to use always in your code either utf-8 or utf-16 and make the conversions only when strictly necessary. When converting a string with a specific code page to utf-8 or utf-16, tries to first convert to one of them (utf-8 or utf-16) identifying first the right code page. To do that conversion, uses the functions depending on where they come from. If you get your string from a XML file, well, the used code page is usually explicated there (and use to be utf-8). If it comes from a Windows control, use Windows API function, like MultiByteToWideChar. (CP_ACP or GetACP() uses to work as by default code page).
Uses always fstream (not wfstream) and its wide interfaces (open and constructor), not its narrow ones. (You can use again MultiByteToWideChar to convert from utf-8 to utf-16).
There are several articles and post with advices for this approach. One of them that I recommend you: http://www.nubaria.com/en/blog/?p=289.
This should work, provided that everything you do is in wide-char notation with wide-char functions. That is, use wfstream, but instead of using mbstowcs, use wide-string literals prefixed with the L char:
const wchar_t* filename = L"C:\temp\test—1.dgn";
Also, make sure your source file is saved as UTF-8 in Visual Studio. Otherwise the em-dash could get locale issues with the em-dash.
Posting this solution for others who run into this. The problem is that Windows assigns the "C" locale on startup by default, and em-dash (0x97) is defined in the "Windows-1252" codepage but is unmapped in the normal ASCII table used by the "C" locale. So the simple solution is to call:
setlocale ( LC_ALL, "" );
Prior to fstream::open. This sets the current codepage to the OS-defined codepage. In my program, the file I wanted to open with fstream was defined by the user, so it was in the system-defined codepage (Windows-1252).
So while fiddling with unicode and wide chars may be a solution to avoid unmapped characters, it wasn't the root of the problem. The actual problem was that the input string's codepage ("Windows-1252") didn't match the active codepage ("C") used by default in Windows programs.

How to parse numbers like "3.14" with scanf when locale expects "3,14"

Let's say I have to read a file, containing a bunch of floating-point numbers. The numbers can be like 1e+10, 5, -0.15 etc., i.e., any generic floating-point number, using decimal points (this is fixed!). However, my code is a plugin for another application, and I have no control over what's the current locale. It may be Russian, for example, and the LC_NUMERIC rules there call for a decimal comma to be used. Thus, Pi is expected to be spelled as "3,1415...", and
sscanf("3.14", "%f", &x);
returns "1", and x contains "3.0", since it refuses to parse past the '.' in the string.
I need to ignore the locale for such number-parsing tasks.
How does one do that?
I could write a parseFloat function, but this seems like a waste.
I could also save the current locale, reset it temporarily to "C", read the file, and restore to the saved one. What are the performance implications of this? Could setlocale() be very slow on some OS/libc combo, what does it really do under the hood?
Yet another way would be to use iostreams, but again their performance isn't stellar.
My personal preference is to never use LC_NUMERIC, i.e. just call setlocale with other categories, or, after calling setlocale with LC_ALL, use setlocale(LC_NUMERIC, "C");. Otherwise, you're completely out of luck if you want to use the standard library for printing or parsing numbers in a standared form for interchange.
If you're lucky enough to be on a POSIX 2008 conforming system, you can use the uselocale and *_l family of functions to make the situation somewhat better. There are at least 2 basic approaches:
Leave the default locale unset (at least the troublesome parts like LC_NUMERIC; LC_CTYPE should probably always be set), and pass a locale_t object for the user's locale to the appropriate *_l functions only when you want to present things to the user in a way that meets their own cultural expectations; otherwise use the default C locale.
Have your code that needs to work with data for interchange keep around a locale_t object for the C locale, and either switch back and forth using uselocale when you need to work with data in a standard form for interchange, or use the appropriate *_l functions (but there is no scanf_l).
Note that implementing your own floating point parser is not easy and is probably not the right solution to the problem unless you're an expert in numerical computing. Getting it right is very hard.
POSIX.1-2008 specifies isalnum_l(), isalpha_l(), isblank_l(), iscntrl_l(), isdigit_l(), isgraph_l(), islower_l(), isprint_l(), ispunct_l(), isspace_l(), isupper_l(), and isxdigit_l().
Here's what I've done with this stuff in the past.
The goal is to use locale-dependent numeric converters with a C-locale numeric representation. The ideal, of course, would be to use non-locale-dependent converters, or not change the locale, etc., etc., but sometimes you just have to live with what you've got. Locale support is seriously broken in several ways and this is one of them.</rant>
First, extract the number as a string using something like the C grammar's simple pattern for numeric preprocessing tokens. For use with scanf, I do an even simpler one:
" %1[-+0-9.]%[-+0-9A-Za-z.]"
This could be simplified even more, depending on how what else you might expect in the input stream. The only thing you need to do is to not read beyond the end of the number; as long as you don't allow numbers to be followed immediately by letters, without intervening whitespace, the above will work fine.
Now, get the struct lconv (man 7 locale) representing the current locale using localeconv(3). The first entry in that struct is const char* decimal_point; replace all of the '.' characters in your string with that value. (You might also need to replace '+' and '-' characters, although most locales don't change them, and the sign fields in the lconv struct are documented as only applying to currency conversions.) Finally, feed the resulting string through strtod and see if it passes.
This is not a perfect algorithm, particularly since it's not always easy to know how locale-compliant a given library actually is, so you might want to do some autoconf stuff to configure it for the library you're actually compiling with.
I am not sure how to solve it in C.
But C++ streams (can) have a unique locale object.
std::stringstream dataStream;
dataStream.imbue(std::locale("C"));
// Note: You must imbue the stream before you do anything wit it.
// If any operations have been performed then an imbue() can
// be silently ignored by the stream (which is a pain to debug).
dataStream << "3.14";
float x;
dataStream >> x;

mfc program using wrong decimal separator/language

I have the comma as the decimal separator in my Windows regional settings (Portuguese language), and all the programs I develop use the comma when formatting strings or using atof.
However, this particular program that came into my hands insists on using the dot as the decimal separator, regardless of my regional settings.
I'm not calling setlocale anywhere in the program or any other language changing function AFAIK. In fact I put these lines of code at the very beginning of the InitInstance() function:
double var = atof("4,87");
TRACE("%f", fDecimal);
This yields 4.000000 in this program and 4,870000 in every other one.
I figure there must be some misplaced setting in the project's properties, but I don't know what it is. Anyone can help?
I'm not calling setlocale anywhere in the program or any other language changing function AFAIK.
That'd be why. C and C++ default to the "C" locale. Try setting the locale to "": setlocale(LC_ALL,"");
atof is relying on the C locale when it comes to determining the expected decimal separator. Thus, as another member mentioned, setlocale(LC_NUMERIC, ""); will set the C locale to the user locale (regional settings) for number-related functions. See MSDN page for more informations on the available flags and the locale names.
For those who don't want to change your C locale, you can use atof_l instead of the standard atol and provide it a locale structure created with _create_locale (what a name).
double _atof_l(const char *str, _locale_t locale);
There are a multitude of alternatives. For example you could use strtod (and its Windows strtod_l counterpart) which is IMHO a better option because it will let you know if something wrong happened.

sprintf, commas and dots in C(++) (and localization?)

I am working in a project using openframeworks and I've been having some problems lately when writing XMLs. I've traced down the problem to a sprintf:
It seems that under certain conditions an sprintf call may write commas instead of dots on float numbers (e.g. "2,56" instead of "2.56"). In my locale the floating numbers are represented with a ',' to separate the decimals from the units.
I am unable to reproduce this behaviour in a simple example, but I've solved the problem by stringifying the value using a stringstream.
I am curious about the circumstances of sprintf using a different localization. When sprintf uses ',' instead of '.' and how to control it?
The decimal separator is controlled by the LC_NUMERIC locale variable. Set setlocale for details. Setting it to the "C" locale will give you a period. You can find out the characters and settings for the current locale by looking in the (read-only) struct returned by localeconv.