Performance issues with u_snprintf_u from libicu - c++

I'm porting some application from wchar_t for C strings to char16_t offered by C++11.
Although I have an issue. The only library I found that can handle snprintf for char16_t types is ICU with their UChar types.
The performance of u_snprintf_u (equivalent to swprintf/snprintf, but taking Uchar as arguments) is abismal.
Some testing leads to u_snprintf_u being 25x slower than snprintf.
Example of what I get on valgrind :
As you can see, the underlying code is doing too much work and instanciating internal objects that I don't want.
Edit : The data I'm working with doesn't need to be interpreted by the underlying ICU code. It's ascii oriented. I didn't find any way to tell ICU to not try to apply locales and such on such function calls.

Related

C++17 Purpose of std::from_chars and std::to_chars?

Before C++17, there existed a variety of methods to convert integers, floats, and doubles to and from strings. For example, std::stringstream, std::to_string, std::atoi, std::stoi, and others could have been used to accomplish these tasks. To which, there exists plenty of posts discussing the differences between those methods.
However, C++ 17 has now introduced std::from_chars and std::to_chars. To which, I'd like to know the reasons for introducing another means of converting to and from strings.
For one, what advantages and functionality do these new functions provide over the previous methods?
Not only that, but are there any notable disadvantages for this new method of string conversion?
std::stringstream is the heavyweight champion. It takes into consideration things like the the stream's imbued locale, and its functionality involves things like constructing a sentry object for the duration of the formatted operation, in order to deal with exception-related issues. Formatted input and output operations in the C++ libraries have some reputation for being heavyweight, and slow.
std::to_string is less intensive than std::istringstream but it still returns a std::string, whose construction likely involves dynamic allocation (less likely with modern short string optimization techniques, but still likely). And, in most cases the compiler still needs to generate all the verbiage, at the call site, to support a std::string object, including its destructor.
std::to_chars are designed to have as little footprint as possible. You provide the buffer, and std::to_chars does very little beyond actually formatting the numeric value into the buffer, in a specific format, without any locale-specific considerations, with the only overhead of making sure that the buffer is big enough. Code that uses std::to_chars does not need to do any dynamic allocation.
std::to_chars is also a bit more flexible in terms of formatting options, especially with floating point values. std::to_string has no formatting options.
std::from_chars is, similarly, a lightweight parser, that does not need to do any dynamic allocation, and does not need to sacrifice any electrons to deal with locale issues, or overhead of stream operations.
to/from_chars are designed to be elementary string conversion functions. They have two basic advantages over the alternatives.
They are much lighter weight. They never allocate memory (you allocate memory for them). They never throw exceptions. They also never look at the locale, which also improves performance.
Basically, they are designed such that it is impossible to have faster conversion functions at an API level.
These functions could even be constexpr (they aren't, though I'm not sure why), while the more heavyweight allocating and/or throwing versions can't.
They have explicit round-trip guarantees. If you convert a float/double to a string (without a specified precision), the implementation is required to make it so that taking that exact sequence of characters and converting it back into a float/double will produce a binary-identical value. You won't get that guarantee from snprintf, stringstream or to_string/stof.
This guarantee is only good however if the to_chars and from_chars calls are using the same implementation. So you can't expect to send the string across the Internet to some other computer that may be compiled with a different standard library implementation and get the same float. But it does give you on-computer serialization guarantees.
All these pre-existing methods were bound to work based on a so-called locale. A locale is basically a set of formatting options that specify, e.g., what characters count as digits, what symbol to use for the decimal point, what thousand's separator to use, and so on. Very often, however, you don't really need that. If you're just, e.g., reading a JSON file, you know the data is formatted in a particular way, there is no reason to be looking up whether a '.' should be a decimal point or not every time you see one. The new functions introduced in <charconv> are basically hardcoded to read and write numbers based on the formatting laid out for the default C locale. There is no way to change the formatting, but since the formatting doesn't have to be flexible, they can be very fast…

C++ small vs all caps datatype

Why in C++ (MSVS), datatypes with all caps are defined (and most of them are same)?
These are exactly the same. Why all caps versions are defined?
double and typedef double DOUBLE
char and typedef char CHAR
bool and BOOL (typedef int BOOL), here both all small and all caps represent Boolean states, why int is used in the latter?
What extra ability was gained through such additional datatypes?
The ALLCAPS typedefs started in the very first days of Windows programming (1.0 and before). Back then, for example, there was no such thing as a bool type. The Windows APIs and headers were defined for old-school C. C++ didn't even exist back when they were being developed.
So to help document the APIs better, compiler macros like BOOL were introduced. Even though BOOL and INT were both macros for the same underlying type (int), this let you look at a function's type signature to see whether an argument or return value was intended as a boolean value (defined as "0 for false, any nonzero value for true") or an arbitrary integer.
As another example, consider LPCSTR. In 16-bit Windows, there were two kinds of pointers: near pointers were 16-bit pointers, and far pointers used both a 16-bit "segment" value and a 16-bit offset into that segment. The actual memory address was calculated in the hardware as ( segment << 4 ) + offset.
There were macros or typedefs for each of these kinds of pointers. NPSTR was a near pointer to a character string, and LPSTR was a far pointer to a character string. If it was a const string, then a C would get added in: NPCSTR or LPCSTR.
You could compile your code in either "small" model (using near pointers by default) or "large" model (using far pointers by default). The various NPxxx and LPxxx "types" would explicitly specify the pointer size, but you could also omit the L or N and just use PSTR or PCSTR to declare a writable or const pointer that matched your current compilation mode.
Most Windows API functions used far pointers, so you would generally see LPxxx pointers there.
BOOL vs. INT was not the only case where two names were synonyms for the same underlying type. Consider a case where you had a pointer to a single character, not a zero-terminated string of characters. There was a name for that too. You would use PCH for a pointer to a character to distinguish it from PSTR which pointed to a zero-terminated string.
Even though the underlying pointer type was exactly the same, this helped document the intent of your code. Of course there were all the same variations: PCCH for a pointer to a constant character, NPCH and LPCH for the explicit near and far, and of course NPCCH and LPCCH for near and far pointers to a constant character. Yes, the use of C in these names to represent both "const" and "char" was confusing!
When Windows moved to 32 bits with a "flat" memory model, there were no more near or far pointers, just flat 32-bit pointers for everything. But all of these type names were preserved to make it possible for old code to continue compiling, they were just all collapsed into one. So NPSTR, LPSTR, plain PSTR, and all the other variations mentioned above became synonyms for the same pointer type (with or without a const modifier).
Unicode came along around that same time, and most unfortunately, UTF-8 had not been invented yet. So Unicode support in Windows took the form of 8-bit characters for ANSI and 16-bit characters (UCS-2, later UTF-16) for Unicode. Yes, at that time, people thought 16-bit characters ought to be enough for anyone. How could there possibly be more than 65,536 different characters in the world?! (Famous last words...)
You can guess what happened here. Windows applications could be compiled in either ANSI or Unicode ("Wide character") mode, meaning that their default character pointers would be either 8-bit or 16-bit. You could use all of the type names above and they would match the mode your app was compiled in. Almost all Windows APIs that took string or character pointers came in both ANSI and Unicode versions, with an A or W suffix on the actual function name. For example, SetWindowText( HWND hwnd, LPCSTR lpString) became two functions: SetWindowTextA( HWND hwnd, LPCSTR lpString ) or SetWindowTextW( HWND hwnd, LPCWSTR lpString ). And SetWindowText itself became a macro defined as one or the other of those depending on whether you compiled for ANSI or Unicode.
Back then, you might have actually wanted to write your code so that it could be compiled either in ANSI or Unicode mode. So in addition to the macro-ized function name, there was also the question of whether to use "Howdy" or L"Howdy" for your window title. The TEXT() macro (more commonly known as _T() today) fixed this. You could write:
SetWindowText( hwnd, TEXT("Howdy") );
and it would compile to either of these depending on your compilation mode:
SetWindowTextA( hwnd, "Howdy" );
SetWindowTextW( hwnd, L"Howdy" );
Of course, most of this is moot today. Nearly everyone compiles their Windows apps in Unicode mode. That is the native mode on all modern versions of Windows, and the ...A versions of the API functions are shims/wrappers around the native Unicode ...W versions. By compiling for Unicode you avoid going through all those shim calls. But you still can compile your app in ANSI (or "multi-byte character set") mode if you want, so all of these macros still exist.
Microsoft decided to create macros or type aliases for all of these types, in the Windows code. It's possible that they were going for "consistency" with the all-caps WinAPI type aliases, like LPCSTR.
But what real benefit does it serve? None.
The case of BOOL is particularly headache-inducing. Although some old-school C had a convention of doing this (before actual bool entered the language), nowadays it really just confuses… especially when using WinAPI with C++.
This convention goes back more than 30 years to the early days of the Windows operating system.
Current practice in the '70s and early '80s was still to use all caps for programming in various assembly languages and the higher languages of the day, Fortran, Cobol... as well as command line interpreters and file system defaults. A habit probably rooted in the encoding of punch cards that goes back way further, to the dawn of the 20th century.
When I started programming in 1975, the card punch we used did not even support lowercase letters as you can see on the pictures, it did not even have a shift key.
MS/DOS was written in assembly language, as were most successful PC packages of the early '80s such as Lotus 1-2-3, MS Word, etc. C was invented at Bell Labs for the Unix system and took a long time to gain momentum in the PC world.
In the budding microprocessor world, there were literally 2 separate schools: the Intel little endian world with all caps assembly documentation and the big endian Motorola alternative, with small caps assembly, C and Unix operating systems and clones and other weird languages such as lisp.
Windows is the brainchild of the former and this proliferation of all caps types and modifiers did not seem ugly then, it looked consistent and reassuring. Microsoft tried various alternatives for the pointer modifiers: far, _far, __far, FAR and finally got rid of these completely but kept the original allcaps typedefs for compatibility purposes, leading to silly compromises such as 32-bit LONG even on 64-bit systems.
This answer is not unbiassed, but it was fun reviving these memories.
Only MS knows.
The only benefit I can think of is for some types (e.g. int) whose size is OS dependant (see the table here). This would allow to use 16 bits type on a 64 bits OS, with some more typedefs or #defines. The code would be easier to port to other OS versions.
Now, if this "portable" thing was true, then the rest of types would follow the same convention, even their sizes were the same in all machines.

Is there a proper formatter for boost::uint64_t to use with snprintf?

I am using boost/cstdint.hpp in a C++ project because I am compiling in C++03 mode (-std=c++03) and I want to have fixed-width integers (they are transmitted over the network and stored to files). I am also using snprintf because it is a simple and fast way to format strings.
Is there a proper formatter to use boost::uint64_t with snprintf(...) or should I switch to another solution (boost::format, std::ostringstream) ?
I am current using %lu but I am not fully happy with it as it may not work on another architecture (where boost::uint64_t is not defined as long unsigned), defeating the purpose of using fixed-width integers.
boost::uint64_t id
id = get_file_id(...)
const char* ENCODED_FILENAME_FORMAT = "encoded%lu.dat";
//...
char encoded_filename[34];
snprintf(encoded_filename, 34, ENCODED_FILENAME_FORMAT, id);
snprintf isn't a Boost function. It knows how to print the fundamental types only. If none of those coincides with boost::uint64_t, then it isn't even possible to print that.
In general, as you note the formatter has to match the underlying type. So even if it's possible, the formatter will be platform-dependent. There's no extension mechanism by which Boost can add new formatters to snprintf.

Why are the standard datatypes not used in Win32 API? [duplicate]

This question already has answers here:
Why does the Win32-API have so many custom types?
(4 answers)
Closed 6 years ago.
I have been learning Visual C++ Win32 programming for some time now.
Why are there the datatypes like DWORD, WCHAR, UINT etc. used instead of, say, unsigned long, char, unsigned int and so on?
I have to remember when to use WCHAR instead of const char *, and it is really annoying me.
Why aren't the standard datatypes used in the first place? Will it help if I memorize Win32 equivalents and use these for my own variables as well?
Yes, you should use the correct data-type for the arguments for functions, or you are likely to find yourself with trouble.
And the reason that these types are defined the way they are, rather than using int, char and so on is that it removes the "whatever the compiler thinks an int should be sized as" from the interface of the OS. Which is a very good thing, because if you use compiler A, or compiler B, or compiler C, they will all use the same types - only the library interface header file needs to do the right thing defining the types.
By defining types that are not standard types, it's easy to change int from 16 to 32 bit, for example. The first C/C++ compilers for Windows were using 16-bit integers. It was only in the mid to late 1990's that Windows got a 32-bit API, and up until that point, you were using int that was 16-bit. Imagine that you have a well-working program that uses several hundred int variables, and all of a sudden, you have to change ALL of those variables to something else... Wouldn't be very nice, right - especially as SOME of those variables DON'T need changing, because moving to a 32-bit int for some of your code won't make any difference, so no point in changing those bits.
It should be noted that WCHAR is NOT the same as const char - WCHAR is a "wide char" so wchar_t is the comparable type.
So, basically, the "define our own type" is a way to guarantee that it's possible to change the underlying compiler architecture, without having to change (much of the) source code. All larger projects that do machine-dependant coding does this sort of thing.
The sizes and other characteristics of the built-in types such as int and long can vary from one compiler to another, usually depending on the underlying architecture of the system on which the code is running.
For example, on the 16-bit systems on which Windows was originally implemented, int was just 16 bits. On more modern systems, int is 32 bits.
Microsoft gets to define types like DWORD so that their sizes remain the same across different versions of their compiler, or of other compilers used to compile Windows code.
And the names are intended to reflect concepts on the underlying system, as defined by Microsoft. A DWORD is a "double word" (which, if I recall correctly, is 32 bits on Windows, even though a machine "word" is probably 32 or even 64 bits on modern systems).
It might have been better to use the fixed-width types defined in <stdint.h>, such as uint16_t and uint32_t -- but those were only introduced to the C language by the 1999 ISO C standard (which Microsoft's compiler doesn't fully support even today).
If you're writing code that interacts with the Win32 API, you should definitely use the types defined by that API. For code that doesn't interact with Win32, use whatever types you like, or whatever types are suggested by the interface you're using.
I think that it is a historical accident.
My theory is that the original Windows developers knew that the standard C type sizes depend on the compiler, that is, one compiler may have 16-bit integer and another a 32-bit integer. So they decided to make the Window API portable between different compilers using a series of typedefs: DWORD is a 32 bit unsigned integer, no matter what compiler/architecture you are using. Naturally, nowadays you will use uint32_t from <stdint.h>, but this wasn't available at that time.
Then, with the UNICODE thing, they got the TCHAR vs. CHAR vs. WCHAR issue, but that's another story.
And, then it grew out of control and you get such nice things as typedef void VOID, *PVOID; that are utterly nonsense.

Where is Unicode version of atof in Windows Mobile

I have a C++ application where I'm replacing a number of sscanf functions with atoi, atof, etc... for performance reasons. The code is TCHAR based so it's _stscanf getting replaced with _ttoi and _ttof. Except there isn't a _ttof on Windows Mobile 5, or even a _wtof for explicit wide character support. I've ended up using _tcstod instead, but that takes an extra parameter that i'm not very interested in. So any ideas why there is no _ttof, _tcstof() or _wtof in Windows Mobile 5.0. It's there in VS2005. Am I missing something really obvious here?
One of the problems of Windows Mobile is the size of RAM and ROM on the device. Therefore a lot of the redundant routines are removed to make sure the OS is as small as possible.
If the data you want to convert is guaranteed to be only in the ASCII charset you can always transform it to ASCII and cat atof, atol, atoi & friends.
I mean if you have something like this(pseudocode):
TCHAR buf_T[20]=_T("12345");
char buf_char[20];
from_TCHAR_to_ascii(buf_T,buf_char);
atoi(buf_char);