What is under the hood of std::tolower? - c++

I wast just reading about std::tolower in CPP-Reference.
Is std::to_lower maybe just a wrapper of a std::use_facet function?
Please see the following example?
#include <iostream>
#include <locale>
int main() {
char c1{ 'A' }, c2{'B'};
std::cout << std::use_facet<std::ctype<char>>(std::locale("C")).tolower(c1) << '\n';
std::cout << (char)std::tolower(c2) << '\n';
}
Yes, std::tolower works with integers, but else, is it calling use_facet or similar?

What is under the hood of std::tolower?
Absolutely nothing useful.
Supposedly the library can use a locale to handle language concerns, but as it currently stands in C++ this has been a long, frustrating pipe dream.
What do I do, then?
Use IBM’s International Components for Unicode. It is a mature, stable library that exists on literally every system that it makes sense to program with i18n. It is on Android and iOS phones (and all the knock-offs), it is on Windows, Linux, Unix, OS X, etc.
The tricky part is just interfacing with the installed system ICU. That is different for each system, but not particularly difficult. (It becomes part of the build script, as does every system-dependent build script.)
ICU works with both C and C++ (though the C++ capability is quite a bit lean compared to the C capability).
(You can also use it with Java, and ported interfaces exist for quite a few other languages as well.)
Since you have C++ tagged, I recommend you just use the C capabilities of the library over a std::wstring (Windows, C++17 or earlier) or a std::u16string (Windows C++20+ and everything else).
Boost Libraries
Boost provides a very nice C++ library to do this kind of stuff.
You can configure Boost Locale to use ICU as a backend.
I haven’t messed with it for quite a long time, and configuring the compile (Boost Locale is one of the Boost Libraries that needs to be compiled) is tricky. Make your way through that and you are golden, though.
Caveats
Managing your locale becomes important. Your program should default to using the user’s system-indicated locale. ICU makes this easy to access and use.
Letter casing is not a universal capability in all languages. Case-conversion and case-folding functions understand this, and behave correctly for those languages.
One particular point is that Turkish has a corner case you should be aware of: the letter I. Any reading you do on letter casing should mention this.
Remember also, that locale is context sensitive. For example, you will likely wish to use a different locale for program code vs strings displayed to the user.

Related

What is preferred char type for cross-platform C++ exceptions and base logging facilities?

Preamble:
I've been programming primarily on Windows for the past 20 years. During that time, I've developed extensive string conversion, logging, and exception reporting mechanisms that rely on Windows API for utf-16le string conversion facilities and underlying I/O services to capture runtime exception context for debugging, logging, reporting, etc.
I had to simply make the choice when we went to Windows utf-16le native code a while back to change most of those fundamental services to express them all in wchar_t fundamentals (or std::wstring which is utf-16le on Windows). I say "had to" because it all hangs together quite well - one achieves a high degree of agreement between the WinAPI itself, and file I/O and console I/O, they all can be made to work well in utf-16le.
Windows provides narrow string I/O and current code page support - but that never works as well as keeping everything in utf-16le, which is then portable across various systems and remains coherent - nothing "lost in transcription."
Goals:
I'm looking to generalize my libraries, but I'm flummoxed by this fundamental disconnect between what was easy(er) under Windows: using wstring for all of the logging and exception data which guarantees that my trace captures and errors which contain locale-specific strings (e.g. locale-specific filenames or user-named elements) remain in their native strings and do not suffer degradation when loaded on my English system -- vs. Linux systems which seem to prefer utf-8 as their core string encoding which is often more compact - but which is very hard to make work for a Windows based application (this is a complex topic - but suffice it to say that Windows doesn't adequately support utf-8 for console I/O or any other file I/O and you're going to have to convert constantly to utf-16le anyway for WinAPI calls - which for a GUI based app, is absolutely constantly - so it's just easier to settle on utf-16le as your 'standard' string type and build up from there).
However, I'd love to have a great programmer-friendly set of exception classes that capture context in a way that is equally "natural" and convenient to the c++ programmer on Unix as well as on Windows.
So this creates some difficult to answer questions - especially since my knowledge of programming c++ on Linux is lacking:
Questions:
Do I provide base exception classes that always capture utf-16le on Windows, but utf-8 on Unix?
How do I convert between APIs which allow for either string type (well, technically, I'd like to provide for std::string and std::wstring - which may be utf-16le or utf-32 on Windows vs. Linux).
Further Considerations:
For #2, I see that the c++ 11 standards body created std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> but deprecated it for c++17.
I could of course rely on Windows APIs when compiling for that platform, but I really want something as fundamental as an exception class hierarchy - whose whole purpose is to make capturing runtime error context easy for the programmer - to have to rely on native services. I really want that core part of my library to only rely upon C++ std library services - I'd prefer to avoid C runtime library services for that matter (note: I don't mind if the std lib relies on C runtime under the hood - that's irrelevant to my concerns, and would be a platform-specific issue, not a dependency created by my library).
So, I'm curious as to what other C++ professionals who have more experience with Linux and Mac OS based systems think would dovetail most easily between their own convenience of writing out an exception and being able to easily capture and manipulate trace and log files or other diagnostic I/O?
Correction: C++11 introduced the converter, and C++17 deprecated it.

Is C++ std::string platform independent?

I am wondering if the std::string of C++ is platform independent meaning that it is available for all compiler implementations.
Background is that I am working with some pretty old legacy code that does not make use of any std::string (of std::something) stuff and that makes it troublesome to work with. It is compiled on Windows and Unix systems including somewhat more exotic platforms like AIX, zLinux, Solaris and HPUX.
It would make life much easier if I could just use the std library but I don't know if it will work on all of those platforms. Any experiences with things like this?
std::string is part of the C++ standard (cfr. [string.classes]) so it is available with every C++-standard conforming implementation.
Be aware though that something might have changed from a major C++ version to another one, e.g. std::string::front (since C++11). If you want your codebase to be portable and consistent you should keep this in mind and also check for the highest available (for whatever reason, stability, policies, backward compatibility, etc..) C++ version you can target.
std::string is part of the standard library and should be available on any conforming implementation.

Cross-platform C++: Use the native string encoding or standardise across platforms?

We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can't decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
and UTF-8 in linux.
It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.
In second case you have a choice:
Use cross platform library that provides strings support (Qt, ICU, for example)
Use bare pointers (I consider std::string a "bare pointer" too)
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
You work with entire strings and splitting, searching, comparing is a rare task
You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
Convert from your encoding to that is used in API
Convert from API encoding to that is used in your code
Pointers is not a problem in your team
From my a little specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.
Advantages of UTF-8:
Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
C std library works great with UTF-8 strings. (*)
C++ std library works great with UTF-8 (std::string and friends). (*)
Legacy code works great with UTF-8.
Quite any platform supports UTF-8.
Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
No Little-Endian/Big-Endian mess.
You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.
Disadvantage is questionable:
Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
Harder (a little actually) to iterate over symbols.
So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).
I'd use the same encoding internally, and normalize the data at entry point. This will involve less code, less gotchas, and will allow you to use the same cross platform library for string processing.
I'd use unicode (utf-16) because it's simpler to handle internally and should perform better because of the constant length for each character. UTF-8 is ideal for output and storage because it's backwards compliant with latin ascii, and unly uses 8 bits for English characters. But inside the program 16-bit is simpler to handle.
C++11 provides the new string types u16string and u32string. Depending on the support your compiler versions deliver, and the expected life expectancy, it might be an idea to stay forward-compatible to those.
Other than that, using the ICU library is probably your best shot at cross-platform compatibility.
This seems to be quite enlightening on the topic. http://www.utf8everywhere.org/
Programming with UTF-8 is difficult as lengths and offsets are mixed up. e.g.
std::string s = Something();
std::cout << s.substr(0, 4);
does not necessarily find the first 4 chars.
I would use whatever a wchar_t is. On Windows that will be UTF-16. On some *nix platforms it might be UTF-32.
When saving to a file, I would recommend converting to UTF-8. That often makes the file smaller, and removes any platform dependencies due to differences in sizeof(wchar_t) or to byte order.

C++ Stdlib IO Implementation details

Are there any guarantees that C++ std IO will be portable across all Desktop and Mobile OS (I'm interested in iOS and Android)?
Is implementation of std IO different across the platforms or it is rather uniform? If it is different, then does it happen due to SDK of the platform (in other words - do SDK's provide those different implementations)?
Who provide those implementation? Who is the author? Does anybody update them?
Where is documentation?
Are there any guarantees that C++ std
IO will be portable across all Desktop
and Mobile OS (I'm interested in iOS
and Android)?
No, there are no guarantees that these platforms will implement, correctly at all the standard library.
Is implementation of std IO different
across the platforms or it is rather
uniform? If it is different, then does
it happen due to SDK of the platform
(in other words - do SDK's provide
those different implementations)?
It's different. I/O is very different on different platforms.
Who provide those implementation? Who
is the author? Does anybody update
them? Where is documentation?
Either the compiler implementor or the platform owner provides them. The C++ Standard describes what the library must do.
I think you are failing to see the power of the standard libraries. They are designed to provide a common set of functionality that is available across any standards compliant compiler. For example, if I write the following code:
#include <iostream>
int main(int a, char** s)
{
std::cout << "Hello World" << std::endl;
return 0;
}
This will be compiled by any standards compliant compiler. You're getting hung up on, well the way std::cout works is different on each platform - yes of course it is. But this is the beauty of it - why do you have to care? On windows, if you compile this with MS Visual C++, that compiler will have the correct implementation (which the standard doesn't care about) to support the above standard way of writing to standard out. Similarly, on Linux GCC will have the correct code to write to whatever implementation, and on Solaris, CC will do the same.
You don't have to worry or frankly care. The handling for your platform is provided by the compiler that you are using for that platform. You have a nice clean high-level interface to work with.
Do you care how the Java VM handles the details of each platform? You don't, it's not your concern, you know when you do System.out.println() it will be written to the screen (or whatever for that VM) appropriately. SO why are you getting hung up on this?
The thing you have understand is whether the compiler that you are using on the specific platform will provide all the functionality in the standard library (i.e. is it fully standards compliant or not) and if not, what's missing and how to work around it. The rest is frankly irrelevant!
As for if you don't like it, well pay for something like Roguewave - which frankly is pissing money away, but it's your money to piss away...
Standard library is exactly that — standard. It's defined by standard. Every standard-compliant compiler must provide it. So the guarantee is that it will be portable across standard-compliant implementations (whether there's one for your target platform is a whole different question altogether). It has nothing to do with platform SDKs (whether it's implemented using one doesn't matter — the observable behaviour must be the same).
The idea of a standard (hence the std) is that it is respected and uniform no matter what platform you are on.
Some developers ship devices with support for all or some of the std library, it's really just up to them how it is implemented.
This is platform specific and probably available in each platform's SDK documentation, available most probably with the SDK or on the vendor's website.

How do I write a C++ program that will easily compile in Linux and Windows?

I am making a C++ program.
One of my biggest annoyances with C++ is its supposed platform independence.
You all probably know that it is pretty much impossible to compile a Linux C++ program in Windows and a Windows one to Linux without a deluge of cryptic errors and platform specific include files.
Of course you can always switch to some emulation like Cygwin and wine, but I ask you, is there really no other way?
The language itself is cross-platform but most libraries are not, but there are three things that you should keep in mind if you want to go completely cross-platform when programming in C++.
Firstly, you need to start using some kind of cross-platform build system, like SCons. Secondly, you need to make sure that all of the libraries that you are using are built to be cross-platform.
And a minor third point, I would recommend using a compiler that exists on all of your target platforms, gcc comes in mind here (C++ is a rather complex beast and all compilers have their own specific quirks).
I have some further suggestions regarding graphical user interfaces for you. There are several of these available to use, the three most notable are:
GTK+
QT
wxWidgets
GTK+ and QT are two API's that come with their own widget sets (buttons, lists, etc.), whilst wxWidgets is more of a wrapper API to the currently running platforms native widget set. This means that the two former might look a bit differently compared to the rest of the system whilst the latter one will look just like a native program.
And if you're into games programming there are equally many API's to choose from, all of them cross-platform as well. The two most fully featured that I know of are:
SDL
SFML
Both of which contains everything from graphics to input and audio routines, either through plugins or built-in.
Also, if you feel that the standard library in C++ is a bit lacking, check out Boost for some general purpose cross-platform sweetness.
Good Luck.
C++ is cross platform. The problem you seem to have is that you are using platform dependent libraries.
I assume you are really talking about UI componenets- in which case I suggest using something like GTK+, Qt, or wxWindows- each of which have UI components that can be compiled for different systems.
The only solution is for you to find and use platform independent libraries.
And, on a side note, neither cygwin or Wine are emulation- they are 100% native implementations of the same functionality found their respective systems.
Once you're aware of the gotchas, it's actually not that hard. All of the code I am currently working on compiles on 32 and 64-bit Windows, all flavors of Linux, as well as Unix (Sun, HP and IBM). Obviously, these are not GUI products. Also, we don't use third-party libraries, unless we're compiling them ourselves.
I have one .h file that contains all of the compiler-specific code. For example, Microsoft and gcc disagree on how to specify an 8-bit integer. So in the .h, I have
#if defined(_MSC_VER)
typedef __int8 int8_t;
#elif defined(__unix)
typedef char int8_t;
#endif
There's also quite a bit of code that uniformizes certain lower-level function calls, for example:
#if defined(_MSC_VER)
#define SplitPath(Path__,Drive__,Name__,Ext__) _splitpath(Path__,Drive__,Dir__,Name__,Ext__)
#elif defined(__unix)
#define SplitPath(Path__,Drive__,Name__,Ext__) UnixSplitPath(Path__,Drive__,Name__,Ext__)
#endif
Now in this case, I believe I had to write a UnixSplitPath() function - there will be times when you need to. But most of the time, you just have to find the correct replacement function. In my code, I'll call SplitPath(), even though it's not a native function on either platform; the #defines will sort it out for me. It takes a while to train yourself.
Believe it or not, my .h file is only 240 lines long. There's really not much to it. And that includes handling endian issues.
Some of the lower-level stuff will need conditional compilation. For example, in Windows, I use Critical Sections, but in Linux I need to use pthread_mutex's. CriticalSection's were encapsulated in a class, and this class has a good deal of conditional compilation. However, the upper-level program is totally unaware, the class functions exactly the same regardless of the platform.
The other secret I can give you is: build your project on all platforms often (particularly at the beginning). It is a lot easier when you nip the compiler problems in the bud. Don't wait until you're done development before you attempt to go cross-platform.
Stick to ANSI C++ and libraries that are cross-platform and you should be fine.
Create some low-level layer that will contain all the platform-specific code in your project. Implement 2 versions of this layer - one for Windows, and one for Linux - with the same interface, and build them to 2 libraries. Access all platform-specific functionality in your project through that interface.
This layer can contain general classes for file access, printing, GUI, etc.
All the (now non-platform-specific) code that uses that layer can now be compiled once on Windows and once on Linux.
Compile it in Window and again in Linux. Unless you used platform specific libraries, it should work. It's not like Java, where you compile it once and it works everywhere. No one has made a virtual machine for C++, and probably never will. The code you write in C++ will work in any platform. You just have to compile it in every platform first.
Suggestions:
Use typedef's for ints. Or #include <stdint.h>. Some machines think int is 8 bytes, some 4. (It used to be 2 and 4. How the times have changed.)
Use encapsulation wherever possible. My last window's compiler thought %lld was %I64d", gave screwy return values for vsnprintf(), similar issues with close() and sockets, etc.
Watch out for stack size / buffer size limits. I've run into an 8k UDP buffer limit under Windows, amongst other problems.
For some reason, my Window's C++ compiler wouldn't accept dynamicly-sized allocations off the stack. E.g.: void foo(int a) { int b[a]; } Be aware of those sort of things. Plan how you will recode.
#ifdef can be your best friend. And your worst enemy! (At the same time!)
It can certainly be done. But compile and test early and often!
Also Linux and Windows have diffrent data model.
See article: The forgotten problems of 64-bit programs development
Standard C++ is code compiles without errors on any platform.
Try using Bloodshed Dev C++ on windows (instead of VC++ / Borland C++).
As Bloodshed Dev C++ confirms C++ standards, so programs compiled using it will be compiled on linux without errors in most of the cases.