Unicode Basics on Windows

Unicode Basics on Windows - c++

I have a C++ library which I deliver to other developers. One of them needs i18n, so he asked me if I could add L prefix to the strings in the API.
I don't know much about i18n so I have some basic questions:
When I compile my lib with Unicode, can other developers use this build as usual ? Or shall developers also change their Visual Studio settings to use unicode ?
When I compile my lib with Unicode, do I need to change all the strings in headers and .cpp files? Or is it sufficient to add L prefix to strings in header files ?
Thanks in advance!
Paul

Adding the L prefix changes the string from a char array into a short array. A better alternative is to wrap all your strings with the "TEXT" macro, i.e.
TEXT("My string")
If your build is a Unicode build, all your strings become an array of shorts, but if not, they remain as an array of chars. Windows also provides the following types:
LPWSTR = short *
LPTSTR = short *, or char * if UNICODE not defined
LPSTR = char *
Don't forget though; even though you've prefixed L or wrapped TEXT to your strings, you need to make sure you're calling the right functions. Standard Windows string API such as lstrlen automatically switch from char * to short * if UNICODE is defined, but you'll need to make sure you're not using functions that only use char *.
Functions that your library exports that use strings will also break older applications that use your library since those applications will still be passing and array of chars rather than shorts, so you'll probably want to work in some sort of backwards compatibility in there.

There's a lot more than Unicode support to internationalization (i18n). Off the top of my head, there is:
Currency
Number representation
Text encodings (partially abstracted by the use of Unicode)
Right-to-left scripts
Text translation mechanisms
Most of this is available in some form or another through APis on Window, whether it be Win32 or .Net, etc. I suggest you take a look at:
Microsoft .Net Internationalization
The Microsoft Win32 Internationalization Checklist

Related

char vs wchar_t when to use which data type

I want to understand the difference between char and wchar_t ? I understand that wchar_t uses more bytes but can I get a clear cut example to differentiate when I would use char vs wchar_t

Short anwser:
You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).
In a C++ program you have two locales:
Standard C library locale set by std::setlocale
Standard C++ library locale set by std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:
Linux is encoding agnostic, so those function simply pass char string to underlying system call
On Windows char string is converted to wide string using user specific locale before system call is made
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:
Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
std::error_category::message(int) for both error categories returns error description using user-specific encoding.
So what we have on Windows is:
Standard library functions that open files are broken and should never be used.
Arguments passed to main(int, char**) are broken and should never be used.
WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path is partially broken and should never be used directly.
Error categories returned by std::generic_category and std::system_category are broken and should never be used.
If you need long term solution for a non-trivial project, I would recommend:
Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.
Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/

Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.
Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t, a.k.a. wide characters, provides more room for encodings.
Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).

Never use wchar_t.
When possible, use (some kind of array of) char, such as std::string, and ensure that it is encoded in UTF-8.
When you must interface with APIs that don't speak UTF-8, use char16_t or char32_t. Never use them otherwise; they provide only illusory advantages and encourage faulty code.
Note that there are plenty of cases where more than one char32_t is required to represent a single user-visible character. OTOH, using UTF-8 with char forces you to handle variable width very early.

C++, using environment variables for paths

is there any way to use environment variables in c++ for path to file?
Idea is to use them without expending so I don't need to use wchar for languages with unicode standard when I want to save/read file.
//EDIT
Little edit with more explanations.
So what I try to achieve is to read/write to file without worrying about characters in path. So I don't want to use wchar as path but it should work if path contains some wide chars.
There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings (Constrol Panel -> Clock, Language, and Region -> Region and Language -> Administrative) which need some actions from users and this is something that I try to avoid.

There are functions getenv and GetEnvironmentVariable but they need to set proper language in Language for non-Unicode programs in windows settings
This is specifically a Windows problem.
On other platforms such as Linux, filepaths and environment variables are natively byte-based; you can access them using the standard C library functions that take byte-string paths like fopen() and getenv(). The pathnames may represent Unicode strings to the user (decoded using some encoding, almost always UTF-8 which can encode any character), but to the code they're just byte strings.
Windows, on the other hand, has filenames and environment variables that are natively strings of 16-bit (UTF-16) code units (which are nearly the same thing as Unicode character code points, but not quite, because that would be too easy... but that's a sadness for another time). You can call Win32 file-handling APIs like CreateFileW() and GetEnvironmentVariableW() using UTF-16 code unit strings (wchar_t, when compiled on Windows) and access any file names directly.
There are also old-school legacy byte-based Win32 functions like GetEnvironmentVariableA() (which is what GetEnvironmentVariable() points to if you are compiling a non-Unicode project). If you call those functions, Windows has to convert from the char byte strings you give it to UTF-16 strings, using some encoding.
That encoding is the ‘ANSI’ (‘A’) locale-specific default code page, which is what “Language for non-Unicode programs” sets.
Although that encoding can be changed by the user, it can't be set to UTF-8 or any other encoding that supports all characters, so even if you ask the user to change it, that still doesn't let you access all files. Thus the Win32 A APIs are always to be avoided.
The problem comes when you want to access files in a manner that works on both Windows and the other platforms. If you call the C standard library with byte strings, the Microsoft C runtime library adapts those calls to call the Win32 A byte-based APIs, which as above are irritatingly limited.
So your unattractive choices are:
use wchar_t and std::wstring strings in your code, using only Win32 APIs for interacting with filenames and environment variables, and accepting that your code will never run on other platforms, or;
use char and UTF-8-encoded std::string strings, and give up on your code accessing filenames and environment variables with non-ASCII characters in on Windows, or;
write a load of branching #ifdef code to switch between using C standard functions for filename and environment interaction, or using Win32 APIs with a bunch of UTF-8-char-to-wchar_t string conversions in between, so that code works across multiple platforms, or;
use a library that encapsulates (3) for you.
Notably there is boost::nowide (since Boost 1.73) which contains boost::nowide::getenv.
This isn't entirely Microsoft's fault: Windows NT was designed in their early days of Unicode before UTF-8 or the astral planes were invented, when it was thought that 16-bit code unit strings were a totally sensible way to store text, and not a lamentable disaster like we know it is now. It is, however, very sad that Windows has not been updated since then to treat UTF-8 as a first-class citizen and provide an easy way to write cross-platform applications.

The standard library gives you the function getenv. Here is an example:
#include <cstdlib>
int main()
{
char* pPath;
pPath = getenv("PATH");
if (pPath)
std::cout << "Path =" << pPath << std::endl;
return 0;
}

In C++ when to use WCHAR and when to use CHAR

I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.

Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.

WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.

The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.

In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.

It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

Using Different Character Encodings

Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?

Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".

What is Microsoft using as the data type for Unicode Strings?

I am in the process of learning C++ and came across an article on the MSDN here:
http://msdn.microsoft.com/en-us/magazine/dd861344.aspx
In the first code example the one line of code which my question relates to is the following:
VERIFY(SetWindowText(L"Direct2D Sample"));
More specifically that L prefix. I had a little read up, and correct me if I am wrong :-), but this is to allow for unicode strings, i.e. to prep for a long character set. Now in during my read up on this I came across another article on Adavnced String Techniques in C here http://www.flipcode.com/archives/Advanced_String_Techniques_in_C-Part_I_Unicode.shtml
It says there are a few options including the inclusion of the header:
#define UNICODE
OR
#define _UNICODE
in C , again point out if I am wrong, appreciate your feedback. Further it shows the datatype suitable for these unicode strings being:
wchar_t
It throws into the mix a macro and a kind of hybrid datatype, the macro being:
_TEXT(t)
which simply prefixes the string with the L and the hybrid data type as
TCHAR
Which it points out will allow for unicode if the header is there and ASCII if not. Now my question is, or more of an asumption which I would like to confirm, would Microsoft use this TCHAR data type which is more flexible or is there any benefit to committing to using the wchar_t.
Also when I say does Microsoft use this, more specifically for exmaple in the ATL and WTL libraries, do anyone of yourselves have preference or have some advice regarding this?
Cheers,
Andrew

For all new software you should define UNICODE and use wchar_t directly. Using ANSI stirngs will come back to haunt you.
You should just use wchar_t and the wide versions of all the CRT functions (ex: wcscmp instead of strcmp). The TEXT macros and TCHAR etc just exist if your code needs to work in both ANSI and UNICODE environments which I feel code rarely needs to do.
When you create a new windows application using Visual Studio UNICODE is automatically defined and wchar_t will work like a built-in.

Short answer: the hybrid infrastructure with the TCHAR type, the _TEXT() macro and the various _t* functions (_tcscpy comes to mind) are a throwback to the times when Microsoft had two platforms coexisting:
The Windows NT line was based on the Unicode string representation
The Windows 95/98/ME line was based on ANSI string representation.
String representation here means that all the Windows APIs that expected or returned string to your app used one or the other representation for these strings. COM added even more confusion as it was available on both platforms -- and expected Unicode strings on both!
In those old times it was encouraged that you write "portable" code: you were instructed to use the hybrid infrastructure for your strings so that you can compile for both models just by defining/undefining UNICODE and/or _UNICODE for your app.
As the Windows9x line is no more relevant (for the vast majority of the apps anyway) you can safely ignore the ANSI world and use the Unicode strings directly.
Beware though that Unicode has multiple representations today: as it is pointed out above the Unicode convention implied by wchar_t is the UCS-2 representation (all characters encoded in 16-bit words). There are other, widely used representations where this is not necessarily true.

On Windows it's wchar_t with UTF-16 (2 bytes) encoding.
Source : http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

TCHAR changes its type depending if UNICODE is defined, and should be used when you want code that you can compile for UNICODE and non-UNICODE.
If you want to explicitly process UNICODE data only, then feel free to use wchar_t.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js