Length of a utf16 string as a utf8 string - c++

I have a utf16 wchar_t* that I need to convert and dump into a utf8 char*. I'm using std::wcstombs to do it and am using the length of the wchar_t* for the max length.
I'm a bit fuzzy on the way utf encoding works though, IIRC, a single character could take up multiple bytes in which case I could possibly lose some characters when doing it like that.
Currently the characters that could come up are pretty limited and would probably fit even in ASCII charset but later on, I'm planning to allow more, such as öäõü and the likes. Am I going to have a problem there? If so, how would I measure the length of the buffer I need to allocate?

Codepoints in the BMP ("Basic Multilingual Plane", i.e. those whose values are not greater than 0xFFFF) require one UTF-16 codeunit or up to three UTF-8 codeunits. Outside of the BMP, a codepoint requires two UTF-16 codeunits (a surrogate pair) or four UTF-8 codeunits.
If your wchar_t is two bytes (UTF-16), in the worst case, the UTF-8 string could require three bytes for an individual wchar_t (that is 50% more memory), and 4 bytes for a surrogate pair (that is the same amount of memory).
If your wchar_t is four bytes (UTF-32), though, non-BMP characters will only require one wchar_t, so the worst case is four bytes for every wchar_t, which is the same amount of memory.
Only allowing one byte for each wchar_t will definitely get you into trouble. That will only work if you have no characters outside of the basic ASCII character set.

Related

UTF vs character types

UTF-8 and UTF-16 are variable length - more than 2 bytes may be used. UTF-32 uses 4 bytes. Unicode and UTF are general concepts but I wonder how it is related to C/C++ character types. Windows (WinApi) uses 2 bytes wchar_t. How to handle UTF-8 character which is longer than two bytes ? Even on Linux where wchar_t is 4 bytes long I may get UTF-8 characters which requires 6 bytes. Please exaplain how it works.
Take care not to confuse a Unicode code point and its representation in a specific encoding. All Unicode code points are in the range 0x0-0x10FFFF, which makes them directly storable as 32-bit numbers (that's what UTF-32 does).
UTF-8 can reach 6 bytes per code point [edit: it's actually 4 in its final version so the space issue is moot, but the rest of the paragraph holds] because it requires some overhead to manage its variable length - that's what permits a lot of other code points to be encoded in only 1 or 2 bytes. But when you're receiving a 6-bytes an UTF-8 character and you want to store it into Linux's 32-bit wchar_t, you don't store it as-is: you convert it to UTF-32, dropping the overhead. Same story with Windows's 16-bit wchar_t, except you might end up with 2 16-bit, UTF-16-encoded halves.
Note: a lot of Windows software is actually using UCS-2, which is essentially UTF-16 without the variable length. These won't be able to handle characters that would have required two UTF-16 wchar_t's.
First of all, the maximum Unicode character (UTF-8, UTF-16 and UTF-32 are encodings of Unicode to bytes) is U+10FFFF, which fits comfortably in a 4-byte wchar_t.
As for the 2 bytes wchar_t, Unicode addressed this problem in UTF-16 by adding in dummy "surrogate" characters in the range U+D800 to U+DFFF.
Quoting an example from the UTF-16 Wikipedia page:
To encode U+10437 (𐐷) to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437.
For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801.
For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37.
For completeness' sake, here is this character encoded in different encodings:
UTF-8: 0xF0 0x90 0x90 0xB7
UTF-16: 0xD801 0xDC37
UTF-32: 0x00010437

Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?

I was reading What is the use of wchar_t in general programming? and found something confusing in the accepted answer:
It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
And I find this from my textbook:
Isn't Unicode encoding with UTF-8 is at most 4 bytes? char for most platforms is 1 byte. Do I misunderstand something?
Update:
After searching and reading, now I know that:
code points and code units are different stuff. Code point is unique while code units rely on encoding.
u8'a'(a char, not string here) is only allowed for basic character set(the ASCII and it's control character stuff), and its value is the corresponding 'a''s code unit value, and for ascii characters, code units are same value as code points. (This is what #codekaizer's answer say)
std::string::size() returns code units.
So the editors are all dealing with code units right? And If I change my file encoding from utf8 to uft32, then size of ə would be 4?
Isn't unicode encoding with utf8 is at most 4 bytes?
As per lex.ccon/3, emphasis mine:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO 10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is in the C0 Controls and Basic
Latin Unicode block). If the value is not representable with a single
UTF-8 code unit, the program is ill-formed. A UTF-8 character literal
containing multiple c-chars is ill-formed.
Single UTF-8 code unit is 1 byte.
You are confusing code points with code units.
In UTF-8 each code unit (≈ data type used by a particular encoding) is one byte (8 bit), so it can be represented in a C++ program by the char type (which the standard guarantees to be at least 8 bit).
Now, of course you cannot represent all Unicode code points (≈ character/glyph) in just a single code unit if it is so small - they are currently well over 1 million, while a byte can have only 256 distinct values. For this reason, UTF-8 uses more code units to represent a single code point (and, to save space and for compatibility, uses a variable length encoding). So, the 😀 code point (U+1F600) will be mapped to 4 code units (f0 9f 98 80).
Most importantly, C++ almost everywhere is concerned just with code units - strings are treated mostly as opaque binary blobs (with the exception of the 0 byte for C strings). For example, strlen and std::string::size() will all report you the number of code units, not of code points.
The u8 cited above is one of the rare exceptions. It's an indication to the compiler that the string enclosed in the literal must be mapped from whatever the encoding the compiler is using to read the source file to an UTF-8 string.
The UTF-* is a family of variable encodings. On UTF-8, for instance, the minimal size is indeed 1 byte, but some characters require more. Those encodings have two advantages:
Compatibility with widespread characters types such as char
Minimal size when the text contains mostly English characters (which occupy 1 byte).
On the down size, variable length encodings require more work for some operations, e.g. calculating the number of characters in a given string. Since each character can occupy a different number of bytes, you can't just look at the string size (in bytes).
Given that, if you're going to use a variable length encoding, it usually makes sense to use the most compressed one, which is UTF-8 (under the assumption your text indeed contains mostly English characters). OTOH, if your text contains a wide range of languages, which will make UTF-8 inefficient, you can opt for the fixed size Unicode representations. On such cases, you'll need wider character types - 2 or 4 bytes.
The character set is not restricted to the ASCII table only. Having entries which can reside in 1 byte. Usually the character is more than that when it comes to different languages e.g Japanese. These characters don't reside in the ASCII table so we use 4 byte for a character in that regard.
In C++ we assume that our character is in the ASCII table so we give it a size of 1 byte.

What is the use of wchar_t in general programming?

Today I was learning some C++ basics and came to know about wchar_t. I was not able to figure out, why do we actually need this datatype, and how do I use it?
wchar_t is intended for representing text in fixed-width, multi-byte encodings; since wchar_t is usually 2 bytes in size it can be used to represent text in any 2-byte encoding. It can also be used for representing text in variable-width multi-byte encodings of which the most common is UTF-16.
On platforms where wchar_t is 4 bytes in size it can be used to represent any text using UCS-4 (Unicode), but since on most platforms it's only 2 bytes it can only represent Unicode in a variable-width encoding (usually UTF-16). It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
About the only modern operating system to use wchar_t extensively is Windows; this is because Windows adopted Unicode before it was extended past U+FFFF and so a fixed-width 2-byte encoding (UCS-2) appeared sensible. Now UCS-2 is insufficient to represent the whole of Unicode and so Windows uses UTF-16, still with wchar_t 2-byte code units.
wchar_t is a wide character. It is used to represent characters which require more memory to represent them than a regular char. It is, for example, widely used in the Windows API.
However, the size of a wchar_t is implementation-dependant and not guaranteed to be larger than char. If you need to support a specific form of character format greater than 8 bits, you may want to turn to char32_t and char16_t which are guaranteed to be 32 and 16 bits respectively.
wchar_t is used when you need to store characters with codes greater than 255 (it has a greater value than char can store).
char can take 256 different values which corresponds to entries in the ISO Latin tables. On the other hand, wide char can take more than 65536 values which corresponds to Unicode values. It is a recent international standard which allows the encoding of characters for virtually all languages and commonly used symbols.
I understand most of them have answered it but as I was learning C++ basics too and came to know about wchar_t, I would like to tell you what I understood after searching about it.
wchar_t is used when you need to store a character over ASCII 255 , because these characters have a greater size than our character type 'char'. Hence, requiring more memory.
e.g.:
wchar_t var = L"Привет мир\n"; // hello world in russian
It generally has a size greater than 8-bit character.
The windows operating system uses it substantially.
It is usually used when there is a foreign language involved.
The wchar_t data type is used to display wide characters that will occupy 16 bits. This datatype occupies "2 or 4" bytes.
Mostly the wchar_t datatype is used when international languages like japanese are used.
The wchar_t type is used for characters of extended character sets. It is among other uses used with wstring which is a string that can hold single characters of extended character sets, as opposed to the string which might hold single characters of size char, or use more than one character to represent a single sign (like utf8).
The wchar_t size is dependent on the locales, and is by the standard said to be able to represent all members of the largest extended character set supported by the locales.
wchar_t is specified in the C++ language in [basic.fundamental]/p5 as:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales ([locale]).
In other words, wchar_t is a data type which makes it possible to work with text containing characters from any language without worrying about character encoding.
On platforms that support Unicode above the basic multilingual plane, wchar_t is usually 4 bytes (Linux, BSD, macOS).
Only on Windows wchar_t is 2 bytes and encoded with UTF-16LE, due to historical reasons (Windows initially supported UCS2 only).
In practice, the "1 wchar_t = 1 character" concept becomes even more complicated, due to Unicode supporting combining characters and graphemes (characters represented by sequences of code points).

How do I use 3 and 4-byte Unicode characters with standard C++ strings?

In standard C++ we have char and wchar_t for storing characters. char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF. std::string uses char, so it can store 1-byte characters only. std::wstring uses wchar_t, so it can store characters up to 2-byte width. This is what I know about strings in C++. Please correct me if I said anything wrong up to this point.
I read the article for UTF-8 in Wikipedia, and I learned that some Unicode characters consume up to 4-byte space. For example, the Chinese character 𤭢 has a Unicode code point 0x24B62, which consumes 3-byte space in the memory.
Is there an STL string container for dealing with these kind of characters? I'm looking for something like std::string32. Also, we had main() for ASCII entry point, wmain() for entry point with 16-bit character support; what entry point do we use for 3 and 4-byte Unicode supported code?
Can you please add a tiny example?
(My OS: Windows 7 x64)
First you need a better understanding of Unicode. Specific answers to your questions are at the bottom.
Concepts
You need a more nuanced set of concepts than are required for very simple text handling as taught in introductory programming courses.
byte
code unit
code point
abstract character
user perceived Character
A byte is the smallest addressable unit of memory. Usually 8 bits today, capable of storing up to 256 different values. By definition a char is one byte.
A code unit is the smallest fixed size unit of data used in storing text. When you don't really care about the content of text and you just want to copy it somewhere or calculate how much memory the text is using then you care about code units. Otherwise code units aren't much use.
A code point represents a distinct member of a character set. Whatever 'characters' are in a character set, they all are assigned a unique number, and whenever you see a particular number encoded then you know which member of the character set you're dealing with.
An abstract character is an entity with meaning in a linguistic system, and is distinct from its representation or any code points assigned to that meaning.
User perceived characters are what they sound like; what the user thinks of as a character in whatever linguistic system he's using.
In the old days, char represented all of these things: a char is by definition a byte, in char* strings the code units are chars, the character sets were small so the 256 values representable by char was plenty to represent every member, and the linguistic systems that were supported were simple, so the members of the character sets mostly represented the characters users wanted to use directly.
But this simple system with char representing pretty much everything wasn't enough to support more complex systems.
The first problem encountered was that some languages use far more than 256 characters. So 'wide' characters were introduced. Wide characters still used a single type to represent four of the above concepts, code units, code points, abstract characters, and user perceived characters. However wide characters are no longer single bytes. This was thought to be the simplest method of supporting large character sets.
Code could mostly be the same, except it would deal with wide characters instead of char.
However it turns out that many linguistic systems aren't that simple. In some systems it makes sense not to have every user-perceived character necessarily be represented by a single abstract character in the character set. As a result text using the Unicode character set sometimes represents user perceived characters using multiple abstract characters, or uses a single abstract character to represent multiple user-perceived characters.
Wide characters have another problem. Since they increase the size of the code unit they increase the space used for every character. If one wishes to deal with text that could adequately be represented by single byte code units, but must use a system of wide characters then the amount of memory used is higher than would be the case for single byte code units. As such, it was desired that wide characters not be too wide. At the same time wide characters need to be wide enough to provide a unique value for every member of the character set.
Unicode currently contains about 100,000 abstract characters. This turns out to require wide characters which are wider than most people care to use. As a result a system of wide characters; where code units larger than one byte are used to directly store codepoint values turns out to be undesirable.
So to summarize, originally there was no need to distinguish between bytes, code units, code points, abstract characters, and user perceived characters. Over time, however, it became necessary to distinguish between each of these concepts.
Encodings
Prior to the above, text data was simple to store. Every user perceived character corresponded to an abstract character, which had a code point value. There were few enough characters that 256 values was plenty. So one simply stored the code point numbers corresponding to the desired user-perceived characters directly as bytes. Later, with wide characters, the values corresponding to user-percieved characters were stored directly as integers of larger sizes, 16 bits, for example.
But since storing Unicode text this way would use more memory than people are willing to spend (three or four bytes for every character) Unicode 'encodings' store text not by storing the code point values directly, but by using a reversible function to compute some number of code unit values to store for each code point.
The UTF-8 encoding, for example, can take the most commonly used Unicode code points and represent them using a single, one byte code unit. Less common code points are stored using two one byte code units. Code points that are still less common are stored using three or four code units.
This means that common text can generally be stored with the UTF-8 encoding using less memory than 16-bit wide character schemes, but also that the numbers stored do not necessarily correspond directly to the code point values of abstract characters. Instead if you need to know what abstract characters are stored, you have to 'decode' the stored code units. And if you need to know the user perceived characters you have to further convert abstract characters into user perceived characters.
There are many different encodings, and in order to convert data using those encodings into abstract characters you must know the right method of decoding. The stored values are effectively meaningless if you don't know what encoding was used to convert the code point values into code units.
An important implication of encodings are that you need to know whether particular manipulations of encoded data are valid, or meaningful.
For example, if you want get the 'size' of a string are you counting bytes, code units, abstract characters, or user perceived characters? std::string::size() counts code units, and if you need a different count then you have to use another method.
As another example, if you split an encoded string you need to know if you're doing so in such a way that the result is still valid in that encoding and that the data's meaning hasn't unintentionally changed. For example you might split between code units that belong to the same code point, thus producing an invalid encoding. Or you might split between code points which must be combined to represent a user perceived character and thus produce data the user will see as incorrect.
Answers
Today char and wchar_t can only be considered code units. The fact that char is only one byte doesn't prevent it from representing code points that take two, three, or four bytes. You simply have to use two, three, or four chars in sequence. This is how UTF-8 was intended to work. Likewise, platforms that use two byte wchar_t to represent UTF-16 simply use two wchar_t in a row when necessary. The actual values of char and wchar_t don't individually represent Unicode code points. They represent code unit values that result from encoding the code points. E.g. The Unicode code point U+0400 is encoded into two code units in UTF-8 -> 0xD0 0x80. The Unicode code point U+24B62 similarly gets encoded into as four code units 0xF0 0xA4 0xAD 0xA2.
So you can use std::string to hold UTF-8 encoded data.
On Windows main() supports not just ASCII, but whatever the system char encoding is. Now even Windows support UTF-8 as the system char encoding and is no longer limited to legacy encodings. You may have to configure Windows for this though; I'm not sure if it's the default yet, and if not hopefully it becomes the default.
You can also use a Win32 API call to directly access the UTF-16 command line parameters instead of using main()s argc and argv parameters. See GetCommandLineW() and CommandLineToArgvW.
wmain()'s argv parameter fully supports Unicode. The 16-bit code units stored in wchar_t on Windows are UTF-16 code units. The Windows API uses UTF-16 natively, so it's quite easy to work with on Windows. wmain() is non-standard though, so relying on this won't be portable.
Example:
#include <iostream>
#include <string>
int main() {
std::string s = "CJK UNIFIED IDEOGRAPH-24B62: \U00024B62";
std::cout << s << '\n';
auto space = s.rfind(' ');
std::cout << "Encoded bytes: ";
for (auto i = space + 1, end = s.size(); i != end; ++i) {
std::cout << std::hex << static_cast<int>(static_cast<unsigned char>(s[i])) << " ";
}
}
If the compiler uses UTF-8 as the narrow execution encoding then s will contain UTF-8 data. If the terminal you're using to run the compiled program supports UTF-8, is configured to use it, and uses a font that supports the character 𤭢 then you should see that character printed out by this program.
On Windows I built with the /utf-8 flag cl.exe /EHsc /utf-8 tmp.cpp, and ran the command to set the console to UTF-8 chcp 65001, resulting in this program printing the correct data. Although due to lack of support from the font the character displayed as a box with a question mark. Copying the text from the console and pasting it somewhere with proper support reveals that the correct character was written.
With /utf-8 you can also write utf-8 data directly in string literals instead of using the \Uxxxxxxxx syntax.
With GCC you can use the flag -fexec-charset=utf-8 to build this program, though it should be the default. -finput-charset=utf-8 allows you to directly write UTF-8 encoded data in your string literals.
Clang doesn't bother to support anything other than UTF-8.
Windows uses UTF-16. Any code point in the range of U+0000 to U+D7FF and U+E000 to U+FFFF will be stored directly; any outside of those ranges will be split into two 16-bit values according to the UTF-16 encoding rules.
For example 0x24B62 will be encoded as 0xd892,0xdf62.
You may convert the strings to work with them any way you'd like but the Windows API will still want and deliver UTF-16 so that's probably going to be the most convenient.
The size and meaning of wchar_t is implementation-defined. On Windows it's 16 bit as you say, on Unix-like systems it's often 32 bit but not always.
For that matter, a compiler is permitted do its own thing and pick a different size for wchar_t than what the system says -- it just won't be ABI-compatible with the rest of the system.
C++11 provides std::u32string, which is for representing strings of unicode code points. I believe that sufficiently recent Microsoft compilers include it. It's of somewhat limited use since Microsoft's system functions expect 16-bit wide characters (a.k.a UTF-16le), not 32-bit unicode code points (a.k.a UTF-32, UCS-4).
You mention UTF-8, though: UTF-8 encoded data can be stored in a regular std::string. Of course since it's a variable-length encoding, you can't access unicode code points by index, you can only access the bytes by index. But you'd normally write your code not to need to access code points by index anyway, even if using u32string. Unicode code points don't correspond 1-1 with printable characters ("graphemes") because of the existence of combining marks in Unicode, so many of the little tricks you play with strings when learning to program (reversing them, searching for substrings) don't work so easily with Unicode data no matter what you store it in.
The character 𤭢 is, as you say, \u24B62. It is UTF-8 encoded as a series of four bytes, not three: F0 A4 AD A2. Translating between UTF-8 encoded data and unicode code points is effort (admittedly not a huge amount of effort and library functions will do it for you). It's best to regard "encoded data" and "unicode data" as separate things. You can use whatever representation you find most convenient right up to the point where you need to (for example) render the text to screen. At that point you need to (re-)encode it to an encoding that your output destination understands.
In standard C++ we have char and wchar_t for storing characters? char can store values between 0x00 and 0xFF. And wchar_t can store values between 0x0000 and 0xFFFF
Not quite:
sizeof(char) == 1 so 1 byte per character.
sizeof(wchar_t) == ? Depends on your system
(for unix usually 4 for Windows usually 2).
Unicode characters consume up to 4-byte space.
Not quite. Unicode is not an encoding. Unicode is a standard the defines what each code point is and the code points are restricted to 21 bits. The first 16 bits defined the character position on a code plain while the following 5 bits defines which plain the character is on.
There are several unicode encodings (UTF-8, UTF-16 and UTF-32 being the most common) this is how you store the characters in memory. There are practical differences between the three.
UTF-8: Great for storage and transport (as it is compact)
Bad because it is variable length
UTF-16: Horrible in nearly all regards
It is always large and it is variable length
(anything not on the BMP needs to be encoded as surrogate pairs)
UTF-32: Great for in memory representations as it is fixed size
Bad because it takes 4 bytes for each character which is usually overkill
Personally I use UTF-8 for transport and storage and UTF-32 for in memory representation of text.
char and wchar_t are not the only data types used for text strings. C++11 introduces new char16_t and char32_t data types and respective STL std::u16string and std::u32string typedefs of std::basic_string, to address the ambiquity of the wchar_t type, which has different sizes and encodings on different platforms. wchar_t is 16-bit on some platforms, suitable for UTF-16 encoding, but is 32-bit on other platforms, suitable for UTF-32 encoding instead. char16_t is specifically 16-bit and UTF-16, and char32_t is specifically 32-bit and UTF-32, on all platforms.

Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

Please clarify for me, how does UTF16 work?
I am a little confused, considering these points:
There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly) (UPDATE: as shown by the answers, this assumption was wrong).
Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.
So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2?
And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?
Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)
UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:
On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)
Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?
Short answer: No.
The size of a wchar_t—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.
In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t to contain big-endian data.
Furthermore, even though each wchar_t is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_ts, and there may be more than one way to represent it.
A common example is the character é (LATIN SMALL LETTER E WITH ACUTE), code point 0x00E9. This can also be represented as “decomposed” code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). Both are valid; see the Wikipedia article on Unicode equivalence for more information.
Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_ts.
On Linux/Mac UTF-8 (with chars) is more common and APIs usually take UTF-8. wchar_t is seen to be wasteful because it uses 4 bytes per character.
For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.
Update
The question has been updated to ask what Windows APIs mean when they ask for the “number of characters” in a string.
If the API says “size of the string in characters” they are referring to the number of wchar_ts (or the number of chars if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t. Those APIs are just looking to fill a buffer and need to know how much room they have.
You seem to have several misconception.
There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)
This is wrong. Assuming you refer to the c++ type wchar_t - It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.
There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.
UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.
Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il
Windows' WCHAR is 16 bits (2 bytes) long.
A Unicode codepoint may be represented by one or two of these WCHAR – 16 or 32 bits (2 or 4 bytes).
wcslen returns number of WCHAR units in a wide string, while wcslen_l returns the number of (locale-dependent) codepoints. Obviously, wcslen <= wcslen_l.
A Unicode character may consist of multiple combining codepoints.
Short story: UTF-16 is a variable-length encoding. A single character may be one or two widechars long.
HOWEVER, you may very well get away with treating it as a fixed-length encoding where every character is one widechar (2 bytes). This is formally called UCS-2, and it used to be Win32's assumption until Windows NT 4. The UCS-2 charset includes practically all living, dead and constructed human languages. And truth be told, working with variable-length encoding strings just sucks. Iteration becomes O(n) operation, string length is not the same as string size, etc. Any sensible parsing becomes a pain.
As for the UTF-16 chars that are not in UCS-2... I only know two subsets that may theoretically come up in real life. First is emoji - the graphical smileys that are popular in the Japanese cell phone culture. On iPhone, there's a bunch of third-party apps that enable input of those. Except on mobile phones, they don't display properly. The other character class is VERY obscure Chinese characters. The ones even most Chinese don't know. All the popular Chinese characters are well inside UCS-2.
There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)
Well WCHAR is an MS thing not a C++ thing.
But there is a wchar_t for wide character. Though this is not always 2. On Linux system it is usually 4 bytes.
Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
Do they. I can believe it.
There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
C/C++ make no assumption avout character encoding. Though the OS can. For example Windows uses UTF-16 as the interface while a lot of Linus use UTF-32. But you need to read the documentation for each interface to know explicitly.
To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
2 bytes is all you need for numbers 0 -> 65535
But UCS (the encoding that UTF is based on) has 20 bits per code point. Thus some code points are encoded as 2 16byte characters in UTF-16 (These are refereed to as surrogate pairs).
UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.
UTF-8/UTF-16 and UTF-32 all encode the same set of code points (which are 20 bytes per code point). UTF-32 is the only one that has a fixed size (UTF-16 was supposed to be fixed size but then they found lots of other characters (Like Klingon) that we needed to encode and we ran out of space in plane 0. So we added 32 more plains (hence the four extra bits).
So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2?
It is either 1 16 bit character or 2 16 bit characters.
And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?
You have to step along and calculate each character one at a time.
Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)
All depneds on your system
This Wikipedia article seems to be a good intro.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.
According to the Unicode FAQ it could be
one or two 16-bit code units
Windows uses 16 bit chars - probably as Unicode was originally 16 bit. So you don't have an exact map - but you might be able to get away with treating all strings you see as just containing 16 but unicode characters,
All characters in the Basic Multilingual Plane will be 2 bytes long.
Characters in other planes will be encoded into 4 bytes each, in the form of a surrogate pair.
Obviously, if a function does not try to detect surrogate pairs and blindly treats each pair of bytes as a character, it will bug out on strings that contain such pairs.