UTF usage in C++ code

UTF usage in C++ code - c++

What is the difference between UTF and UCS.
What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:
Internal representation inside the code
For string manipulation at run-time
For using the string for display purposes.
Best storage representation (i.e. In file)
Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

What is the difference between UTF and UCS.
UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.
UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.
Internal representation inside the code
Best storage representation (i.e. In file)
Best on wire transport format (Transfer between application that may
be on different architectures and have
a different standard locale)
For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.
Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:
UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.
Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.

Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?

I would suggest:
For representation in code, wchar_t or equivalent.
For storage representation, UTF-8.
For wire representation, UTF-8.
The advantage of UTF-8 in storage and wire situations is that machine endianness is not a factor. The advantage of using a fixed size character such as wchar_t in code is that you can easily find out the length of a string without having to scan it.

UTC is Coordinated Universal Time, not a character set (I didn't find any charset called UTC).
For internal representation, you may want to use wchar_t for each character, and std::wstring for strings. They use exactly 2 bytes for each character, so seeking and random access will be fast.
For storage, if most of the data are not ASCII (i.e. code >= 128), you may want to use UTF-16 which is almost the same as serialized wstring and wchar_t.
Since UTF-16 can be little endian or big endian, for wire transport, try to convert it to UTF-8, which is architecture-independent.

In internal representation inside the code, you'd better do this for both European and non-European characters:
\uNNNN
Characters in the range \u0020 to \u007E, and a little bit of whitespace (e.g. end of line) can be written as ordinary characters. Anything above \u0080, if you write it as an ordinary character then it will compile only in your code page (e.g. OK in France but breaking in Russia, OK in Russia but breaking in Japan, OK in China but breaking in the US, etc.).

Related

How do I properly use std::string on UTF-8 in C++?

My platform is a Mac. I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.
I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.
However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.
Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

Unicode Glossary
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
UTF Primer
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
UTF-8: 1 to 4 Code Units,
UTF-16: 1 or 2 Code Units,
UTF-32: 1 Code Unit.
std::string and std::wstring.
Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
Picking std::string or std::u32string?
If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.
If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.
UTF-8 in std::string.
UTF-8 actually works quite well in std::string.
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
str.find('\n') works,
str.find("...") works for matching byte by byte1,
str.find_first_of("\r\n") works if searching for ASCII characters.
Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.
Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".
1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

std::string and friends are encoding-agnostic. The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char. For most compilers the latter is 8-bit. The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type). You can't store UTF-8 in std::wstring; that's not what it's designed for. It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.
If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library. ICU is one of the libraries in the field; there may be others.
Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte. Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.

Consider upgrading to C++20 and std::u8string that is the best thing we have as of 2019 for holding UTF-8. There are no standard library facilities to access individual code points or grapheme clusters but at least your type is strong enough to at least say it is true UTF-8.

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.
For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)
Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.
The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.
Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

Should I go ahead with std::string or switch to std::wstring?
I would recommend using std::string because wchar_t is non-portable and C++20 char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms including macOS that you are using normal char strings are already UTF-8.
Most of the standard string operations work with UTF-8 but operate on code units. If you want a higher-level API you'll have to use something else such as the text library proposed to Boost.

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:
A pointer to a buffer containing string data in some encoding X
supported by ICU
The length of the data in the buffer, in bytes
The encoding of the buffer (i.e. X)
Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).
For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.
For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.

For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.
In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.
This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.
Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.
The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.
For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.

I use the rather simplistic STL approach of:
std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);
Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)
Let me know if you need more information:)

If you restrict "arbitrary encoding" requirement to "any encoding that uses same codevalue for space and tab as ascii" which is still rather general you even don't need ICU at all. boost::trim_right or boost::trim_right_if is all you need.
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/usage.html#idp206822440

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.

A commonally used library is ICU - International Components for Unicode

Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.

First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

When encoding actually matters? (e.g., string storing, printing?)

Just curious about the encodings that system is using when handling string storing(if it cares) and printing.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)
Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?

(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)
Not quite. Make sure you understand one important distinction.
A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.
Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.
The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?
Yes, it will. Suppose you have std::string euro = "€"; Then:
With the windows-1252 encoding, the string will be encoded as the byte 0x80.
With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.
~$ echo $LANG
en_US.utf8
Windows has a GetACP function to get the "ANSI" code page number.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.

Encoding and Decoding is inherently the same process, i.e. they both transform one integral sequence to another integral sequence.
The difference between encoding and decoding is on the conceptual level. When you "decode" a character, you transform an integral sequence encoded in a known encoding ("string") into a system-specific integral sequence ("text"). And when you "encode", you're transforming a system-specific integral sequence ("text") into an integral sequence encoded in a particular encoding ("string").
This difference is conceptual, and not physical, the memory still holds a decoded "text" as a "string"; however since a particular system always represent "text" in a particular encoding, text transformations would not need to deal with the specificities of the actual system encoding, and can safely assume to be able to work on a sequence of conceptual "characters" instead of "bytes".
Generally however, the encoding used for "text" uses encoding that have properties that makes it easy to work with (e.g. fixed-length characters, simple one-to-one mapping between characters and byte-sequence, etc); while the encoded "string" is encoded with an efficient encoding (e.g. variable-length characters, context-dependant encoding, etc)
Joel On Software has a writeup on this: http://www.joelonsoftware.com/articles/Unicode.html
This one is a good one as well: http://www.jerf.org/programming/encoding.html

Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value differ depending on the
encoding currently in use? (I remember
that Bjarne says that encoding is the
mapping between char and integer(s) so
char should be stored as integer(s) in
memory, and different encodings don't
necessarily have the same mapping)
You're sort of thinking about this backwards. Different encodings interpret the underlying integers as different characters (or parts of characters, if we're talking about a multi-byte character set), depending on the encoding.
Question 2: If positive, std::string and std::wstring must have
the knowledge of the encoding
themselves(although another guy told
me this is NOT true)? Otherwise, how
is it able to translate the char to
correct integers and store them? How
does the system know the encoding?
Both std::string and std::wstring are completely encoding agnostic. As far as C++ is concerned, they simply store arrays of char objects and wchar_t objects respectively. The only requirement is that char is one-byte, and wchar_t is some implementation-defined width. (Usually 2 bytes on Windows and 4 on Linux/UNIX)
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
That depends on the platform. ISO C++ only talks about the global locale object, std::locale(), which generally refers to your current system-specific settings.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Generally, if you output to the screen through stdout, the characters you see displayed are interpreted and rendered according to your system's current locale settings.

Any one working with encodings should read this Joel on Software article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I found it useful when I started working with encodings.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use?
C/C++ programmers are used to thinking of characters as bytes, because almost everyone starts working with the ascii character set, maps the integers 0-255 to symbols such as the letters of the alphabet and arabic numbers. The fact that the C char datatype is actually a byte doesn't help matters.
The std::string class stores data as 8-bit integers, and std::wstring stores data in 16-bit integers. Neither class contains any concept of encoding. You can use any 8-bit encoding such as ASCII, UTF-8, Latin-1, Windows-1252 with a std::string, and any 8-bit or 16-bit encoding, such as UTF-16, with a std::wstring.
Data stored in std::string and std::wstring must always be interpreted by some encoding. This generally comes into play when you interact with the operating system: reading or writing data from a file, a stream, or making OS API calls that interact with strings.
So to answer your question, if you store the same byte in a std::string and a std::wstring, the memory will contain the same value (except the wstring will contain a null byte), but the interpretation of that byte will depend on the encoding in use.
If you store the same character in each of the strings, then the bytes may be different, again depending on the encoding. For example, the Euro symbol (€) might be stored in the std::string using a UTF-8 encoding, which is corresponds to the bytes 0xE2 0x82 0xAC. In the std::wstring, it might be stored using the UTF-16 encoding, which would be the bytes 0x20AC.
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Yes, the locale determines how the OS interprets strings at it's API boundaries. Locale's define more than just encoding. They also include things information on how money, date, time, and other things should be formatted. On Linux or OS X, you can use the locale command in the terminal to see what the current locale is:
mch#bohr:/$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
So in this case, my locale is Canadian English. Each locale defines a encoding used to interpret strings. In this case the locale name makes it clear that it is using a UTF-8 encoding, but you can run locale -ck LC_CTYPE to see more information about the current encoding:
mch#bohr:/$ locale -ck LC_CTYPE
LC_CTYPE
ctype-class-names="upper";"lower";"alpha";"digit";"xdigit";"space";"print";"graph";"blank";"cntrl";"punct";"alnum";"combining";"combining_level3"
ctype-map-names="toupper";"tolower";"totitle"
ctype-width=16
ctype-mb-cur-max=6
charmap="UTF-8"
... output snipped ...
If you want to test a program using encodings, you can set the LC_ALL environment variable to the locale you want to use. You can also change the locale using setlocale. Permanently changing the locale depends on your distribution.
On Windows, most API functions come in a narrow and a wide format. For example, [GetCurrentDirectory][9] comes in GetCurrentDirectoryW (Unicode) and GetCurrentDirectoryA (ANSI) variants. Unicode, in this context, means UTF-16.
I don't know enough about Windows to tell you how to set the locale, other than to try the languages control panel.
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
When you print a string to std::cout, the OS will interpret that string in the encoding set by the locale. If your string is UTF-8 encoded and the OS is using Windows-1252, it will be necessary to convert it to that encoding. One way to do this is with the iconv library.

Can BSTR's hold characters that take more than 16 bits to represent?

I am confused about Windows BSTR's and WCHAR's, etc. WCHAR is a 16-bit character intended to allow for Unicode characters. What about characters that take more then 16-bits to represent? Some UTF-8 chars require more then that. Is this a limitation of Windows?
Edit: Thanks for all the answers. I think I understand the Unicode aspect. I am still confused on the Windows/WCHAR aspect though. If WCHAR is a 16-bit char, does Windows really use 2 of them to represent code-points bigger than 16-bits or is the data truncated?

UTF-8 is not the encoding used in Windows' BSTR or WCHAR types. Instead, they use UTF-16, which defines each code point in the Unicode set using either 1 or 2 WCHARs. 2 WCHARs gives exactly the same amount of code points as 4 bytes of UTF-8.
So there is no limitation in Windows character set handling.

UTF8 is an encoding of a Unicode character (codepoint). You may want to read this excellent faq on the subject. To answer your question though, BSTRs are always encoded as UTF-16. If you have UTF-32 encoded strings, you will have to transcode them first.

As others have mentioned, the FAQ has a lot of great information on unicode.
The short answer to your question, however, is that a single unicode character may require more than one 16bit character to represent it. This is also how UTF-8 works; any unicode character that falls outside the range that a single byte is able to represent uses two (or more) bytes.

BSTR simply contains 16 bit code units that can contain any UTF-16 encoded data. As for the OS, Windows has supported surrogate pairs since XP. See the Dr International FAQ

The Unicode standard defines somewhere over a million unique code-points (each code-point represents an 'abstract' character or symbol - e.g. 'E', '=' or '~').
The standard also defines several methods of encoding those million code-points into commonly used fundamental data types, such as 8-bit characters, or 16-byte wchars.
The two most widely used encodings are utf-8 and utf-16.
utf-8 defines how to encode unicode code points into 8-bit chars. Each unicode code-point will map to between 1 and 4 8-bit chars.
utf-16 defines how to encode unicode code points into 16-bit words (WCHAR in Windows). Most code-points will map onto a single 16-bit WCHAR, but there are some that require two WCHARs to represent.
I recommend taking a look at the Unicode standard, and especially the FAQ (http://unicode.org/faq/utf_bom.html)

Windows has used UTF-16 as its native representation since Windows 2000; prior to that it used UCS-2. UTF-16 supports any Unicode character; UCS-2 only supports the BMP. i.e. it will do the right thing.
In general, though, it doesn't matter much, anyway. For most applications strings are pretty opaque, and just passed to some I/O mechanism (for storage in a file or database, or display on-screen, etc.) that will do the Right Thing. You just need to ensure you don't damage the strings at all.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js