c++: How to support surrogate characters in utf8

c++: How to support surrogate characters in utf8 - c++

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.
What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).
Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.
However, there is a requirement where it needs to support Surrogate pairs.
Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.
However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.
It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.
So convert the utf-16 encoded strings to utf-8. Documentation here: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
Wrong question. Use UTF-8 internally.
What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.
See above. Convert UTF-16 to UTF-8 for inbound data and convert back to UTF-16 outbound when necessary.

Related

How to detect unicode file names in Linux

I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?

utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).
howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.
and so ideally you should really ignore what those bytes might encode.
might be difficult to reconcile with naïve user’s view, though

Canonical Unicode string form

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

Comparing Unicode codepoint sequences:
UTF-8 is a canonical representation itself. Two Unicode strings that are composed of the same Unicode codepoints will always be encoded to exactly the same UTF-8 byte sequence and thus can be compared with memcmp. It is a necessary property of the UTF-8 encoding, otherwise it would not be easily decodable. But we can go further, this is true for all official Unicode encoding schemes, UTF-8, UTF-16 and UTF-32. They encode a string to different byte sequences, but they always encode the same string to the same sequence. If you consider endianness and platform independence, UTF-8 is the recommended encoding scheme because you don't have to deal with byte orders when reading or writing 16-bit or 32-bit values.
So the answer is that if two strings are encoded with the same encoding scheme (eg. UTF-8) and endiannes (it's not an issue with UTF-8), the resulting byte sequence will be the same.
Comparing Unicode strings:
There's an other issue that is more difficult to handle. In Unicode some glyphs (the character you see on the screen or paper) can be represented with a single codepoint or a combination of two consecutive codepoints (called combining characters). This is usually true for glyphs with accents, diacritic marks, etc. Because of the different codepoint representation, their corresponding byte sequence will differ. Comparing strings while taking these combining characters into consideration can not be performed with simple byte comparison, first you have to normalize it.
The other answers mention some Unicode normalization techniques, canonical forms and libraries that you can use for converting Unicode strings to their normal form. Then you will be able to compare them byte-by-byte with any encoding scheme.

You're looking to normalize the string to one of the Unicode normalization forms. libicu can do this for you, but not on a UTF-8 string. You have to first convert it to UChar, using e.g. ucnv_toUChars, then normalize with unorm_normalize, then convert back using ucnv_fromUChars. I think there's also some specific version of ucnv_* for UTF-8 encoding.
If memcmp is your only goal you can of course do that directly on the UChar array after unorm_normalize.

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks

If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.

To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

Read Unicode Files

I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.

Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data.
Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare).
For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
open file with wopen, or _wfopen as binary
read the first bytes to identify encoding using the BOM
if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
if the encoding is utf-16le (little endian) read in a wchar_t array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
BOM (http://unicode.org/faq/utf_bom.html)
wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)

We'll need more information to answer the question (for example, are you trying to read the Unicode file into a char buffer or a wchar_t buffer? What encoding does the file use?), but for now you might want to make sure you're not running into this issue if your file is Unicode and you're using fgetws in text mode.
When a Unicode stream-I/O
function operates in text mode, the
source or destination stream is
assumed to be a sequence of multibyte
characters. Therefore, the Unicode
stream-input functions convert
multibyte characters to wide
characters (as if by a call to the
mbtowc function). For the same reason,
the Unicode stream-output functions
convert wide characters to multibyte
characters (as if by a call to the
wctomb function).

Unicode is the mapping from numerical codes into characters. The step before Unicode is the file's encoding: how do you transform some consequtive bytes into a numerical code? You have to check whether the file is stored as big-endian, little-endian or something else.
Often, the BOM (Byte order marker) is written as the first two bytes in the file: either FF FE or FE FF.

The intended way of handling charsets is to let the locale system do it.
You have to have set the correct locale before opening your stream.
BTW you tag your question C++, you wrote about fgets and fgetws but not
IOStreams; is your problem C++ or C ?
For C:
#include <locale.h>
setlocale(LC_ALL, ""); /* at least LC_CTYPE */
For C++
#include <locale>
std::locale::global(std::locale(""));
Then wide IO (wstream, fgetws) should work if you environment is correctly
set for Unicode. If not, you'll have to change your environment (I don't
how it works under Windows, for Unix, setting the LC_ALL variable is the
way, see locale -a for supported values). Alternatively, replacing the
empty string by the locale would also work, but then you hardcode the
locale in your program and your users won't perhaps appreciate that.
If your system doesn't support an adequate locale, in C++ have the
possibility to write a facet for the conversion yourself. But that outside
of the scope of this answer.

You CANNOT reliably convert Unicode, even UTF-8, to ASCII. The character sets ('planes' in Unicode documentation) do not map back to ASCII - that's why Unicode exists in the first place.

First: I assume you are trying to read UTF8-Encoded Unicode (since you can read some characters). You can check this for example in Notpad++
For your problem - I'd suggest using some sort of library. You could try QT, QFile supports Unicode (as well as the rest of the library).
If this is too much, use a special unicode-library like for example: http://utfcpp.sourceforge.net/.
And learn about unicode: http://en.wikipedia.org/wiki/Unicode. There you'll find references to the different unicode-encodings.

Can BSTR's hold characters that take more than 16 bits to represent?

I am confused about Windows BSTR's and WCHAR's, etc. WCHAR is a 16-bit character intended to allow for Unicode characters. What about characters that take more then 16-bits to represent? Some UTF-8 chars require more then that. Is this a limitation of Windows?
Edit: Thanks for all the answers. I think I understand the Unicode aspect. I am still confused on the Windows/WCHAR aspect though. If WCHAR is a 16-bit char, does Windows really use 2 of them to represent code-points bigger than 16-bits or is the data truncated?

UTF-8 is not the encoding used in Windows' BSTR or WCHAR types. Instead, they use UTF-16, which defines each code point in the Unicode set using either 1 or 2 WCHARs. 2 WCHARs gives exactly the same amount of code points as 4 bytes of UTF-8.
So there is no limitation in Windows character set handling.

UTF8 is an encoding of a Unicode character (codepoint). You may want to read this excellent faq on the subject. To answer your question though, BSTRs are always encoded as UTF-16. If you have UTF-32 encoded strings, you will have to transcode them first.

As others have mentioned, the FAQ has a lot of great information on unicode.
The short answer to your question, however, is that a single unicode character may require more than one 16bit character to represent it. This is also how UTF-8 works; any unicode character that falls outside the range that a single byte is able to represent uses two (or more) bytes.

BSTR simply contains 16 bit code units that can contain any UTF-16 encoded data. As for the OS, Windows has supported surrogate pairs since XP. See the Dr International FAQ

The Unicode standard defines somewhere over a million unique code-points (each code-point represents an 'abstract' character or symbol - e.g. 'E', '=' or '~').
The standard also defines several methods of encoding those million code-points into commonly used fundamental data types, such as 8-bit characters, or 16-byte wchars.
The two most widely used encodings are utf-8 and utf-16.
utf-8 defines how to encode unicode code points into 8-bit chars. Each unicode code-point will map to between 1 and 4 8-bit chars.
utf-16 defines how to encode unicode code points into 16-bit words (WCHAR in Windows). Most code-points will map onto a single 16-bit WCHAR, but there are some that require two WCHARs to represent.
I recommend taking a look at the Unicode standard, and especially the FAQ (http://unicode.org/faq/utf_bom.html)

Windows has used UTF-16 as its native representation since Windows 2000; prior to that it used UCS-2. UTF-16 supports any Unicode character; UCS-2 only supports the BMP. i.e. it will do the right thing.
In general, though, it doesn't matter much, anyway. For most applications strings are pretty opaque, and just passed to some I/O mechanism (for storage in a file or database, or display on-screen, etc.) that will do the Right Thing. You just need to ensure you don't damage the strings at all.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js