How to UTF-8 encode a character/string - c++

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks

If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.

To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

Related

c++: How to support surrogate characters in utf8

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.
What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.
We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).
Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.
However, there is a requirement where it needs to support Surrogate pairs.
Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.
However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.
It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.
We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.
So convert the utf-16 encoded strings to utf-8. Documentation here: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
Wrong question. Use UTF-8 internally.
What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.
See above. Convert UTF-16 to UTF-8 for inbound data and convert back to UTF-16 outbound when necessary.

How to detect unicode file names in Linux

I have a windows application written in C++. In this we used to check a file name is unicode or not using the wcstombs() function. If the conversion fails, we assume that it is unicode file name. Likewise when i tried the same in Linux, the conversion doesn't fail. I know in windows, the default charset is LATIN whereas the default charset of Linux is UTF8. Based on whether file name is unicode or not, we have different set of codings. Since I couldn't figure it out in Linux, I can't make my application portable for Unicode characters. Is there any other work around for this or am I doing anything wrong ?
utf-8 has the nice property that all ascii characters are represented as in ascii, and all non-ascii characters are represented as sequences of two or more bytes >=128. so all you have to check for ascii is the numerical magnitude of unsigned byte. if >=128, then non-ascii, which with utf-8 as the basic encoding means "unicode" (even if within range of latin-1, and note that latin-1 is a proper subset of unicode, constituting the first 256 code points).
howevever, note that while in Windows a filename is a sequence of characters, in *nix it is a sequence of bytes.
and so ideally you should really ignore what those bytes might encode.
might be difficult to reconcile with naïve user’s view, though

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

When encoding actually matters? (e.g., string storing, printing?)

Just curious about the encodings that system is using when handling string storing(if it cares) and printing.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)
Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)
Not quite. Make sure you understand one important distinction.
A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.
Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.
The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?
Yes, it will. Suppose you have std::string euro = "€"; Then:
With the windows-1252 encoding, the string will be encoded as the byte 0x80.
With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.
~$ echo $LANG
en_US.utf8
Windows has a GetACP function to get the "ANSI" code page number.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.
Encoding and Decoding is inherently the same process, i.e. they both transform one integral sequence to another integral sequence.
The difference between encoding and decoding is on the conceptual level. When you "decode" a character, you transform an integral sequence encoded in a known encoding ("string") into a system-specific integral sequence ("text"). And when you "encode", you're transforming a system-specific integral sequence ("text") into an integral sequence encoded in a particular encoding ("string").
This difference is conceptual, and not physical, the memory still holds a decoded "text" as a "string"; however since a particular system always represent "text" in a particular encoding, text transformations would not need to deal with the specificities of the actual system encoding, and can safely assume to be able to work on a sequence of conceptual "characters" instead of "bytes".
Generally however, the encoding used for "text" uses encoding that have properties that makes it easy to work with (e.g. fixed-length characters, simple one-to-one mapping between characters and byte-sequence, etc); while the encoded "string" is encoded with an efficient encoding (e.g. variable-length characters, context-dependant encoding, etc)
Joel On Software has a writeup on this: http://www.joelonsoftware.com/articles/Unicode.html
This one is a good one as well: http://www.jerf.org/programming/encoding.html
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value differ depending on the
encoding currently in use? (I remember
that Bjarne says that encoding is the
mapping between char and integer(s) so
char should be stored as integer(s) in
memory, and different encodings don't
necessarily have the same mapping)
You're sort of thinking about this backwards. Different encodings interpret the underlying integers as different characters (or parts of characters, if we're talking about a multi-byte character set), depending on the encoding.
Question 2: If positive, std::string and std::wstring must have
the knowledge of the encoding
themselves(although another guy told
me this is NOT true)? Otherwise, how
is it able to translate the char to
correct integers and store them? How
does the system know the encoding?
Both std::string and std::wstring are completely encoding agnostic. As far as C++ is concerned, they simply store arrays of char objects and wchar_t objects respectively. The only requirement is that char is one-byte, and wchar_t is some implementation-defined width. (Usually 2 bytes on Windows and 4 on Linux/UNIX)
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
That depends on the platform. ISO C++ only talks about the global locale object, std::locale(), which generally refers to your current system-specific settings.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Generally, if you output to the screen through stdout, the characters you see displayed are interpreted and rendered according to your system's current locale settings.
Any one working with encodings should read this Joel on Software article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I found it useful when I started working with encodings.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use?
C/C++ programmers are used to thinking of characters as bytes, because almost everyone starts working with the ascii character set, maps the integers 0-255 to symbols such as the letters of the alphabet and arabic numbers. The fact that the C char datatype is actually a byte doesn't help matters.
The std::string class stores data as 8-bit integers, and std::wstring stores data in 16-bit integers. Neither class contains any concept of encoding. You can use any 8-bit encoding such as ASCII, UTF-8, Latin-1, Windows-1252 with a std::string, and any 8-bit or 16-bit encoding, such as UTF-16, with a std::wstring.
Data stored in std::string and std::wstring must always be interpreted by some encoding. This generally comes into play when you interact with the operating system: reading or writing data from a file, a stream, or making OS API calls that interact with strings.
So to answer your question, if you store the same byte in a std::string and a std::wstring, the memory will contain the same value (except the wstring will contain a null byte), but the interpretation of that byte will depend on the encoding in use.
If you store the same character in each of the strings, then the bytes may be different, again depending on the encoding. For example, the Euro symbol (€) might be stored in the std::string using a UTF-8 encoding, which is corresponds to the bytes 0xE2 0x82 0xAC. In the std::wstring, it might be stored using the UTF-16 encoding, which would be the bytes 0x20AC.
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Yes, the locale determines how the OS interprets strings at it's API boundaries. Locale's define more than just encoding. They also include things information on how money, date, time, and other things should be formatted. On Linux or OS X, you can use the locale command in the terminal to see what the current locale is:
mch#bohr:/$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
So in this case, my locale is Canadian English. Each locale defines a encoding used to interpret strings. In this case the locale name makes it clear that it is using a UTF-8 encoding, but you can run locale -ck LC_CTYPE to see more information about the current encoding:
mch#bohr:/$ locale -ck LC_CTYPE
LC_CTYPE
ctype-class-names="upper";"lower";"alpha";"digit";"xdigit";"space";"print";"graph";"blank";"cntrl";"punct";"alnum";"combining";"combining_level3"
ctype-map-names="toupper";"tolower";"totitle"
ctype-width=16
ctype-mb-cur-max=6
charmap="UTF-8"
... output snipped ...
If you want to test a program using encodings, you can set the LC_ALL environment variable to the locale you want to use. You can also change the locale using setlocale. Permanently changing the locale depends on your distribution.
On Windows, most API functions come in a narrow and a wide format. For example, [GetCurrentDirectory][9] comes in GetCurrentDirectoryW (Unicode) and GetCurrentDirectoryA (ANSI) variants. Unicode, in this context, means UTF-16.
I don't know enough about Windows to tell you how to set the locale, other than to try the languages control panel.
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
When you print a string to std::cout, the OS will interpret that string in the encoding set by the locale. If your string is UTF-8 encoded and the OS is using Windows-1252, it will be necessary to convert it to that encoding. One way to do this is with the iconv library.