UTF-8 decoding library - c++

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.

A commonally used library is ICU - International Components for Unicode

Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.

First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Related

How to compare/replace non-ASCII chars in array in C++?

I have a large char array, which contains Czech diacritical characters (e.g. "á"), coded in UTF-8. I need to replace them to their ASCII equivalents (e.g. "a"), because program must work on Windows (Linux console accepts these chars perfectly).
I am reading array char by char and writing content into string.
Here is code I am using, this doesnt work:
int array_size = 50000; //size of file array
char * array = new char[array_size]; //array to store file contents
string ascicontent="";
if ('\u00E1'==array[zacatek]) { //check if char is "á"
ascicontent +='a'; //write ordinal "a" into string
}
I even tried replacing '\u00E1' with 'á', but it also doesnt work. Guessing there is problem that these chars are longer than ascii.
How can I declare the non-ascii char, so it could be compared?
Each char is a single byte, however UTF-8 can use multiple bytes to encode a single character. In particular U+00E1 is encoded as two bytes: 0xC3 0xA1. So you can't do what you want with just comparing a single char.
There are multiple ways that you might be able to tackle your problem:
A) First, try googling for "windows console utf-8" and see if that gives anything which might make things just work without having to alter the characters at all. (I don't know if anything can work for you, I've never tried this.)
B) Convert the data to wide characters (wchar_t) using MultiByteToWideChar or mbstowcs and then google how to use wcout or such to output UTF-16 to the console.
C) Use MultiByteToWideChar to convert the data from UTF-8 to UTF-16. Then use WideCharToMultiByte to convert from UTF-16 to the console's code page, relying on the fact that it can automatically "best fit" common characters (such as "á" to "a").
D) If you really only care about a limited set of characters (such as only the accented characters in the Czech code page), then you could possibly write your own lookup table of UTF-8 byte sequences and your desired replacements. You just need to be doing comparisons on the UTF-8 by those multiple bytes rather than individual chars. Among various tools out there, I've found this page helpful for seeing how characters are encoded in various ways.
Which of these make the most sense for your program depends on various factors, such as how easy or hard it might be to keep the Windows-specific pieces from conflicting with the Linux-specific or cross-platform parts.
char in C is not unicode, it is really a byte; it only gets converted to a glyph by the terminal console you happen to use. On some Linux implementations (like Debian) it defaults to UTF-8, so if your program outputs a sequence of bytes encoded in UTF-8, your terminal will display the proper glyph. If you know that array is UTF-8 encoded, you must check for the proper byte sequence.
Edit: take a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Please take a look at this link http://en.wikipedia.org/wiki/Wide_character.
And I believe this code might help you:
std::wstring str(L"cccccááddddddd");
std::replace( str.begin(), str.end(), L'á', L'a');

Strings of 4 byte character in Windows?

I have a program that does various operations on char types in std::string, for example
if (my_string.front() == my_char) {
// do stuff with my_string
}
I'm looking for some practical advice on how to make my program support Unicode. I need the ability to compare characters to characters, and that means 4 byte characters are required so that even the largest Unicode characters can be processed without losses.
I'm on Windows with a GCC compiler and read that in this case, std::wstring is 2 bytes. C++11 has std::u32string with 4 bytes but it seems largely unsupported by the standard library.
What's the easiest solution in this case?
Even if you had a string of uint32 you could not just compare these integers one by one. You would have to first normalize the strings before. As normalization is NOT simple, you will end up using a library like ICU. So you may directly try to use it directly :)
http://site.icu-project.org/
Windows uses the UTF-16 encoding:
http://en.wikipedia.org/wiki/UTF-16
You don't need "four byte characters" to support all unicode symbols. UTF-16 is a variable length encoding.
Good reading material:
http://www.joelonsoftware.com/articles/Unicode.html

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.
Right ?
So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.
If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes).
You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.
Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break.
string::compare will work but do not expect to get alphabetical ordering. That is language dependent.
string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.
Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).
You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/
You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...
string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.
There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.