Convention for a matching Unicode code point while respecting the BOM? - c++

When searching for a codepoint in a Unicode string with a relevant BOM (UTF-16/32), it makes sense to leave the encoding as-is and match the codepoint to the string's BOM.
For example, I want to trim leading and trailing slash characters.
(pseudocode)
utf16 trim_slash(utf16 string) {
bom = bom_from_strong(string)
utf16_slash = utf16_byte_order("/", bom)
offset = 0
search codepoint from right
if codepoint[i] = utf16_slash
offset++
if offset
string = string.substr(0, len(string) - offset)
}
For doing the same with preceding codepoints, I would skip over the BOM and in the case I want to extract a substring, I would simply add the BOM back on.
I'm using ConvertUTF.cpp from LLVM for UTF operations which seems to respect the BOM when converting between encodings but I still need to take the byte order into consideration when comparing with string literals and strings from other sources.
Am I going about this the right way and is my effort justified? I want to ensure that I have as proper handling of Unicode as I can.
I'm currently standardized on converting all incoming strings to UTF-32 where I need to walk along codepoints to compare search terms and then extract some substring. But I see that this is overkill when I only need to walk along the beginning and the end of a string such as the example pseudocode. In this case it would be much faster to just return the same string if nothing changes; whereas with UTF-32 I have to convert to UTF-32 and then back to the original width and then pass the final copy as the result.
With UTF-32 the minimum is 3 copies per call versus one copy if I were to consider the BOM.
Additionally, converting between UTF formats may result in a string which does not align to the original representation (having BOM or not, regardless or endianess).

Usually, BOMs are only relevant "on-the-wire" meaning that they signal the byte order of a file, network data, or some other protocol stream as it is transmitted between systems (see the Unicode FAQ).
When such a stream is read by a program (e.g. when your utf16 string is created), it should be converted to the platform's native byte order. That is, string should always in the native byte order and the BOM becomes irrelevant. When the string is written back to a file/network/stream, it should be converted from the native byte order into whatever is appropriate for the protocol (with a BOM).
Code that works with strings (other than reading/writing byte streams) should never need to handle non-native byte orders.

Related

How to convert only the next one character from a UTF-8 byte array efficiently?

I have this code which works:
QString qs = QString::fromUtf8(bp,ut).at(0);
QChar c(qs[0]);
Where bp is a QByteArray::const_pointer, and ut is the maximum expected length of the UTF-8 encoded Unicode code-point.
I then grab the first QChar c from the QString qs.
It seems that there should be a more efficient way to simply get only the next QChar from the UTF-8 byte array without having to convert an arbitrary amount of the QByteArray into a QString and then getting only the first QChar.
EDIT From the comments below, it is clear that no one yet understands my question. So I will start with some basics. UTF-8 and UTF-16 are two different encodings of the world standard Unicode. The most common and encouraged Unicode encoding for transfer over the Internet and Unicode text files is UTF-8 which results in every Unicode code-point using 1 to 4 bytes in UTF-8 encoding. UTF-16 on the other hand is more convenient for handling characters inside a program. Therefore the vast majority of software out there is making the conversion between these two encodings all the time. A QChar is the more convenient UTF-16 encoding of all the Unicode code-points from 0x00 to 0xffff, which covers the majority of the languages and symbols so far defined and in common use. Surrogate pairs are used for the higher Unicode code-point values. At present surrogate pairs seem to have limited support, and are not of interest to me as for the present question.
When you read a text file into a QPlainTextEdit the conversion is done automatically and behind the scenes. Reading a QString from a QByteArray can also be done automatically (provided your locale and codec settings are set for UTF-8), or they can be done explicitly using toUtf8() or fromUtf8() as in my code above.
The conversion in the other direction can efficiently be done implicitly (behind the scenes) or explicitly with the following code:
ba += *si; // Depends on the UTF-8 codec
or
ba += QString(*si).toUtf8(); // UTF-8 explicitly
where ba is a QByteArray and si is QString::const_iterator. These do exactly the same thing (assuming the codec is set to UTF-8). They both convert the next (one) character from the QChar pointed to within a QString resulting in appending one or more bytes in ba.
All I am trying to do is the inverse conversion for only one character at a time, efficiently. Internally this is being done for every character being converted, and I'm sure it is being done very efficiently.
The problem with QString::fromUtf8(p,n) is that n is the number of bytes to process rather than the number of characters to convert. Therefore you must allow for the largest number of bytes which could be 3 (or 4 if it actually handled surrogate pairs). So if all you want is the next character, you must be prepared to process several bytes, and they do get converted and then are discarded if the result is a QString with more than one character.
Q: Is there a conversion function that does this one character at a time?
You want to use QTextDecoder.
It is, according to the documentation:
The QTextDecoder class provides a state-based decoder.
A text decoder converts text from an encoded text format into Unicode using a specific codec.
The decoder converts text in this format into Unicode, remembering any state that is required between calls.
The important thing here is state. QString and QTextCodec are stateless, so they work on entire strings, start to end.
QTextDecoder, on the other hand, allows you to work on byte buffers one byte at a time, maintaining a state between calls so the caller knows if an UTF-8 sequence has been only partially decoded.
For example:
QTextDecoder decoder(QTextCodec::codecForName("UTF-8"));
QString result;
for (int i = 0; i < bytearray.size(); i++) {
result = decoder.toUnicode(bytearray.constData() + i, 1);
if (!result.isEmpty()) {
break; // we got our character !
}
}
The rationale behind this loop is that as long as the decoder is not able to decode a complete UTF-8 character, it will return an empty string.
As soon as it is able to, the result string will contain the one decoded unicode character.
This loop is as efficient as possible, and by memorizing the loop index, next characters can be obtained in the same way.

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.
A commonally used library is ICU - International Components for Unicode
Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.
First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.