Determine if a byte array contains an ANSI or Unicode string? - c++

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142

First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

Related

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.
A commonally used library is ICU - International Components for Unicode
Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.
First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Does encoding affect the result of strstr() (and related functions)

Does character set encoding affects the result of strstr() function?
For example, I have read a data to "buf" and do this:
char *p = strstr (buf, "UNB");
I wonder whether the data is encoded in ASCII or others (e.g. EBCDIC) affects the result of this function?
(Since "UNB" are different bit streams under different encoding ways...)
If yes, what's the default that is used for these function? (ASCII?)
Thanks!
The C functions like strstr operate on the raw char data,
independently of the encoding. In this case, you potentially have two
different encodings: the one the compiler used for the string literal,
and the one your program used when filling buf. If these aren't the
same, then the function may not work as expected.
With regards to the "default" encoding, there isn't one, at least as far
as the standard is concerned; the ”basic execution character
set“ is implementation defined. In practice, systems which don't
use an encoding derived from ASCII (ISO 8859-1 seems the most common, at
least here in Europe) are exceedingly rare. As for the encoding you get
in buf, that depends on where the characters come from; if you're
reading from an istream, it depends on the locale imbued in the
stream. In practice, however, again, almost all of these (UTF-8,
ISO8859-x, etc.) are derived from ASCII, and are identical with ASCII
for all of the characters in the basic execution character set
(which includes all of the characters legal in traditional C). So for
"UNB", you're likely safe. (but for something like "üéâ", you almost
certainly aren't.)
Your string constant ("UNB") is encoded in source file encoding, so it must match the encoding of your buffer
Both string parameters must be the same encoding. With string literals the encoding of the C++ source (platform encoding). For Unicode, UTF-8 the function has another problem: Unicode has accented letters with diacritics but these can also be encoded as basic letter plus a combining diacritic symbol. é can be one letter [é] or two: [e] + [combining-´]. Normalisation exists.
For Java it is becoming usance (a very silent development) to explicitly set the source encoding to UTF-8. For C++ projects I am not aware of such conventions becoming widespread.
strstr should work without a problem on UTF-8 encoded unicode characters.
with this function, data is encoded in ASCII.

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks
If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.
To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.