Detecting Multi-Byte Character Encodings

Detecting Multi-Byte Character Encodings - c++

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.

ICU does character set detection. You must note that, as the ICU documentation states:
This is, at best, an imprecise operation using statistics and
heuristics. Because of this, detection works best if you supply at
least a few hundred bytes of character data that's mostly in a single
language.

If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.
As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.

in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.
This can of course only be a heuristic and you need to scan the whole text.
Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.

Related

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!

The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)

Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'

For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt

Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.

First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.

My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?

To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.

Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)

The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?

Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.

Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)

In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

MFC: what would be the regex to check if a character is unicode or not?

I'm trying to use windows' API IsTextUnicode to check if a character input is unicode or not, but is sort of buggy. I figured, it might be better using a regex. However, I'm new to constructing regular expressions. What would be the regex to check if a character is unicode or not?
Thanks...

Well, that depends what you mean by ‘Unicode’. As the answers so far say, pretty much any character “is Unicode”.
Windows abuses the term ‘Unicode’ to mean the UTF-16LE encoding that the Win32 API uses internally. You can detect UTF-16 by looking for the Byte Order Mark at the front, bytes FF FE for UTF-16LE (or FE FF for UTF-16BE). It's possible to have UTF-16 text that is not marked with a BOM, but that's quite bad news as you can only detect it by pure guesswork.
Pure guesswork is what the IsTextUnicode function is all about. It looks at the input bytes and, by seeing how often common patterns turn up in it, guesses how likely it is that the bytes represent UTF-16LE or UTF-16BE-encoded characters. Since every sequence of bytes is potentially a valid encoding of characters(*), you might imagine this isn't very predictable or reliable. And you'd be right.
See Windows i18n guru Michael Kaplan's description of IsTextUnicode and why it's probably not a good idea.
In general you would want a more predictable way of guessing what encoding a set of bytes represents. You could try:
if it begins FE FF, it's UTF-16LE, what Windows thinks of as ‘Unicode’;
if it begins FF FE, it's UTF-16BE, what Windows equally-misleadingly calls ‘reverse’ Unicode;
otherwise check the whole string for invalid UTF-8 sequences. If there are none, it's probably UTF-8 (or just ASCII);
otherwise try the system default codepage.
(*: actually not quite true. Apart from the never-a-characters like U+FFFF, there are also many sequences of UTF-16 code units that aren't valid characters, thanks to the ‘surrogates’ approach to encoding characters outside the 16-bit range. However IsTextUnicode doesn't know about those anyway, as it predates the astral planes.)

Every character you'll encounter is part of Unicode. For instance, latin 'a' is U+0061. This is especially true on Windows, which natievely uses Unicode and UTF-16 encoding.
The Microsoft function IsTextUnicode is named rather unfortunately. It could more accurately be described as GuessTextEncodingFromRawBytes(). I suspect that your real problem is not the interpretation of raw bytes, since you already know it's one character.

I think you're mixing up two different concepts. A character and its encoding are not the same. Some characters (like A) are encoded identically in ASCII or latin-1 and UTF-8, some aren't, some can only be encoded in UTF-8 etc.
IsTextUnicode() tries to guess the encoding from a stream of raw bytes.
If, on the other hand, you already have a character representation, and you wish to find out whether it can be natively expressed as ASCII or latin-1 or some other encoding, then you could indeed look at the character range ([\u0000-\u007F] for ASCII).
Lastly, there are some invalid codes (like \uFFFE) which are possible bytes representations that are not allowed as Unicode characters. But I don't think this is what you're looking for.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js