Iterating through Unicode codepoints character by character - c++

I've got a series of Unicode codepoints. What I really need to do is iterate through these codepoints as a series of characters, not a series of codepoints, and determine properties of each individual character, e.g. is a letter, whatever.
For example, imagine that I was writing a Unicode-aware textbox, and the user entered a Unicode character that was more than one codepoint- for example, "e with diacritic". I know that this specific character can be represented as one codepoint as well, and can be normalized to that form, but I don't think that's possible in the general case. How could I implement backspace? It obviously can't just erase the last codepoint, because they might have just entered more than one codepoint.
How can I iterate over a bunch of Unicode codepoints as characters?
Edit: The Break Iterators offered by ICU appear to be pretty much what I need. However, I'm not using ICU, so any references on how to implement my own equivalent functionality would be an accepted answer.
Another edit: It turns out that the Windows API does indeed offer this functionality. MSDN just isn't very good about putting all the string functions in one place. CharNext is the function I'm looking for.

Use the ICU library.
http://site.icu-project.org/
for example:
http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90
is the function that returns the character at a particular character offset in a string.

The UTF8-CPP project has a bunch of clean, easy to read, STL-like algorithms to iterate over Unicode strings codepoint by codepoint, character by character, etc. You can look into that for inspiration.
Note that the "character by character" approach might not be obvious. One easy way to do it is to iterate over an UTF-32 string in normalization form C, which guarantees fixed length encoding.

Related

Properly checking for palindromes using UTF-8 strings in C++

When trying to answer a question, How to use enqueu, dequeue, push, and peek in a Palindrome?, I suggested a palindrome can be found using std::string by:
bool isPalindrome(const std::string str)
{
return std::equal(str.begin(), str.end(), str.rbegin(), str.rend());
}
For a Unicode string, I suggested:
bool isPalindrome(const std::u8string str)
{
std::u8string rstr{str};
std::reverse(rstr.begin(), rstr.end());
return str == rstr;
}
I now think this will create problems when you have multibyte characters in the string because the byte-order of the multibyte character is also reversed. Also, some characters will be equivalent to each other in different locales. Therefore, in C++20:
how do you make the comparison robust to multibyte characters?
how do you make the comparison robust to different locales when there can be equivalency between multiple characters?
Reversing a Unicode string becomes non-trivial. Converting from UTF-8 to UTF-32/UCS-4 is a good start, but not sufficient by itself--Unicode also has combining code points, so two (or more) consecutive code points form a single resulting grapheme (the added code point(s) add(s) diacritic marking to the base character), and for things to work correctly, you need to keep these in the correct order.
So, basically instead of code points, you need to divide the input up into a series of graphemes, and reverse the order of the graphemes, not just the code points.
To deal with multiple different sequences of code points that represent the same sequence of characters, you normally want to do normalization. There are four different normalization forms. In this case, you'd probably want to use NFC or NFD (should be equivalent for this purpose). The NFKC/NFKD forms are primarily for compatibility with other character sets, which it sounds like you probably don't care about.
This can also be non-trivial though. Just for one well known example, consider the German character "ß". This is sort of equivalent to "ss", but only exists in lower-case, since it never occurs at the beginning of a word. So, there's probably room for argument about whether something like Ssaß is a palindrome or not (for the moment ignoring the minor detail that it's not actually a word). For palindromes, most people ignore letter case, so it would be--but your code in the question seems to treat case as significant, in which case it probably shouldn't be.

Convert UTF-16 (wchar_t on Windows) to UTF32

I have a string of characters given to me by a Windows API function (GetLocaleInfoEx with LOCALE_SLONGDATE) as wchar_t. Is it correct to say that the value returned from Windows will be UTF-16, and that therefore it may not be one wchar_t, one "printable character"?
To make writing my parser easier, is there a function I can use to convert from UTF-16 to UTF-32, where I'll be guaranteed (I assume), one array element represents one character?
where I'll be guaranteed (I assume), one array element represents one character?
That's not how Unicode works. One codepoint (an array element in UTF-32) does not necessarily map to a single visible character. Multiple codepoints can combine to form a character thanks to features like Unicode combining characters.
You have to do genuine Unicode analysis if you want to be able to know how many visible characters a Unicode string has.
Even with dates (particularly long-form dates as you asked for), you are not safe from such features. The locale can return arbitrary Unicode strings, so you have no way to know from just the number of codepoints how long a Unicode string is.
Looking at the documentation for LOCALE_SLONGDATE it is stated that any characters other than the format pictures must be enclosed in single quotes. So in this particular case converting to UTF-32 should indeed solve your problem (but see proviso below).
By the same token, though, you don't need to. The only UTF-16 characters that don't represent a single UTF-32 character are the surrogate characters, none of which can be mistaken for a single quote. So to separate out the format pictures from the surrounding text, you just need to scan the UTF-16 string for single quotes. (The same is even true of UTF-8; the only byte that looks like a single quote is a single quote.)
Any surrogate pairs, combining characters, or other complications should always be safely tucked away inside the substrings thus delimited. Provided you never attempt to subdivide the substrings themselves, you should be safe.
Proviso: the documentation does not indicate whether it is permissible to combine a single quote mark with a combining character in a locale, and if so, how it will be interpreted. I interpret that as meaning that such a combination is not allowed. In any case, it seems unlikely that Windows itself would go to the trouble of dealing with such an unnecessary complication. So it should be safe enough to ignore this case too, but YMMV.

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

Detecting Multi-Byte Character Encodings

What C/C++ Libraries are there for detecting the multi-byte character encoding (UTF-8, UTF-16, etc) of character array (char*). A bonus would be to also detect when the matcher halted, that is detect prefix match ranges of a given set of a possible encodings.
ICU does character set detection. You must note that, as the ICU documentation states:
This is, at best, an imprecise operation using statistics and
heuristics. Because of this, detection works best if you supply at
least a few hundred bytes of character data that's mostly in a single
language.
If the input is only ASCII, there's no way to detect what should be hone had there been any high-bit-set bytes in the stream. May as well just pick UTF-8 in that case.
As for UTF-8 vs. ISO-8859-x, you could try parsing the input as UTF-8 and fall back to ISO-8859 if the parse fails, but that's about it. There's not really a way to detect which ISO-8859 variant is there. I'd recommend looking at the way Firefox tries to auto-detect, but it's not foolproof and probably depends on knowing the input is HTML.
in general, there is no possibly to detect the character encoding, except if the text has some special mark denoting the encoding. You could heuristically detect an encoding using dictionaries that contain words with characters that are only present in some encodings.
This can of course only be a heuristic and you need to scan the whole text.
Example: "an English text can be written in multiple encodings". This sentence can be written for example using a German codepage. It's indistinguishable from most "western" encodings (including UTF-8) unless you add some special characters (like ä) that are not present in ASCII.