In ICU UnicodeString what is the difference between countChar32() and length()? - c++

From the docs;
The length is the number of UChar code units are in the UnicodeString. If you want the number of code points, please use countChar32().
and
Count Unicode code points in the length UChar code units of the string.
A code point may occupy either one or two UChar code units. Counting code points involves reading all code units.
From this I am inclined to think that a code point is an actual character and a code unit is just one possible part of a character.
For example.
Say you have a unicode string like:
'foobar'
Both the length and countChar32 will be 6. Then say you have a string composed of 6 chars that take the full 32 bits to encode the length would be 12 but the countChar32 would be 6.
Is this correct?

The two values will only differ if you use characters out of the Base Multilingual Plane (BMP). These characters are represented in UTF-16 as surrogate pairs. Two 16-bit characters make up one logical character. If you use any of these, each pair counts as one 32-bit character but two elements of length.

Related

Count number of actual characters in a std::string (not chars)?

Can I count the number of 'characters that a std::string' contains and not the number of bytes? For instance, std::string::size and std::string::length return the number of bytes (chars):
std::string m_string1 {"a"};
// This is 1
m_string1.size();
std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();
Is there a way to get the number of characters? For instance to obtain thet m_string2 has 2 characters.
It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string. However, that isn't going to match what you want even for їa.
For example ї may be a single code point
ї CYRILLIC SMALL LETTER YI' (U+0457)
or two consecutive code points
і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)
If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa when counting code points.
And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size() is counting code units, not code points. With std::u32string these two will at least coincide, even if it doesn't help you as I demonstrate above.
You need some unicode library like ICU if you want to do this properly.

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?
You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.
In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

Extract (first) UTF-8 character from a std::string

I need to use a C++ implementation of PHP's mb_strtoupper function to imitate Wikipedia's behavior.
My problem is, that I want to feed only a single UTF-8 character to the function, namely the first of a std::string.
std::string s("äbcdefg");
mb_strtoupper(s[0]); // this obviously can't work with multi-byte characters
mb_strtoupper('ä'); // works
Is there an efficient way to detect/return only the first UTF-8 character of a string?
In UTF-8, the high bits of the first byte tell you how many subsequent bytes are part of the same code point.
0b0xxxxxxx: this byte is the entire code point
0b10xxxxxx: this byte is a continuation byte - this shouldn't occur at the start of a string
0b110xxxxx: this byte plus the next (which must be a continuation byte) form the code point
0b1110xxxx: this byte plus the next two form the code point
0b11110xxx: this byte plus the next three form the code point
The pattern can be assumed to continue, but I don't think valid UTF-8 ever uses more than four bytes to represent a single code point.
If you write a function that counts the number of leading bits set to 1, then you can use it to figure out where to split the byte sequence in order to isolate the first logical code point, assuming the input is valid UTF-8. If you want to harden against invalid UTF-8, you'd have to write a bit more code.
Another way to do it is to take advantage of the fact that continuation bytes always match the pattern 0b10xxxxxx, so you take the first byte, and then keep taking bytes as long as the next byte matches that pattern.
std::size_t GetFirst(const std::string &text) {
if (text.empty()) return 0;
std::size_t length = 1;
while ((text[length] & 0b11000000) == 0b10000000) {
++length;
}
return length;
}
For many languages, a single code point usually maps to a single character. But what people think of as single characters may be closer to what Unicode calls a grapheme cluster, which is one or more code points that combine to produce a glyph.
In your example, the ä can be represented in different ways: It could be the single code point U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or it could be a combination of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS. Fortunately, just picking the first code point should work for your goal to capitalize the first letter.
If you really need the first grapheme cluster, you have to look beyond the first code point to see if the next one(s) combine with it. For many languages, it's enough to know which code points are "non-spacing" or "combining" or variant selectors. For some complex scripts (e.g., Hangul?), you might need to turn to this Unicode Consortium technical report.
Library str.h
#include <iostream>
#include "str.h"
int main (){
std::string text = "äbcdefg";
std::string str = str::substr(text, 0, 1); // Return:~ ä
std::cout << str << std::endl;
}

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.
Right ?
So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.
If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes).
You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.
Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break.
string::compare will work but do not expect to get alphabetical ordering. That is language dependent.
string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.
Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).
You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/
You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...
string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.
There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).