C++ test for validation UTF-8 - c++

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?

You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.

In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

Related

Count number of actual characters in a std::string (not chars)?

Can I count the number of 'characters that a std::string' contains and not the number of bytes? For instance, std::string::size and std::string::length return the number of bytes (chars):
std::string m_string1 {"a"};
// This is 1
m_string1.size();
std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();
Is there a way to get the number of characters? For instance to obtain thet m_string2 has 2 characters.
It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string. However, that isn't going to match what you want even for їa.
For example ї may be a single code point
ї CYRILLIC SMALL LETTER YI' (U+0457)
or two consecutive code points
і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)
If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa when counting code points.
And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size() is counting code units, not code points. With std::u32string these two will at least coincide, even if it doesn't help you as I demonstrate above.
You need some unicode library like ICU if you want to do this properly.

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

Encoding binary data using string class

I am going through one of the requirment for string implementations as part of study project.
Let us assume that the standard library did not exist and we were
foced to design our own string class. What functionality would it
support and what limitations would we improve. Let us consider
following factors.
Does binary data need to be encoded?
Is multi-byte character encoding acceptable or is unicode necessary?
Can C-style functions be used to provide some of the needed functionality?
What kind of insertion and extraction operations are required?
My question on above text
What does author mean by "Does binary data need to be encoded?". Request to explain with example and how can we implement this.
What does author mean y point 2. Request to explain with example and how can we implement this.
Thanks for your time and help.
Regarding point one, "Binary data" refers to sequences of bytes, where "bytes" almost always means eight-bit words. In the olden days, most systems were based on ASCII, which requires seven bits (or eight, depending on who you ask). There was, therefore, no need to distinguish between bytes and characters. These days, we're more friendly to non-English speakers, and so we have to deal with Unicode (among other codesets). This raises the problem that string types need to deal with the fact that bytes and characters are no longer the same thing.
This segues onto point two, which is about how you represent strings of characters in a program. UTF-8 uses a variable-length encoding, which has the remarkable property that it encodes the entire ASCII character set using exactly the same bytes that ASCII encoding uses. However, it makes it more difficult to, e.g., count the number of characters in a string. For pure ASCII, the answer is simple: characters = bytes. But if your string might have non-ASCII characters, you now have to walk the string, decoding characters, in order to find out how many there are1.
These are the kinds of issues you need to think about when designing your string class.
1This isn't as difficult as it might seem, since the first byte of each character is guaranteed not to have 10 in its two high-bits. So you can simply count the bytes that satisfy (c & 0xc0) != 0xc0. Nonetheless, it is still expensive relative to just treating the length of a string buffer as its character-count.
The question here is "can we store ANY old data in the string, or does certain byte-values need to be encoded in some special way. An example of that would be in the standard C language, if you want to use a newline character, it is "encoded" as \n to make it more readable and clear - of course, in this example I'm talking of in the source code. In the case of binary data stored in the string, how would you deal with "strange" data - e.g. what about zero bytes? Will they need special treatment?
The values guaranteed to fit in a char is ASCII characters and a few others (a total of 256 different characters in a typical implementation, but char is not GUARANTEED to be 8 bits by the standard). But if we take non-european languages, such as Chinese or Japanese, they consist of a vastly higher number than the ones available to fit in a single char. Unicode allows for several million different characters, so any character from any european, chinese, japanese, thai, arabic, mayan, and ancient hieroglyphic language can be represented in one "unit". This is done by using a wider character - for the full size, we need 32 bits. The drawback here is that most of the time, we don't actually use that many different characters, so it is a bit wasteful to use 32 bits for each character, only to have zero's in the upper 24 bits nearly all the time.
A multibyte character encoding is a compromise, where "common" characters (common in the European languages) are used as one char, but less common characters are encoded with multiple char values, using a special range of character to indicate "there is more data in the next char to combine into a single unit". (Or,one could decide to use 2, 3, or 4 char each time, to encode a single character).

In ICU UnicodeString what is the difference between countChar32() and length()?

From the docs;
The length is the number of UChar code units are in the UnicodeString. If you want the number of code points, please use countChar32().
and
Count Unicode code points in the length UChar code units of the string.
A code point may occupy either one or two UChar code units. Counting code points involves reading all code units.
From this I am inclined to think that a code point is an actual character and a code unit is just one possible part of a character.
For example.
Say you have a unicode string like:
'foobar'
Both the length and countChar32 will be 6. Then say you have a string composed of 6 chars that take the full 32 bits to encode the length would be 12 but the countChar32 would be 6.
Is this correct?
The two values will only differ if you use characters out of the Base Multilingual Plane (BMP). These characters are represented in UTF-16 as surrogate pairs. Two 16-bit characters make up one logical character. If you use any of these, each pair counts as one 32-bit character but two elements of length.