Extract (first) UTF-8 character from a std::string - c++

I need to use a C++ implementation of PHP's mb_strtoupper function to imitate Wikipedia's behavior.
My problem is, that I want to feed only a single UTF-8 character to the function, namely the first of a std::string.
std::string s("äbcdefg");
mb_strtoupper(s[0]); // this obviously can't work with multi-byte characters
mb_strtoupper('ä'); // works
Is there an efficient way to detect/return only the first UTF-8 character of a string?

In UTF-8, the high bits of the first byte tell you how many subsequent bytes are part of the same code point.
0b0xxxxxxx: this byte is the entire code point
0b10xxxxxx: this byte is a continuation byte - this shouldn't occur at the start of a string
0b110xxxxx: this byte plus the next (which must be a continuation byte) form the code point
0b1110xxxx: this byte plus the next two form the code point
0b11110xxx: this byte plus the next three form the code point
The pattern can be assumed to continue, but I don't think valid UTF-8 ever uses more than four bytes to represent a single code point.
If you write a function that counts the number of leading bits set to 1, then you can use it to figure out where to split the byte sequence in order to isolate the first logical code point, assuming the input is valid UTF-8. If you want to harden against invalid UTF-8, you'd have to write a bit more code.
Another way to do it is to take advantage of the fact that continuation bytes always match the pattern 0b10xxxxxx, so you take the first byte, and then keep taking bytes as long as the next byte matches that pattern.
std::size_t GetFirst(const std::string &text) {
if (text.empty()) return 0;
std::size_t length = 1;
while ((text[length] & 0b11000000) == 0b10000000) {
++length;
}
return length;
}
For many languages, a single code point usually maps to a single character. But what people think of as single characters may be closer to what Unicode calls a grapheme cluster, which is one or more code points that combine to produce a glyph.
In your example, the ä can be represented in different ways: It could be the single code point U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or it could be a combination of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS. Fortunately, just picking the first code point should work for your goal to capitalize the first letter.
If you really need the first grapheme cluster, you have to look beyond the first code point to see if the next one(s) combine with it. For many languages, it's enough to know which code points are "non-spacing" or "combining" or variant selectors. For some complex scripts (e.g., Hangul?), you might need to turn to this Unicode Consortium technical report.

Library str.h
#include <iostream>
#include "str.h"
int main (){
std::string text = "äbcdefg";
std::string str = str::substr(text, 0, 1); // Return:~ ä
std::cout << str << std::endl;
}

Related

Count number of actual characters in a std::string (not chars)?

Can I count the number of 'characters that a std::string' contains and not the number of bytes? For instance, std::string::size and std::string::length return the number of bytes (chars):
std::string m_string1 {"a"};
// This is 1
m_string1.size();
std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();
Is there a way to get the number of characters? For instance to obtain thet m_string2 has 2 characters.
It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string. However, that isn't going to match what you want even for їa.
For example ї may be a single code point
ї CYRILLIC SMALL LETTER YI' (U+0457)
or two consecutive code points
і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)
If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa when counting code points.
And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size() is counting code units, not code points. With std::u32string these two will at least coincide, even if it doesn't help you as I demonstrate above.
You need some unicode library like ICU if you want to do this properly.

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?
You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.
In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

How to ignore accents in a string so it does not alter its length?

I am determining the length of certain strings of characters in C++ with the function length(), but noticed something strange: say I define in the main function
string str;
str = "canción";
Then, when I calculate the length of str by str.length() I get as output 8. If instead I define str = "cancion" and calculate str's length again, the output is 7. In other words, the accent on the letter 'o' is altering the real length of the string. The same thing happens with other accents. For example, if str = "für" it will tell me its length is 4 instead of 3.
I would like to know how to ignore these accented characters when determinig the lenght of a string; however, I wouldn't want to ignore isolated characters like '. For example, if str = livin', the lenght of str must be 6.
It is a difficult subject. Your string is likely UTF-8 encoded, and str.length() counts bytes. An ASCII character can be encoded in 1 byte, but characters with codes larger than 127 is encoded in more than 1 byte.
Counting unicode code points may not give you the answer you needed. Instead, you need to take account the width of the code point to handle separated accents and code points with double width (and maybe there are other cases as well). So this is difficult to do this properly without using a library.
You may want to check out ICU.
If you have a constrained case and you don't want to use a library for this, you may want to check out UTF-8 encoding (it is not difficult), and create a simple UTF-8 code point counter (a simple algorithm could be to count bytes where (b&0xc0)!=0x80).
Sounds like UTF-8 encoding. Since the characters with the accents cannot be stored in a single byte, they are stored in 2 bytes. See https://en.wikipedia.org/wiki/UTF-8

standard function to count number of char in string C++

is there a standard function like size() or length() to count the number of chars in a string. The following give 5 and 6 for the same word :
#include <iostream>
using namespace std;
int main(){
string s="Ecole";
cout<<s.size()<<"\n";
}
and
#include <iostream>
using namespace std;
int main(){
string s="école";
cout<<s.size()<<"\n";
}
Thank you.
Use:
wstring
Instead of:
string
The string école is actually have 6 characters as the char é takes two bytes in memory.
The hax representation of é is c3 a9
The ASCII character set doesn't have a lot of "special" characters, the most exotic is probably ' (backquote). std::string can hold about 0.025% of all Unicode characters (usually, 8 bit char) hence if you want to store a string like école use wstring instead of string
Short answer: there is no good answer. Text is complicated.
First, you need to decide what "length" you're looking for to figure out what to call.
In your example, std::string::size() is providing the length in C chars (i.e. bytes). As Vishnu pointed out, the length of the character "é" is 2 bytes, not 1.
On the other hand, if you switch to std::wstring::size() as suggested by Duncan, it will start measuring the size in UTF-16 code points. In that case, the character "é" is 1 UTF-16 code point.
Switching to wstring might seem like the solution, but it depends on what you're doing. For example, if you're trying to get the size of the string to allocate a buffer -- measured in bytes -- then std::string::size() might be correct, but std::wstring::size() would be wrong, because each UTF-16 code point takes 2 bytes to store. (Technically, std::wstring is storing wchar_t characters, and is not necessarily even in UTF-16, and each code point takes sizeof(wchar_t) bytes to store...so it doesn't really work in general, anyways.)
Even if you just want the "number of characters a person would see" (the number of glyphs), switching to wstring won't work for more complicated data. For example, "é" (character http://www.fileformat.info/info/unicode/char/e9/index.htm'>U+00E9) is 1 UTF-16 code point but "é" can also be represented as "e" plus a combining acute accent (character http://www.fileformat.info/info/unicode/char/0301/index.htm'>U+0301). You might need to read about Unicode normalization. There are also situations where a single "character" takes 2 UTF-16 code points, called surrogate pairs -- although a lot of software safely ignores these.
Honestly, with Unicode you either have to accept the fact that you won't handle all the edge cases, or you have to give up on processing things one "character" at a time and instead do things one "word" (a string of code points separated by whitespace) to get things to work. Then you would ask the library you're using -- for example, a drawing library -- how wide each "word" is and hope that they have correctly handled all of the accents, combining characters, surrogate pairs, etc.

In ICU UnicodeString what is the difference between countChar32() and length()?

From the docs;
The length is the number of UChar code units are in the UnicodeString. If you want the number of code points, please use countChar32().
and
Count Unicode code points in the length UChar code units of the string.
A code point may occupy either one or two UChar code units. Counting code points involves reading all code units.
From this I am inclined to think that a code point is an actual character and a code unit is just one possible part of a character.
For example.
Say you have a unicode string like:
'foobar'
Both the length and countChar32 will be 6. Then say you have a string composed of 6 chars that take the full 32 bits to encode the length would be 12 but the countChar32 would be 6.
Is this correct?
The two values will only differ if you use characters out of the Base Multilingual Plane (BMP). These characters are represented in UTF-16 as surrogate pairs. Two 16-bit characters make up one logical character. If you use any of these, each pair counts as one 32-bit character but two elements of length.