Count number of actual characters in a std::string (not chars)? - c++

Can I count the number of 'characters that a std::string' contains and not the number of bytes? For instance, std::string::size and std::string::length return the number of bytes (chars):
std::string m_string1 {"a"};
// This is 1
m_string1.size();
std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();
Is there a way to get the number of characters? For instance to obtain thet m_string2 has 2 characters.

It is not possible to count "characters" in a Unicode string with anything in the C++ standard library in general. It isn't clear what exactly you mean with "character" to begin with and the closest you can get is counting code points by using UTF-32 literals and std::u32string. However, that isn't going to match what you want even for їa.
For example ї may be a single code point
ї CYRILLIC SMALL LETTER YI' (U+0457)
or two consecutive code points
і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)
If you don't know that the string is normalized, then you can't distinguish the two with the standard library and there is no way to force normalization either. Even for UTF-32 string literals it is up to the implementation which one is chosen. You will get 2 or 3 for a string їa when counting code points.
And that isn't even considering the encoding issue that you mention in your question. Each code point itself may be encoded into multiple code units depending on the chosen encoding and .size() is counting code units, not code points. With std::u32string these two will at least coincide, even if it doesn't help you as I demonstrate above.
You need some unicode library like ICU if you want to do this properly.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

standard function to count number of char in string C++

is there a standard function like size() or length() to count the number of chars in a string. The following give 5 and 6 for the same word :
#include <iostream>
using namespace std;
int main(){
string s="Ecole";
cout<<s.size()<<"\n";
}
and
#include <iostream>
using namespace std;
int main(){
string s="école";
cout<<s.size()<<"\n";
}
Thank you.
Use:
wstring
Instead of:
string
The string école is actually have 6 characters as the char é takes two bytes in memory.
The hax representation of é is c3 a9
The ASCII character set doesn't have a lot of "special" characters, the most exotic is probably ' (backquote). std::string can hold about 0.025% of all Unicode characters (usually, 8 bit char) hence if you want to store a string like école use wstring instead of string
Short answer: there is no good answer. Text is complicated.
First, you need to decide what "length" you're looking for to figure out what to call.
In your example, std::string::size() is providing the length in C chars (i.e. bytes). As Vishnu pointed out, the length of the character "é" is 2 bytes, not 1.
On the other hand, if you switch to std::wstring::size() as suggested by Duncan, it will start measuring the size in UTF-16 code points. In that case, the character "é" is 1 UTF-16 code point.
Switching to wstring might seem like the solution, but it depends on what you're doing. For example, if you're trying to get the size of the string to allocate a buffer -- measured in bytes -- then std::string::size() might be correct, but std::wstring::size() would be wrong, because each UTF-16 code point takes 2 bytes to store. (Technically, std::wstring is storing wchar_t characters, and is not necessarily even in UTF-16, and each code point takes sizeof(wchar_t) bytes to store...so it doesn't really work in general, anyways.)
Even if you just want the "number of characters a person would see" (the number of glyphs), switching to wstring won't work for more complicated data. For example, "é" (character http://www.fileformat.info/info/unicode/char/e9/index.htm'>U+00E9) is 1 UTF-16 code point but "é" can also be represented as "e" plus a combining acute accent (character http://www.fileformat.info/info/unicode/char/0301/index.htm'>U+0301). You might need to read about Unicode normalization. There are also situations where a single "character" takes 2 UTF-16 code points, called surrogate pairs -- although a lot of software safely ignores these.
Honestly, with Unicode you either have to accept the fact that you won't handle all the edge cases, or you have to give up on processing things one "character" at a time and instead do things one "word" (a string of code points separated by whitespace) to get things to work. Then you would ask the library you're using -- for example, a drawing library -- how wide each "word" is and hope that they have correctly handled all of the accents, combining characters, surrogate pairs, etc.

Encoding binary data using string class

I am going through one of the requirment for string implementations as part of study project.
Let us assume that the standard library did not exist and we were
foced to design our own string class. What functionality would it
support and what limitations would we improve. Let us consider
following factors.
Does binary data need to be encoded?
Is multi-byte character encoding acceptable or is unicode necessary?
Can C-style functions be used to provide some of the needed functionality?
What kind of insertion and extraction operations are required?
My question on above text
What does author mean by "Does binary data need to be encoded?". Request to explain with example and how can we implement this.
What does author mean y point 2. Request to explain with example and how can we implement this.
Thanks for your time and help.
Regarding point one, "Binary data" refers to sequences of bytes, where "bytes" almost always means eight-bit words. In the olden days, most systems were based on ASCII, which requires seven bits (or eight, depending on who you ask). There was, therefore, no need to distinguish between bytes and characters. These days, we're more friendly to non-English speakers, and so we have to deal with Unicode (among other codesets). This raises the problem that string types need to deal with the fact that bytes and characters are no longer the same thing.
This segues onto point two, which is about how you represent strings of characters in a program. UTF-8 uses a variable-length encoding, which has the remarkable property that it encodes the entire ASCII character set using exactly the same bytes that ASCII encoding uses. However, it makes it more difficult to, e.g., count the number of characters in a string. For pure ASCII, the answer is simple: characters = bytes. But if your string might have non-ASCII characters, you now have to walk the string, decoding characters, in order to find out how many there are1.
These are the kinds of issues you need to think about when designing your string class.
1This isn't as difficult as it might seem, since the first byte of each character is guaranteed not to have 10 in its two high-bits. So you can simply count the bytes that satisfy (c & 0xc0) != 0xc0. Nonetheless, it is still expensive relative to just treating the length of a string buffer as its character-count.
The question here is "can we store ANY old data in the string, or does certain byte-values need to be encoded in some special way. An example of that would be in the standard C language, if you want to use a newline character, it is "encoded" as \n to make it more readable and clear - of course, in this example I'm talking of in the source code. In the case of binary data stored in the string, how would you deal with "strange" data - e.g. what about zero bytes? Will they need special treatment?
The values guaranteed to fit in a char is ASCII characters and a few others (a total of 256 different characters in a typical implementation, but char is not GUARANTEED to be 8 bits by the standard). But if we take non-european languages, such as Chinese or Japanese, they consist of a vastly higher number than the ones available to fit in a single char. Unicode allows for several million different characters, so any character from any european, chinese, japanese, thai, arabic, mayan, and ancient hieroglyphic language can be represented in one "unit". This is done by using a wider character - for the full size, we need 32 bits. The drawback here is that most of the time, we don't actually use that many different characters, so it is a bit wasteful to use 32 bits for each character, only to have zero's in the upper 24 bits nearly all the time.
A multibyte character encoding is a compromise, where "common" characters (common in the European languages) are used as one char, but less common characters are encoded with multiple char values, using a special range of character to indicate "there is more data in the next char to combine into a single unit". (Or,one could decide to use 2, 3, or 4 char each time, to encode a single character).

In ICU UnicodeString what is the difference between countChar32() and length()?

From the docs;
The length is the number of UChar code units are in the UnicodeString. If you want the number of code points, please use countChar32().
and
Count Unicode code points in the length UChar code units of the string.
A code point may occupy either one or two UChar code units. Counting code points involves reading all code units.
From this I am inclined to think that a code point is an actual character and a code unit is just one possible part of a character.
For example.
Say you have a unicode string like:
'foobar'
Both the length and countChar32 will be 6. Then say you have a string composed of 6 chars that take the full 32 bits to encode the length would be 12 but the countChar32 would be 6.
Is this correct?
The two values will only differ if you use characters out of the Base Multilingual Plane (BMP). These characters are represented in UTF-16 as surrogate pairs. Two 16-bit characters make up one logical character. If you use any of these, each pair counts as one 32-bit character but two elements of length.