counting unicode characters in c++ - c++

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.
EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).
Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.
The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).
Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.
while (*p != 0)
{
if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
++count;
++p;
}
Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:
while (*p != 0)
{
if ((*p & 0xc0) != 0x80)
++count;
++p;
}
Or if you want to be super clever and make it a 2-liner:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
The Wikipedia page for UTF-8 clearly shows the patterns.

A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

I know, it's late for this thread but, it could help
with ICU stuff, I did it like this:
string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;

I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.
open file
until eof
if file.readchar & 0xC0 != 0x80
increment count
close file

Related

Counting UTF-16 characters

Lets say that I have a array of UTF-16 bytes. How would I count how many characters are in the array of UTF-16 bytes? The array can also be inside of a boundary. For example, lets say that there is a 4 byte UTF-16 character, and only 3 of the 4 bytes are read into a buffer. I then try counting that 3 byte buffer. How would I detect that there is not enough bytes?
4-byte UTF-16 characters are called surrogate pairs, and all surrogate pairs start with a code unit in the range 0xD800 to 0xDBFF inclusive. To count the number of characters (aka code points) in a UTF-16 string you therefore want to do something like this (pseudo-code):
char_count = 0;
string_pos = 0;
while (!end_of_string)
{
code_unit = input_string [string_pos];
++char_count;
if (code_unit >= 0xd800 && code_unit <= 0xdbff)
string_pos += 2;
else
++string_pos;
}
To detect an incomplete surrogate pair, just check if there are any code units left in the string after detecting the lead-in value. You might also want to check for invalid surrogate pairs.
Wikipedia has a good write-up on UTF-16 here.
Examine the state of the decoder. If the state is READY then it is enough, otherwise it is not enough. Of course, you need to maintain the state with each incoming code point.

Finding and comparing a Unicode charater in C++

I am writing a Lexical analyzer that parses a given string in C++. I have a string
line = R"(if n = 4 # comment
return 34;
if n≤3 retur N1
FI)";
All I need to do is output all words, numbers and tokens in a vector.
My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.
So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with ≠ (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].
If your input-file is utf8, just treat your unicode characters ≤, ≠, etc as strings. So you just have to use the same logic to recognize "≤" as you would for "<=". The length of a unicode char is then given by strlen("≤")
All Unicode encodings are variable-length except UTF-32. Therefore the next character isn't necessary a single char and you must read it as a string. Since you're using a char* or std::string, the encoding is likely UTF-8 and the next character and can be returned as std::string
The encoding of UTF-8 is very simple and you can read about it everywhere. In short, the first byte of a sequence will indicate how long that sequence is and you can get the next character like this:
std::string getNextChar(const std::string& str, size_t index)
{
if (str[index] & 0x80 == 0) // 1-byte sequence
return std::string(1, str[index])
else if (str[index] & 0xE0 == 0xC0) // 2-byte sequence
return std::string(&str[index], 2)
else if (str[index] & 0xF0 == 0xE0) // 3-byte sequence
return std::string(&str[index], 3)
else if (str[index] & 0xF8 == 0xF0) // 4-byte sequence
return std::string(&str[index], 4)
throw "Invalid codepoint!";
}
It's a very simple decoder and doesn't handle invalid codepoints or broken datastream yet. If you need better handling you'll have to use a proper UTF-8 library

incorrect character recognition in string array C++

i have to write a program that asks user to input characters, store them in a string array, then read the last character of the string and then determine whether it's an integer or a letter or neither of those.
#include<iostream>
using namespace std;
int main()
{
string str;
cout << "Enter a line of string: ";
getline(cin, str);
char i = str.length()-1;
cout<<str[i];
cout<<endl;
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z')) cout << "it's a letter";
else if (str[i] >=0) cout << "it's a number";
else cout << "it is something else";
return 0;
}
i came up with this code, it recognizes both letters and integers, but i have 2 problems with it:
1) if the last character of string is a symbol, for example an '*', then the program says it's an integer, although it's not.
how could i correct this?
2) also, if i type in an additional condition for the following statement in order to recognize integers from 0 to 9, the code fails:
else if ((str[i] >=0) && (str[i] <=9)) cout << "it's a number";
it works only if it's stated as (str[i] >=0)
why is it so?
Because numbers in string are not integer numbers, but font symbols/characters. Think about what is 'a', if the computer operates only with 1 and 0 values, it doesn't have any "a" symbol in HW. So the letter "a" must be encoded into some 01100001 pattern (for example 0x61 is 'a' in ASCII and UTF8 encoding). In similar way the digits are also encoded as particular bit patterns, for example 0x30 to 0x39 in ASCII and UTF8 for digits 0 to 9.
So your test has to be if ((str[i] >= '0') && ... (notice the digit zero is in apostrophes, telling compiler to use the character encoding, and will compile to 0x30 or in binary 00110000).
You can also write if ((str[i] >= 48) && .. and it will compile to identical machine code, because for compiler there is no difference between 48, 0x30 or '0'. But for humans reading your source later and trying understand what you were trying to achieve, the '0' is probably best fit to explain your intent)
Also notice the encoding which is used (as defined by your compiler and source) may use different values to encode particular symbol, causing the string processing to be sometimes quite tricky business, for example in UTF8 the common English characters are encoded identically with the 7 bit ASCII encoding (which is trivial old encoding, and you should start by learning a bit about it) and each letter fits into single byte (8 bits), but the accented and extra characters are encoded as series of bytes (two to five), so for example trivial things like "count how many characters are in string" turn into complex task requiring rich tables with definition of unicode characters and their category/function, to correctly count characters as in human way (simple Arabic text consisting of 4 letters may require for example 12 bytes of memory, so simple strlen will return 12 then). ... if you need these things for unicode encoding, you better use some library, than implementing it yourself.
Your code will work well for simple ASCII-like strings only, which is enough to practice stuff like this, just remember everything in the computer is encoded as 1 and 0 (sounds, characters, pixel/vector graphics, ... everything, at some point, down below, under many layers of abstraction and utilities, there is some point where the information is encoded as series of bit values 1 or 0, and nothing else), so when you are trying to process some data, it may be helpful to learn about how the thing is actually encoded in the HW and how much of that is abstracted from you by your programming language/library+OS API.
You can also check in debugger in memory view, what values are used to encode your string, and how it looks in terms of byte values.
Your condition in 'else if' doesn't seem to satisfy your requirement.
Try something like:
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z'))
cout << "it's a letter";
else if (str[i] >= '0' && str[i] <= '9')
cout << "it's a number";
else
cout << "it is something else";

C++: Explanation for a Competitive Programming Tip

On the DMOJ online judge, used for competitive programming, one of the tips for a faster execution time (C++) was to add this macro on top if the problem only requires unsigned integral data types to be read.
How does this work and what are the advantages and disadvantages of using this?
#define scan(x) do{while((x=getchar())<'0'); for(x-='0'; '0'<=(_=getchar()); x= (x<<3)+(x<<1)+_-'0');}while(0)
char _;
Source: https://dmoj.ca/tips/#cpp-io
First let's reformat this a bit:
#define scan(dest) \
do { \
while((dest = getchar()) < '0'); \
for(dest -= '0'; '0' <= (temp = getchar()); dest = (dest<<3) + (dest<<1) + temp - '0');
} while(0)
char temp;
First, the outer do{...}while(0) is just to ensure proper parsing of the macro. See here for more info.
Next, while((dest = getchar()) < '0'); - this might as well just be dest = getchar() but it does some additional work by discarding any characters below (but not above) the '0' character. This can be useful since whitespace characters are all "less than" the 0 character in ascii.
The meat of the macro is the for loop. First, the initialization expression dest -= '0', sets dest to the actual integer value represented by the character by taking advantage of the fact that the 0-9 characters in ascii encoding are adjacent and sequential. So if the first character were '5' (value 53), subtracting '0' (value 48) results in the integer value 5.
The condition statement, '0' <= (temp = getchar()), does several things - first, it gets the next character and assigns it to temp, then checks to see if it is greater than or equal to the '0' character (so will fail on whitespace).
As long as the character is a numeral (or at least equal to '0'), the increment expression is evaluated. dest = (dest<<3) + (dest<<1) + temp - '0' - the temp - '0' expression does the same adjustment as before from ascii to numeric value, and the shifts and adds are just an obscure way of multiplying by 10. In other words, it is equivalent to temp -= '0'; dest = dest * 10 + temp;. Multiplying by 10 and adding the next digit's value is what builds the final value.
Finally, char temp; declares the temporary character storage for use in subsequent macro invocations in the program.
As far as why you'd use it, I'm skeptical that it would provide any measurable benefit compared to something like scanf or atoi.
What this does is it reads a number character by character with a bunch of premature optimizations. See #MooseBoys' answer for more details.
About its advantages and disadvantages, I don't see any benefit to using this at all. Stuff like (x<<3)+(x<<1) which is equal to x * 10 are optimizations that should be done by the compiler, not you.
As far as I know, cin and cout is fast enough for all competitive programming purposes especially if you disable syncing with stdio. I've been using it since I started competitive programming and never had any problems.
Also, my own testing shows cin and cout isn't slower than C I/O, despite the popular belief. You can try testing the performance of this yourself. Make sure you have optimizations enabled.
Apparently, some competitive programmers focus way too much on stuff like fast I/O when their algorithm is the thing that matters most.

std::string optimal way to truncate utf-8 at safe place

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}