Extending 'isalnum' to recognize UTF-8 umlaut

Extending 'isalnum' to recognize UTF-8 umlaut - c++

I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.
Is there maybe a more elegant way to solve this issue?
The code is as follows:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
EDIT:
I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?

Your code doesn't do what you're claiming.
The utf-8 representation of Ä is two bytes - 0xC3,0x84. A lone byte with a value above 0x7F is meaningless in utf-8.
Some general suggestions:
Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.
It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).
Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.

I'm not 100% sure but the C++ std::isalnum in <locale> almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/

It's impossible with the interface you define, since UTF-8 is a
multibyte encoding; a single character requires multiple char to
represent it. (I've got code for determining whether a UTF-8 is a
member of a specified set of characters in my library, but the
character is specified by a pair of iterators, and not a single char.)

Related

Detect german dialect in a string c++

I need to parse each character in a string for example:
"foo123 aaa [ ßü+öä"
And leave only a-z A-Z 0-9 whitespaces and german dialect characters.
In this example the result would be:
"foo123 aaa ßüöä"
The string should be parsed one character at a time since I need to know if any of the characters is from german dialect.
#include <string>
#include <iostream>
#include <algorithm>
int main()
{
std::string s = "Abc123ü + ßöÄ;";
s.erase(std::remove_if(s.begin(),s.end(),
[&](unsigned char c) {
if ((c >= 'a' and c<='z') or (c>='0' and c<='9') or (c>='A' and c<='Z')) return false;
if (c==0xDF) {
return false;
}
return true;
}), s.end());
std::cout<<"Stripped string: " << s;
return 0;
}
The output here is Abc123 and I expect Abc123ß

Original ASCII is seven bit (0-127) ... 8-bit ASCII used code pages to cover other languages in (128-255)
The usual code page for German is defined in ISO/IEC 8859-1 (often also called Latin1 amongst others). Often ISO/IEC 8859-15 is actually used, because this contains the € sign, which is missing in ISO/IEC 8859-1, but for the "Umlauts", you need: äöüÄÖÜ and ß, the codes should be the same.
// Latin1 Ä Ö Ü ä ö ü ß
vector<int> ascii_vals {196, 214, 220, 228, 246, 252, 223};
Unfortunately, Windows likes to use code page 437 or 850, where the codes would be different again.
However, you really should use UTF-8 in modern applications. I guess you have to connect to a legacy system, when you have to use ASCII. Better define to use UTF-8 internally and convert the strings. When you send the ASCII string to a different system, that expects UTF-8 your nicely filtered German Umlauts might get converted falsely again.

Finding and comparing a Unicode charater in C++

I am writing a Lexical analyzer that parses a given string in C++. I have a string
line = R"(if n = 4 # comment
return 34;
if n≤3 retur N1
FI)";
All I need to do is output all words, numbers and tokens in a vector.
My program works with regular tokens, words and numbers; but I cannot figure out how to parse Unicode characters. The only Unicode characters my program needs to save in a vector are ≤ and ≠.
So far I all my code basically takes the string line by line, reads the first word, number or token, chops it off and recursively continues to eat tokens until the string is empty. I am unable to compare line[0] with ≠ (of course) and I am also not clear on how much of the string I need to chop off in order to get rid of the Unicode char? In case of "!=" I simple remove line[0] and line[1].

If your input-file is utf8, just treat your unicode characters ≤, ≠, etc as strings. So you just have to use the same logic to recognize "≤" as you would for "<=". The length of a unicode char is then given by strlen("≤")

All Unicode encodings are variable-length except UTF-32. Therefore the next character isn't necessary a single char and you must read it as a string. Since you're using a char* or std::string, the encoding is likely UTF-8 and the next character and can be returned as std::string
The encoding of UTF-8 is very simple and you can read about it everywhere. In short, the first byte of a sequence will indicate how long that sequence is and you can get the next character like this:
std::string getNextChar(const std::string& str, size_t index)
{
if (str[index] & 0x80 == 0) // 1-byte sequence
return std::string(1, str[index])
else if (str[index] & 0xE0 == 0xC0) // 2-byte sequence
return std::string(&str[index], 2)
else if (str[index] & 0xF0 == 0xE0) // 3-byte sequence
return std::string(&str[index], 3)
else if (str[index] & 0xF8 == 0xF0) // 4-byte sequence
return std::string(&str[index], 4)
throw "Invalid codepoint!";
}
It's a very simple decoder and doesn't handle invalid codepoints or broken datastream yet. If you need better handling you'll have to use a proper UTF-8 library

incorrect character recognition in string array C++

i have to write a program that asks user to input characters, store them in a string array, then read the last character of the string and then determine whether it's an integer or a letter or neither of those.
#include<iostream>
using namespace std;
int main()
{
string str;
cout << "Enter a line of string: ";
getline(cin, str);
char i = str.length()-1;
cout<<str[i];
cout<<endl;
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z')) cout << "it's a letter";
else if (str[i] >=0) cout << "it's a number";
else cout << "it is something else";
return 0;
}
i came up with this code, it recognizes both letters and integers, but i have 2 problems with it:
1) if the last character of string is a symbol, for example an '*', then the program says it's an integer, although it's not.
how could i correct this?
2) also, if i type in an additional condition for the following statement in order to recognize integers from 0 to 9, the code fails:
else if ((str[i] >=0) && (str[i] <=9)) cout << "it's a number";
it works only if it's stated as (str[i] >=0)
why is it so?

Because numbers in string are not integer numbers, but font symbols/characters. Think about what is 'a', if the computer operates only with 1 and 0 values, it doesn't have any "a" symbol in HW. So the letter "a" must be encoded into some 01100001 pattern (for example 0x61 is 'a' in ASCII and UTF8 encoding). In similar way the digits are also encoded as particular bit patterns, for example 0x30 to 0x39 in ASCII and UTF8 for digits 0 to 9.
So your test has to be if ((str[i] >= '0') && ... (notice the digit zero is in apostrophes, telling compiler to use the character encoding, and will compile to 0x30 or in binary 00110000).
You can also write if ((str[i] >= 48) && .. and it will compile to identical machine code, because for compiler there is no difference between 48, 0x30 or '0'. But for humans reading your source later and trying understand what you were trying to achieve, the '0' is probably best fit to explain your intent)
Also notice the encoding which is used (as defined by your compiler and source) may use different values to encode particular symbol, causing the string processing to be sometimes quite tricky business, for example in UTF8 the common English characters are encoded identically with the 7 bit ASCII encoding (which is trivial old encoding, and you should start by learning a bit about it) and each letter fits into single byte (8 bits), but the accented and extra characters are encoded as series of bytes (two to five), so for example trivial things like "count how many characters are in string" turn into complex task requiring rich tables with definition of unicode characters and their category/function, to correctly count characters as in human way (simple Arabic text consisting of 4 letters may require for example 12 bytes of memory, so simple strlen will return 12 then). ... if you need these things for unicode encoding, you better use some library, than implementing it yourself.
Your code will work well for simple ASCII-like strings only, which is enough to practice stuff like this, just remember everything in the computer is encoded as 1 and 0 (sounds, characters, pixel/vector graphics, ... everything, at some point, down below, under many layers of abstraction and utilities, there is some point where the information is encoded as series of bit values 1 or 0, and nothing else), so when you are trying to process some data, it may be helpful to learn about how the thing is actually encoded in the HW and how much of that is abstracted from you by your programming language/library+OS API.
You can also check in debugger in memory view, what values are used to encode your string, and how it looks in terms of byte values.

Your condition in 'else if' doesn't seem to satisfy your requirement.
Try something like:
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z'))
cout << "it's a letter";
else if (str[i] >= '0' && str[i] <= '9')
cout << "it's a number";
else
cout << "it is something else";

Code not identifying Swedish characters despite #include<clocale> in c++

I have to write a code in C++ that identifies and counts English and non-English characters in a string.
The user writes an input and the program must count user's letters and notify when it finds non-English letters.
My problem is that I get a question mark instead of the non-English letter!
At the beginning of the code I wrote:
...
#include <clocale>
int main() {
std::setlocale(LC_ALL, "sv_SE.UTF-8");
...
(the locale is Swedish)
If I try to print out Swedish letters before the counting loops (as a test), it does work, so I guess the clocale is working fine.
But when I launch the counting loop below,
for (unsigned char c: rad) {
if (c < 128) {
if (isalpha(c) != 0)
bokstaver++;
}
if (c >= 134 && c <= 165) {
cout << "Your text contains a " << c << '\n';
bokstaver++;
}
}
my non-English letter is taken into account but not printed out with cout.
I used unsigned char since non-English letters are between ASCII 134 and 165, so I really don't know what to do.
test with the word blå:

non-English letters are between ASCII 134 and 165
No, they aren't. Non english characters are not between any ASCII characters in UTF-8. Non ASCII characters consist of two or more code units (those individual code units themselves can represent some character in ASCII) . å for example consists of 0xC3 followed by 0xA5.
The C and C++ library functions which only accept a single char (such as std::isalpha) are not useful when using UTF-8 because that single char can only represent a single code unit.

counting unicode characters in c++

How do you count unicode characters in a UTF-8 file in C++? Perhaps if someone would be so kind to show me a "stand alone" method, or alternatively, a short example using http://icu-project.org/index.html.
EDIT: An important caveat is that I need to build counts of each character, so it's not like I'm counting the total number of characters, but the number of occurrences of a set of characters.

In UTF-8, a non-leading byte always has the top two bits set to 10, so just ignore all such bytes. If you don't mind extra complexity, you can do more than that (to skip ahead across non-leading bytes based on the bit pattern of a leading byte) but in reality, it's unlikely to make much difference except for short strings (because you'll typically be close to the memory bandwidth anyway).
Edit: I originally mis-read your question as simply asking about how to count the length of a string of characters encoded in UTF-8. If you want to count character frequencies, you probably want to convert those to UTF-32/UCS-4, then you'll need some sort of sparse array to count the frequencies.
The hard part of this deals with counting code points vs. characters. For example, consider the character "À" -- the "Latin capital letter A with grave". There are at least two different ways to produce this character. You can use codepoint U+00C0, which encodes the whole thing in a single code point, or you can use codepoint U+0041 (Latin capital letter A) followed by codepoint U+0300 (Combining grave accent).
Normalizing (with respect to Unicode) means turning all such characters into the the same form. You can either combine them all into a single codepoint, or separate them all into separate code points. For your purposes, it's probably easier to combine them into into a single code point whenever possible. Writing this on your own probably isn't very practical -- I'd use the normalizer API from the ICU project.

If you know the UTF-8 sequence is well formed, it's quite easy. Count up each byte that starts with a zero bit or two one bits. The first condition will chatch every code point that is represented by a single byte, the second will catch the first byte of each multi-byte sequence.
while (*p != 0)
{
if ((*p & 0x80) == 0 || (*p & 0xc0) == 0xc0)
++count;
++p;
}
Or alternatively as remarked in the comments, you can simply skip every byte that's a continuation:
while (*p != 0)
{
if ((*p & 0xc0) != 0x80)
++count;
++p;
}
Or if you want to be super clever and make it a 2-liner:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
The Wikipedia page for UTF-8 clearly shows the patterns.

A discussion with a full routine written in C++ is at http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

I know, it's late for this thread but, it could help
with ICU stuff, I did it like this:
string TheString = "blabla" ;
UnicodeString uStr = UnicodeString::fromUTF8( theString.c_str() ) ;
cout << "length = " << uStr.length( ) << endl ;

I wouldn't consider this a language-centric question. The UTF-8 format is fairly simple; decoding it from a file should be only a few lines of code in any language.
open file
until eof
if file.readchar & 0xC0 != 0x80
increment count
close file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js