Detect german dialect in a string c++ - c++

I need to parse each character in a string for example:
"foo123 aaa [ ßü+öä"
And leave only a-z A-Z 0-9 whitespaces and german dialect characters.
In this example the result would be:
"foo123 aaa ßüöä"
The string should be parsed one character at a time since I need to know if any of the characters is from german dialect.
#include <string>
#include <iostream>
#include <algorithm>
int main()
{
std::string s = "Abc123ü + ßöÄ;";
s.erase(std::remove_if(s.begin(),s.end(),
[&](unsigned char c) {
if ((c >= 'a' and c<='z') or (c>='0' and c<='9') or (c>='A' and c<='Z')) return false;
if (c==0xDF) {
return false;
}
return true;
}), s.end());
std::cout<<"Stripped string: " << s;
return 0;
}
The output here is Abc123 and I expect Abc123ß

Original ASCII is seven bit (0-127) ... 8-bit ASCII used code pages to cover other languages in (128-255)
The usual code page for German is defined in ISO/IEC 8859-1 (often also called Latin1 amongst others). Often ISO/IEC 8859-15 is actually used, because this contains the € sign, which is missing in ISO/IEC 8859-1, but for the "Umlauts", you need: äöüÄÖÜ and ß, the codes should be the same.
// Latin1 Ä Ö Ü ä ö ü ß
vector<int> ascii_vals {196, 214, 220, 228, 246, 252, 223};
Unfortunately, Windows likes to use code page 437 or 850, where the codes would be different again.
However, you really should use UTF-8 in modern applications. I guess you have to connect to a legacy system, when you have to use ASCII. Better define to use UTF-8 internally and convert the strings. When you send the ASCII string to a different system, that expects UTF-8 your nicely filtered German Umlauts might get converted falsely again.

Related

Code not identifying Swedish characters despite #include<clocale> in c++

I have to write a code in C++ that identifies and counts English and non-English characters in a string.
The user writes an input and the program must count user's letters and notify when it finds non-English letters.
My problem is that I get a question mark instead of the non-English letter!
At the beginning of the code I wrote:
...
#include <clocale>
int main() {
std::setlocale(LC_ALL, "sv_SE.UTF-8");
...
(the locale is Swedish)
If I try to print out Swedish letters before the counting loops (as a test), it does work, so I guess the clocale is working fine.
But when I launch the counting loop below,
for (unsigned char c: rad) {
if (c < 128) {
if (isalpha(c) != 0)
bokstaver++;
}
if (c >= 134 && c <= 165) {
cout << "Your text contains a " << c << '\n';
bokstaver++;
}
}
my non-English letter is taken into account but not printed out with cout.
I used unsigned char since non-English letters are between ASCII 134 and 165, so I really don't know what to do.
test with the word blå:
non-English letters are between ASCII 134 and 165
No, they aren't. Non english characters are not between any ASCII characters in UTF-8. Non ASCII characters consist of two or more code units (those individual code units themselves can represent some character in ASCII) . å for example consists of 0xC3 followed by 0xA5.
The C and C++ library functions which only accept a single char (such as std::isalpha) are not useful when using UTF-8 because that single char can only represent a single code unit.

How to make std::regex match Utf8

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.
I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.
here's my test case:
#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main(int argc, char** argv)
{
// make a string with 3 UTF8 characters
const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
string tobesearched((char*)p);
// want to match the UTF8 character before c
string pattern(".c");
regex re(pattern);
std::smatch match;
bool r = std::regex_search(tobesearched, match, re);
if (r)
{
// m.size() will be bytes, and we expect 3
// expect 0xC2, 0x80, 'c'
string m = match[0];
cout << "match length " << m.size() << endl;
// but we only get 2, we get the 0x80 and the 'c'.
// so it's matching on single bytes and not utf8
// code here is just to dump out the byte values.
for (int i = 0; i < m.size(); ++i)
{
int c = m[i] & 0xff;
printf("%02X ", c);
}
printf("\n");
}
else
cout << "not matched\n";
return 0;
}
I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.
Some regex flavours support \X which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).
Another option is the \uFFFF where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]. Again, it depends on what the regex flavour supports. There is another variant of \u in \x{...} which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}.
Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info
Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]" i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).
A loop through the individual bytes would be more efficient, though:
If the lead bit is 0, i.e. if its signed value is > -1, it is a 1 byte char representation. Skip to the next byte and start again.
Else if the lead bits are 11110 i.e. if its signed value is > -17, n=4.
Else if the lead bits are 1110 i.e. if its signed value is > -33, n=3.
Else if the lead bits are 110 i.e. if its signed value is > -65, n=2.
Optionally, check that the next n bytes each start with 10, i.e. for each byte, if it has a signed value < -63, it is invalid UTF-8 encoding.
You now know that the previous n bytes constitute a unicode-exclusive character. So, if the NEXT character is 'c' i.e. == 99, you can say it matched - return true.

Printing unicode literals in C

I am making an OpenVG application for Raspberry Pi that displays some text and I need a support for foreign characters (Polish in this case). I plan to prepare a function that maps unicode characters to literals in C in some higher level language but for now there's a problem with printing those literals in C.
Given the code below:
//both output the "ó" character, as expected
char A[] = "\xF3";
wchar_t B[] = L"\xF3";
//"ś" is expected as output but instead I get character with code 0x5B - "["
char A[] = "\x15B";
wchar_t B[] = L"\x15B";
Most of Polish characters have 3-digit hexadecimal codes. When I attempt to print "ś" (0x15B), it prints character "[" (0x5B) instead. It turns out I cannot print any unicode characters with more than 2-digit codes.
Is used data type the cause? I have considered using char16_t and char32_t but the header files are nowhere to be found in the system.
It's what in this
char A[]={'\xc5','\x9b'};
c59b is "ś" (0x15B) by UTF-8.

Including decimal equivalent of a char in a character array

How do I create a character array using decimal/hexadecimal representation of characters instead of actual characters.
Reason I ask is because I am writing C code and I need to create a string that includes characters that are not used in English language. That string would then be parsed and displayed to an LCD Screen.
For example '\0' decodes to 0, and '\n' to 10. Are there any more of these special characters that i can sacrifice to display custom characters. I could send "Temperature is 10\d C" and degree sign is printed instead of '\d'. Something like this would be great.
Assuming you have a character code that is a degree sign on your display (with a custom display, I wouldn't necessarily expect it to "live" at the common place in the extended IBM ASCII character set, or that the display supports Unicode character encoding) then you can use the encoding \nnn or \xhh, where nnn is up to three digits in octal (base 8) or hh is up to two digits of hex code. Unfortunately, there is no decimal encoding available - Dennis Ritchie and/or Brian Kernighan were probably more used to using octal, as that was quite common at the time when C was first developed.
E.g.
char *str = "ABC\101\102\103";
cout << str << endl;
should print ABCABC (assuming ASCII encoding)
You can directly write
char myValues[] = {1,10,33,...};
Use \u00b0 to make a degree sign (I simply looked up the unicode code for it)
This requires unicode support in the terminal.
Simple, use std::ostringstream and casting of the characters:
std::string s = "hello world";
std::ostringstream os;
for (auto const& c : s)
os << static_cast<unsigned>(c) << ' ';
std::cout << "\"" << s << "\" in ASCII is \"" << os.str() << "\"\n";
Prints:
"hello world" in ASCII is "104 101 108 108 111 32 119 111 114 108 100 "
A little more research and i found answer to my own question.
Characters follower by a '\' are called escape sequence.
You can put octal equivalent of ascii in your string by using escape sequence from'\000' to '\777'.
Same goes for hex, 'x00' to 'xFF'.
I am printing my custom characters by using 'xC1' to 'xC8', as i only had 8 custom characters.
Every thing is done in a single line of code: lcd_putc("Degree \xC1")

Extending 'isalnum' to recognize UTF-8 umlaut

I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.
Is there maybe a more elegant way to solve this issue?
The code is as follows:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
EDIT:
I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?
Your code doesn't do what you're claiming.
The utf-8 representation of Ä is two bytes - 0xC3,0x84. A lone byte with a value above 0x7F is meaningless in utf-8.
Some general suggestions:
Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.
It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).
Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.
I'm not 100% sure but the C++ std::isalnum in <locale> almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/
It's impossible with the interface you define, since UTF-8 is a
multibyte encoding; a single character requires multiple char to
represent it. (I've got code for determining whether a UTF-8 is a
member of a specified set of characters in my library, but the
character is specified by a pair of iterators, and not a single char.)