Code not identifying Swedish characters despite #include<clocale> in c++

Code not identifying Swedish characters despite #include<clocale> in c++ - c++

I have to write a code in C++ that identifies and counts English and non-English characters in a string.
The user writes an input and the program must count user's letters and notify when it finds non-English letters.
My problem is that I get a question mark instead of the non-English letter!
At the beginning of the code I wrote:
...
#include <clocale>
int main() {
std::setlocale(LC_ALL, "sv_SE.UTF-8");
...
(the locale is Swedish)
If I try to print out Swedish letters before the counting loops (as a test), it does work, so I guess the clocale is working fine.
But when I launch the counting loop below,
for (unsigned char c: rad) {
if (c < 128) {
if (isalpha(c) != 0)
bokstaver++;
}
if (c >= 134 && c <= 165) {
cout << "Your text contains a " << c << '\n';
bokstaver++;
}
}
my non-English letter is taken into account but not printed out with cout.
I used unsigned char since non-English letters are between ASCII 134 and 165, so I really don't know what to do.
test with the word blå:

non-English letters are between ASCII 134 and 165
No, they aren't. Non english characters are not between any ASCII characters in UTF-8. Non ASCII characters consist of two or more code units (those individual code units themselves can represent some character in ASCII) . å for example consists of 0xC3 followed by 0xA5.
The C and C++ library functions which only accept a single char (such as std::isalpha) are not useful when using UTF-8 because that single char can only represent a single code unit.

Related

Detect german dialect in a string c++

I need to parse each character in a string for example:
"foo123 aaa [ ßü+öä"
And leave only a-z A-Z 0-9 whitespaces and german dialect characters.
In this example the result would be:
"foo123 aaa ßüöä"
The string should be parsed one character at a time since I need to know if any of the characters is from german dialect.
#include <string>
#include <iostream>
#include <algorithm>
int main()
{
std::string s = "Abc123ü + ßöÄ;";
s.erase(std::remove_if(s.begin(),s.end(),
[&](unsigned char c) {
if ((c >= 'a' and c<='z') or (c>='0' and c<='9') or (c>='A' and c<='Z')) return false;
if (c==0xDF) {
return false;
}
return true;
}), s.end());
std::cout<<"Stripped string: " << s;
return 0;
}
The output here is Abc123 and I expect Abc123ß

Original ASCII is seven bit (0-127) ... 8-bit ASCII used code pages to cover other languages in (128-255)
The usual code page for German is defined in ISO/IEC 8859-1 (often also called Latin1 amongst others). Often ISO/IEC 8859-15 is actually used, because this contains the € sign, which is missing in ISO/IEC 8859-1, but for the "Umlauts", you need: äöüÄÖÜ and ß, the codes should be the same.
// Latin1 Ä Ö Ü ä ö ü ß
vector<int> ascii_vals {196, 214, 220, 228, 246, 252, 223};
Unfortunately, Windows likes to use code page 437 or 850, where the codes would be different again.
However, you really should use UTF-8 in modern applications. I guess you have to connect to a legacy system, when you have to use ASCII. Better define to use UTF-8 internally and convert the strings. When you send the ASCII string to a different system, that expects UTF-8 your nicely filtered German Umlauts might get converted falsely again.

How to print the word but with the first letter as a capital

Hello I am trying to do problem word capitalization on code forces and to do the problem i am trying to use ascii table(my professor said I cant use cmath and anything other than loops arrays ascii table and the basics) it prints the number of the capital letter on the ascii lets say the ascii code for small a is 96 and capital A is 100 it prints 100 not A here's my code.
#include <iostream>
using namespace std;
int main()
{
string s;
cin>>s;
char x=s[0];
cout<<x+22<<s;
}

Because 22 is an int, the type of x+22 is also int. You need to convert it back into a char for iostream to interpret it as a character:
cout << char(x+22) << s;
Note that I only addressed what you asked about: A number being printed instead of a character. There might be other small errors in there for you to find.

incorrect character recognition in string array C++

i have to write a program that asks user to input characters, store them in a string array, then read the last character of the string and then determine whether it's an integer or a letter or neither of those.
#include<iostream>
using namespace std;
int main()
{
string str;
cout << "Enter a line of string: ";
getline(cin, str);
char i = str.length()-1;
cout<<str[i];
cout<<endl;
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z')) cout << "it's a letter";
else if (str[i] >=0) cout << "it's a number";
else cout << "it is something else";
return 0;
}
i came up with this code, it recognizes both letters and integers, but i have 2 problems with it:
1) if the last character of string is a symbol, for example an '*', then the program says it's an integer, although it's not.
how could i correct this?
2) also, if i type in an additional condition for the following statement in order to recognize integers from 0 to 9, the code fails:
else if ((str[i] >=0) && (str[i] <=9)) cout << "it's a number";
it works only if it's stated as (str[i] >=0)
why is it so?

Because numbers in string are not integer numbers, but font symbols/characters. Think about what is 'a', if the computer operates only with 1 and 0 values, it doesn't have any "a" symbol in HW. So the letter "a" must be encoded into some 01100001 pattern (for example 0x61 is 'a' in ASCII and UTF8 encoding). In similar way the digits are also encoded as particular bit patterns, for example 0x30 to 0x39 in ASCII and UTF8 for digits 0 to 9.
So your test has to be if ((str[i] >= '0') && ... (notice the digit zero is in apostrophes, telling compiler to use the character encoding, and will compile to 0x30 or in binary 00110000).
You can also write if ((str[i] >= 48) && .. and it will compile to identical machine code, because for compiler there is no difference between 48, 0x30 or '0'. But for humans reading your source later and trying understand what you were trying to achieve, the '0' is probably best fit to explain your intent)
Also notice the encoding which is used (as defined by your compiler and source) may use different values to encode particular symbol, causing the string processing to be sometimes quite tricky business, for example in UTF8 the common English characters are encoded identically with the 7 bit ASCII encoding (which is trivial old encoding, and you should start by learning a bit about it) and each letter fits into single byte (8 bits), but the accented and extra characters are encoded as series of bytes (two to five), so for example trivial things like "count how many characters are in string" turn into complex task requiring rich tables with definition of unicode characters and their category/function, to correctly count characters as in human way (simple Arabic text consisting of 4 letters may require for example 12 bytes of memory, so simple strlen will return 12 then). ... if you need these things for unicode encoding, you better use some library, than implementing it yourself.
Your code will work well for simple ASCII-like strings only, which is enough to practice stuff like this, just remember everything in the computer is encoded as 1 and 0 (sounds, characters, pixel/vector graphics, ... everything, at some point, down below, under many layers of abstraction and utilities, there is some point where the information is encoded as series of bit values 1 or 0, and nothing else), so when you are trying to process some data, it may be helpful to learn about how the thing is actually encoded in the HW and how much of that is abstracted from you by your programming language/library+OS API.
You can also check in debugger in memory view, what values are used to encode your string, and how it looks in terms of byte values.

Your condition in 'else if' doesn't seem to satisfy your requirement.
Try something like:
if ((str[i]>='a' && str[i]<='z') || (str[i]>='A' && str[i]<='Z'))
cout << "it's a letter";
else if (str[i] >= '0' && str[i] <= '9')
cout << "it's a number";
else
cout << "it is something else";

Including decimal equivalent of a char in a character array

How do I create a character array using decimal/hexadecimal representation of characters instead of actual characters.
Reason I ask is because I am writing C code and I need to create a string that includes characters that are not used in English language. That string would then be parsed and displayed to an LCD Screen.
For example '\0' decodes to 0, and '\n' to 10. Are there any more of these special characters that i can sacrifice to display custom characters. I could send "Temperature is 10\d C" and degree sign is printed instead of '\d'. Something like this would be great.

Assuming you have a character code that is a degree sign on your display (with a custom display, I wouldn't necessarily expect it to "live" at the common place in the extended IBM ASCII character set, or that the display supports Unicode character encoding) then you can use the encoding \nnn or \xhh, where nnn is up to three digits in octal (base 8) or hh is up to two digits of hex code. Unfortunately, there is no decimal encoding available - Dennis Ritchie and/or Brian Kernighan were probably more used to using octal, as that was quite common at the time when C was first developed.
E.g.
char *str = "ABC\101\102\103";
cout << str << endl;
should print ABCABC (assuming ASCII encoding)

You can directly write
char myValues[] = {1,10,33,...};

Use \u00b0 to make a degree sign (I simply looked up the unicode code for it)
This requires unicode support in the terminal.

Simple, use std::ostringstream and casting of the characters:
std::string s = "hello world";
std::ostringstream os;
for (auto const& c : s)
os << static_cast<unsigned>(c) << ' ';
std::cout << "\"" << s << "\" in ASCII is \"" << os.str() << "\"\n";
Prints:
"hello world" in ASCII is "104 101 108 108 111 32 119 111 114 108 100 "

A little more research and i found answer to my own question.
Characters follower by a '\' are called escape sequence.
You can put octal equivalent of ascii in your string by using escape sequence from'\000' to '\777'.
Same goes for hex, 'x00' to 'xFF'.
I am printing my custom characters by using 'xC1' to 'xC8', as i only had 8 custom characters.
Every thing is done in a single line of code: lcd_putc("Degree \xC1")

Extending 'isalnum' to recognize UTF-8 umlaut

I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.
Is there maybe a more elegant way to solve this issue?
The code is as follows:
bool isalnumlaut(const char character) {
int cr = (int) (unsigned char) character;
if (isalnum(character)
|| cr == 195 // UTF-8
|| cr == 132 // Ä
|| cr == 164 // ä
|| cr == 150 // Ö
|| cr == 182 // ö
|| cr == 156 // Ü
|| cr == 188 // ü
|| cr == 159 // ß
) {
return true;
} else {
return false;
}
}
EDIT:
I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?

Your code doesn't do what you're claiming.
The utf-8 representation of Ä is two bytes - 0xC3,0x84. A lone byte with a value above 0x7F is meaningless in utf-8.
Some general suggestions:
Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.
It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).
Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.

I'm not 100% sure but the C++ std::isalnum in <locale> almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/

It's impossible with the interface you define, since UTF-8 is a
multibyte encoding; a single character requires multiple char to
represent it. (I've got code for determining whether a UTF-8 is a
member of a specified set of characters in my library, but the
character is specified by a pair of iterators, and not a single char.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js