how to detect non-ascii characters in C++ Windows? - c++

I'm simply trying detect non-ascii characters in my C++ program on Windows.
Using something like isascii() or :
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:
#include <iostream>
using namespace std;
int main()
{
int c;
c = getchar();
cout << isascii(c) << endl;
cout << c << endl;
printf("0x%x\n", c);
cout << (char)c;
return 0;
}
and input a 😁 (because i am so happy right now), the output is
1
63
0x3f
?
Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be
1
72
0x48
H
This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like
i = input()
print(ord(i))
gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?

Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.
I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "😁" isn't ASCII, and from this page: https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "😁" is 0x39 which is why you're seeing that printed out.

Okay, I have solved this. I was not aware of translation modes.
_setmode(_fileno(stdin), _O_WTEXT);
Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened.
https://en.cppreference.com/w/cpp/language/translation_phases

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

C++ Visual Studio Unicode confusion

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.
I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.
The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.
I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.
#include <iostream>
using namespace std;
int main()
{
unsigned int character = 122; // Starting at "z"
for (int i = 0; i < 100; i++)
{
cout << (char)character << endl;
cout << "decimal code point = " << (int)character << endl;
cout << "size of character = " << sizeof(character) << endl;
character++;
system("pause");
cout << endl;
}
return 0;
}
By the way, here is the Unicode chart
http://unicode-table.com/en/#control-character
Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).
For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.
There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

How can I read accented characters in C++ and use them with isalnum?

I am programming in French and, because of that, I need to use accented characters. I can output them by using
#include <locale> and setlocale(LC_ALL, ""), but there seems to be a problem when I read accented characters. Here is simple example I made to show the problem :
#include <locale>
#include <iostream>
using namespace std;
const string SymbolsAllowed = "+-*/%";
int main()
{
setlocale(LC_ALL, ""); // makes accents printable
// Traduction : Please write a string with accented characters
// 'é' is shown correctly :
cout << "Veuillez écrire du texte accentué : ";
string accentedString;
getline(cin, accentedString);
// Accented char are not shown correctly :
cout << "Accented string written : " << accentedString << endl;
for (unsigned int i = 0; i < accentedString.length(); ++i)
{
char currentChar = accentedString.at(i);
// The program crashes while testing if currentChar is alphanumeric.
// (error image below) :
if (!isalnum(currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar))
{
cout << endl << "Character not allowed : " << currentChar << endl;
system("pause");
return 1;
}
}
cout << endl << "No unauthorized characters were written." << endl;
system("pause");
return 0;
}
Here is an output example before the program crashes :
Veuillez écrire du texte accentué : éèàìù
Accented string written : ʾS.?—
I noticed the debugger from Visual Studio shows that I have written something different than what it outputs :
[0] -126 '‚' char
[1] -118 'Š' char
[2] -123 '…' char
[3] -115 '' char
[4] -105 '—' char
The error shown seems to tell that only characters between -1 and 255 can be used but, according to the ASCII table the value of the accented characters I used in the example above do not exceed this limit.
Here is a picture of the error dialog that pops up : Error message: Expression: c >= -1 && c <= 255
Can someone please tell me what I am doing wrong or give me a solution for this? Thank you in advance. :)
char is a signed type on your system (indeed, on many systems) so its range of values is -128 to 127. Characters whose codes are between 128 and 255 look like negative numbers if they are stored in a char, and that is actually what your debugger is telling you:
[0] -126 '‚' char
That's -126, not 126. In other words, 130 or 0x8C.
isalnum and friends take an int as an argument, which (as the error message indicates) is constrained to the values EOF (-1 on your system) and the range 0-255. -126 is not in this range. Hence the error. You could cast to unsigned char, or (probably better, if it works on Windows), use the two-argument std::isalnum in <locale>
For reasons which totally escape me, Windows seems to be providing console input in CP-437 but processing output in CP-1252. The high half of those two code pages is completely different. So when you type é, it gets sent to your program as 130 (0xC2) from CP-437, but when you send that same character back to the console, it gets printed according to CP-1252 as an (low) open single quote ‚ (which looks a lot like a comma, but isn't). So that's not going to work. You need to get input and output to be on the same code page.
I don't know a lot about Windows, but you can probably find some useful information in the MS docs. That page includes links to Windows-specific functions which set the input and output code pages.
Intriguingly, the accented characters in the source code of your program appear to be CP-1252, since they print correctly. If you decide to move away from code page 1252 -- for example, by adopting Unicode -- you'll have to fix your source code as well.
With the is* and to* functions, you really need to cast the input to unsigned char before passing it to the function:
if (!isalnum((unsigned char)currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar)) {
While you're at it, I'd advise against using strchr as well, and switch to something like this:
std::string SymbolsAllowed = "+-*/%";
if (... && SymbolsAllowed.find(currentChar) == std::string::npos)
While you're at it, you should probably forget that you ever even heard of the exit function. You should never use it in C++. In the case here (exiting from main) you should just return. Otherwise, throw an exception (and if you want to exit the program, catch the exception in main and return from there).
If I were writing this, I'd do the job somewhat differently in general though. std::string already has a function to do most of what your loop is trying to accomplish, so I'd set up symbolsAllowed to include all the symbols you want to allow, then just do a search for anything it doesn't contain:
// Add all the authorized characters to the string:
for (unsigned char a = 0; a < std::numeric_limits<unsigned char>::max(); a++)
if (isalnum(a) || isspace(a)) // you probably want to allow spaces?
symbolsAllowed += a;
// ...
auto pos = accentedString.find_first_not_of(symbolsAllowed);
if (pos != std::string::npos) {
std::cout << "Character not allowed: " << accentedString[pos];
return 1;
}

Decimal values of Extended ASCII characters

I wrote a function to test if a string consists only of letters, and it works well:
bool is_all_letters(const char* src) {
while (*src) {
// A-Z, a-z
if ((*src>64 && *src<91) || (*src>96 && *src<123)) {
*src++;
}
else {
return false;
}
}
return true;
}
My next step was to include “Extended ASCII Codes”, I thought it was going to be really easy but that’s where I ran into trouble. For example:
std::cout << (unsigned int)'A' // 65 <-- decimal ascii value
std::cout << (unsigned int)'ñ'; // 4294967281 <-- what?
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
My goal is to restrict user input to only letters in ISO 8859-1 (latin 1). I’ve only worked with single byte characters and would like to avoid multi-byte characters if possible.
I am guessing that I can compare the unsigned int values above, i.e.: 4294967281, but it does not feel right to me and besides, I don’t know if that large integer is VC 8.0 representation of 'ñ' and changes from compiler to compiler.
Please advise
UPDATE - Per some suggestions made by Christophe, I ran the following code:
locale loc("spanish") ;
cout<<loc.name() << endl; // Spanish_Spain.1252
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
It does return Spanish_Spain.1252 but unfortunately, the loop iterations print the same data as the default C locale (using VC++ 8 / VS 2005).
Christophe shows different (desired) results as you can see in his screen shots below, but he uses a much newer version of VC++.
The code chart you found on the internet is actually Windows OEM code page 437, which was never endorsed as a standard. Although it is sometimes called "extended ASCII", that description is highly misleading. (See the Wikipedia article Extended ASCII: "The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue."
You can find the history of OEM437 on Wikipedia, in various versions.
What was endorsed as a standard 8-bit encoding is ISO-8859-1, which later became the first 256 code points in Unicode. (It's one of a series of 8-bit encodings designed for use in different parts of the world; ISO-8859-1 is specified to the Americas and Western Europe.) So that's what you will find in most computers produced in this century in those regions, although more recently more and more operating systems are converting to full Unicode support.
The value you see for (unsigned int)'ñ' is the result of casting the ISO-8859-1 code 0xF1 from a (signed) char (that is, -15) to an unsigned int. Had you cast it to an int, you would have seen -15.
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
Asciitable.com appears to give the code for the old IBM437 DOS character set (still used in the Windows command prompt), in which ñ is indeed 164. But that's just one of hundreds of “extended ASCII” variants.
The value 4294967281 = 0xFFFFFFF1 you got is a sign-extension of the (signed) char value 0xF1, which is how ñ is encoded in ISO-8859-1 and close variants like Windows-1252.
To start with, you're trying to reinvent std::isalpha. But you'll need to pass the ISO-8859-1 locale IIRC, by default that just checks ASCII.
The behavior you see is because char is signed (because you didn't compile with /J, which is the smart thing to do when you use more than just ASCII - VC++ defaults to signed char).
There is already plenty of information here. However, I'd like to propose some ideas to adress your inital problem, being the categorisation of extended character set.
For this, I suggest the use of <locale> (country specific topics), and especially the new locale-aware form of isalpha(), isspace(), isprint(), ... .
Here a little piece of code to help you to find out what chars could be a letter in your local alphabet:
std::locale::global(std::locale("")); // sets the environment default locale currently in place
std::cout << std::locale().name() << std::endl; // display name of current locale
std::locale loc ; // use a copy of the active global locale (you could use another)
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
This will print out the ascii code from 0 to 255, followed by an indicator if it is a letter according to the local settings, and the character itself if it's printable.
FOr example, on my PC, I get:
And all the accented chars, as well as ñ, and greek letters are considered as alpha, whereas £ and mathematical symbols are considered as non alpha printable.

Matching russian vowels in C++

I wanted to write a function which returns true if a given character is a russian vowel. But the results I get are strange to me. This is what I've got so far:
#include <iostream>
using namespace std;
bool is_vowel_p(char working_char)
// returns true if the character is a russian vowel
{
string matcher = "аяё×эеуюыи";
if (find(matcher.begin(), matcher.end(), working_char) != matcher.end())
return true;
else
return false;
}
void main()
{
cout << is_vowel_p('е') << endl; // russian vowel
cout << is_vowel_p('Ж') << endl; // russian consonant
cout << is_vowel_p('D') << endl; // latin letter
}
The result is:
1
1
0
what is strange to me. I expected the following result:
1
0
0
It's seems that there is some kind of internal mechanism which I don't know yet. I'm at first interested in how to fix this function to work properly. And second, what is going on there, that I get this result.
string and char are only guaranteed to represent characters in the basic character set - which does not include the Cyrillic alphabet.
Using wstring and wchar_t, and adding L before the string and character literals to indicate that they use wide characters, should allow you to work with those letters.
Also, for portability you need to include <algorithm> for find, and give main a return type of int.
C++ source code is ASCII. You are entering unicode characters. The comparison is being done using 8 bit values. I bet one of the vowels fulfills the following:-
vowel & 255 == (code point for 'Ж') & 255
You need to use unicode functions to do this, not ASCII functions, i.e. use functions that require wchar_t values. Also, make sure your compiler can parse the non-ASCII vowel string. Using MS VC, the compiler requires:-
L"аяё×эеуюыи" or TEXT("аяё×эеуюыи")
the latter is a macro that adds the L when compiling with unicode support.
Convert the code to use wchar_t and it should work.
Very useful function in locale.h
setlocale(LC_ALL, "Russian");
Past this in the beginning of the program.
Example:
#include <stdio.h>
#include <locale.h>
void main()
{
setlocale(LC_ALL, "Russian");
printf("Здравствуй, мир!\n");//Hello, world!
}
Make sure your system default locale is Russian, and make sure your file is saved as codepage 1251 (Cyrillic/Windows). If it's saved as Unicode, this won't ever work.
The system default locale is the one used by non-Unicode-compliant programs. It's in Control Panel, under Regional settings.
Alternatively, rewritte to use wstring and wchar_t and L"" string/char literals.