Matching russian vowels in C++ - c++

I wanted to write a function which returns true if a given character is a russian vowel. But the results I get are strange to me. This is what I've got so far:
#include <iostream>
using namespace std;
bool is_vowel_p(char working_char)
// returns true if the character is a russian vowel
{
string matcher = "аяё×эеуюыи";
if (find(matcher.begin(), matcher.end(), working_char) != matcher.end())
return true;
else
return false;
}
void main()
{
cout << is_vowel_p('е') << endl; // russian vowel
cout << is_vowel_p('Ж') << endl; // russian consonant
cout << is_vowel_p('D') << endl; // latin letter
}
The result is:
1
1
0
what is strange to me. I expected the following result:
1
0
0
It's seems that there is some kind of internal mechanism which I don't know yet. I'm at first interested in how to fix this function to work properly. And second, what is going on there, that I get this result.

string and char are only guaranteed to represent characters in the basic character set - which does not include the Cyrillic alphabet.
Using wstring and wchar_t, and adding L before the string and character literals to indicate that they use wide characters, should allow you to work with those letters.
Also, for portability you need to include <algorithm> for find, and give main a return type of int.

C++ source code is ASCII. You are entering unicode characters. The comparison is being done using 8 bit values. I bet one of the vowels fulfills the following:-
vowel & 255 == (code point for 'Ж') & 255
You need to use unicode functions to do this, not ASCII functions, i.e. use functions that require wchar_t values. Also, make sure your compiler can parse the non-ASCII vowel string. Using MS VC, the compiler requires:-
L"аяё×эеуюыи" or TEXT("аяё×эеуюыи")
the latter is a macro that adds the L when compiling with unicode support.
Convert the code to use wchar_t and it should work.

Very useful function in locale.h
setlocale(LC_ALL, "Russian");
Past this in the beginning of the program.
Example:
#include <stdio.h>
#include <locale.h>
void main()
{
setlocale(LC_ALL, "Russian");
printf("Здравствуй, мир!\n");//Hello, world!
}

Make sure your system default locale is Russian, and make sure your file is saved as codepage 1251 (Cyrillic/Windows). If it's saved as Unicode, this won't ever work.
The system default locale is the one used by non-Unicode-compliant programs. It's in Control Panel, under Regional settings.
Alternatively, rewritte to use wstring and wchar_t and L"" string/char literals.

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

How can I read accented characters in C++ and use them with isalnum?

I am programming in French and, because of that, I need to use accented characters. I can output them by using
#include <locale> and setlocale(LC_ALL, ""), but there seems to be a problem when I read accented characters. Here is simple example I made to show the problem :
#include <locale>
#include <iostream>
using namespace std;
const string SymbolsAllowed = "+-*/%";
int main()
{
setlocale(LC_ALL, ""); // makes accents printable
// Traduction : Please write a string with accented characters
// 'é' is shown correctly :
cout << "Veuillez écrire du texte accentué : ";
string accentedString;
getline(cin, accentedString);
// Accented char are not shown correctly :
cout << "Accented string written : " << accentedString << endl;
for (unsigned int i = 0; i < accentedString.length(); ++i)
{
char currentChar = accentedString.at(i);
// The program crashes while testing if currentChar is alphanumeric.
// (error image below) :
if (!isalnum(currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar))
{
cout << endl << "Character not allowed : " << currentChar << endl;
system("pause");
return 1;
}
}
cout << endl << "No unauthorized characters were written." << endl;
system("pause");
return 0;
}
Here is an output example before the program crashes :
Veuillez écrire du texte accentué : éèàìù
Accented string written : ʾS.?—
I noticed the debugger from Visual Studio shows that I have written something different than what it outputs :
[0] -126 '‚' char
[1] -118 'Š' char
[2] -123 '…' char
[3] -115 '' char
[4] -105 '—' char
The error shown seems to tell that only characters between -1 and 255 can be used but, according to the ASCII table the value of the accented characters I used in the example above do not exceed this limit.
Here is a picture of the error dialog that pops up : Error message: Expression: c >= -1 && c <= 255
Can someone please tell me what I am doing wrong or give me a solution for this? Thank you in advance. :)
char is a signed type on your system (indeed, on many systems) so its range of values is -128 to 127. Characters whose codes are between 128 and 255 look like negative numbers if they are stored in a char, and that is actually what your debugger is telling you:
[0] -126 '‚' char
That's -126, not 126. In other words, 130 or 0x8C.
isalnum and friends take an int as an argument, which (as the error message indicates) is constrained to the values EOF (-1 on your system) and the range 0-255. -126 is not in this range. Hence the error. You could cast to unsigned char, or (probably better, if it works on Windows), use the two-argument std::isalnum in <locale>
For reasons which totally escape me, Windows seems to be providing console input in CP-437 but processing output in CP-1252. The high half of those two code pages is completely different. So when you type é, it gets sent to your program as 130 (0xC2) from CP-437, but when you send that same character back to the console, it gets printed according to CP-1252 as an (low) open single quote ‚ (which looks a lot like a comma, but isn't). So that's not going to work. You need to get input and output to be on the same code page.
I don't know a lot about Windows, but you can probably find some useful information in the MS docs. That page includes links to Windows-specific functions which set the input and output code pages.
Intriguingly, the accented characters in the source code of your program appear to be CP-1252, since they print correctly. If you decide to move away from code page 1252 -- for example, by adopting Unicode -- you'll have to fix your source code as well.
With the is* and to* functions, you really need to cast the input to unsigned char before passing it to the function:
if (!isalnum((unsigned char)currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar)) {
While you're at it, I'd advise against using strchr as well, and switch to something like this:
std::string SymbolsAllowed = "+-*/%";
if (... && SymbolsAllowed.find(currentChar) == std::string::npos)
While you're at it, you should probably forget that you ever even heard of the exit function. You should never use it in C++. In the case here (exiting from main) you should just return. Otherwise, throw an exception (and if you want to exit the program, catch the exception in main and return from there).
If I were writing this, I'd do the job somewhat differently in general though. std::string already has a function to do most of what your loop is trying to accomplish, so I'd set up symbolsAllowed to include all the symbols you want to allow, then just do a search for anything it doesn't contain:
// Add all the authorized characters to the string:
for (unsigned char a = 0; a < std::numeric_limits<unsigned char>::max(); a++)
if (isalnum(a) || isspace(a)) // you probably want to allow spaces?
symbolsAllowed += a;
// ...
auto pos = accentedString.find_first_not_of(symbolsAllowed);
if (pos != std::string::npos) {
std::cout << "Character not allowed: " << accentedString[pos];
return 1;
}

C++ Non ASCII letters

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?
You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.