C++ Non ASCII letters - c++

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!

In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.

Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}

Related

Printing unicode Characters in C++

im trying to print a interface using these characters:
"╣║╗╝╚╔╩╦╠═╬"
but, when i try to print it, returns something like this:
"ôöæËÈ"
interface.txt
unsigned char* tabuleiroImportado() {
std::ifstream TABULEIRO;
TABULEIRO.open("tabuleiro.txt");
unsigned char tabu[36][256];
for (unsigned char i = 0; i < 36; i++) {
TABULEIRO >> tabu[i];
std::cout << tabu[i] << std::endl;
}
return *tabu;
}
i'm using this function to import the interface.
Just like every other possible kind of data that lives in your computer, it must be represented by a sequence of bytes. Each byte can have just 256 possible values.
All the carbon-based life forms, that live on the third planet from the sun, use all sorts of different alphabets with all sorts of characters, whose total number is much, more than 256.
A single byte by itself cannot, therefore, express all characters. The most simple way of handling all possible permutations of characters is to pick just 256 (or less) of them at a time, and assign the possible (up to 256) to a small set of characters, and call it your "character set".
Such is, apparently, your "tabuleiro.txt" file: its contents must be using some particular character set which includes the characters you expect to see there.
Your screen display, however, uses a different character set, hence the same values show different characters.
However, it's probably more complicated than that: modern operating system and modern terminals employ multi-byte character sequence, where a single character can be represented by specific sequences of more than just one byte. It's fairly likely that your terminal screen is based on multi-byte Unicode encoding.
In summary: you need to figure out two things:
Which character set your file uses
Which character set your terminal display uses
Then write the code to properly translate one to the other
It goes without saying that noone else could possibly tell you which character set your file uses, and which character set your terminal display uses. That's something you'll need to figure out. And without knowing both, you can't do step 3.
To print the Unicode characters, you can put the Unicode value with the prefix \u.
If the console does not support Unicode, then you cannot get the correct result.
Example:
#include <iostream>
int main() {
std::cout << "Character: \u2563" << std::endl;
std::cout << "Character: \u2551" << std::endl;
std::cout << "Character: \u2560" << std::endl;
}
Output:
Character: ╣
Character: ║
Character: ╠
the answer is use the unsigned char in = manner like char than a = unicode num
so this how to do it i did get an word like that when i was making an game engine for cmd so please up vote because it works in c++17 gnu gcc and in 2021 too to 2022 use anything in the place of a named a

How to apply <cctype> functions on text files with different encoding in c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

Decimal values of Extended ASCII characters

I wrote a function to test if a string consists only of letters, and it works well:
bool is_all_letters(const char* src) {
while (*src) {
// A-Z, a-z
if ((*src>64 && *src<91) || (*src>96 && *src<123)) {
*src++;
}
else {
return false;
}
}
return true;
}
My next step was to include “Extended ASCII Codes”, I thought it was going to be really easy but that’s where I ran into trouble. For example:
std::cout << (unsigned int)'A' // 65 <-- decimal ascii value
std::cout << (unsigned int)'ñ'; // 4294967281 <-- what?
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
My goal is to restrict user input to only letters in ISO 8859-1 (latin 1). I’ve only worked with single byte characters and would like to avoid multi-byte characters if possible.
I am guessing that I can compare the unsigned int values above, i.e.: 4294967281, but it does not feel right to me and besides, I don’t know if that large integer is VC 8.0 representation of 'ñ' and changes from compiler to compiler.
Please advise
UPDATE - Per some suggestions made by Christophe, I ran the following code:
locale loc("spanish") ;
cout<<loc.name() << endl; // Spanish_Spain.1252
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
It does return Spanish_Spain.1252 but unfortunately, the loop iterations print the same data as the default C locale (using VC++ 8 / VS 2005).
Christophe shows different (desired) results as you can see in his screen shots below, but he uses a much newer version of VC++.
The code chart you found on the internet is actually Windows OEM code page 437, which was never endorsed as a standard. Although it is sometimes called "extended ASCII", that description is highly misleading. (See the Wikipedia article Extended ASCII: "The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue."
You can find the history of OEM437 on Wikipedia, in various versions.
What was endorsed as a standard 8-bit encoding is ISO-8859-1, which later became the first 256 code points in Unicode. (It's one of a series of 8-bit encodings designed for use in different parts of the world; ISO-8859-1 is specified to the Americas and Western Europe.) So that's what you will find in most computers produced in this century in those regions, although more recently more and more operating systems are converting to full Unicode support.
The value you see for (unsigned int)'ñ' is the result of casting the ISO-8859-1 code 0xF1 from a (signed) char (that is, -15) to an unsigned int. Had you cast it to an int, you would have seen -15.
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
Asciitable.com appears to give the code for the old IBM437 DOS character set (still used in the Windows command prompt), in which ñ is indeed 164. But that's just one of hundreds of “extended ASCII” variants.
The value 4294967281 = 0xFFFFFFF1 you got is a sign-extension of the (signed) char value 0xF1, which is how ñ is encoded in ISO-8859-1 and close variants like Windows-1252.
To start with, you're trying to reinvent std::isalpha. But you'll need to pass the ISO-8859-1 locale IIRC, by default that just checks ASCII.
The behavior you see is because char is signed (because you didn't compile with /J, which is the smart thing to do when you use more than just ASCII - VC++ defaults to signed char).
There is already plenty of information here. However, I'd like to propose some ideas to adress your inital problem, being the categorisation of extended character set.
For this, I suggest the use of <locale> (country specific topics), and especially the new locale-aware form of isalpha(), isspace(), isprint(), ... .
Here a little piece of code to help you to find out what chars could be a letter in your local alphabet:
std::locale::global(std::locale("")); // sets the environment default locale currently in place
std::cout << std::locale().name() << std::endl; // display name of current locale
std::locale loc ; // use a copy of the active global locale (you could use another)
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
This will print out the ascii code from 0 to 255, followed by an indicator if it is a letter according to the local settings, and the character itself if it's printable.
FOr example, on my PC, I get:
And all the accented chars, as well as ñ, and greek letters are considered as alpha, whereas £ and mathematical symbols are considered as non alpha printable.

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.

Reverse string with non-ASCII characters

I want to change the order in the string with special characters like this:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ
to
ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ
I try to use std::reverse
std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;
but the output show me that:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ <- reversed string
So i try do this "manually" :
std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count << " " << text1.size() << std::endl;
unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
char tmp = text1[i];
text1[i] = text1[maxIndex];
text1[maxIndex] = tmp;
maxIndex--;
}
std::cout << text1 << std::endl;
But in this case I have a problem in text1.size() because every special character are counted twice:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
13 27 <- second number is text1.size()
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ
How is the proper way to reverse a string with special characters?
Your code really does correctly reverse bytes in your string, there's nothing wrong here. The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" in UTF-8 encoding.
And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. This means that one char (one byte) is no longer one character, so reversing char's isn't now the same as reversing characters.
To achieve your goal you have at least two options:
Use some utf-8 library that will let you iterate characters instead of bytes. One example is http://utfcpp.sourceforge.net/
Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles.
UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html
You might code a reverseUt8 function by yourself:
std::string getMultiByteReversed(char ch1, char ch2)
{
if (ch == '\xc3') // most utf8 characters
return std::string(ch1)+ std::string(ch2);
} else {
return std::string(ch1);
}
}
std::string reverseMultiByteString(const std::string &s)
{
std::string result;
for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
std::string reversed;
if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
result += reversed;
++it;
} else {
result += *it;
}
}
return result;
}
You can look up the utf8 codes at: http://www.utf8-chartable.de/
There are a couple of issues here. The answer is complex and can depend on exactly what you're trying to do.
First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. If you just reverse the bytes, you'll break the UTF-8 encoding. The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes.
The next problem is that a single grapheme might consist of multiple Unicode code points. For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e".
Now you might think that you can get around this problem by normalizing the string into a composed form first. That has its own problems, however, because not every grapheme can be represented by a single code point. For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. There is no single code point for it. If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead.
Then you have languages where characters change form based on context, and I don't yet know much about dealing with those.
It may be closer to what you want to reverse grapheme clusters. See UAX29: Unicode text segmentation.
have you tried swapping characters one by one.
For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. In that way, the string will be reversed.