g++ unicode character ifstream - c++

this is a question about unicode characters in a text input file. This discussion was close but not quite the answer. Compiled with VS2008 and executed on Windows these charcters are recognized on read (maybe represented as a different symbol but read) - compiled with g++and executed on linux they are displayed as blank.
‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
The rest of the Unicode symbols appear to work fine, I did not check them all but found this set did not work.
Questions:
(1) Why?
(2) Is there a solution?
void Lexicon::buildMapFromFile(string filename ) //map
{
ifstream file;
file.open(filename.c_str(), ifstream::binary);
string wow, mem, key;
unsigned int x = 0;
while(true) {
getline(file, wow);
cout << wow << endl;
if (file.fail()) break; //boilerplate check for error
while (x < wow.length() ) {
if (wow[x] == ',') { //look for csv deliniator
key = mem;
mem.clear();
x++; //step over ','
} else
mem += wow[x++];
}
//cout << mem << " code " << key << " is " << (key[0] - '€') << " from €" << endl;
cout << "enter 1 to continue: ";
while (true) {
int choice = GetInteger();
if (choice == 1) break;
}
list_map0[key] = mem; //char to string
list_map1[mem] = key; //string to char
mem.clear(); //reset memory
x = 0;//reset index
}
//printf("%d\n", list_map0.size());
file.close();
}
The unicode symbols are read from a csv file and parsed for the unicode symbol and an associated string. Initially I though there was a bug in the code but in this post the review found it is fine and I followed the issue to how the characters are handled.
The test is cout << wow << endl;

The characters you show are all characters from Windows codepage 1252 which do not exist in the ISO-8859 1 encoding. These two encodings are similar and so are often confused.
If the input is CP1252 and you are reading it as though it were ISO-8859 1 then those characters are read as control characters and will not behave as normal, visible characters.
There are many possible things you could be doing incorrectly to cause this, but you'll have to post more details in order to determine which. A more complete answer requires knowing how you are reading the data, how you are converting and storing it internally, how you are testing the read data, and the input data and/or encoding.
Your displayed code doesn't do any conversions while reading the data, and the commented-out code to print the data is the same; no conversions. This means to print the data you are simply relying on the input data to be correct for the platform you run the program on. That means that, for example, if you run your program in the console on Windows then your input file needs to be encoded using the console's codepage*.
To fix the problem you can either; ensure the input file matches the encoding required by the particular console you run the program on; or specify the input encoding, convert to a known internal encoding when reading and then convert to the console encoding when printing.
* and if it's not, for example if the console is cp437 and the file is cp1252 then the characters you listed will instead show up as: É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »

Your problem-statement does not detail out thee platform for g++, but from your tags it appears that you are compiling that same code on linux.
Windows and linux are both unicode enabled. so, if your code in windows/vs-2008 had wstring class, then, you have to change it back to string on linux/g++. If you are using wstring in linux, it will not work the same way.

Unicode handling in C++ code is not straightforward and it is implementation-depending (you have already seen that the output change between VS2008 and g++). Furthermore, Unicode can be implemented by different character encodings (like UTF-8 and UTF-16).
There is an enlightening article in this page. It talks about Unicode conversion for STL-based software. For text i/o the main weapon is codecvt, a library function that can be used for translating strings between different character encoding systems.

Related

how to detect non-ascii characters in C++ Windows?

I'm simply trying detect non-ascii characters in my C++ program on Windows.
Using something like isascii() or :
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:
#include <iostream>
using namespace std;
int main()
{
int c;
c = getchar();
cout << isascii(c) << endl;
cout << c << endl;
printf("0x%x\n", c);
cout << (char)c;
return 0;
}
and input a 😁 (because i am so happy right now), the output is
1
63
0x3f
?
Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be
1
72
0x48
H
This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like
i = input()
print(ord(i))
gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?
Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.
I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "😁" isn't ASCII, and from this page: https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "😁" is 0x39 which is why you're seeing that printed out.
Okay, I have solved this. I was not aware of translation modes.
_setmode(_fileno(stdin), _O_WTEXT);
Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened.
https://en.cppreference.com/w/cpp/language/translation_phases

How to apply <cctype> functions on text files with different encoding in c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

C++ Visual Studio Unicode confusion

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.
I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.
The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.
I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.
#include <iostream>
using namespace std;
int main()
{
unsigned int character = 122; // Starting at "z"
for (int i = 0; i < 100; i++)
{
cout << (char)character << endl;
cout << "decimal code point = " << (int)character << endl;
cout << "size of character = " << sizeof(character) << endl;
character++;
system("pause");
cout << endl;
}
return 0;
}
By the way, here is the Unicode chart
http://unicode-table.com/en/#control-character
Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).
For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.
There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

How can I read accented characters in C++ and use them with isalnum?

I am programming in French and, because of that, I need to use accented characters. I can output them by using
#include <locale> and setlocale(LC_ALL, ""), but there seems to be a problem when I read accented characters. Here is simple example I made to show the problem :
#include <locale>
#include <iostream>
using namespace std;
const string SymbolsAllowed = "+-*/%";
int main()
{
setlocale(LC_ALL, ""); // makes accents printable
// Traduction : Please write a string with accented characters
// 'é' is shown correctly :
cout << "Veuillez écrire du texte accentué : ";
string accentedString;
getline(cin, accentedString);
// Accented char are not shown correctly :
cout << "Accented string written : " << accentedString << endl;
for (unsigned int i = 0; i < accentedString.length(); ++i)
{
char currentChar = accentedString.at(i);
// The program crashes while testing if currentChar is alphanumeric.
// (error image below) :
if (!isalnum(currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar))
{
cout << endl << "Character not allowed : " << currentChar << endl;
system("pause");
return 1;
}
}
cout << endl << "No unauthorized characters were written." << endl;
system("pause");
return 0;
}
Here is an output example before the program crashes :
Veuillez écrire du texte accentué : éèàìù
Accented string written : ʾS.?—
I noticed the debugger from Visual Studio shows that I have written something different than what it outputs :
[0] -126 '‚' char
[1] -118 'Š' char
[2] -123 '…' char
[3] -115 '' char
[4] -105 '—' char
The error shown seems to tell that only characters between -1 and 255 can be used but, according to the ASCII table the value of the accented characters I used in the example above do not exceed this limit.
Here is a picture of the error dialog that pops up : Error message: Expression: c >= -1 && c <= 255
Can someone please tell me what I am doing wrong or give me a solution for this? Thank you in advance. :)
char is a signed type on your system (indeed, on many systems) so its range of values is -128 to 127. Characters whose codes are between 128 and 255 look like negative numbers if they are stored in a char, and that is actually what your debugger is telling you:
[0] -126 '‚' char
That's -126, not 126. In other words, 130 or 0x8C.
isalnum and friends take an int as an argument, which (as the error message indicates) is constrained to the values EOF (-1 on your system) and the range 0-255. -126 is not in this range. Hence the error. You could cast to unsigned char, or (probably better, if it works on Windows), use the two-argument std::isalnum in <locale>
For reasons which totally escape me, Windows seems to be providing console input in CP-437 but processing output in CP-1252. The high half of those two code pages is completely different. So when you type é, it gets sent to your program as 130 (0xC2) from CP-437, but when you send that same character back to the console, it gets printed according to CP-1252 as an (low) open single quote ‚ (which looks a lot like a comma, but isn't). So that's not going to work. You need to get input and output to be on the same code page.
I don't know a lot about Windows, but you can probably find some useful information in the MS docs. That page includes links to Windows-specific functions which set the input and output code pages.
Intriguingly, the accented characters in the source code of your program appear to be CP-1252, since they print correctly. If you decide to move away from code page 1252 -- for example, by adopting Unicode -- you'll have to fix your source code as well.
With the is* and to* functions, you really need to cast the input to unsigned char before passing it to the function:
if (!isalnum((unsigned char)currentChar) && !strchr(SymbolsAllowed.c_str(), currentChar)) {
While you're at it, I'd advise against using strchr as well, and switch to something like this:
std::string SymbolsAllowed = "+-*/%";
if (... && SymbolsAllowed.find(currentChar) == std::string::npos)
While you're at it, you should probably forget that you ever even heard of the exit function. You should never use it in C++. In the case here (exiting from main) you should just return. Otherwise, throw an exception (and if you want to exit the program, catch the exception in main and return from there).
If I were writing this, I'd do the job somewhat differently in general though. std::string already has a function to do most of what your loop is trying to accomplish, so I'd set up symbolsAllowed to include all the symbols you want to allow, then just do a search for anything it doesn't contain:
// Add all the authorized characters to the string:
for (unsigned char a = 0; a < std::numeric_limits<unsigned char>::max(); a++)
if (isalnum(a) || isspace(a)) // you probably want to allow spaces?
symbolsAllowed += a;
// ...
auto pos = accentedString.find_first_not_of(symbolsAllowed);
if (pos != std::string::npos) {
std::cout << "Character not allowed: " << accentedString[pos];
return 1;
}

Decimal values of Extended ASCII characters

I wrote a function to test if a string consists only of letters, and it works well:
bool is_all_letters(const char* src) {
while (*src) {
// A-Z, a-z
if ((*src>64 && *src<91) || (*src>96 && *src<123)) {
*src++;
}
else {
return false;
}
}
return true;
}
My next step was to include “Extended ASCII Codes”, I thought it was going to be really easy but that’s where I ran into trouble. For example:
std::cout << (unsigned int)'A' // 65 <-- decimal ascii value
std::cout << (unsigned int)'ñ'; // 4294967281 <-- what?
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
My goal is to restrict user input to only letters in ISO 8859-1 (latin 1). I’ve only worked with single byte characters and would like to avoid multi-byte characters if possible.
I am guessing that I can compare the unsigned int values above, i.e.: 4294967281, but it does not feel right to me and besides, I don’t know if that large integer is VC 8.0 representation of 'ñ' and changes from compiler to compiler.
Please advise
UPDATE - Per some suggestions made by Christophe, I ran the following code:
locale loc("spanish") ;
cout<<loc.name() << endl; // Spanish_Spain.1252
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
It does return Spanish_Spain.1252 but unfortunately, the loop iterations print the same data as the default C locale (using VC++ 8 / VS 2005).
Christophe shows different (desired) results as you can see in his screen shots below, but he uses a much newer version of VC++.
The code chart you found on the internet is actually Windows OEM code page 437, which was never endorsed as a standard. Although it is sometimes called "extended ASCII", that description is highly misleading. (See the Wikipedia article Extended ASCII: "The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue."
You can find the history of OEM437 on Wikipedia, in various versions.
What was endorsed as a standard 8-bit encoding is ISO-8859-1, which later became the first 256 code points in Unicode. (It's one of a series of 8-bit encodings designed for use in different parts of the world; ISO-8859-1 is specified to the Americas and Western Europe.) So that's what you will find in most computers produced in this century in those regions, although more recently more and more operating systems are converting to full Unicode support.
The value you see for (unsigned int)'ñ' is the result of casting the ISO-8859-1 code 0xF1 from a (signed) char (that is, -15) to an unsigned int. Had you cast it to an int, you would have seen -15.
I thought that the decimal value for ‘ñ’ was going to be 164 as listed on the ASCII chart at www.asciitable.com.
Asciitable.com appears to give the code for the old IBM437 DOS character set (still used in the Windows command prompt), in which ñ is indeed 164. But that's just one of hundreds of “extended ASCII” variants.
The value 4294967281 = 0xFFFFFFF1 you got is a sign-extension of the (signed) char value 0xF1, which is how ñ is encoded in ISO-8859-1 and close variants like Windows-1252.
To start with, you're trying to reinvent std::isalpha. But you'll need to pass the ISO-8859-1 locale IIRC, by default that just checks ASCII.
The behavior you see is because char is signed (because you didn't compile with /J, which is the smart thing to do when you use more than just ASCII - VC++ defaults to signed char).
There is already plenty of information here. However, I'd like to propose some ideas to adress your inital problem, being the categorisation of extended character set.
For this, I suggest the use of <locale> (country specific topics), and especially the new locale-aware form of isalpha(), isspace(), isprint(), ... .
Here a little piece of code to help you to find out what chars could be a letter in your local alphabet:
std::locale::global(std::locale("")); // sets the environment default locale currently in place
std::cout << std::locale().name() << std::endl; // display name of current locale
std::locale loc ; // use a copy of the active global locale (you could use another)
for (int i = 0; i < 255; i++) {
cout << i << " " << isalpha(i, loc)<< " " << (isprint(i,loc) ? (char)(i):'?') << endl;
}
This will print out the ascii code from 0 to 255, followed by an indicator if it is a letter according to the local settings, and the character itself if it's printable.
FOr example, on my PC, I get:
And all the accented chars, as well as ñ, and greek letters are considered as alpha, whereas £ and mathematical symbols are considered as non alpha printable.