How to apply <cctype> functions on text files with different encoding in c++ - c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}

Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"

Related

C++ Non ASCII letters

How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}

C++ - Replace special characters from file with non ascii characters

I'm having a difficult time replacing some characters from my file with some diacritics from my mother tongue ; such as:
character_to_replace replacement
º ș
ª Ș
þ ț
Þ Ț
I've found the Unicode for the character_to_replace but for some reason the file won't save to the expected output. I figured out it's something to do with the UTF-8 and unicode conversion. However i managed to print out the characters but only to the console when i try to write to the file it doesn't work. Here's my code:
void replace(string &source, string to_replace, string replacement)
{
int found = 0;
string auxiliar;
auxiliar = source;
while (found != string::npos)
{
found = auxiliar.find(to_replace);
if (found != -1)
{
source.replace(found, 1, replacement);
auxiliar = auxiliar.substr(found + to_replace.size());
}
}
}
int main()
{
cout << endl;
string line;
ifstream file;
ofstream send_line;
send_line.open("out.txt");
file.open("in.txt");
while (!file.eof())
{
getline(file, line);
replace(line, "\u00b0", "\u0219");
replace(line, "\u00aa", "\u0218");
replace(line, "\u00fe", "\u021b");
replace(line, "\u00de", "\u021a");
send_line << line << "\n";
}
file.close();
send_line.close();
}
Can you point me to the right direction where I may solve this ? Thank you.
What system are you using?
It looks like the file you're processing may be encoded in UTF8, but the ≤ character isn't in the codeset underlying the locale you're using.
Try running the command locale to see what locale you are using. If the LC_CTYPE entry does not end in something like UTF-8, you might try the command:
locale -a
to get a list of locales available and look for something that fits you language and location with a UTF-8 codeset. The locale names aren't standardized, but a common convention is to have a 2 letter code for your language, an underscore, a 2 letter country code, a period, and a code set identifier. The locale I use much of the time is en_US.UTF-8 (English, United States of America, UTF-8) on OS X and the above commands work without error in this locale.
You can use the environment variables LANG and LC_* to set the locale for standard utilities that you run. Good applications will set the locale they use to be controlled by the environment variables. If you are using applications that don't set their locale based on what the user requests, they will problem be run in the C or POSIX Locale.
Please follow this link
http://www.unix.com/unix-for-dummies-questions-and-answers/220029-remove-replace-non-ascii-character-file.html

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?
You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

C++ How to get first letter of wstring

This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.

g++ unicode character ifstream

this is a question about unicode characters in a text input file. This discussion was close but not quite the answer. Compiled with VS2008 and executed on Windows these charcters are recognized on read (maybe represented as a different symbol but read) - compiled with g++and executed on linux they are displayed as blank.
‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
The rest of the Unicode symbols appear to work fine, I did not check them all but found this set did not work.
Questions:
(1) Why?
(2) Is there a solution?
void Lexicon::buildMapFromFile(string filename ) //map
{
ifstream file;
file.open(filename.c_str(), ifstream::binary);
string wow, mem, key;
unsigned int x = 0;
while(true) {
getline(file, wow);
cout << wow << endl;
if (file.fail()) break; //boilerplate check for error
while (x < wow.length() ) {
if (wow[x] == ',') { //look for csv deliniator
key = mem;
mem.clear();
x++; //step over ','
} else
mem += wow[x++];
}
//cout << mem << " code " << key << " is " << (key[0] - '€') << " from €" << endl;
cout << "enter 1 to continue: ";
while (true) {
int choice = GetInteger();
if (choice == 1) break;
}
list_map0[key] = mem; //char to string
list_map1[mem] = key; //string to char
mem.clear(); //reset memory
x = 0;//reset index
}
//printf("%d\n", list_map0.size());
file.close();
}
The unicode symbols are read from a csv file and parsed for the unicode symbol and an associated string. Initially I though there was a bug in the code but in this post the review found it is fine and I followed the issue to how the characters are handled.
The test is cout << wow << endl;
The characters you show are all characters from Windows codepage 1252 which do not exist in the ISO-8859 1 encoding. These two encodings are similar and so are often confused.
If the input is CP1252 and you are reading it as though it were ISO-8859 1 then those characters are read as control characters and will not behave as normal, visible characters.
There are many possible things you could be doing incorrectly to cause this, but you'll have to post more details in order to determine which. A more complete answer requires knowing how you are reading the data, how you are converting and storing it internally, how you are testing the read data, and the input data and/or encoding.
Your displayed code doesn't do any conversions while reading the data, and the commented-out code to print the data is the same; no conversions. This means to print the data you are simply relying on the input data to be correct for the platform you run the program on. That means that, for example, if you run your program in the console on Windows then your input file needs to be encoded using the console's codepage*.
To fix the problem you can either; ensure the input file matches the encoding required by the particular console you run the program on; or specify the input encoding, convert to a known internal encoding when reading and then convert to the console encoding when printing.
* and if it's not, for example if the console is cp437 and the file is cp1252 then the characters you listed will instead show up as: É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
Your problem-statement does not detail out thee platform for g++, but from your tags it appears that you are compiling that same code on linux.
Windows and linux are both unicode enabled. so, if your code in windows/vs-2008 had wstring class, then, you have to change it back to string on linux/g++. If you are using wstring in linux, it will not work the same way.
Unicode handling in C++ code is not straightforward and it is implementation-depending (you have already seen that the output change between VS2008 and g++). Furthermore, Unicode can be implemented by different character encodings (like UTF-8 and UTF-16).
There is an enlightening article in this page. It talks about Unicode conversion for STL-based software. For text i/o the main weapon is codecvt, a library function that can be used for translating strings between different character encoding systems.