This sounds like a simple problem, but C++ is making it difficult (for me at least): I have a wstring and I would like to get the first letter as a wchar_t object and then remove this first letter from the string.
This here does not work for non-ASCII characters:
wchar_t currentLetter = word.at(0);
Because it returns two characters (in a loop) for characters such as German Umlauts.
This here does not work, either:
wchar_t currentLetter = word.substr(0,1);
error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'
And neither does this:
wchar_t currentLetter = word.substr(0,1).c_str();
error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'
Any other ideas?
Cheers,
Martin
---- Update -----
Here is some executable code that should demonstrate the problem. This program will loop over all letters and output them one by one:
#include <iostream>
using namespace std;
int main() {
wstring word = L"für";
wcout << word << endl;
wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;
wchar_t currentLetter;
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
currentLetter = word.at(0);
wcout << L"Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
However, the actual output I get is:
f?r
? ? ?
Letter: f
Letter: ?
Letter: r
The source file is encoded in UTF8 and the console's encoding is also set to UTF8.
Here's a solution provided by Sehe:
#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>
using namespace std;
template <typename C>
std::string to_utf8(C const& in)
{
std::string result;
auto out = std::back_inserter(result);
auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);
std::copy(begin(in), end(in), utf8out);
return result;
}
int main() {
wstring word = L"für";
bool isLastLetter;
do {
isLastLetter = ( word.length() == 1 );
auto currentLetter = to_utf8(word.substr(0, 1));
cout << "Letter: " << currentLetter << endl;
word = word.substr(1, word.length()); // remove first letter
} while (word.length() > 0);
return EXIT_SUCCESS;
}
Output:
Letter: f
Letter: ü
Letter: r
Yes you need Boost, but it seems that you're going to need an external library anyway.
1
C++ has no idea of Unicode. Use an external library such as ICU
(UnicodeString class) or Qt (QString class), both support Unicode,
including UTF-8.
2
Since UTF-8 has variable length, all kinds of indexing will do
indexing in code units, not codepoints. It is not possible to do
random access on codepoints in an UTF-8 sequence because of it's
variable length nature. If you want random access you need to use a
fixed length encoding, like UTF-32. For that you can use the U prefix
on strings.
3
The C++ language standard has no notion of explicit encodings. It only
contains an opaque notion of a "system encoding", for which wchar_t is
a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external
encoding, you must use an external library. The library of choice
would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and
available on many platforms, although on Windows the
WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello
World: \U0010FFFF";. Those are already in UTF8, but they cannot
interface with the opaque wstring other than through the way I
described.
4 (about source files but still sorta relevant)
Encoding in C++ is quite a bit complicated. Here is my understanding
of it.
Every implementation has to support characters from the basic source
character set. These include common characters listed in §2.2/1
(§2.3/1 in C++11). These characters should all fit into one char. In
addition implementations have to support a way to name other
characters using a way called universal character names and look like
\uffff or \Uffffffff and can be used to refer to unicode characters. A
subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to
source characters (used at compile time) is implementation defined.
This constitutes the encoding used.
Related
When trying to determine the length of a low-level character string with the strlen function of I have noticed that it does not work properly when the string contains Spanish characters that do not exist in English, such as the exclamation opening symbol !, accents or the letter ñ. All these elements are counted as two characters, a situation that is not fixed with Locale.
#include <cstring>
#include <iostream>
int main() {
const char * s1 = "Hola!";
const char * s2 = "¡Hola!";
std::cout << s1 << " has " << strlen(s1) << " elements, but " << s2
<< " has " << strlen(s2) << " intead of 6" << std::endl;
}
This is a work for the university on low-level strings, so it is not possible to use libraries as strings.
strlen gives you the number of non-zero char objects in the buffer pointed to by its argument, up to the first zero char. Your system is apparently using a character encoding (most likely UTF-8) where these problematic characters take up more than one byte (that is, more than one char object).
How to solve this depends on what you're trying to do. For certain operations (such as determining the size of a buffer needed to store the string), the result from strlen is 100% correct, as it's exactly what you need. For most other purposes, welcome to the vast world of character/byte/code-point/whatever nuances. You might want to read up on text encodings, Unicode etc. http://utf8everywhere.org/ might be a good site to start.
You've mentioned this is a university assignment: based on what the teaching goal is, you might need to implement some form of UTF en/de-coding, or just steer clear of non-ASCII characters.
I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO-8859-1
ISO Latin-1
ASCII
UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"
How do i loop through the letters of a string when it has non ASCII charaters?
This works on Windows!
for (int i = 0; i < text.length(); i++)
{
std::cout << text[i]
}
But on linux if i do:
std::string text = "á";
std::cout << text.length() << std::endl;
It tells me the string "á" has a length of 2 while on windows it's only 1
But with ASCII letters it works good!
In your windows system's code page, á is a single byte character, i.e. every char in the string is indeed a character. So you can just loop and print them.
On Linux, á is represented as the multibyte (2 bytes to be exact) utf-8 character 'C3 A1'. This means that in your string, the á actually consists of two chars, and printing those (or handling them in any way) separately yields nonsense. This will never happen with ASCII characters because the utf-8 representation of every ASCII character fits in a single byte.
Unfortunately, utf-8 is not really supported by C++ standard facilities. As long as you only handle the whole string and neither access individual chars from it nor assume the length of the string equals the number of actual characters in the string, std::string will most likely do fine.
If you need more utf-8 support, look for a good library that implements what you need.
You might also want to read this for a more detailed discussion on different character sets on different systems and advice regarding string vs. wstring.
Also have a look at this for information on how to handle different character encodings portably.
Try using std::wstring. The encoding used isn't supported by the standard as far as I know, so I wouldn't save these contents to a file without a library that handles a specific format. of some sort. It supports multi-byte characters so you can use letters and symbols not supported by ASCII.
#include <iostream>
#include <string>
int main()
{
std::wstring text = L"áéíóú";
for (int i = 0; i < text.length(); i++)
std::wcout << text[i];
std::wcout << text.length() << std::endl;
}
How can I make an arbitrary number be interpreted as Unicode when outputted to the terminal?
So for example:
#include <iostream>
int main() {
int euro_dec = 0x20AC;
std::cout << "from int: " << euro_dec
<< "\nfrom \\u: \u20AC" << std::endl;
return 0;
}
This prints:
from int: 8364
from \u: €
What does the escape sequence \u do to make the number 0x20AC be interpreted as Unicode?
I tested using wcout and the output was:
from int: 8364
from \u:
A unicode escape sequence occurring in program text is converted to the equivalent Unicode character at the very first phase of translation (2.2p1b1 [lex.phases]). This occurs even before the program is tokenized or preprocessed.
To convert a Unicode codepoint expressed as an integer to your native narrow multibyte encoding, use c32rtomb:
#include <cuchar>
char buf[MB_CUR_MAX];
std::mbstate_t ps{};
std::size_t ret = std::c32rtomb(buf, euro_dec, &ps);
if (ret != static_cast<std::size_t>(-1)) {
std::cout << std::string(buf, &buf[ret]); // outputs €
}
Note that cuchar is poorly supported; if you know that your native narrow string encoding is UTF-8 you can use codecvt_utf8<char32_t> but otherwise you'll have to use platform-specific facilities.
When you output the integer variable, the library converts the value to text, it doesn't actually output the value as an integer.
When using the "\u", it's the compiler which reads the number and converts it to the appropriate byte sequence which it inserts directly into the literal string.
I wanted to write a function which returns true if a given character is a russian vowel. But the results I get are strange to me. This is what I've got so far:
#include <iostream>
using namespace std;
bool is_vowel_p(char working_char)
// returns true if the character is a russian vowel
{
string matcher = "аяё×эеуюыи";
if (find(matcher.begin(), matcher.end(), working_char) != matcher.end())
return true;
else
return false;
}
void main()
{
cout << is_vowel_p('е') << endl; // russian vowel
cout << is_vowel_p('Ж') << endl; // russian consonant
cout << is_vowel_p('D') << endl; // latin letter
}
The result is:
1
1
0
what is strange to me. I expected the following result:
1
0
0
It's seems that there is some kind of internal mechanism which I don't know yet. I'm at first interested in how to fix this function to work properly. And second, what is going on there, that I get this result.
string and char are only guaranteed to represent characters in the basic character set - which does not include the Cyrillic alphabet.
Using wstring and wchar_t, and adding L before the string and character literals to indicate that they use wide characters, should allow you to work with those letters.
Also, for portability you need to include <algorithm> for find, and give main a return type of int.
C++ source code is ASCII. You are entering unicode characters. The comparison is being done using 8 bit values. I bet one of the vowels fulfills the following:-
vowel & 255 == (code point for 'Ж') & 255
You need to use unicode functions to do this, not ASCII functions, i.e. use functions that require wchar_t values. Also, make sure your compiler can parse the non-ASCII vowel string. Using MS VC, the compiler requires:-
L"аяё×эеуюыи" or TEXT("аяё×эеуюыи")
the latter is a macro that adds the L when compiling with unicode support.
Convert the code to use wchar_t and it should work.
Very useful function in locale.h
setlocale(LC_ALL, "Russian");
Past this in the beginning of the program.
Example:
#include <stdio.h>
#include <locale.h>
void main()
{
setlocale(LC_ALL, "Russian");
printf("Здравствуй, мир!\n");//Hello, world!
}
Make sure your system default locale is Russian, and make sure your file is saved as codepage 1251 (Cyrillic/Windows). If it's saved as Unicode, this won't ever work.
The system default locale is the one used by non-Unicode-compliant programs. It's in Control Panel, under Regional settings.
Alternatively, rewritte to use wstring and wchar_t and L"" string/char literals.