Problem with getline and "strange characters" - c++

I have a strange problem,
I use
wifstream a("a.txt");
wstring line;
while (a.good()) //!a.eof() not helping
{
getline (a,line);
//...
wcout<<line<<endl;
}
and it works nicely for txt file like this
http://www.speedyshare.com/files/29833132/a.txt
(sorry for the link, but it is just 80 bytes so it shouldn't be a problem to get it , if i c/p on SO newlines get lost)
BUT when I add for example 水 (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples )to any line that is the line where loading stops. I was under the wrong impression that getline that takes wstring as one input and wifstream as other can chew any txt input...
Is there any way to read every single line in the file even if it contains funky characters?

The not-very-satisfying answer is that you need to imbue the input stream with a locale which understands the particular character encoding in question. If you don't know which locale to choose, you can use the empty locale.
For example (untested):
std::wifstream a("a.txt");
std::locale loc("");
a.imbue(loc);
Unfortunately, there is no standard way to determine what locales are available for a given platform, let alone select one based on the character encoding.
The above code puts the locale selection in the hands of the user, and if they set it to something plausible (e.g. en_AU.UTF-8) it might all Just Work.
Failing this, you probably need to resort to third-party libraries such as iconv or ICU.
Also relevant this blog entry (apologies for the self-promotion).

The problem is with your call to the global function getline (a,line). This takes a std::string. Use the std::wistream::getline method instead of the getline function.

C++ fstreams delegeate I/O to their filebufs. filebufs always read "raw bytes" from disk and then use the stream locale's codecvt facet to convert between these raw bytes into their "internal encoding".
A wfstream is a basic_fstream<wchar_t> and thus has a basic_filebuf<wchar_t> which uses the locale's codecvt<wchar_t, char> to convert the bytes read from disk into wchar_ts. If you read a UCS-2 encoded file, the conversion must thus be performed with a codecvt who "knows" that the external encoding is UCS-2. You thus need a locale with such a codecvt (see, for example, this SO question)
By default, the stream's locale is the global locale at the stream construction. To use a specific locale, it should be imbue()-d on the stream.

Related

Why the read() and the get() methods of the std::wistream read byte-width character?

I'm trying to get if a file has the Unicode BOM at its beginning. I prefer to use the iostream standard library. I tried to solve this task as follows:
std::wifstream str(filename);
wchar_t bom;
str.get(bom);
I assumed that because of the wchar_t characters has two bytes size, this code should read the first two bytes from the file, but it reads only the first 0xFF byte.
I understand, this can be solved via the "ordinary" stream, but I have the academical interest: why the given code returns one byte only?
basic_istream::get tries to read one character from a stream and convert it to whatever type basic_istream is templated on.
What constitutes a character in a stream (the character encoding of the stream) is determined by the locale, not by the type basic_istream is templated on.
Thus, if you need to impose a 16-bit character encoding, you need to imbue a C++ locale with 16-bit character encoding in the stream, regardless of whether it is ifstream or wifstream. As far as I know, there are no 16-bit locales built into Windows. You may construct such C++ locale from a system-supplied locale by adding a codecvt facet, for example like this:
std::wifstream str(filename);
str.imbue(std::locale(str.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff,
std::codecvt_mode::little_endian>));
Skip std::codecvt_mode::little_endian if your encoding is big endian. You can also skip the BOM by using std::codecvt_mode::consume_header.
std::codecvt_utf16 is deprecated since C++17, so you are on your own if you decide to use it. You can also build your own codecvt facet.

How to convert UTF-8 text from file to some container which can be iterable and check every symbol for being alphanumeric in C++?

I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.
Let's say I have a file which I am sure always will be UTF-8:
á
Let's say I have code:
wifstream input{argv[1]};
wstring line;
getline(input, line);
When I debug it, I see it's stored as L"á", so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0]).
I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided:
https://en.cppreference.com/w/cpp/locale/codecvt_utf8
I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too:
https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
So how to correctly read the line with let's say that letter (symbol) á and be able to iterate it as some container with size 1 so some function like iswalnum can be simply called?
EDIT: When I fix the bugs in those examples (for c++latest), I still have á in UTF-8 and á in UTF-16.
L"á" means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.
wifstream input{argv[1]};
input.imbue(std::locale("en_US.UTF-8"));
wstring line;
getline(input, line);
Now wstring line will contain Unicode code points (á in your case) and can be easily iterated.
Caveat: on Windows wchar_t is deficient (16-bit), and is good enough for iterating over BMP only.

How to convert a UTF-8 string to the encoding of a stream

Imagine I have decided to use UTF-8 everywhere internally in my C++11 program, so I have a std::string that contains text encoded in UTF-8. I now want to do some IO of that text. Writing it to std::cout, for example. Although I've used UTF-8 internally, I can not assume the program user and operating environment is so obliging as to use UTF-8 too. For good or bad reasons, the character encoding of text that I ought to send through std::cout might not be UTF-8. My program must perform a conversion, taking my UTF-8 encoded text and converting it to the encoding that std::cout expects. How can I find out the encoding on that output stream, then do the character encoding?
Looking at the declarations of standard C++ streams, it looks like I can use std::io_base::get_loc to get the "locale" of the output stream, then get a std::codecvt "code conversion facet" for the stream. But which facet should I get? And how do I actually use that facet to convert from UTF-8 to the output encoding?
And if those facilities of the standard library can not do the task, what other options do I have?
How can I find out the encoding on that output stream
You don't.
The expectations of the receiver of any output stream that is not yourself (whether cout, cerr, a file-stream, or whatever) are not something that you can determine. The concept of "standard output" does not come bundled with an associated concept of "encoding". Encoding expectations are implicit, not explicit.
Yes, streams have locale facets. But that is purely you saying "I want to encode output in this way". That says nothing about the needs of the consumer on the other end of the stream. It's simply a way for you to do conversions to what you believe the receiver wants.
C++ doesn't have a way to query what the receiver expects. And without that knowledge, ICU or iconv or whatever are not helpful to you.
The way this is generally done is with platform-specific code. On your Windows build, you can either output wchar_ts encoded in UTF-16, or set codepages and use facets to convert for that. On Linux, you can generally assume that the console will accept UTF-8. And so forth.
But there is no simple "do this and it will work" mechanism.

Convert UCS-2 inside character array to UTF-8 std::string

Well this is a direct "followup" of this question; I decided to split the problem into two - Originally I posted the whole picture to prevent getting another close with "YZ problem". For now consider I know already the character encoding.
However I read a string using std::getline from a file. This file is encoded in a format I know -say UTF16 big endian-.
But not "all" files are UTF16 (actually most are UTF8), I prefer to have as little code-copying as possible.
Now my first response is to "just read the bytes" and "then do the conversion to UTF-8", and skip the conversion if the input is already UTF-8. So I read it first into a std::string (please ignore the "ugglyness" of OpenFilestreams()[file_index]);
std::string retString;
if (isValidIndex(file_index) && OpenFilestreams()[file_index]->good()) {
std::getline(*OpenFilestreams()[file_index], retString);
}
return retString;
After this I oblviously have a nonsense string - as the bytes are ordered as if the string was UCS2/UTF-16. So how can I convert this std::string to another std::string resulting in the UTF8-byte ordering. - Or should I do this at line reading level (or even at opening the file-stream level?)
I prefer to keep myself to the C++11 standard, maybe boost/ICU if it is really better (already have boots, but no ICU library at my pc).

Issue on writing wstring to a file for hebrew/arabic language

I want to read hebrew(unicode) using xerces parser. I am able to read the value in XMLCh. However, while writing it to another file I get gargabe value. I tried using ofstream, wofstream but didnot helped.
Let me know your suggestions
The problem with wofstream is that it accepts the wide string for the open() method but does not actually write wide characters to the file. You have to be explicit about that and imbue() it with a locale that has a codecvt with the encoding you want. Implementation of such a codecvt that produces a UTF encoding is still spotty, here's an example that uses Boost.
It's been a while since I've used xerces, but I remember that XMLCh are their special character types, and probably you must convert them to wchar before writing. Alternatively you can try to save it byte by byte.. Good Luck!
as far as i know (obout arabic) you have to write oppositely since its from right to left so write a code to switch the letters before writing it to a file