Issue on writing wstring to a file for hebrew/arabic language - c++

I want to read hebrew(unicode) using xerces parser. I am able to read the value in XMLCh. However, while writing it to another file I get gargabe value. I tried using ofstream, wofstream but didnot helped.
Let me know your suggestions

The problem with wofstream is that it accepts the wide string for the open() method but does not actually write wide characters to the file. You have to be explicit about that and imbue() it with a locale that has a codecvt with the encoding you want. Implementation of such a codecvt that produces a UTF encoding is still spotty, here's an example that uses Boost.

It's been a while since I've used xerces, but I remember that XMLCh are their special character types, and probably you must convert them to wchar before writing. Alternatively you can try to save it byte by byte.. Good Luck!

as far as i know (obout arabic) you have to write oppositely since its from right to left so write a code to switch the letters before writing it to a file

Related

How to convert UTF-8 text from file to some container which can be iterable and check every symbol for being alphanumeric in C++?

I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.
Let's say I have a file which I am sure always will be UTF-8:
á
Let's say I have code:
wifstream input{argv[1]};
wstring line;
getline(input, line);
When I debug it, I see it's stored as L"á", so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0]).
I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided:
https://en.cppreference.com/w/cpp/locale/codecvt_utf8
I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too:
https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
So how to correctly read the line with let's say that letter (symbol) á and be able to iterate it as some container with size 1 so some function like iswalnum can be simply called?
EDIT: When I fix the bugs in those examples (for c++latest), I still have á in UTF-8 and á in UTF-16.
L"á" means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.
wifstream input{argv[1]};
input.imbue(std::locale("en_US.UTF-8"));
wstring line;
getline(input, line);
Now wstring line will contain Unicode code points (á in your case) and can be easily iterated.
Caveat: on Windows wchar_t is deficient (16-bit), and is good enough for iterating over BMP only.

std::string, std::wstring and UTF8

I want to use string encoded in the UTF-8 (I'm sorry if its a bad wording, please correct me so I understand what is a proper one). Also, I want my program to be cross-platform.
IIUC, the proper way to do so is to use std::wstring and then convert it to be UTF8. The trouble is that I think that on Linux std::string is already encoded in UTF8 (I may be wrong so).
So what is the best way to create a UTF8 representation of std::{w}string with the least possible conditional code?
The strings are constants, they are hard coded and they will be used in the SQLite queries.
P.S.: I am going to try with XCode 5, hoping that it is C++11 compliant.
they are hard coded.
If all of the strings in question are hard-coded string literals, then you don't need anything special.
Use the u8 prefix when declaring such strings will ensure that they are encoded in UTF-8. On every platform that supports this feature of C++11. The type of such strings is const char [], just like a regular string literal:
const char my_utf8_literal[] = u8"Some String.";
Of course, these can be stored in std::string (not wstring) as well:
std::string my_utf8_string = u8"Some String.";
You said that your goal was to use them in SQLite queries and commands. In that case, it should be pretty easy to make everything work. You would be using SQLite's string formatting commands to build queries, and while they are blind to UTF-8, so long as all of your inputs are UTF-8, the outputs will also be valid UTF-8. So there shouldn't be any problems.
For UTF-8 processing there's a Library called tiny-utf8. It provides a drop-in replacement for std::string or more specifically std::u32string (::value_type is char32_t, but data representation is utf8 with char's). That's more or less the easiest way to handle utf8 in C++11.
The strings are constants, they are hard coded and they will be used
in the SQLite queries.
If you have hardcoded strings, you would just have to change the encoding of your source file to UTF8 and prepend the U-prefix to your string literal, with which you can then construct an utf8_string class to work with it.
So what is the best way to create a UTF8 representation of
std::{w}string with the least possible conditional code?
IMHO If you are able to, don't work with wchar_t and wstring, since they are probably the most vaguely specified and platform specific things in the C++ string library.
I hope this helped at least a Little bit.
Cheers, Jakob
The question has changed after this answer was posted, adding that the strings are hardcoded literals to be used in SQL queries. For that simple u8 strings are a simple solution, and parts answered here become irrelevant. I'm not going to chase the question through this or further changes.
Re
” I want to use string encoded in the UTF-8 (I'm sorry if its a bad wording, please correct me so I understand what is a proper one). Also, I want my program to be cross-platform.
Then you're plain out of luck.
Microsoft's documentation explicitly states that their setlocale does not support UTF-8:
MSDN docs on setlocale:
” The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
Heads-up: in spite of the fact that It Does Not Work™, and is explicitly documented as not working, there are numerous web sites and blogs, probably even books, that recommend the approach, in a sort of ostrich-like way. They often look authoritative. But the info is rubbish.
Re
” what is the best way to create a UTF8 representation of std::{w}string with the least possible conditional code?
That depends on what you have available. The standard library offers std::codecvt. It's been asked about and answered before, e.g. (Convert wstring to string encoded in UTF-8).

How to check whether text file is encoded in UTF-8?

How to check whether text file is encoded in UTF-8 in C++?
Try to read it as UTF-8 and see if UTF-8 encoding is broken or not and if not, if there are valid Unicode points only.
But still there's no guarantee the file is in UTF-8 or ASCII or something else. How would you interpret a file containing a single byte, the letter A? ASCII? UTF-8? Other? Likewise, what if the file starts with the BOM by sheer luck but isn't really UTF-8 or isn't intended to be UTF-8?
This article may be of interest.
You can never know for sure that any piece of binary data was intended to represent UTF-8. However, you can always check if it can be interpreted as UTF-8. The simplest way would be to just try and convert it (say to UTF-32) and see if you get no errors. If all you need is the validation, then you can do the same thing without actually writing the output. (You'll need to write this yourself, but it's easy.)
Note that it is crucial for security reasons to abort the conversion entirely at the first error, and not try to "recover" somehow.
Try converting to UTF-16. If you get no errors, then it is very likely UTF-8.
But not matter what you do, it is still a best guess.

Problem with getline and "strange characters"

I have a strange problem,
I use
wifstream a("a.txt");
wstring line;
while (a.good()) //!a.eof() not helping
{
getline (a,line);
//...
wcout<<line<<endl;
}
and it works nicely for txt file like this
http://www.speedyshare.com/files/29833132/a.txt
(sorry for the link, but it is just 80 bytes so it shouldn't be a problem to get it , if i c/p on SO newlines get lost)
BUT when I add for example 水 (from http://en.wikipedia.org/wiki/UTF-16/UCS-2#Examples )to any line that is the line where loading stops. I was under the wrong impression that getline that takes wstring as one input and wifstream as other can chew any txt input...
Is there any way to read every single line in the file even if it contains funky characters?
The not-very-satisfying answer is that you need to imbue the input stream with a locale which understands the particular character encoding in question. If you don't know which locale to choose, you can use the empty locale.
For example (untested):
std::wifstream a("a.txt");
std::locale loc("");
a.imbue(loc);
Unfortunately, there is no standard way to determine what locales are available for a given platform, let alone select one based on the character encoding.
The above code puts the locale selection in the hands of the user, and if they set it to something plausible (e.g. en_AU.UTF-8) it might all Just Work.
Failing this, you probably need to resort to third-party libraries such as iconv or ICU.
Also relevant this blog entry (apologies for the self-promotion).
The problem is with your call to the global function getline (a,line). This takes a std::string. Use the std::wistream::getline method instead of the getline function.
C++ fstreams delegeate I/O to their filebufs. filebufs always read "raw bytes" from disk and then use the stream locale's codecvt facet to convert between these raw bytes into their "internal encoding".
A wfstream is a basic_fstream<wchar_t> and thus has a basic_filebuf<wchar_t> which uses the locale's codecvt<wchar_t, char> to convert the bytes read from disk into wchar_ts. If you read a UCS-2 encoded file, the conversion must thus be performed with a codecvt who "knows" that the external encoding is UCS-2. You thus need a locale with such a codecvt (see, for example, this SO question)
By default, the stream's locale is the global locale at the stream construction. To use a specific locale, it should be imbue()-d on the stream.

How to write a std::string to a UTF-8 text file

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?
The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.
And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.
If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).
You may want to start your file with a byte order mark so that other programs will know it is UTF-8.
There is nice tiny library to work with utf8 from c++: utfcpp
libiconv is a great library for all our encoding and decoding needs.
If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.
What is the easiest and simple way to do so?
The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string.
As the internet still lacks of one, I went to implement the functionality on my own:
tinyutf8 (EDIT: now Github).
This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).
Hope this helps!
If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.
std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());
std::string str_std( byteArray.constData(), byteArray.length());
My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.
#include <utf/utf.h>
int main()
{
using namespace utf;
u32string u32_text = U"ɦΈ˪˪ʘ";
// do stuff with string
// convert to utf8 string
utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());
u8string u8_text(pos, end);
// write out utf8 to file.
// ...
}
Use Glib::ustring from glibmm.
It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).
As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.
So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols
(languages) in single.
It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.
Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!
I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.