c++ unicode writing is not working - c++

I am trying to write some Russian unicode text in file by wfstream. Following piece of code has been used for it.
wfstream myfile;
locale AvailLocale("Russian");
myfile.imbue(AvailLocale);
myfile.open(L"d:\\example.txt",ios::out);
if (myfile.is_open())
{
myfile << L"доброе утро" <<endl;
}
myfile.flush();
myfile.close();
Something unrecognizable is written to the file by executing this code, I am using VS 2008.

I don't know much about imbuing locales in C++, but perhaps it isn't working because you're trying to write Cyrillic characters using the Greek locale?

If you use std::locale("Russian") the file will be encoded as "Cyrillic (Windows)" (and not some Unicode format) when you use VS2008. If you for example open it with Internet Explorer and change the encoding to Cyrillic (Windows) the characters become visable.
It is more common to store files in some Unicode format.
When I use:
const std::locale AvailLocale
= std::locale(std::locale("Russian"), new std::codecvt_utf8<wchar_t>());
to store it as UTF-8 and open the file for example in notepad as an UTF-8 file. It see доброе утро
Similarly you can use codecvt_utf16 or some other coding scheme to encode the unicode characters.
codecvt_utf8 is specific for C++11.
Alternatively you can use boost:
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Or implement something similar yourself (from looking at http://www.boost.org/doc/libs/1_40_0/boost/detail/utf8_codecvt_facet.hpp it doesn't seem that complicated).
Or this library: http://utfcpp.sourceforge.net/

Related

CStdioFile problems with encoding on read file

I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéíóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.

fstream::open() Unicode or Non-Ascii characters don't work (with std::ios::out) on Windows

In a C++ project, I want to open a file (fstream::open()) (which seems to be a major problem). The Windows build of my program fails miserably.
File "ä" (UTF-8 0xC3 0xA4)
std::string s = ...;
//Convert s
std::fstream f;
f.open(s.c_str(), std::ios::binary | std::ios::in); //Works (f.is_open() == true)
f.close();
f.open(s.c_str(), std::ios::binary | std::ios::in | std::ios::out); //Doesn't work
The string s is UTF-8 encoded, but then converted from UTF-8 to Latin1 (0xE4). I'm using Qt, so QString::fromUtf8(s.c_str()).toLocal8Bit().constData().
Why can I open the file for reading, but not for writing?
File "и" (UTF-8 0xD0 0xB8)
Same code, doesn't work at all.
It seems, this character doesn't fit in the Windows-1252 charset. How can I open such an fstream (I'm not using MSVC, so no fstream::open(const wchar_t*, ios_base::openmode))?
In Microsoft implementations of STL, there's a non-standard extension (overload) to allow unicode support for UTF-16 encoded strings.
Just pass UTF-16 encoded std::wstring to fstream::open(). This is the only way to make it work with fstream.
You can read more on what I find to be the easiest way to support unicode on windows here: http://utf8everywhere.org/
Using the standard APIs (such as std::fstream) on Windows you can only open a file if the filename can be encoded using the currently set "ANSI Codepage" (CP_ACP).
This means that there can be files which simply cannot be opened using these APIs on Windows. Unless Microsoft implements support for setting CP_ACP to CP_UTF8 then this cannot be done using Microsoft's CRT or C++ standard library implementation.
(Windows has had a feature called "short" filenames where, when enabled, every file on the drive had an ASCII filename that can be used via standard APIs. However this feature is going away so it does not represent a viable solution.)
Update: Windows 10 has added support for setting the codepage to UTF-8

Opening Unicode text files in C++ and displaying their contents

Currently I am attempting to open a text file that was saved in Unicode format, copy it's contents to a wstring, and then display it on the console. Because I am trying to understand more about working with strings and opening files, I'm experimenting with it in a simple program. Here is the source.
int main()
{
std::wfstream myfile("C:\\Users\\Jacob\\Documents\\openfiletest.txt");
if(!myfile.is_open())
{
std::cout << "error" << std::endl;
}
else
{
std::cout << "opened" << std::endl;
}
std::wstring mystring;
myfile >> mystring;
std::wcout << mystring << std::endl;
system("PAUSE");
}
When I try to display it on the console it displays  ■W H Y when it should display WHY (really it's "WHY WONT YOU WORK", but ill worry about why it's incomplete later I guess).
In all honesty, using Unicode is not very important to me because this isn't a program that I will be selling (more for just my self). I do want to get familiar with it though because eventually I do plan on needing to knowledge of using Unicode in C++. I am also using boost file-system for working with directories and multithreading while using C++/cli for the GUI. My question(s): Should I really bother using Unicode if I don't need it at this point in time, If so how do I fix this problem, and are there and cross platform libraries for dealing with strings and files that use different Unicode encodings (windows with UTF-16 and Linux with UTF-32).
Also, any articles on Unicode in C++ or Unicode in general would be appreciated. Here is one that I found and it helped a little.The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Thanks.
EDIT: Here is another arcticle I just found that was useful Reading UTF-8 Strings with C++
That's a byte order mark. If you find one at the beginning of the file, just strip it.
And the spaces in between letters are probably because the console isn't very wide char friendly.
It displays just one word because myfile is a stream and operator>> extracts just one string separated by whitespaces from the stream. You might want to try the getline function.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.