How to properly convert USC-2 little endian into UTF-8? - c++

I have a file and the line endings are in the windows style \r\n; it is encoded in USC-2 little endian.
Say this is my file fruit.txt (USC-2 little endian):
So I open it in a std::wifstream and try to parse the contents:
// open the file
std::wifstream file("fruit.txt");
if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));
// create container for the lines
std::forward_list<std::string> lines;
// Add each line to the container
std::wstring line;
while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));
If I try to print to cout...
// Printing to cout
for( auto it = lines.cbegin(); it != lines.cend(); ++it )
std::cout << *it << std::endl;
...This is what it outputs:
Cherry
Banana
ÿþApple
Worse yet, if I open it in Notepad++, this is what it looks like
I can sort-of rectify this by forcibly converting the encoding back to USC-2 which results in this:
My wstring_to_string function is defined as this:
std::string wstring_to_string( const std::wstring& wstr ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
return convert.to_bytes(wstr);
}
What in the world is going on here? How can I get a normal UTF-8 string? I have tried this method too: How to read utf-16 file into utf-8 std::string line by line, but imbuing the std::wifstream first results in no outputs altogether. Can someone please help direct me in the best way to go about converting USC-2 LE data to readable UTF-8 data?
Edit I think there may be a bug with mingw64/mingw-w64-x86_64-gcc 6.3.0-2 which is provided by MSYS2. I have tried everyone's suggestions and imbuing the locale into the streams is just rendering no output at all. I do know there are only two native locales provided, "C" and "POSIX". I was going to try Visual Studio but don't have sufficient internet speed for the 4GB download. I have used ICU like #Andrei R. suggested and it is working great.
I would have loved to use standard libraries but I am ok with this. Please take a look at my code if you need this solution: https://pastebin.com/qudy7yva

The code itself is fine as-is.
The real problem is that your input file is NOT valid UTF-16LE to begin with (your use of std::codecvt_utf8_utf16 requires UTF-16, not UCS-2). This is clearly shown in your Notepad++ screenshots.
Offhand, the file data looks like a UTF-16LE file with a BOM (ÿþ is the UTF-16LE BOM when viewed as 8bit ANSI) was appended as-is to the end of a UCS-2BE (or UTF-16BE) file that did not have a BOM.
You need to fix the input file so the entire file is valid UTF-16LE from beginning to end (with or without a BOM in front, not in the middle).
Then the code you already have will work.

converting to/from unicode is in general not so trivial. Have a look at ICU libraries, I believe, this is by far most complete encoding conversion library for c/c++.
There are also platform-dependent ways like WideCharToMultibyte (Win) or iconv (Linux). Or, with Qt, you can use QString::fromUtf16. Probably you will have to reverse endianness yourself.

For your case, the main issue is that you made the wifstream read the file in a wrong way. If you print the size of wstr in wstring_to_string, you will find that it's not what you expect.
https://stackoverflow.com/a/19698449/4005852
Set proper locale will fix this issue.
std::string wstring_to_string( const std::wstring& wstr ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
return convert.to_bytes(wstr);
}
int main()
{
// open the file
std::wifstream file("fruit.txt", std::ios::binary);
file.imbue(std::locale(file.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));
// create container for the lines
std::forward_list<std::string> lines;
// Add each line to the container
std::wstring line;
file.get(); // remove BOM
while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));
// Printing to cout
for( auto it = lines.cbegin(); it != lines.cend(); ++it )
std::cout << *it << std::endl;
return 0;
}

Related

Wrong filename when using chinese characters

I'm trying to create a file on Windows using a Chinese character. The entire path is inside the variable "std::string originalPath", however, I have a charset problem that I simply cannot understand to overcome.
I have written the following code:
#include <iostream>
#include <boost/locale.hpp>
#include <boost/filesystem/fstream.hpp>
#include <windows.h>
int main( int argc, char *argv[] )
{
// Start the rand
srand( time( NULL ) );
// Create and install global locale
std::locale::global( boost::locale::generator().generate( "" ) );
// Make boost.filesystem use it
boost::filesystem::path::imbue( std::locale() );
// Check if set to utf-8
if( std::use_facet<boost::locale::info>( std::locale() ).encoding() != "utf-8" ){
std::cerr << "Wrong encoding" << std::endl;
return -1;
}
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
LPWSTR lp=(LPWSTR )newPath.c_str();
CreateFileW(lp,GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ |
FILE_SHARE_WRITE, NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL );
return 0;
}
Running it, however, I get inside the folder "C:\test\s" a file of name "¦ᄌタ.png", instead of "一.png", which I want. The only way I found to overcome this is to exchange the lines
std::string originalPath = "C:/test/s/一.png";
// Convert to wstring (**WRONG!**)
std::wstring newPath( originalPath.begin(), originalPath.end() );
to simply
std::wstring newPath = L"C:/test/s/一.png";
In this case the file "一.png" appears perfectly inside the folder "C:\test\s". Nonetheless, I cannot do that because the software get its path from a std::string variable. I think the conversion from std::string to std::wstring is being performed the wrong way, however, as it can be seen, I'm having deep problem trying to understand this logic. I read and researched Google exhaustively, read many qualitative texts, but all my attempts seem to be useless. I tried the MultiByteToWideChar function and also boost::filesystem, but both for no help, I simply cannot get the right filename when written to the folder.
I'm still learning, so I'm very sorry if I'm making a dumb mistake. My IDE is Eclipse and it is set up to UTF-8.
You need to actually convert the UTF-8 string to UTF-16. For that you have to look up how to use boost::locale::conv or (on Windows only) the MultiByteToWideChar function.
std::wstring newPath( originalPath.begin(), originalPath.end() ); won't work, it will simply copy all the bytes one by one and cast them to a wchar_t.
Thank you for your help, roeland. Finally I managed to find a solution and I simply used this following library: "http://utfcpp.sourceforge.net/". I used the function "utf8::utf8to16" to convert my original UTF-8 string to UTF-16, this way allowing Windows to display the Chinese characters correctly.

How to read non-ASCII lines from file with std::ifstream on Linux?

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:
#include <fstream>
#include <iostream>
int main() {
std::wfstream file("file"); // aaaàaaa
std::wstring str;
std::getline(file, str);
std::wcout << str << std::endl; // aaa
}
But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?
You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.
Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:
// olá
#include <iostream>
#include <fstream>
int main() {
std::fstream file("test.cpp");
std::string str;
std::getline(file, str);
std::cout << str << std::endl;
}
The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:
/ / o l á
47 47 32 111 108 195 161
Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.
You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!
Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.
The correction is simple: Use a UTF-8 locale. Look:
// olá
#include <iostream>
#include <fstream>
#include <locale>
int main() {
std::ios::sync_with_stdio(false);
std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
std::wcout.imbue(loc); // Use it for output
std::wfstream file("test.cpp");
file.imbue(loc); // Use it for file input
std::wstring str;
std::getline(file, str); // str.size() will be 6
std::wcout << str << std::endl;
}
You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

c++ read Arabic text from file

In C++, I have a text file that contains Arabic text like:
شكلك بتعرف تقرأ عربي يا ابن الذين
and I want to parse each line of this file into a string and use string functions on it (like substr, length, at...etc.) then print some parts of it to an output file.
I tried doing it but it prints some garbage characters like "\'c7\'e1\'de\'d1\"
Is there any library to support Arabic characters?
edit: just adding the code:
#include <iostream>
#include <fstream>
using namespace std;
int main(){
ifstream ip;
ip.open("d.rtf");
if(ip.is_open() != true){
cout<<"open failed"<<endl;
return 0;
}
string l;
while(!ip.eof()){
getline(ip, l);
cout<<l<<endl;
}
return 0;
}
Note: I still need to add some processing code like
if(l == "كلام بالعربي"){
string s = l.substr(0, 4);
cout<<s<<" is what you are looking for"<<endl;
}
You need to find out which text encoding the file is using. For example, to read an UTF-8 file as a wchar_t you can (C++11):
std::wifstream fin("text.txt");
fin.imbue(std::locale("en_US.UTF-8"));
std::wstring line;
std::getline(fin, line);
std::wcout << line << std::endl;
The best way to deal with this, in my opinion, is to use some UNICODE helper. The strings in C or even in C++ are just an array of bytes. When you do, for example, a strlen() [C] or somestring.length() [C++] you will only have the number os bytes of that string instead of number os characters.
Some auxiliar functions can be used help you on it, like mbstowcs(). But my opinion is that they are kinda old and hard to use.
Another way is to use C++11, that, in theory, has support for many things related to UTF-8. But I never saw it working perfectly, at least if you need to be multi-platform.
The best solution I found is to use ICU library. With this I can work on UTF-8 strings easily and with the same "charm" as working with a regular std::string. You have a string class with methods, for length, substrings and so on... and it's very portable. I use it on Window, Mac and Linux.
You can use Qt too .
Simple example :
#include <QDebug>
#include <QTextStream>
#include <QFile>
int main()
{
QFile file("test.txt");
file.open(QIODevice::ReadOnly | QIODevice::Text);
QTextStream stream(&file);
QString text=stream.readAll();
if(text == "شكلك بتعرف تقرأ عربي يا ابن الذين")
qDebug()<<",,,, ";
}
It is better to process an Arabic text line by line. To get all lines of Arabic text from file, try this
std::wifstream fin("arabictext.txt");
fin.imbue(std::locale("en_US.UTF-8"));
std::wstring line;
std::wstring text;
while ( std::getline(fin, line) )
{
text= text+ line + L"\n";
}

C++ - string.compare issues when output to text file is different to console output?

I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
ThisIsATestStringOutputtedToAFile
T^#h^#i^#s^#I^#s^#A^#T^#e^#s^#t^#S^#t^#r^#i^#n^#g^#O^#u^#t^#p^#u^#t^#
t^#e^#d^#T^#o^#A^#F^#i^#l^#e
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
EDIT
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
myInput.open(fileLocation.c_str());
myOutput.open("test.txt");
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
cin.get();
myInput.close();
myOutput.close();
TEST_ASSERT(compare1.compare(compare2) == 0);
How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
UPDATE:
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.
The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;
}
Output:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
NOTE
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.

C++ (VC) Text output breaks lines with 0d 0d 0a instead of 0d 0a - how to fix?

EDIT: The solution to this problem was supplied by Ulrich Eckhardt in the comments below. Also: this problem had an entirely different cause and solution from the ones described in possible duplicates. Again, see Ulrich Eckhardt's comment for details.
With the help of the experts here, I managed to put together a program that writes the contents of the Windows clipboard to a text file in a specified code page. It now seems to work perfectly, except that the line breaks in the text file are three bytes - 0d 0d 0a - instead of 0d 0a - and this causes problems (additional lines) when I import the text into a word processor.
Is there an easy way to replace 0d 0d 0a with 0d 0a in the text stream, or is there something I should be doing differently in my code? I haven't found anything like this elsewhere. Here is the code:
#include <stdafx.h>
#include <windows.h>
#include <iostream>
#include <fstream>
#include <codecvt> // for wstring_convert
#include <locale> // for codecvt_byname
using namespace std;
void BailOut(char *msg)
{
fprintf(stderr, "Exiting: %s\n", msg);
exit(1);
}
string ExePath()
{
char buffer[MAX_PATH];
GetModuleFileNameA(NULL, buffer, MAX_PATH);
string::size_type pos = string(buffer).find_last_of("\\/");
return string(buffer).substr(0, pos);
}
// get output code page from command-line argument; use 1252 by default
int main(int argc, char *argv[])
{
string codepage = ".1252";
if (argc > 1) {
string cpnum = argv[1];
codepage = "." + cpnum;
}
// HANDLE clip;
string clip_text = "";
// exit if clipboard not available
if (!OpenClipboard(NULL))
{ BailOut("Can't open clipboard"); }
if (IsClipboardFormatAvailable(CF_TEXT)) {
HGLOBAL hglb = GetClipboardData(CF_TEXT);
if (hglb != NULL) {
LPSTR lptstr = (LPSTR)GlobalLock(hglb);
if (lptstr != NULL) {
// read the contents of lptstr which just a pointer to the string:
clip_text = (char *)hglb;
// release the lock after you're done:
GlobalUnlock(hglb);
}
}
}
CloseClipboard();
// create conversion routines
typedef std::codecvt_byname<wchar_t, char, std::mbstate_t> codecvt;
std::wstring_convert<codecvt> cp1252(new codecvt(".1252"));
std::wstring_convert<codecvt> outpage(new codecvt(codepage));
std::string OutFile = ExePath() + "\\#clip.txt"; // output file name
ofstream OutStream; // open an output stream
OutStream.open(OutFile, ios::out | ios::trunc);
// make sure file is successfully opened
if (!OutStream) {
cout << "Error opening file " << OutFile << " for writing.\n";
return 1;
}
// convert to DOS/Win codepage number in "outpage"
OutStream << outpage.to_bytes(cp1252.from_bytes(clip_text)).c_str();
//OutStream << endl;
OutStream.close(); // close output stream
return 0;
}
The comments here are on the right track, but let me provide more context and point out a lingering problem.
There are various line-terminator/separator conventions. Many Unix-derived systems use a line feed character at the end of every line. In ASCII, that's '\x0A'. Other systems, like Windows and many networking protocols, use a carriage return followed by a line-feed between lines. In ASCII, that's '\x0D' '\x0A'. (There are other schemes as well, but they are much rarer.)
The C and C++ input/output libraries for reading and writing text can hide these conventions from you so that you can right code one way that does the "right thing" on whatever the underlying platform is.
The programming convention is to use '\n', which is almost certainly equivalent to a line feed if your underlying platform uses ASCII or Unicode (but not if it uses EBCDIC, which doesn't have a line feed character). When writing to a file, the library will intercept the '\n' and put whatever convention your platform requires. For example, if you're on a Linux machine, it'll output a line feed (and since '\n' has the same value as a line feed, this is basically a no-op). On Windows, the library will intercept the '\n' and output a carriage return and a line feed. The input side of things does the opposite.
When you get text from the clipboard on Windows, you don't really know which convention it uses. Since it's Windows, you'd probably expect CR+LF, but lots of programs that might put text on the clipboard might not behave properly on Windows.
In your case, it seems the text from the clipboard does indeed have both a carriage return and a line feed between lines. When you then output that in text mode, the i/o library outputs the carriage return, and then it sees the line feed (which it thinks is a '\n'), and so it outputs another carriage return followed by a line feed. That's why you see a doubling of the carriage returns.
Switching the output to binary mode tells the library "don't convert '\n'." So, that solves your immediate problem.
But there's still the problem that the clipboard text might sometimes have just line feeds between (or at the ends of) lines. If you output that in binary mode, you won't get the carriage returns, and the file technically won't be in the format your platform wants. Some programs will cope with this, but others, e.g., Notepad, will not.
More information.