Why getline is reading my entire unicode file - c++

I have seen many threads but none of solution given is working for me, so if anyone can throw some light that would be great
I am reading unicode file and using getline i try to scan line by line but then it scans the entire file, since the objects are wstring it does not allow me to place delimeter in getline. and asks only wchar_t in which i cant fit the delimeter. (\0 does not work as i am reading in binary mode) so below is code snippet
Platform: Windows , Visual Studio 2010
Unicode encoding: UTF 16
wifstream fin("profiles1.prd", ios_base::binary); //open a file
wofstream fout("DXout.txt",ios_base::binary); // this dumps the parsing ouput
fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
wstring stream;
getline(fin,stream);

I am hopeful this is what you're looking for:
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,
std::codecvt_mode(std::little_endian|std::consume_header)>);
Windows is little-endian, and as such to both skip the BOM and imbue utf16, you need to punch it in the gut by inventing a new conversion mode.
Hope it helps you out. I leave the writing side to you.

Related

Why am i getting these invalid characters before my file data?

I am trying to read a file into a string either by getline function or fileContents.assign( (istreambuf_iterator<char>(myFile)), (istreambuf_iterator<char>()));
Either of the way gives me the above output which shown in the image.
First way:
string fileContents;
ifstream myFile("textFile.txt");
while(getline(myFile,fileContents))
cout<<fileContents<<endl;
Alternate way:
string fileContents;
ifstream myFile(fileName.c_str());
if (myFile.is_open())
{
fileContents.assign( (istreambuf_iterator<char>(myFile) ),
(istreambuf_iterator<char>() ) );
cout<<fileContents;
}
The file begins with those characters, most likely a BOM to tell you what the encoding of the file is.
You probably are not able to see them in Windows Notepad because Notepad hides the encoding bytes. Get a decent text editor that lets you see the binary of the file and you will see those characters.
Your file starts with a UTF-8 BOM (bytes 0xEF 0xBB 0xBF). You are reading the file's raw bytes as-is and outputting them to a display that is using an OEM font for codepage 437. To handle text files properly, especially Unicode-encoded text files, you need to read the first few bytes, check for a BOM (and there are several you can look for), and if detected then seek past the BOM and interpret the remaining bytes of the file in the specified encoding, in this case UTF-8.

Writing wide string to a file in byte mode stopped

I am writing out unicode text (stored as wstring) into a file and I'm doing it in byte mode, but the string in the file ends prior to "™" character being printed. Is "™" not unicode or am I doing something wrong?
wofstream output;
outp.open("output.txt", ofstream::binary);
wstring a =L"ABC™";
output << a;
TM is definitely unicode. ofstream and wofstream do not write the text in UTF-8 format. You've to encode the output buffer in UTF-8 in order to see the results you're expecting. So, try using "WideCharToMultiByte".
There is a common misconception about the iostream binary mode: that it is to read/write binary files. The iostream library works only with text files and only read and write text files. The only thing the the "binary" mode changes is how NL (new line) characters are handled. In binary more, no transformation occurs. In non-binary mode, writing LF characters ('\n') to a stream will convert it to the platform specific new line sequence (Unix -> LF, Windows -> CR LF ("\r\n"), Mac -> CR) while when reading, the platform specific new line sequence will be converted to a single LF ('\n') character.
For everything else, nothing changes, meaning an wofstream will always convert the Unicode wide character string to single byte or multi byte character stream depending on the locale used by your process. If you have a locale of "en_US.utf8" on Linux for example, it will be converted to UTF8. Now, if the current locale does not have a representation for the TM Unicode symbol, then either nothing or a '?' will be written to the file.

c++ fstreams open file with utf-16 name

At first I built my project on Linux and it was built around streams.
When I started moving to Windows I ran into some problems.
I have a name of the file that I want to open in UTF-16 encoding.
I try to do it using fstream:
QString source; // content of source is shown on image
char *op= (char *) source.data();
fstream stream(op, std::ios::in | std::ios::binary);
But file cannot be opened.
When I check it,
if(!stream.is_open())
{} // I always get that it's not opened. But file indeed exists.
I tried to do it with wstream. But result is the same, because wstream accepts only char * too. As I understand it's so , because string , that is sent as char * , is truncated after the first zero and only one symbol of the file's name is sent, so file is never found. I know wfstream in Vissual studio can accept wchar_t * line as name, but compiler of my choice is MinGW and it doesn't have such signature for wstring constructor.
Is there any way to do it with STL streams?
ADDITION
That string can contaion not only Ascii symbols, it can contain Russian, German, Chinese symbols simultaneously. I don't want limit myself only to ASCII or local encoding.
NEXT ADDITION
Also data can be different, not only ASCII, otherwise I wouldn't bother myself with Unicode at all.
E.g.
Thanks in advance!
Boost::Filesystem especially the fstream.hpp header may help.
If you are using MSVC and it's implementation of the c++ standard library, something like this should work:
QString source; // content of source is shown on image
wchar_t *op= source.data();
fstream stream(op, std::ios::in | std::ios::binary);
This works because the Microsoft c++ implementation has an extension to allow fstream to be opened with a wide character string.
Convert the UTF-16 string using WideCharToMultiByte with CP_ACP before passing the filename to fstream.

c++ getline reads entire file

I'm using std::getline() to read from a text file, line by line. However, the first call to getline is reading in the entire file! I've also tried specifying the delimeter as '\n' explicitly. Any ideas why this might be happening?
My code:
std::ifstream serialIn;
...
serialIn.open(argv[3]);
...
std::string tmpStr;
std::getline(serialIn, tmpStr, '\n');
// All 570 lines in the file is in tmpStr!
...
std::string serialLine;
std::getline(serialIn, serialLine);
// serialLine == "" here
I am using Visual Studio 2008. The text file has 570 lines (I'm viewing it in Notepad++ fwiw).
Edit: I worked around this problem by using Notepad++ to convert the line endings in my input text file to "Windows" line endings. The file was written with '\n' at the end of each line, using c++ code. Why would getline() require the Windows line endings (\r\n)?? Does this have to do with character width, or Microsoft implementation?
Just guessing, but could your file have Unix line-endings and you're running on Windows?
You're confusing the newline you see in code ('\n') with the actual line-ending representation for the platform (some combination of carriage-return (CR) and linefeed (LF) bytes).
The standard I/O library functions automatically convert line-endings for your platform to and from conceptual newlines for text-mode streams (the default). See What's the difference between text and binary I/O? from the comp.lang.c FAQ. (Although that's from the C FAQ, the concepts apply to C++ as well.) Since you're on Windows, the standard I/O functions by default write newlines as CR-LF and expect CR-LF for newlines when reading.
If you don't want these conversions done and would prefer to see the raw, unadulterated data, then you should set your streams to binary mode. In binary mode, \n corresponds to just LF, and \r corresponds to just CR.
In C, you can specify binary mode by passing "b" as one of the flags to fopen:
FILE* file = fopen(filename, "rb"); // Open a file for reading in binary mode.
In C++:
std::ifstream in;
in.open(filename, std::ios::binary);
or:
std::ifstream in(filename, std::ios::binary);

Output data not the same as input data

I'm doing some file io and created the test below, but I thought testoutput2.txt would be the same as testinputdata.txt after running it?
testinputdata.txt:
some plain
text
data with
a number
42.0
testoutput2.txt (In some editors its on seperate lines, but in others its all on one line)
some plain
਍ऀ琀攀砀琀ഀഀ
data with
਍ 愀  渀甀洀戀攀爀ഀഀ
42.0
int main()
{
//Read plain text data
std::ifstream filein("testinputdata.txt");
filein.seekg(0,std::ios::end);
std::streampos length = filein.tellg();
filein.seekg(0,std::ios::beg);
std::vector<char> datain(length);
filein.read(&datain[0], length);
filein.close();
//Write data
std::ofstream fileoutBinary("testoutput.dat");
fileoutBinary.write(&datain[0], datain.size());
fileoutBinary.close();
//Read file
std::ifstream filein2("testoutput.dat");
std::vector<char> datain2;
filein2.seekg(0,std::ios::end);
length = filein2.tellg();
filein2.seekg(0,std::ios::beg);
datain2.resize(length);
filein2.read(&datain2[0], datain2.size());
filein2.close();
//Write data
std::ofstream fileout("testoutput2.txt");
fileout.write(&datain2[0], datain2.size());
fileout.close();
}
Its working fine on my side, i have run your program on VC++ 6.0 and checked the output on notepad and MS Word. can you specify name of editor where you are facing problem.
You can't read Unicode text into a std::vector<char>. The char data type only works with narrow strings, and my guess is that the text file you're reading in (testinputdata.txt) is saved with either UTF-8 or UTF-16 encoding.
Try using the wchar_t type for your characters, instead. It is specifically designed to work with "wide" (or Unicode) characters.
Thou shalt verify thy input was successful! Although this would sort you out, you should also note that number of bytes in the file has no direct relationship to the number of characters being read: there can be less characters than bytes (think Unicode character using multiple bytes using UTF8 to be encoded) or vice versa (although the latter doesn't happen with any of the Unicode encodings). All you experience is that read() couldn't read as many characters as you'd asked it to read but write() happily wrote the junk you gave it.