Output data not the same as input data - c++

I'm doing some file io and created the test below, but I thought testoutput2.txt would be the same as testinputdata.txt after running it?
testinputdata.txt:
some plain
text
data with
a number
42.0
testoutput2.txt (In some editors its on seperate lines, but in others its all on one line)
some plain
਍ऀ琀攀砀琀ഀഀ
data with
਍ 愀  渀甀洀戀攀爀ഀഀ
42.0
int main()
{
//Read plain text data
std::ifstream filein("testinputdata.txt");
filein.seekg(0,std::ios::end);
std::streampos length = filein.tellg();
filein.seekg(0,std::ios::beg);
std::vector<char> datain(length);
filein.read(&datain[0], length);
filein.close();
//Write data
std::ofstream fileoutBinary("testoutput.dat");
fileoutBinary.write(&datain[0], datain.size());
fileoutBinary.close();
//Read file
std::ifstream filein2("testoutput.dat");
std::vector<char> datain2;
filein2.seekg(0,std::ios::end);
length = filein2.tellg();
filein2.seekg(0,std::ios::beg);
datain2.resize(length);
filein2.read(&datain2[0], datain2.size());
filein2.close();
//Write data
std::ofstream fileout("testoutput2.txt");
fileout.write(&datain2[0], datain2.size());
fileout.close();
}

Its working fine on my side, i have run your program on VC++ 6.0 and checked the output on notepad and MS Word. can you specify name of editor where you are facing problem.

You can't read Unicode text into a std::vector<char>. The char data type only works with narrow strings, and my guess is that the text file you're reading in (testinputdata.txt) is saved with either UTF-8 or UTF-16 encoding.
Try using the wchar_t type for your characters, instead. It is specifically designed to work with "wide" (or Unicode) characters.

Thou shalt verify thy input was successful! Although this would sort you out, you should also note that number of bytes in the file has no direct relationship to the number of characters being read: there can be less characters than bytes (think Unicode character using multiple bytes using UTF8 to be encoded) or vice versa (although the latter doesn't happen with any of the Unicode encodings). All you experience is that read() couldn't read as many characters as you'd asked it to read but write() happily wrote the junk you gave it.

Related

Number getting stored as special character in C++

#include<fstream>
#include<string.h>
#include<iostream>
using namespace std;
class contact
{
long long ph;
unsigned char name[20],add[50],email[30];
public:
void create_contact()
{
cout<<"Phone: ";
cin>>ph;
cout<<"Name: ";
cin.ignore();
cin>>name;
cout<<"Address: ";
cin.ignore();
cin>>add;
cout<<"Email address: ";
cin.ignore();
cin>>email;
cout<<"\n";
}
void show_contact()
{
cout<<endl<<"Phone Number: "<<ph;
cout<<endl<<"Name: "<<name;
cout<<endl<<"Address: "<<add;
cout<<endl<<"Email Address : "<<email;
}
long long getPhone()
{
return ph;
}
unsigned char* getName()
{
return name;
}
unsigned char* getAddress()
{
return add;
}
unsigned char* getEmail()
{
return email;
}
};
fstream fp;
contact cont;
void save_contact()
{
fp.open("contactBook.txt",ios::out|ios::app);
cont.create_contact();
fp.write((char*)&cont,sizeof(contact));
fp.close();
cout<<endl<<endl<<"Contact Has Been Sucessfully Created...";
getchar();
Hey there, I am new to C++ as well as this community and in this is the code that I have been working on, the phone number of the contact is getting saved as random special characters. This is half of the code where I think the problem occurs Any ideas on how I could fix it? It would be of much help. Thanks!
I take it you expected to see the phone number written out in your text file as something like "15551234567." However, long long is not stored in this form in memory. It's actually stored as a 64-bit binary integer. The special characters you describe are likely the encoded version of that integer. If you read the data back in, you should find that it is still an integer.
However, there is one remaining issue. You are missing ios::binary on the fstream open command. Each of the ios flag imbues the stream with a particular behavior:
ios::out - indicates that this stream should be an output stream that you can write bytes to
ios::app - indicates that this stream should be opened in "append" mode. This means that it will not erase the contents of the file every time you open it, and any bytes outputted to the stream are appended to the end of the file.
ios::binary - opens the file in binary mode, which is needed when you want to input/output binary data, rather than just text.
You want to open the file with ios::out | ios::app | ios::binary. Forgetting binary is going to lead to very difficult to debug errors.
Now binary mode is a bit of a pest. Sorry for this being a long read, but its a lot easier to come to grips with this flag if you understand the history behind it.
Way back in the early days of computing, there was a disagreement about how to write a newline into a file. This way the days of typewriters, where starting a new line was broken into two actions. There was "carriage return" which moved the sliding bit of the typewriter back to the start of the line (this was the loud part of the motion), and there was a "line feed" which moved the paper up one spot. Each of these were separate actions, so they were given separate characters in ASCII, one of the the definitive ways to write text as a string of bytes. The 8-bit number 10 encoded a line feed (aka LF), and the 8-bit number 13 encoded a carriage return (aka CR). This would permit one to do things like overtyping, a trick where one types one character (like a letter) and then goes back to add another over the top (like an accent). You might write à by first typing a, doing a "carriage return" and then writing a `, just like you did on a typewriter.
Some operating systems (such as Windows) encoded the start of the next line as both of these characters, so you'd see CR LF in a text file. Other operating systems (such as Unix) decided that it wasn't worth wasting a precious byte at the end of every line, so they chose to represent the start of a new line just with a LF. Others (such as Macintosh), decided to represent the new line as CR. Nobody could agree.
To deal with this, many file reading/writing APIs treat these characters specially. fopen and fstream follow a pattern where if they see a CR LF or a CR in a text file, they silently turn it into a LF character when read. This lets you read every file type. Likewise, if it sees a LF character when writing, it expands it to whatever the platform specified a new line should look like. This lets you write cross-platform code which writes text files without having to pay attention to which new line character is used on each platform!
However, this causes huge problems for binary data. Consider the number 302,844,416 written as a 32 bit number. In hexadecimal, we would write that as 0x120D0A00 (hex is a popular way to write numbers in programming because every byte can be written as 2 characters in hex). The issue is the middle two bytes of the number, 0x0D and 0x0A. In decimal, these are 13 and 10, which you should recognize as the same bytes as CR and LF.
If the program tries to read that number in "text mode," it will see the CR LF pair, and turn it into just a single LF, per the C rules. Now, instead of our number being 0x120D0A00, its 0x120A00XX, where the XX is whatever the next byte was in the file. Very bad things! Not only is this data corrupted, but you probably needed the next byte for whatever came next in the file!
ios::binary and the "b" flag for fopen resolve this. They tell C/C++ that the data is going to be binary. There wont be any new lines to convert. If you write bytes to a binary stream, they get written directly to the file, without any clever attempts to handle new lines.
Your phone number is stored as a long long, which is a binary integer format. Without ios::binary, you run the risk of the number just happening to have a CR LF pair in it, and fstream will corrupt your data. ios::binary tells fstream to not mess with the data in that way.

Why am i getting these invalid characters before my file data?

I am trying to read a file into a string either by getline function or fileContents.assign( (istreambuf_iterator<char>(myFile)), (istreambuf_iterator<char>()));
Either of the way gives me the above output which shown in the image.
First way:
string fileContents;
ifstream myFile("textFile.txt");
while(getline(myFile,fileContents))
cout<<fileContents<<endl;
Alternate way:
string fileContents;
ifstream myFile(fileName.c_str());
if (myFile.is_open())
{
fileContents.assign( (istreambuf_iterator<char>(myFile) ),
(istreambuf_iterator<char>() ) );
cout<<fileContents;
}
The file begins with those characters, most likely a BOM to tell you what the encoding of the file is.
You probably are not able to see them in Windows Notepad because Notepad hides the encoding bytes. Get a decent text editor that lets you see the binary of the file and you will see those characters.
Your file starts with a UTF-8 BOM (bytes 0xEF 0xBB 0xBF). You are reading the file's raw bytes as-is and outputting them to a display that is using an OEM font for codepage 437. To handle text files properly, especially Unicode-encoded text files, you need to read the first few bytes, check for a BOM (and there are several you can look for), and if detected then seek past the BOM and interpret the remaining bytes of the file in the specified encoding, in this case UTF-8.

Reading From A File Which Contains Unicode Characters

I have this huge file which contains unicode strings at the beginning (first ~10,000 character or so)
I don't care about the unicode part, parts I'm interested aren't unicode but whenever I try to read those parts I get '=', and if I were to load the entire file to char array and write to to some temporary file (without altering the data) with ofstream I get incorrect data actually all I get is a text file filled with Í If I were to remove the unicode part manually everything works fine, So it seems ifstream cannot deal with streams which contains unicode data, but if this assumption is true, is there any way to work on this file introducing a new library to my project?
Thanks,
EDIT: Here's a sample code, program reads from this file which contains characters (some, not all) that can't be represented in ASCII.
ifstream inFile("somefile");
inFile.seekg(0,ios_base::end);
size_t size = inFile.tellg();
inFile.seekg(0,ios_base::beg);
char *book = new char[size];
inFile.read(book,size);
for (int i = 0; i < size; i++) {
cout << book[i] << " " << i << endl; //book[i] will always be '='
}
ofstream outFile("TEST.txt");
outFile.write(book,size);
outFile.close();
Keith Thompson's question is very important. Depending on which Unicode encoding, writing a small C routine that reads (and discards) the Unicode characters can be trivial, or slightly more complex.
Supposing the encoding is UTF-8, you will have a problem determining when to stop discarding because ASCII is a subset of UTF-8, so any time you encounter an ASCII char, you might be tempted to say "this is it, we're back in ASCII land" and the next char still might be still outside the ASCII range.
So you need to read the file and determine where the last character>127 is. Anything after that is plain ASCII -- hopefully.
A text file is generally in just one encoding utf-8, utf-16 (big or little endian) or utf-32 (big or little) or ASCII or other ANSI code pages. Mixing of encoding is only possible in some custom ways.
That said, you will have to read both the data that you need and that you don't in the same encoding. If you know the format is utf-8 you could, depending on what you are going to do with the data, read the file as a binary file into char buffer piece by piece. Then you could API(s) like strnextc (on windows. equivalent API must be available on other platforms) to move character by character on the buffer. Once you reach the end - you could move the balance to the front of the buffer and load the rest of the buffer from the file.
In fact you could use the above approach in general for any encoding. But for utf-16, you could try using wifstream - provided the endianess of the file and the platform you would be running on is the same. And you need to check if the implementation of wifstream is good at handling change in endiness and is able to take care of BOM (byte order mark) - 2 byte sequence ("FE FF" or "FF FE") that is generally present at the beginning of a file - leave alone surrogate pairs.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx