Why am i getting these invalid characters before my file data? - c++

I am trying to read a file into a string either by getline function or fileContents.assign( (istreambuf_iterator<char>(myFile)), (istreambuf_iterator<char>()));
Either of the way gives me the above output which shown in the image.
First way:
string fileContents;
ifstream myFile("textFile.txt");
while(getline(myFile,fileContents))
cout<<fileContents<<endl;
Alternate way:
string fileContents;
ifstream myFile(fileName.c_str());
if (myFile.is_open())
{
fileContents.assign( (istreambuf_iterator<char>(myFile) ),
(istreambuf_iterator<char>() ) );
cout<<fileContents;
}

The file begins with those characters, most likely a BOM to tell you what the encoding of the file is.
You probably are not able to see them in Windows Notepad because Notepad hides the encoding bytes. Get a decent text editor that lets you see the binary of the file and you will see those characters.

Your file starts with a UTF-8 BOM (bytes 0xEF 0xBB 0xBF). You are reading the file's raw bytes as-is and outputting them to a display that is using an OEM font for codepage 437. To handle text files properly, especially Unicode-encoded text files, you need to read the first few bytes, check for a BOM (and there are several you can look for), and if detected then seek past the BOM and interpret the remaining bytes of the file in the specified encoding, in this case UTF-8.

Related

Is there a way to ignore end of file characters while reading files in c++?

So I am trying to read a file into a program in C++, but there are random end of files thrown in throughout the file. When trying to read the file, ifstream stops reading when it hits an end of file character.
This is the code that I am using to try to read the file
size_t bytesAvailable = 1000;
std::ifstream file(directory, std::ifstream::in);
unsigned char headDataBuffer[1000];
file.read((char*)(&headDataBuffer[0]), bytesAvailable);
the file I am trying to read gets this far into the file but then stops when it reaches a certain character which I later found out to be an end of file, there is plenty of text afterwards but I can't seem to get ifstream to read anything after the end of file character. Is there a way to read the entire file without having to break it up into smaller chunks?
Firsts few lines of the file
˜1È£….ƒÑäÄÕ!õÏ]ÀåM”Ú2jó8ÒQ;Fb#Ãë»Cé‚ 1³¸)æ¸)¼™Â¢¼mí¾J”ÜT’S·Õ}xÇ\'Ò¬Ëëk|&cõe´„[zÊN4äHH•Æpé€i‹,ɶ‰v%••¡ÁÎ:ïÂOÚåÀ‡É=wí7iÓOQ3Fg,‚¹ªGô“(stops right here) I9á¸"æ£/¼™Ù£«|¿¿FI€À^‚ ‚2 tÁ[;Åéúî2`9es¹Va°ÝNe-˜1È´’},••°ÛÙuòŸLÚቜÕ/9ñ7,Õ[uv/†í]¼CúŸ
Try opening the file in binary mode. On some platforms, text mode and binary mode behave differently, such as the text mode interpreting end-of-line into LF, or interpreting a control character (possibly Ctrl+D or Ctrl+Z) as an end-of-file.
size_t bytesAvailable = 1000;
std::ifstream file(directory, std::ifstream::in|std::ifstream::binary);
unsigned char headDataBuffer[1000];
file.read((char*)(&headDataBuffer[0]), bytesAvailable);

How do I remove the character "" from the beginning of a text file in C++?

I'm trying to read a text file, and for each word, I will put them into a node of a binary search tree. However, the first character is always read as " + first word". For example, if my first word is "This", then the first word that is inserted into my node is "This". I've been searching the forum for a solution to fix it, there was one post asking the same problem in Java, but no one has addressed it in C++. Would anyone help me to fix it ? Thank you.
I came to the a simple solution. I opened the file in Notepad, and saved it as ANSI. After that, the file is reading and passing correctly into the binary search tree
That's UTF-8's BOM
You need to read the file as UTF-8. If you don't need Unicode and just use the first 127 ASCII code points then save the file as ASCII or UTF-8 without BOM
This is Byte Order Mark (BOM). It's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.
In C++, you can use the following function to convert a UTF-8 BOM file to ANSI.
void change_encoding_from_UTF8BOM_to_ANSI(const char* filename)
{
ifstream infile;
string strLine="";
string strResult="";
infile.open(filename);
if (infile)
{
// the first 3 bytes (ef bb bf) is UTF-8 header flags
// all the others are single byte ASCII code.
// should delete these 3 when output
getline(infile, strLine);
strResult += strLine.substr(3)+"\n";
while(!infile.eof())
{
getline(infile, strLine);
strResult += strLine+"\n";
}
}
infile.close();
char* changeTemp=new char[strResult.length()];
strcpy(changeTemp, strResult.c_str());
char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
strResult=changeResult;
ofstream outfile;
outfile.open(filename);
outfile.write(strResult.c_str(),strResult.length());
outfile.flush();
outfile.close();
}
in debug mode findout the symbol for the special character and then replace it
content.replaceAll("\uFEFF", "");

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

Output data not the same as input data

I'm doing some file io and created the test below, but I thought testoutput2.txt would be the same as testinputdata.txt after running it?
testinputdata.txt:
some plain
text
data with
a number
42.0
testoutput2.txt (In some editors its on seperate lines, but in others its all on one line)
some plain
਍ऀ琀攀砀琀ഀഀ
data with
਍ 愀  渀甀洀戀攀爀ഀഀ
42.0
int main()
{
//Read plain text data
std::ifstream filein("testinputdata.txt");
filein.seekg(0,std::ios::end);
std::streampos length = filein.tellg();
filein.seekg(0,std::ios::beg);
std::vector<char> datain(length);
filein.read(&datain[0], length);
filein.close();
//Write data
std::ofstream fileoutBinary("testoutput.dat");
fileoutBinary.write(&datain[0], datain.size());
fileoutBinary.close();
//Read file
std::ifstream filein2("testoutput.dat");
std::vector<char> datain2;
filein2.seekg(0,std::ios::end);
length = filein2.tellg();
filein2.seekg(0,std::ios::beg);
datain2.resize(length);
filein2.read(&datain2[0], datain2.size());
filein2.close();
//Write data
std::ofstream fileout("testoutput2.txt");
fileout.write(&datain2[0], datain2.size());
fileout.close();
}
Its working fine on my side, i have run your program on VC++ 6.0 and checked the output on notepad and MS Word. can you specify name of editor where you are facing problem.
You can't read Unicode text into a std::vector<char>. The char data type only works with narrow strings, and my guess is that the text file you're reading in (testinputdata.txt) is saved with either UTF-8 or UTF-16 encoding.
Try using the wchar_t type for your characters, instead. It is specifically designed to work with "wide" (or Unicode) characters.
Thou shalt verify thy input was successful! Although this would sort you out, you should also note that number of bytes in the file has no direct relationship to the number of characters being read: there can be less characters than bytes (think Unicode character using multiple bytes using UTF8 to be encoded) or vice versa (although the latter doesn't happen with any of the Unicode encodings). All you experience is that read() couldn't read as many characters as you'd asked it to read but write() happily wrote the junk you gave it.

Mismatch between characters put and read

I'm trying to write a Huffman encoder but I'm getting some compression errors. I identified the problem as mismatches between characters that were put() to the ofstream and the characters read() from the same file.
One specific instance of this problem :
The put() writes ASCII character 10 (Line feed)
The read() reads ASCII character 13 (Carriage return)
I thought read and put read and write raw data ( no character translations ) I'm not sure why this is happening. Can someone help me out?
Here is the ofstream instance for writing the compressed file:
std::ofstream compressedFileStream(getCompressedFileName(),std::ios::binary||std::ios::ate);
and the ifstream instance for reading the same
std::ifstream fileInput(getFileName()+".huf",std::ios::binary);
The code is running on Windows 7 and all streams in the program are opened in binary mode.
Not opening in binary mode due to a typo:
std::ofstream compressedFileStream(getCompressedFileName(),std::ios::binary||std::ios::ate)
should be:
std::ofstream compressedFileStream(getCompressedFileName(),std::ios::binary|std::ios::ate)
// ^
|, not ||.
The symptoms show that you are creating the ofsteam with text mode or you are creating it using a filedesc that is opened in text mode.
You will want to pass ios::binary to it at construction time or it may run in text mode on Windows.
After you added the code, the reason proves to be a typo;
std::ios::binary||std::ios::ate
should be
std::ios::binary|std::ios::ate
On Windows, if you are writing binary data, you need to open the file with the appropriate attributes.
Similarly, if you are reading binary data, you need to open the file with the appropriate attributes.