What should binary file look like after conversion from text? - c++

Problem:
Split the binary I/O from the example code into two: one program that converts an ordinary text file into binary and one program that reads binary and converts into text. Test these programs by comparing a text file with what you get by converting it to binary and back.
Example code:
#include "std_lib_facilities.h"
int main(){
cout <<"Please enter input file name.\n";
string name;
cin >> name;
// open file to read, with no byte interpretation
ifstream ifs(name.c_str(), ios_base::binary);
if(!ifs) error("Can't open input file: ", name);
cout << "Please enter output file name.\n";
cin >> name;
// open file to write
ofstream ofs(name.c_str(), ios_base::binary);
if(!ofs) error("Can't open output file: ", name);
vector<int> v;
// read from binary file
int i;
while(ifs.read(as_bytes(i), sizeof(int))) v.push_back(i);
// do something with v
// write to binary file
for(int i = 0; i < v.size(); ++i) ofs.write(as_bytes(v[i]), sizeof(int));
return 0;
}
Here is my code, instead of reading and writing int values, I tried with strings:
#include "std_lib_facilities.h"
void textToBinary(string, string);
//--------------------------------------------------------------------------------
int main(){
const string info("This program converts text to binary files.\n");
cout << info;
const string testFile("test.txt");
const string binaryFile("binary.bin");
textToBinary(testFile, binaryFile);
getchar();
return 0;
}
//--------------------------------------------------------------------------------
void textToBinary(string ftest, string fbinary){
// open text file to read
ifstream ift(ftest);
if(!ift) error("Can't open input file: ", ftest);
// copy contents in vector
vector<string>textFile;
string line;
while (getline(ift,line)) textFile.push_back(line);
// open binary file to write
ofstream fb(fbinary, ios::binary);
if(!fb) error("Can't open output file: ", fbinary);
// convert text to binary, by writing the vector contents
for(size_t i = 0; i < textFile.size(); ++i){ fb.write(textFile[i].c_str(), textFile[i].length()); fb <<'\n';}
cout << "Conversion done!\n";
}
Note:
My text file contains Lorem Ipsum, no digits or special punctuation. After I write the text using binary mode, there is a perfect character interpretation and the source text file looks exactly like the destination. (My attention goes to the fact that when using binary mode and the function write(as_bytes(), sizeof()), the content of the text file is translated perfectly and there are not mistakes.)
Question:
How should the binary file look like after I use binary mode(no char interpretation) and the function write(as_bytes(), sizeof()) when writing?

In both Unix-land and Windows a file is primarily just a sequence of bytes.
With the Windows NTFS file system (which is default) you can have more than one sequence of bytes in the same file, but there is always one main sequence which is the one that ordinary tools see. To ordinary tools every file appears as just a single sequence of bytes.
Text mode and binary mode in C++ concern whether the basic i/o machinery should translate to and from an external convention. In Unix-land there is no difference. In Windows text mode translates newlines from internal single byte C convention (namely ASCII linefeed, '\n'), to external double byte Windows convention (namely ASCII carriage return '\r' + linefeed '\n'), and vice versa. Also, on input in Windows, encountering a single byte value 26, a "control Z", is or can be interpreted as end of file.
Regarding the literal question,
” The question is in what format are they written in the binary file, shouldn't they be written in not-interpreted form, i.e raw bytes?
the text is written as raw bytes in both cases. The difference is only about how newlines are translated to the external convention for newlines. Since your text 1)doesn't contain any newlines, there's no difference. Edit: Not shown in your code except by scrolling it sideways, there's a fb <<'\n' that outputs a newline to the file opened in binary mode, and if this produces the same bytes as in the original text file, then there is no effective translation, which implies you're not doing this in Windows.
About the extra streams for Windows files, they're used e.g. for Windows (file) Explorer's custom file properties, and they're accessible e.g. via a bug in the Windows command interpreter, like this:
C:\my\forums\so\0306>echo This is the main stream >x.txt
C:\my\forums\so\0306>dir | find "x"
04-Jul-15 08:36 PM 26 x.txt
C:\my\forums\so\0306>echo This is a second byte stream, he he >x.txt:2nd
C:\my\forums\so\0306>dir | find "x"
04-Jul-15 08:37 PM 26 x.txt
C:\my\forums\so\0306>type x.txt
This is the main stream
C:\my\forums\so\0306>type x.txt:2nd
The filename, directory name, or volume label syntax is incorrect.
C:\my\forums\so\0306>find /v "" <x.txt:2nd
This is a second byte stream, he he
C:\my\forums\so\0306>_
I just couldn't resist posting an example. :)
1) You state that “My text file contains Lorem Ipsum, no digits or special punctuation”, which indicates no newlines.

Related

how to make a new line in binary file c++?

Can anybody help me with this simple thing in file handling?
This is my code:
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int main()
{
fstream f;
f.open("input05.bin", ios_base::binary | ios_base::out);
string str = "";
cout << "Input text:"<<endl;
while (1)
{
getline(cin, str);
if (str == "end")
break;
else {
f.write((char*)&str, sizeof(str));
}
}
f.close();
f.open("input05.bin", ios_base::in | ios_base::binary);
while (!f.eof())
{
string st;
f.read((char*)&st, sizeof(st));
cout << st << endl;
}
f.close();
}
It is running successfully now. I want to format the output of the text file according to my way.
I have:
hi this is first program i writer this is an experiment
How can I make my output file look like the following:
hi this is first program
I writer this is an experiment
What should I do to format the output in that way?
First of all,
string str;
....
f.write((char*)&str, sizeof(str));
is absolutely wrong as you cast a pointer to an object of type std::string to a pointer to a character, i.e. char*. Note that an std::string is an object having data members like the length of the string and a pointer to the memory where the string content is kept, but it is not a c-string of type char *. Further, sizeof(str) gives you the size of the "wrapper object" with the length member and the pointer, but it does not give you the length of the string.
So it should be something like this:
f.write(str.c_str(), str.length());
Another thing is the os-dependant handling of new line character. Depending on the operating system, a new line is represented either by 0x0d 0x0a or just by 0x0d. In memory, c++ treats a new line always as a single character '\n'(i.e. 0x0d). When writing to a file in text mode, c++ will expand an '\n' to 0x0d 0x0a or just keep it as 0x0d (depending on the platform). If you write to a file in binary mode, however, this replacement will not occur. So if you create a file in binary mode and insert only a 0x0d, then - depending on the platform - printing the file in the console will not result in a new line.
Try to write ...
f.write(str.c_str(), str.length());
f.put('\r');
such that it will work on your platform (and will not work on other platforms then).
That's why you should write in text mode if you want to write text.

Reading file returns weird results

I am writing a compression program with a huffman tree.
It generates a file filled with a bit of overhead de decompress and a bunch of random bits which are then split into pieces of 8 and turned into the char corresponding with those 8 bits. So essentially random chars. And then they are written into a file.
When reading this file two problems occur:
The chars shown when I cout the random chars is different from the ones in the file.
My loop that reads the file stops only a few lines in.
I'm using the following function to read the file:
void Convertor::HuffmanToFile(string outputLocation){
string fileInfo, fileDataPiece;
ifstream inputFile;
ofstream outputFile;
stringstream fileData;
outputFile.open(outputLocation, ofstream::out | ofstream::trunc);
inputFile.open(inputLocation);
if (inputFile.fail()) {
cerr << "Error opening text file" << endl;
exit(1);
}
while (inputFile >> fileDataPiece){
fileData << fileDataPiece;
}
inputFile.close();
Decoder decoder(fileInfo,fileData.str());
outputFile << decoder.decodeInfo();
outputFile.close();
}
If anyone could hand me a clue as to where I should look into that would be great!
Be careful when using operator>> from istream into string - it will skip over white space characters! My guess is that is causing the differences.
You are loading the whole file at once. Your way is unnecessary complicated. One good way to do it in C++ is described here: Read whole ASCII file into C++ std::string

visual studio c++ cin big string from command line

When I run the following program and paste 50000 symbols to the command line, the program gets 4096 symbols only. Could you please suggest me what to do in order to get the full list of symbols?
#include <iostream>
#include <string>
using namespace std;
int main()
{
char temp[50001];
while (cin.getline(temp, 50001, '\n'))
{
string s(temp);
cout << s.size() << endl;
}
return 0;
}
P.S.
When I read the symbols from file using fstream, it's OK
I'm taking a leap jump here but since many powershell terminals have 4096 truncation limits (take a look at the Out-File documentation), this is likely a Windows command line limitation rather than a getline limitation.
The same problem has been encountered previously by others: https://github.com/Discordia/large-std-input/blob/master/LargeStdInput/Main.cpp
I don't understand why you are reading into a character array, then transferring it into a string.
In any case, your issue may be with repeated allocations.
Reading into std::string directly
Two simple lines:
std::string s;
getline(cin, s, '\n');
Reading into an array first
Yes, there is a simpler method:
#define BUFFER_SIZE 8196 // Very important, named constant
char temp[BUFFER_SIZE];
cin.getline(temp, BUFFER_SIZE, '\n');
// Get the number of characters actually read
unsigned int chars_read = cin.gcount();
std::string s(temp, chars_read); // Here's how to transfer the characters.
Using a debugger, you need to view the value in chars_read to verify that the quantity of characters read is valid.
Binary reading
Some platforms provide translations between the data read and your program. For example, Windows uses Ctrl-Z as an EOF character; Linux uses Ctrl-D.
The input data may use UTF encoding and contain values outside the range of ASCII printable set.
So, the preferred method is to read from a stream opened in binary mode. Unfortunately, cin cannot be opened easily in binary mode.
See Open cin in binary
The preferred method, if possible, is to put the text into a file and read from the file.

Converting between text files and binary files in C++

For converting an ordinary text file into binary and then convert that binary file back to a text file so that the first text file equals with the last text file, I have wrote below code.
But the bintex text file and the final text file aren't equal. I don't know which part of code is incorrect.
Input sample ("bintex") contains this: 1983 1362
The result ("final") contains this: 959788084
which of course are not equal.
#include <iostream>
#include <fstream>
using namespace std;
int main() try
{
string name1 = "bintex", name2 = "texbin", name3 = "final";
ifstream ifs1(name1.c_str());
if(!ifs1) error("Can't open file for reading.");
vector<int>v1, v2;
int i;
while(ifs1.read(as_bytes(i), sizeof(int)));
v1.push_back(i);
ifs1.close();
ofstream ofs1(name2.c_str(), ios::binary);
if(!ofs1) error("Can't open file for writting.");
for(int i=0; i<v1.size(); i++)
ofs1 << v1[i];
ofs1.close();
ifstream ifs2(name2.c_str(), ios::binary);
if(!ifs2) error("Can't open file for reading.");
while(ifs2.read(as_bytes(i), sizeof(int)));
v2.push_back(i);
ifs2.close();
ofstream ofs2(name3.c_str());
if(!ofs2) error("Can't open file for writting.");
for(int i=0; i<v2.size(); i++)
ofs2 << v2[i];
ofs2.close();
keep_window_open();
return 0;
}
//********************************
catch(exception& e)
{
cerr << e.what() << endl;
keep_window_open();
return 0;
}
What is this?
while(ifs1.read(as_bytes(i), sizeof(int)));
It looks like a loop that reads all input and throws it away. The line afterward suggests that you should be using braces instead of a semicolon there, and doing the write in the block.
Your read and write operations aren't symmetric.
ifs1.read(as_bytes(i), sizeof(int))
grabs 4 bytes, and dumps the values into the char* its passed.
ofs1 << v1[i];
output the integer in v[i] as text. Those are very very different formats.
If you used >> to read you would have a lot more success.
To expound, the first read might look like this {'1','9','8','3'}, which I would guess would be the 959788084 you are seeing when you pun it to an int. Your second read would be {' ','1','3','6'}, like not what you'd hoped for either.
It's not clear (to me, at least), what you are trying to do.
When you say that the orginal file contains 1983 1262, what do
you really mean? That it contains two four byte integers, in
some unspecified format, whose values are 1983 and 1262? If so,
the problem is probably due to your machine not using the same
format. You cannot, in general, just read bytes (using
istream::read) and expect them to mean anything in your
machine's internal format. You have to read the bytes into
a buffer, and unformat them, according to the format with which
they were written.
Of course, opening a stream in binary mode doesn't mean that
the actual data are in some binary format; it just affects
things like how (or more strictly speaking, whether) line
endings are encoded, and how end of file is recognized.
(Strictly speaking, a binary file is not divided into lines. It
is just a sequence of bytes. Of course, some of those bytes
might have values that you, in your program, interpret and new
line characters.) If your file actually contains nine bytes
with characters corresponding to "1983 1362", then you'll have
to parse them as a text format, even if the file is written in
binary. You can do this by reading the entire file into
a string, and usingstd::istringstream; _or_, on most common
systems (but not necessarily on all exotics) by using>>` to
read, just as you would with a text file.
EDIT:
Just a simple reminder: you don't show the code for as_bytes,
but I'm willing to guess that there's a reinterpret_cast in
it. And any time you have to use a reinterpret cast, you can be
very sure that what you're doing isn't portable, and if it's
supposed to be portable, you're doing it wrong.

Binary file I/O issues

Edit: I'm trying to convert a text file into bytes. I'm not sure if the code is turning it into bytes or not. Here is the link to the header so you can see the as_bytes function.
link
#include "std_lib_facilities.h"
int main()
{
cout << "Enter input file name.\n";
string file;
cin >> file;
ifstream in(file.c_str(), ios::binary);
int i;
vector<int> bin;
while(in.read(as_bytes(i), sizeof(int)))
bin.push_back(i);
ofstream out(file.c_str(), ios::out);
for(int i = 0; i < bin.size(); ++i)
out << bin[i];
keep_window_open();
}
Note that now the out stream just outputs the contents of the vector. It doesn't use the write function or the binary mode. This converts the file to a large line of numbers - is this what I'm looking for?
Here is an example of the second code's file conversion:
that guy likes to eat lots of pie (not sure if this was exact text)
turns to
543518319544825700191924850016351970295432362115448292821701667182186922608417526375411952522351186935715718643976841768956006
The reason your first method didn't change the file is because all files are stored in the same way. The only "difference" between text files and binary files is that text files contain only bytes that can be shown as ASCII characters, while binary files* have a much more random variety and order of bytes. So you are reading bytes in as bytes and then outputting them as bytes again!
*I'm including Unicode text files as binary, since they can have multiple bytes to denote one character point, depending on the character point and the encoding used.
The second method is also fairly simple. You are reading in the bytes, as before, and storing them in integers (which are probably 4 bytes long). Then you are just printing out the integers as if they are integers, so you are seeing a string of numbers.
As for why your first method cut off some of the bytes, you're right in that it's probably some bug in your code. I thought it was more important to explain what the ideas are in this case, rather than debug some test code.