Change a text file in random access - c++

I have a text file and want to change it in some places, for example in byte range 4030 to 4060, in this way:
if there is a character 'C' or 'c' followed by 'G' or 'g' ,must be changed in 'B' character
The input file is a text file and I want to get a changed text output file. There is no random access in text file and so I must open file in binary form and make changes, but the output file will be binary and I don't have any idea to get a text output. The code is below:
int main()
{
string str, cstr;
ReadTextFile("in", 4030, 4060);
return 0;
}
string ReadTextFile(string path, int from, int to)
{
fstream fp(path.c_str(), ios::in|ios::out|ios::binary);
char *target;
string res, str;
target = new char[to - from + 1];
if (!target)
{
cout << "Cannot allocate memory." << endl;
return "";
}
fp.seekg(from);
fp.read(target, to - from);
target[to - from] = 0;
res = target;
str = changestring(res);
fp.seekg(from);
fp.write((char *)&str, to-from);
return res;
}
string changestring(string str)
{
int l = str.length();
l = l-1;
for (int i = 0; i <= l; i++)
{
if (str[i] == 'C' || str[i] == 'c')
{
int j = i+1;
if (str[j] == 'G' || str[j] == 'g')
str[i] = 'B';
}
}
return str;
}

You misunderstand text and binary. It's not true that if you open a file in binary mode then the output will be binary. All binary mode does in practice is stop \n being translated to \r\n on Windows systems. In other words it affects the line endings in a text file, nothing else.
You are getting binary in your output because you are writing the pointers internal to a string to your output, write characters to your file instead.
Change
fp.write((char *)&str, to-from);
to
fp.write(str.c_str(), to-from);
This would also work
fp << str;
To get text output, write text. It doesn't matter whether you do that with write or with <<, text is just text. Stop thinking in terms of two modes of output, binary and text, it's not accurate, its what you write not how you write it that determines whether the output is text or binary.

Related

Handling extra bytes in huffman compression/decompression

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also then encodes and compresses an output file and currently has some functionality in decompression and decoding.
In summary, my program takes a input file compresses and encodes an output file, closes the output file and opens the encoding as an input file, and takes a new output file that is supposed to have a decoded message identical to the original text input file.
My problem is that in my test run while compressing I notice that I have 3 extra bytes and in turn when I decompress and decode my encoded file, these 3 extra bytes are being decoded to my output file. Depending on the amount of text in the original input file, my other tests output these extra bytes.
My research has let me to a few suggestions such as making the first 8 bytes of your encoded output file the 64 bits of an unsigned long long that give the number of bytes in the file or using a psuedo-EOF but I am stuck on how I would go about handling it and which of the two is a smart way to handle it given the code I have already written or if either is a smart way at all?
Any guidance or solution to this problem is appreciated.
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
freq[256] holds the frequency of each ascii character read or holds 0 if it is not in original input file.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile; //to read file
ifile.open(fileName, ios::binary);
if (!ifile)//to check if file is open or not
{
die("Can't read again"); // function that exits program if can't open
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get(); //read one char from file and store it in int
char buffer = 0, bit_count = 0;
while (read != -1) {//run this loop until reached to end of file(-1)
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << (buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
void decodeOutput(const string & fileName2, const string & fileName3, string code[256], const unsigned long long freq[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
priority_queue < node > q;
for (unsigned i = 0; i < 256; i++) {
if (freq[i] == 0) {
code[i] = "";
}
}
for (unsigned i = 0; i < 256; i++)
if (freq[i])
q.push(node(unsigned(i), freq[i]));
if (q.size() < 1) {
die("no data");
}
while (q.size() > 1) {
node *child0 = new node(q.top());
q.pop();
node *child1 = new node(q.top());
q.pop();
q.push(node(child0, child1));
} // created the tree
string answer = "";
const node * temp = &q.top(); // root
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) { //reading 8 bits at a time
if ((c >> p & 1) == '0') { // if bit is a 0
temp = temp->child0; // go left
}
else { // if bit is a 1
temp = temp->child1; // go right
}
if (temp->child0 == NULL && temp->child1 == NULL) // leaf node
{
answer += temp->value;
temp = &q.top();
}
}
}
ofile << ans;
}
Because of integral promotion rules, (buffer << (8 - bit_count)) will be an integer expression, causing 4 bytes to be written. To only write one byte, you need to cast this to a char.
ofile << char(buffer << (8 - bit_count));

Handling last byte in huffman compression/decompression

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also then encodes and compresses an output file and then is able to take the compressed file as an input file and does decompression and decoding.
In summary, my program takes a input file compresses and encodes an output file, closes the output file and opens the encoding as an input file, and takes a new output file that is supposed to have a decoded message identical to the original text input file.
My current problem with this program: When decoding the compressed file I get an extra character or so that is not in the original input file decoded. This is due to the trash bits from what I know. With research I found one solution may be to use a psuedo-EOF character to stop decoding before the trash bits are read but I am not sure how to implement this in my current functions that handle encoding and decoding so all guidance and help is much appreciated.
My end goal is to be able to use this program to also completely decode the encoded file without the trash bits sent to output file.
Below I have two functions, encodedOutput and decodeOutput that handle the compression and decompression.
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
freq[256] holds the frequency of each ascii character read or holds 0 if it is not in original input file.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile; //to read file
ifile.open(fileName, ios::binary);
if (!ifile)//to check if file is open or not
{
die("Can't read again"); // function that exits program if can't open
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get(); //read one char from file and store it in int
char buffer = 0, bit_count = 0;
while (read != -1) {//run this loop until reached to end of file(-1)
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << char(buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
void decodeOutput(const string & fileName2, const string & fileName3, string code[256], const unsigned long long freq[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
priority_queue < node > q;
for (unsigned i = 0; i < 256; i++) {
if (freq[i] == 0) {
code[i] = "";
}
}
for (unsigned i = 0; i < 256; i++)
if (freq[i])
q.push(node(unsigned(i), freq[i]));
if (q.size() < 1) {
die("no data");
}
while (q.size() > 1) {
node *child0 = new node(q.top());
q.pop();
node *child1 = new node(q.top());
q.pop();
q.push(node(child0, child1));
} // created the tree
string answer = "";
const node * temp = &q.top(); // root
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) { //reading 8 bits at a time
if ((c >> p & 1) == '0') { // if bit is a 0
temp = temp->child0; // go left
}
else { // if bit is a 1
temp = temp->child1; // go right
}
if (temp->child0 == NULL && temp->child1 == NULL) // leaf node
{
answer += temp->value;
temp = &q.top();
}
}
}
ofile << ans;
}
Change it to freq[257] and code[257], and set freq[256] to one. Your EOF is symbol 256, and it will appear once in the stream, at the end. At the end of your encoding, send symbol 256. When you receive symbol 256 while decoding, stop.

Why does writing to binary file writes one byte from no where

I have a class for writing to bytes to binary file
class BITWRITER{
public:
ofstream OFD;
char var;
int x;
BITWRITER(char* pot){
OFD.open(pot);
x = 0;
var =0;
}
void WRITE(bool b){
var ^= (-b^var)&(1 << x);
x++;
if(x == 7){
OFD.write(&var, 1);
x = 0;
var = 0;
}
}
}
And my sample code:
string bitCode = "0001010";
bool BitIsOne = false;
BITWRITER *write= new BITWRITER("out.bin");
for(int i = bitCode.length()-1 ; i >= 0; i--){
if(bitCode[i] == '1')
BitIsOne=true;
else
BitIsOne=false;
write->WRITE(BitIsOne);
}
delete write;
What I don't get it is, why when i run this exact code, when I then next read this file instead of having in binary file only one byte, I have two bytes.
In this example, the output should be
"1010"
but before this one random byte is somehow created ("1101").
Any ideas would be appreciated!
Binary 1010 is 0x0a which is a newline. You're openign the file without specifying that it should be opened in binary mode.
On Windows when you write a newline to a text mode file it will translate it to a cr/lf sequence. A cr return is 0x0d which is a binary 1101.
Specify that the file should be opened in binary mode:
OFD.open(pot, ios::binary);

C++ XOR encryption truncates file

I'm currently trying to implement file encryption via XOR. Simple as it is, though, I struggle with encryption of multiline files.
Actually, my first problem was that XOR can produce zero chars, which are interpreted as line-end by std::string, thus my solution was:
std::string Encryption::encrypt_string(const std::string& text)
{ //encrypting string
std::string result = text;
int j = 0;
for(int i = 0; i < result.length(); i++)
{
result[i] = 1 + (result[i] ^ code[j]);
assert(result[i] != 0);
j++;
if(j == code.length())
j = 0;
}
return result;
}
std::string Encryption::decrypt_string(const std::string& text)
{ // decrypting string
std::string result = text;
int j = 0;
for(int i = 0; i < result.length(); i++)
{
result[i] = (result[i] - 1) ^ code[j];
assert(result[i] != 0);
j++;
if(j == code.length())
j = 0;
}
return result;
}
Not neat, but fine for the first attempt. But when trying to crypt text files, I understood, that depending on encryption key, my output file gets truncated in random places. My best thought was, that \n is handled incorrectly, because strings from keyboard (even with \n) don't break the code.
bool Encryption::crypt(const std::string& input_filename, const std::string& output_filename, bool encrypt)
{ //My file function
std::fstream finput, foutput;
finput.open(input_filename, std::fstream::in);
foutput.open(output_filename, std::fstream::out);
if (finput.is_open() && foutput.is_open())
{
std::string str;
while (!finput.eof())
{
std::getline(finput, str);
if (encrypt)
str.append("\n");
std::string encrypted = encrypt ? encrypt_string(str) : decrypt_string(str);
foutput.write(encrypted.c_str(), str.length());
}
finput.close();
foutput.close();
return true;
}
return false;
}
What could be the problem, given that console input is XOR'ed fine?
XOR can produce zero chars, which are interpreted as line-end by std::string
std::string provides overloads to most functionality which allow you to specify the size of input data. It allows you to also check for the size of the stored data. Therefore, a 0-value char inside of std::string is perfectly reasonable and acceptable.
Therefore, the problem isn't std::string treating nulls as end-of-line but perhaps std::getline() which may be doing that.
I see that you're using std::ostream::write() so I see you're already familiar with using sizes as parameters. So why not also use std::istream::read() instead of std::getline()?
Therefore, you can simply read in "chunks" or "blocks" of the file instead of needing to treat line separators as a special case.

Reading in. file from C++

Let's assume that I have the following file: wow.txt which reads
4 a 1 c
and what I want to do is I would like to output the following:
d 1 a 3
change the integer to the corresponding alphabet (d is 4th letter, a is 1st letter
in the alphabet), and alphabet letter to the corresponding integer
(a is 1st letter in the alphabet, c is the 3rd letter in the alphabet)
I started with the following code in C++.
int main()
{
ifstream inFile;
inFile.open("wow.txt");
ofstream outFile;
outFile.open("sample.txt");
int k, g;
char a, b;
inFile>>k>>a>>g>>b;
outFile<<(char)(k+96)<<(int)a-96<<(char)(g+96)<<(int)b-96
}
inFile.close();
outFile.close();
}
but then here, I could only do it because I knew that the text in wow.txt
goes integer, character, integer, character.
Also, even if I knew the pattern, if the text in wow.txt is super-long, then
there's no way I could've solved the problem using the method I used, manually
typing in each input (Defining k, g as integers, a, b as characters, and
then doing inFile>>k>>a>>g>>b;)
Also, I didn't know know the pattern, there's no way
I could've solved it. I was wondering if there's a C++
function that reads the input from the given text file and determines its type, so
that I could attack the this type of problem in the more general case.
I'm very new to C++ programming language (or programming in general),
so any help about this would be greatly appreciated.
The term you're searching for is parsing. It is the idea of taking in text and transforming it into something meaningful. Your C++ compiler, for example, does exactly that with your program's code -- it reads in text, parses it into a series of internal representations that it does transforms on, then outputs binary code that, when run, carries out the intent of the code you wrote.
In your case, you want to turn the problem on its head -- instead of telling the input stream what to expect next from the file, you simply extract everything as text, and then figure it out yourself (you let the stream tell you what's there). If you think about it, it's text (or rather, binary data, but close enough) all the way down anyway, even when you're asking for, say, an integer to be read from the stream -- the stream does the integer parsing for you in that case, but it's still just text being parsed.
Here's some example code (untested) to get you started:
std::ifstream fin("wow.txt");
// Read everything in (works well for short files; longer
// ones could be read incrementally (streamed), but this
// adds complexity
fin.seekg(0, fin.end);
std::size_t size = fin.tellg();
fin.seekg(0, fin.beg);
std::vector<char> text(size);
fin.read(&size[0], size);
fin.close();
// Now 'tokenize' the text (into words, in this case characters)
enum TokenType { Letter, Number };
struct Token {
const char* pos;
std::size_t length;
TokenType type;
};
std::vector<Token> tokens;
for (const char* pos = &text[0]; pos != &text[0] + text.size(); ++pos) {
if (*pos >= 'a' && *pos <= 'z') {
// Letter! (lowercase)
Token tok = { pos, 1, Letter };
tokens.push_back(tok);
// TODO: Validate that the next character is whitespace (or EOF)
}
else if (*pos >= '0' && *pos <= '9') {
Token tok = { pos, 1, Number };
while (*pos >= '0' && *pos <= '9') {
++pos;
++tok.length;
}
tokens.push_back(tok);
// TODO: Validate that the next character is whitespace (or EOF)
}
else if (*pos == ' ' || *pos == '\t' || *pos == '\r' || *pos == '\n') {
// Whitespace, skip
// Note that newlines are normally tracked in order to give
// the correct line number in error messages
}
else {
std::cerr << "Unexpected character "
<< *pos
<< " at position "
<< (pos - &text[0]) << std::endl;
}
}
// Now that we have tokens, we can transform them into the desired output
std::ofstream fout("sample.txt");
for (auto it = tokens.begin(); it != tokens.end(); ++it) {
if (it->type == Letter) {
fout << static_cast<int>(*(it->pos) - 'a') + 1;
}
else {
int num = 0;
for (int i = 0; i != tok.length; ++i) {
num = num * 10 + (tok.pos[i] - '0');
}
// TODO: Make sure number is within bounds
fout << static_cast<char>(num - 1 + 'a');
}
fout << ' ';
}
fout.close();