Handling extra bytes in huffman compression/decompression

Handling extra bytes in huffman compression/decompression - c++

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also then encodes and compresses an output file and currently has some functionality in decompression and decoding.
In summary, my program takes a input file compresses and encodes an output file, closes the output file and opens the encoding as an input file, and takes a new output file that is supposed to have a decoded message identical to the original text input file.
My problem is that in my test run while compressing I notice that I have 3 extra bytes and in turn when I decompress and decode my encoded file, these 3 extra bytes are being decoded to my output file. Depending on the amount of text in the original input file, my other tests output these extra bytes.
My research has let me to a few suggestions such as making the first 8 bytes of your encoded output file the 64 bits of an unsigned long long that give the number of bytes in the file or using a psuedo-EOF but I am stuck on how I would go about handling it and which of the two is a smart way to handle it given the code I have already written or if either is a smart way at all?
Any guidance or solution to this problem is appreciated.
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
freq[256] holds the frequency of each ascii character read or holds 0 if it is not in original input file.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile; //to read file
ifile.open(fileName, ios::binary);
if (!ifile)//to check if file is open or not
{
die("Can't read again"); // function that exits program if can't open
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get(); //read one char from file and store it in int
char buffer = 0, bit_count = 0;
while (read != -1) {//run this loop until reached to end of file(-1)
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << (buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
void decodeOutput(const string & fileName2, const string & fileName3, string code[256], const unsigned long long freq[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
priority_queue < node > q;
for (unsigned i = 0; i < 256; i++) {
if (freq[i] == 0) {
code[i] = "";
}
}
for (unsigned i = 0; i < 256; i++)
if (freq[i])
q.push(node(unsigned(i), freq[i]));
if (q.size() < 1) {
die("no data");
}
while (q.size() > 1) {
node *child0 = new node(q.top());
q.pop();
node *child1 = new node(q.top());
q.pop();
q.push(node(child0, child1));
} // created the tree
string answer = "";
const node * temp = &q.top(); // root
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) { //reading 8 bits at a time
if ((c >> p & 1) == '0') { // if bit is a 0
temp = temp->child0; // go left
}
else { // if bit is a 1
temp = temp->child1; // go right
}
if (temp->child0 == NULL && temp->child1 == NULL) // leaf node
{
answer += temp->value;
temp = &q.top();
}
}
}
ofile << ans;
}

Because of integral promotion rules, (buffer << (8 - bit_count)) will be an integer expression, causing 4 bytes to be written. To only write one byte, you need to cast this to a char.
ofile << char(buffer << (8 - bit_count));

Related

How to find a string in a binary file?

I want to find a specific string "fileSize" in a binary file.
The purpose of finding that string is to get 4 bytes that next to the string because that 4 bytes contains the size of data that I want to read it.
The content of the binary file like the following:
The same string in another position:
Another position:
The following is the function that writes the data to a file:
void W_Data(char *readableFile, char *writableFile) {
ifstream RFile(readableFile, ios::binary);
ofstream WFile(writableFile, ios::binary | ios::app);
RFile.seekg(0, ios::end);
unsigned long size = (unsigned long)RFile.tellg();
RFile.seekg(0, ios::beg);
unsigned int bufferSize = 1024;
char *contentsBuffer = new char[bufferSize];
WFile.write("fileSize:", 9);
WFile.write((char*)&size, sizeof(unsigned long));
while (!RFile.eof()) {
RFile.read(contentsBuffer, bufferSize);
WFile.write(contentsBuffer, bufferSize);
}
RFile.close();
WFile.close();
delete contentsBuffer;
contentsBuffer = NULL;
}
Also, the function that searches for the string:
void R_Data(char *readableFile) {
ifstream RFile(readableFile, ios::binary);
const unsigned int bufferSize = 9;
char fileSize[bufferSize];
while (RFile.read(fileSize, bufferSize)) {
if (strcmp(fileSize, "fileSize:") == 0) {
cout << "Exists" << endl;
}
}
RFile.close();
}
How to find a specific string in a binary file?

I think of using find() is an easy way to search for patterns.
void R_Data(const std::string filename, const std::string pattern) {
std::ifstream(filename, std::ios::binary);
char buffer[1024];
while (file.read(buffer, 1024)) {
std::string temp(buffer, 1024);
std::size_t pos = 0, old = 0;
while (pos != std::string::npos) {
pos = temp.find(pattern, old);
old = pos + pattern.length();
if ( pos != std::string::npos )
std::cout << "Exists" << std::endl;
}
file.seekg(pattern.length()-1, std::ios::cur);
}
}

How to find a specific string in a binary file?
If you don't know the location of the string in the file, I suggest the following:
Find the size of the file.
Allocate memory for being able to read everything in the file.
Read everything from the file to the memory allocated.
Iterate over the contents of the file and use std::strcmp/std::strncmp to find the string.
Deallocate the memory once you are done using it.
There are couple of problems with using
const unsigned int bufferSize = 9;
char fileSize[bufferSize];
while (RFile.read(fileSize, bufferSize)) {
if (strcmp(fileSize, "filesize:") == 0) {
cout << "Exists" << endl;
}
}
Problem 1
The strcmp line will lead to undefined behavior when fileSize actually contains the string "fileSize:" since the variable has enough space only for 9 character. It needs an additional element to hold the terminating null character. You could use
const unsigned int bufferSize = 9;
char fileSize[bufferSize+1] = {0};
while (RFile.read(fileSize, bufferSize)) {
if (strcmp(fileSize, "filesize:") == 0) {
cout << "Exists" << endl;
}
}
to take care of that problem.
Problem 2
You are reading the contents of the file in blocks of 9.
First call to RFile.read reads the first block of 9 characters.
Second call to RFile.read reads the second block of 9 characters.
Third call to RFile.read reads the third block of 9 characters. etc.
Hence, unless the string "fileSize:" is at the boundary of one such blocks, the test
if (strcmp(fileSize, "filesize:") == 0)
will never pass.

Handling last byte in huffman compression/decompression

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also then encodes and compresses an output file and then is able to take the compressed file as an input file and does decompression and decoding.
In summary, my program takes a input file compresses and encodes an output file, closes the output file and opens the encoding as an input file, and takes a new output file that is supposed to have a decoded message identical to the original text input file.
My current problem with this program: When decoding the compressed file I get an extra character or so that is not in the original input file decoded. This is due to the trash bits from what I know. With research I found one solution may be to use a psuedo-EOF character to stop decoding before the trash bits are read but I am not sure how to implement this in my current functions that handle encoding and decoding so all guidance and help is much appreciated.
My end goal is to be able to use this program to also completely decode the encoded file without the trash bits sent to output file.
Below I have two functions, encodedOutput and decodeOutput that handle the compression and decompression.
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
freq[256] holds the frequency of each ascii character read or holds 0 if it is not in original input file.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile; //to read file
ifile.open(fileName, ios::binary);
if (!ifile)//to check if file is open or not
{
die("Can't read again"); // function that exits program if can't open
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get(); //read one char from file and store it in int
char buffer = 0, bit_count = 0;
while (read != -1) {//run this loop until reached to end of file(-1)
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << char(buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
void decodeOutput(const string & fileName2, const string & fileName3, string code[256], const unsigned long long freq[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
priority_queue < node > q;
for (unsigned i = 0; i < 256; i++) {
if (freq[i] == 0) {
code[i] = "";
}
}
for (unsigned i = 0; i < 256; i++)
if (freq[i])
q.push(node(unsigned(i), freq[i]));
if (q.size() < 1) {
die("no data");
}
while (q.size() > 1) {
node *child0 = new node(q.top());
q.pop();
node *child1 = new node(q.top());
q.pop();
q.push(node(child0, child1));
} // created the tree
string answer = "";
const node * temp = &q.top(); // root
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) { //reading 8 bits at a time
if ((c >> p & 1) == '0') { // if bit is a 0
temp = temp->child0; // go left
}
else { // if bit is a 1
temp = temp->child1; // go right
}
if (temp->child0 == NULL && temp->child1 == NULL) // leaf node
{
answer += temp->value;
temp = &q.top();
}
}
}
ofile << ans;
}

Change it to freq[257] and code[257], and set freq[256] to one. Your EOF is symbol 256, and it will appear once in the stream, at the end. At the end of your encoding, send symbol 256. When you receive symbol 256 while decoding, stop.

Huffman Decoding Function Uncompressing One Character Repeatedly

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also encodes and compresses an output file.
I am now trying to decompress and decode my current output file which is opened as an input file and a new output file is to have the decoded message identical to the original text input file.
My thought process for this part of the assignment is to recreate a tree with huffman codes and then while reading 8 bits at a time, traverse through tree until I reach a leaf node where I will have updated an empty string(string answer) and then output it to my output file.
My problem: After writing this function I see that only one character in between all of the other characters of my original input file gets output repeatedly. I am confused as to why this is the case because I am expecting the output file to be identical to the original input file.
Any guidance or solution to this problem is appreciated.
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
freq[256] holds the frequency of each ascii character read or holds 0 if it is not in original input file.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile;
ifile.open(fileName, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get();
char buffer = 0, bit_count = 0;
while (read != -1) {
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << (buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
// Work in progress
void decodeOutput(const string & fileName2, const string & fileName3, string code[256], const unsigned long long freq[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
priority_queue < node > q;
for (unsigned i = 0; i < 256; i++) {
if (freq[i] == 0) {
code[i] = "";
}
}
for (unsigned i = 0; i < 256; i++)
if (freq[i])
q.push(node(unsigned(i), freq[i]));
if (q.size() < 1) {
die("no data");
}
while (q.size() > 1) {
node *child0 = new node(q.top());
q.pop();
node *child1 = new node(q.top());
q.pop();
q.push(node(child0, child1));
} // created the tree
string answer = "";
const node * temp = &q.top(); // root
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) { //reading 8 bits at a time
if ((c >> p & 1) == '0') { // if bit is a 0
temp = temp->child0; // go left
}
else { // if bit is a 1
temp = temp->child1; // go right
}
if (temp->child0 == NULL && temp->child1 == NULL) // leaf node
{
ans += temp->value;
temp = &q.top();
}
ofile << ans;
}
}
}

(c >> p & 1) == '0'
Will only return true when (c >> p & 1) equals 48, so your if statement will always follow the else branch. The correct code is:
(c >> p & 1) == 0

Huffman Decoding Compressed File

I have a program that produces a Huffman tree based on ASCII character frequency read in a text input file. The Huffman codes are stored in a string array of 256 elements, empty string if the character is not read. This program also encodes and compresses an output file.
I am now trying to decompress and decode my current output file which is opened as an input file and a new output file is to have the decoded message identical to the original text input file.
My thought process for this part of my assignment is to work backwards from the encoding function I have made and read 8 bits at a time and somehow decode the message by updating a variable (string n) which is an empty string at first, through recursion of the Huffman tree until I get a code to output to output file.
I have currently started the function but I am stuck and I am looking for some guidance in writing my current decodeOutput function. All help is appreciated.
My completed encodedOutput function and decodeOutput function is down below:
(For encodedOutput function, fileName is the input file parameter, fileName2 is the output file parameter)
(For decodeOutput function, fileName2 is the input file parameter, fileName 3 is output file parameter)
code[256] is a parameter for both of these functions and holds the Huffman code for each unique character read in the original input file, for example, the character 'H' being read in the input file may have a code of "111" stored in the code array for code[72] at the time it is being passed to the functions.
void encodeOutput(const string & fileName, const string & fileName2, string code[256]) {
ifstream ifile;//to read file
ifile.open(fileName, ios::binary);
if (!ifile) //to check if file is open or not
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName2, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
int read;
read = ifile.get();//read one char from file and store it in int
char buffer = 0, bit_count = 0;
while (read != -1) {
for (unsigned b = 0; b < code[read].size(); b++) { // loop through bits (code[read] outputs huffman code)
buffer <<= 1;
buffer |= code[read][b] != '0';
bit_count++;
if (bit_count == 8) {
ofile << buffer;
buffer = 0;
bit_count = 0;
}
}
read = ifile.get();
}
if (bit_count != 0)
ofile << (buffer << (8 - bit_count));
ifile.close();
ofile.close();
}
//Work in progress
void decodeOutput(const string & fileName2, const string & fileName3, string code[256]) {
ifstream ifile;
ifile.open(fileName2, ios::binary);
if (!ifile)
{
die("Can't read again");
}
ofstream ofile;
ofile.open(fileName3, ios::binary);
if (!ofile) {
die("Can't open encoding output file");
}
string n = "";
for (int c; (c = ifile.get()) != EOF;) {
for (unsigned p = 8; p--;) {
if ((c >> p & 1) == '0') { // if bit is a 0
}
else if ((c >> p & 1) == '1') { // if bit is a 1
}
else { // Output string n (decoded character) to output file
ofile << n;
}
}
}
}

The decoding would be easier if you had the original Hoffman tree used to construct the codebook. But suppose you only have the codebook (i.e., the string code[256]) but not the original Hoffman tree. What you can do is the following:
Partition the codebook into groups of codewords with different lengths. Say the codebook consists of codewords with n different lengths: L0 < L1 < ... < Ln-1 .
Read (but do not consume yet) k bits from input file, with k increasing From L0 up to Ln-1, until you find a match between the input k bits and a codeword of length k = Li for some i.
Output the 8-bit character corresponding to the matching codeword, and consume the k bits from input file.
Repeat until all bits from input file are consumed.
If the codebook were constructed correctly, and you always look up the codewords in increasing length, you should never find a sequence of input bits which you cannot find a matching codeword.
Effectively, in terms of the Hoffman tree equivalence, every time you compare k input bits with a group of codewords of length k, you are checking whether a leaf at tree level-k contains an input-matching codeword; every time you increase k to the next longer group of codewords, you are walking down the tree to a higher level (say level-0 is the root).

Why does writing to binary file writes one byte from no where

I have a class for writing to bytes to binary file
class BITWRITER{
public:
ofstream OFD;
char var;
int x;
BITWRITER(char* pot){
OFD.open(pot);
x = 0;
var =0;
}
void WRITE(bool b){
var ^= (-b^var)&(1 << x);
x++;
if(x == 7){
OFD.write(&var, 1);
x = 0;
var = 0;
}
}
}
And my sample code:
string bitCode = "0001010";
bool BitIsOne = false;
BITWRITER *write= new BITWRITER("out.bin");
for(int i = bitCode.length()-1 ; i >= 0; i--){
if(bitCode[i] == '1')
BitIsOne=true;
else
BitIsOne=false;
write->WRITE(BitIsOne);
}
delete write;
What I don't get it is, why when i run this exact code, when I then next read this file instead of having in binary file only one byte, I have two bytes.
In this example, the output should be
"1010"
but before this one random byte is somehow created ("1101").
Any ideas would be appreciated!

Binary 1010 is 0x0a which is a newline. You're openign the file without specifying that it should be opened in binary mode.
On Windows when you write a newline to a text mode file it will translate it to a cr/lf sequence. A cr return is 0x0d which is a binary 1101.
Specify that the file should be opened in binary mode:
OFD.open(pot, ios::binary);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Handling extra bytes in huffman compression/decompression - c++

Because of integral promotion rules, (buffer << (8 - bit_count)) will be an integer expression, causing 4 bytes to be written. To only write one byte, you need to cast this to a char. ofile << char(buffer << (8 - bit_count));

Related

How to find a string in a binary file?

Handling last byte in huffman compression/decompression

Huffman Decoding Function Uncompressing One Character Repeatedly

Huffman Decoding Compressed File

Why does writing to binary file writes one byte from no where

Categories

Resources