Decoding problems with Lempel-Ziv-Welch algorithm

Decoding problems with Lempel-Ziv-Welch algorithm - c++

I have to implement the LZW algorithm but I have found some trouble with the decoding part.
I think the code is right because it works with a example I've found somewhere on the web: if I initialize my dictionary as follows
m_dictionary.push_back("a");
m_dictionary.push_back("b");
m_dictionary.push_back("d");
m_dictionary.push_back("n");
m_dictionary.push_back("_");
and my input file has the string banana_bandana, I get the following results:
compressed.txt: 1036045328
decompressed.txt:banana_bandana
But if I initialize the dictionary with all the 255 ASCII characters, the decoding process fails miserably. I think the problem rests in the number of bits used on the codes because when I'm going to decode, I always read from the input file char by char (8 bits) instead the correct number of bits, I guess.
Below is the code of my implementation of this algorithm:
template <class T>
size_t toUnsigned(T t) {
std::stringstream stream;
stream << t;
size_t x;
stream >> x;
return x;
}
bool LempelZivWelch::isInDictionary(const std::string& entry) {
return (std::find(m_dictionary.begin(), m_dictionary.end(), entry) != m_dictionary.end());
}
void LempelZivWelch::initializeDictionary() {
m_dictionary.clear();
for (int i = 0; i < 256; ++i)
m_dictionary.push_back(std::string(1, char(i)));
}
void LempelZivWelch::addEntry(std::string entry) {
m_dictionary.push_back(entry);
}
size_t LempelZivWelch::encode(char *data, size_t dataSize) {
initializeDictionary();
std::string s;
char c;
std::ofstream file;
file.open("compressed.txt", std::ios::out | std::ios::binary);
for (size_t i = 0; i < dataSize; ++i) {
c = data[i];
if(isInDictionary(s + c))
s = s + c;
else {
for (size_t j = 0; j < m_dictionary.size(); ++j)
if (m_dictionary[j] == s) {
file << j;
break;
}
addEntry(s + c);
s = c;
}
}
for (size_t j = 0; j < m_dictionary.size(); ++j)
if (m_dictionary[j] == s) {
file << j;
break;
}
file.close();
return dataSize;
}
size_t LempelZivWelch::decode(char *data, size_t dataSize) {
initializeDictionary();
std::string entry;
char c;
size_t previousCode, currentCode;
std::ofstream file;
file.open("decompressed.txt", std::ios::out | std::ios::binary);
previousCode = toUnsigned(data[0]);
file << m_dictionary[previousCode];
for (size_t i = 1; i < dataSize; ++i) {
currentCode = toUnsigned(data[i]);
entry = m_dictionary[currentCode];
file << entry;
c = entry[0];
addEntry(m_dictionary[previousCode] + c);
previousCode = currentCode;
}
file.close();
return dataSize;
}
And this is the function that reads the input files:
void Compression::readFile(std::string filename) {
std::ifstream file;
file.open(filename.c_str(), std::ios::in | std::ios::binary | std::ios::ate);
if (!file.is_open())
exit(EXIT_FAILURE);
m_dataSize = file.tellg();
m_data = new char [m_dataSize];
file.seekg(0, std::ios::beg);
file.read(m_data, m_dataSize);
file.close();
}
My guess is the decoding problem resides in reading the input file as a array of chars and/or writing to the compressed file the chars as size_t.
Thanks in advance!

It looks like you are outputting the dictionary indices as ASCII encoded numbers. How are you going to tell the sequence 1,2,3 from 12,3 or 1,23.
You need to encode the data in an unambiguous way using either 9-bit (10, 11 or whatever) numbers or some sort of prefix-free code like huffman coding.

Related

Cant compare wide characters | C++

If I read special characters from a file and then try to compare them (like with an if) it doesn't recognize them.
std::wstring c;
std::wifstream file;
file.open("test.txt");
while (file)
{
wchar_t tmp = file.get();
c += tmp;
}
file.close();
size_t l = c.length();
for (int i = 0; i < l; i++)
{
wchar_t a = c[i];
if (a == L'ä') {
std::cout << "if triggered.";
}
}
But when I create a wchar and predefine a special character it does work.
wchar_t a = L'ä';
if (a == L'ä') {
std::cout << "if triggered";
}
and if I put the wstring that was loaded from the file, in the file I get the text back. Nothing weird happens there.

This depends on the kind of file encoding.
I would implicitly say that, in this case, UTF-8.
The code below may be work fine:
std::string str;
{
std::ifstream file;
file.open("D:/test.txt");
file >> str;
}
wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
wstring wstr = myconv.from_bytes(str);
size_t l = wstr.length();
for (int i = 0; i < l; i++)
{
auto a = wstr[i];
if (a == L'ä') {
std::cout << "if triggered.";
}
}
However, std::codecvt_utf8 is deprecated in C++17
For the cases using higher C++17:
By MSVC++
I recommend using CString, it's too easy and worked on every almost version of C++, follow this:
std::string str;
{
std::ifstream file;
file.open("D:/test.txt");
file >> str;
}
CString wstr = (CString)CA2W(str.c_str(), CP_UTF8);
size_t l = wstr.GetLength();
for (int i = 0; i < l; i++)
{
auto a = wstr[i];
if (a == L'ä') {
std::cout << "if triggered.";
}
}
#include <atlstr.h> for non-MFC

C++ decoding LZ77-compressed data using std::fstream too slow

I have a function in my code which decodes a file compressed using the LZ77 algorithm. But on 15 MB input file decompression takes about 3 minutes (too slow). What's the reason of poor performance? On every step of the loop I read two or three bytes and get length, offset and next character. If offset is not zero I also have to move "offset" bytes back in output stream and read "length" bytes. Then I insert them to the end of the same stream before writing next character there.
void uncompressData(long block_size, unsigned char* data, fstream &file_out)
{
unsigned char* append;
append = new unsigned char[buf_length];
link myLink;
long cur_position = 0;
file_out.seekg(0, ios::beg);
cout << file_out.tellg() << endl;
int i=0;
myLink.length=-1;
while(i<(block_size-1))
{
if(myLink.length!=-1) file_out << myLink.next;
myLink.length = (short)(data[i] >> 4);
//cout << myLink.length << endl;
if(myLink.length!=0)
{
myLink.offset = (short)(data[i] & 0xF);
myLink.offset = myLink.offset << 8;
myLink.offset = myLink.offset | (short)data[i+1];
myLink.next = (unsigned char)data[i+2];
cur_position=file_out.tellg();
file_out.seekg(-myLink.offset,ios_base::cur);
if(myLink.length<=myLink.offset)
{
file_out.read((char*)append, myLink.length);
}
else
{
file_out.read((char*)append, myLink.offset);
int k=myLink.offset,j=0;
while(k<myLink.length)
{
append[k]=append[j];
j++;
if(j==myLink.offset) j=0;
k++;
}
}
file_out.seekg(cur_position);
file_out.write((char*)append, myLink.length);
i++;
}
else {
myLink.offset = 0;
myLink.next = (unsigned char)data[i+1];
}
i=i+2;
}
unsigned char hasOddSymbol = data[block_size-1];
if(hasOddSymbol==0x0) { file_out << myLink.next; }
delete[] append;
}

You could try doing it on a std::stringstream in memory instead:
#include <sstream>
void uncompressData(long block_size, unsigned char* data, fstream& out)
{
std::stringstream file_out; // first line in the function
// the rest of your function goes here
out << file_out.rdbuf(); // last line in the function
}

Why am i getting blank output after writing this filehandling code in c++?

I have made a tester class where I take questions from a question pool text file and put random questions from there to a docx file. I want to know why my code is giving me blank output in the docx file.
my random function is working fine. I am selecting two two questions from three questions file.
Here is my code - `
void test()
{
string line;
fstream question1("questiondesc.txt",ios::in | ios::out | ios::app);
fstream testgen("GeneratedTest.docx",ios::trunc | ios::in | ios::out);
testgen.open("GeneratedTest.docx");
if(!question1.is_open())
{
question1.open("questiondesc.txt");
}
int i,num;
for (i = 0; i < 2; i++) {
num = random(1,12);
for(int i =1;i<=num;i++)
{
getline(question1,line);
}
question1.clear();
question1.seekg(0, ios::beg);
testgen<<line<<endl;
}
question1.close();
ifstream question2("questionmcq.txt");
if(!question2.is_open())
{
question2.open("questionmcq.txt");
}
for (i = 0; i < 2; i++) {
num = random(1,26);
while(num%2==0)
{
num = random(1,26);
}
for(int i =1;i<=num;i++)
{
getline(question2,line);
}
testgen<<line<<endl;
getline(question2,line);
testgen<<line<<endl;
question2.clear();
question2.seekg(0, ios::beg);
}
question2.close();
ifstream question3("questionanalytical.txt");
if(!question3.is_open())
{
question3.open("questionanalytical.txt");
}
for (i = 0; i < 2; i++) {
num = random(1,12);
for(int i =1;i<=num;i++)
{
getline(question3,line);
}
question3.clear();
question3.seekg(0, ios::beg);
testgen<<line<<endl;
}
question3.close();
testgen.close();
}

There are errors in your code. I will show them as a comment in the below listing. Additionally I will show (onw of many, and maybe not the best ) solutions for your problem.
You should break down your problem into smaller pieces and design more functions. Then, life will be easier.
Additionally. You´should write comments. If you write comments, then you will detect the problems by yourself.
Your code with my remarks:
#include <iostream>
#include <fstream>
#include <string>
#include <random>
using namespace std; // NO NEVER USE
int random(int from, int to) {
std::random_device random_device;
std::mt19937 generator(random_device());
std::uniform_int_distribution<int> distribution(from, to);
return distribution(generator);
}
void test()
{
string line; // Line is not initialized an not needed here. Pollutes namespace
fstream question1("questiondesc.txt", ios::in | ios::out | ios::app); // Opening a file with these flags will fail. Use ifstream
fstream testgen("GeneratedTest.docx", ios::trunc | ios::in | ios::out);// Opening a file with these flags will fail. Use ofstream
testgen.open("GeneratedTest.docx"); // File was alread opened and failed. Reopening will not work. It failed alread
if (!question1.is_open()) // Use if "(!question1)" instead. There could be also other error bits
{ // Always check the status of any IO operation
question1.open("questiondesc.txt"); // Will never work. Failer already
}
int i, num; // Variable not initialized and not needed here. Name space pollution
for (i = 0; i < 2; i++) {
num = random(1, 12); // This function was not defined. I redefined it
for (int i = 1; i <= num; i++) // i=1 and i<= reaaly) not i=0 and i<num?
{
getline(question1, line); // Always check status of any IO function
}
question1.clear();
question1.seekg(0, ios::beg);
testgen << line << endl;
}
question1.close(); // The destructor of the fstream will close the file for you
ifstream question2("questionmcq.txt"); // Now you open the file as ifstream
if (!question2.is_open()) // Do check for all possible flags.: If (!question2)
{
question2.open("questionmcq.txt"); // Will not work, if it failed in the first time
}
for (i = 0; i < 2; i++) { // So 2 times
num = random(1, 26);
while (num % 2 == 0) // If numbers are equal
{
num = random(1, 26); // Get an odd number
}
for (int i = 1; i <= num; i++) // Usually from 0 to <num
{
getline(question2, line);
}
testgen << line << endl;
getline(question2, line);
testgen << line << endl;
question2.clear();
question2.seekg(0, ios::beg);
}
question2.close(); // No need to close. Destructor will do it for you
ifstream question3("questionanalytical.txt"); // Now you open the file as ifstream
if (!question3.is_open()) // Wrong check. Check for all flags
{
question3.open("questionanalytical.txt"); // Will not help in case of failure
}
// Now this is the 3rd time with the same code. So, put it into a function
for (i = 0; i < 2; i++) {
num = random(1, 12);
for (int i = 1; i <= num; i++)
{
getline(question3, line);
}
question3.clear();
question3.seekg(0, ios::beg);
testgen << line << endl;
}
question3.close();
testgen.close();
}
int main() {
test();
return 0;
}
And here one possible solution. With functions to handler similar parts of the code:
#include <iostream>
#include <string>
#include <fstream>
#include <random>
#include <vector>
#include <tuple>
// From the internet: https://en.cppreference.com/w/cpp/numeric/random/random_device
int random(int from, int to) {
std::random_device random_device;
std::mt19937 generator(random_device());
std::uniform_int_distribution<int> distribution(from, to);
return distribution(generator);
}
std::string readNthLineFromFile(std::ifstream& ifs, int n) {
// Reset file to the beginning
ifs.clear();
ifs.seekg(0, std::ios::beg);
// Default return string in case of error
std::string result{ "\n*** Error while reading a line from the source file\n" };
// If getline fails or ifs is in fail state, the string will be default
for (; std::getline(ifs, result) && (n != 0); n--);
// Give back the desired line
return result;
}
void generateQuestion(std::ifstream& sourceFileStream, std::ofstream& destinationFileStream, int n, const bool twoLines = false) {
// We want to prevent readin the same question again
int oldLineNumber = 0;
// For whatever reason, do this 2 times.
for (size_t i = 0U; i < 2; ++i) {
// If we want to read 2 consecutive lines, then we should not come up with the last kine in the file
if (twoLines & (n > 1)) --n;
// Get a random line number. But no duplicates in the 2 loops
int lineNumber{};
do {
lineNumber = random(1, n);
} while (lineNumber == oldLineNumber);
// For the next loop execution
oldLineNumber = lineNumber;
// Read the random line
std::string line{ readNthLineFromFile(sourceFileStream, lineNumber) };
// And write it to the destination file
destinationFileStream << line << "\n";
// If we want to read to lines in a row
if (twoLines) {
// Read next line
line = readNthLineFromFile(sourceFileStream, ++lineNumber);
// And write it to the destination file
destinationFileStream << line << "\n";
}
}
}
int main() {
const std::string destinationFilename{ "generatedTest.txt" };
const std::string questions1Filename{ "questiondesc.txt" };
const std::string questions2Filename{ "questionmcq.txt" };
const std::string questions3Filename{ "questionanalytical.txt" };
// Here we store the filenames and if one or 2 lines shall be read
std::vector<std::tuple<const std::string, const size_t, const bool>> source{
{ questions1Filename, 12U, false },
{ questions2Filename, 26U, true },
{ questions3Filename, 12U, false }
};
// Open the destination file and check, if it could be opened
if (std::ofstream destinationFileStream(destinationFilename); destinationFileStream) {
// Now open the first source file and generate the questions
for (const std::tuple<const std::string, const size_t, const bool>& t : source) {
// Open source file and check, if it could be opened
if (std::ifstream sourceFileStream(std::get<0>(t)); sourceFileStream) {
generateQuestion(sourceFileStream, destinationFileStream, std::get<1>(t), std::get<2>(t));
}
else {
std::cerr << "\n*** Error. Could not open source file '" << std::get<0>(t) << "'\n";
}
}
}
else {
std::cerr << "\n*** Error: Could not open destination file '" << destinationFilename << "'\n";
}
return 0;
}

Writing and reading a file

I'm trying to use ifstream/ofstream to read/write but for some reason, the data gets corrupted along the way. Heres the read/write methods and the test:
void FileWrite(const char* FilePath, std::vector<char> &data) {
std::ofstream os (FilePath);
int len = data.size();
os.write(reinterpret_cast<char*>(&len), 4);
os.write(&(data[0]), len);
os.close();
}
std::vector<char> FileRead(const char* FilePath) {
std::ifstream is(FilePath);
int len;
is.read(reinterpret_cast<char*>(&len), 4);
std::vector<char> ret(len);
is.read(&(ret[0]), len);
is.close();
return ret;
}
void test() {
std::vector<char> sample(1024 * 1024);
for (int i = 0; i < 1024 * 1024; i++) {
sample[i] = rand() % 256;
}
FileWrite("C:\\test\\sample", sample);
auto sample2 = FileRead("C:\\test\\sample");
int err = 0;
for (int i = 0; i < sample.size(); i++) {
if (sample[i] != sample2[i])
err++;
}
std::cout << err << "\n";
int a;
std::cin >> a;
}
It writes the length correctly, reads it correctly and starts reading the data correctly but at some point(depending on input, usually at around the 1000'th byte) it goes wrong and everything to follow is wrong. Why is that?

for starter, you should open the file stream for binary read and write :
std::ofstream os (FilePath,std::ios::binary);
(edit: assuming char really means "signed char")
Do notice that regular char can hold up to CHAR_MAX/2 value, which is 127.
If the random number is bigger - the result will wrap around, resulting negative value. the stream will try to write this character as a text character, which is invalid value to write. binary format should at least fix this problem.
Also, you shouldn't close the stream yourself here, the destructor does it for you.
Two more simple points:
1) &(data[0]) should be just &data[0], the () are redundant
2) try keep the same convention. you write upper-camel-case for FilePath variable, but lower-camel-case for all the other variables.

c++ writing and reading objects to binary files

I'm trying to read an array object (Array is a class I've made using read and write functions to read and write from binary files. So far the write functions works but it won't read from the file properly for some reason. This is the write function :
void writeToBinFile(const char* path) const
{
ofstream ofs(path, ios_base::out | ios_base::app | ios_base::binary);
if (ofs.is_open())
{
ostringstream oss;
for (unsigned int i = 0; i < m_size; i++)
{
oss << ' ';
oss << m_data[i];
}
ofs.write(oss.str().c_str(), oss.str().size());
}
}
This is the read function :
void readFromBinFile(const char* path)
{
ifstream ifs(path, ios_base::in | ios_base::binary || ios_base::ate);
if (ifs.is_open())
{
stringstream ss;
int charCount = 0, spaceCount = 0;
ifs.unget();
while (spaceCount != m_size)
{
charCount++;
if (ifs.peek() == ' ')
{
spaceCount++;
}
ifs.unget();
}
ifs.get();
char* ch = new char[sizeof(char) * charCount];
ifs.read(ch, sizeof(char) * charCount);
ss << ch;
delete[] ch;
for (unsigned int i = 0; i < m_size; i++)
{
ss >> m_data[i];
m_elementCount++;
}
}
}
those are the class fields :
T* m_data;
unsigned int m_size;
unsigned int m_elementCount;
I'm using the following code to write and then read (1 execution for reading another for writing):
Array<int> arr3(5);
//arr3[0] = 38;
//arr3[1] = 22;
//arr3[2] = 55;
//arr3[3] = 7;
//arr3[4] = 94;
//arr3.writeToBinFile("binfile.bin");
arr3.readFromBinFile("binfile.bin");
for (unsigned int i = 0; i < arr3.elementCount(); i++)
{
cout << "arr3[" << i << "] = " << arr3[i] << endl;
}
The problem is now at the readFromBinFile function, it get stuck in an infinite loop and peek() returns -1 for some reason and I can't figure why.
Also note I'm writing to the binary file using spaces to make a barrier between each element so I would know to differentiate between objects in the array and also a space at the start of the writing to make a barrier between previous stored binary data in the file to the array binary data.

The major problem, in my mind, is that you write fixed-size binary data in variable-size textual form. It could be so much simpler if you just stick to pure binary form.
Instead of writing to a string stream and then writing that output to the actual file, just write the binary data directly to the file:
ofs.write(reinterpret_cast<char*>(m_data), sizeof(m_data[0]) * m_size);
Then do something similar when reading the data.
For this to work, you of course need to save the number of entries in the array/vector first before writing the actual data.
So the actual write function could be as simple as
void writeToBinFile(const char* path) const
{
ofstream ofs(path, ios_base::out | ios_base::binary);
if (ofs)
{
ofs.write(reinterpret_cast<const char*>(&m_size), sizeof(m_size));
ofs.write(reinterpret_cast<const char*>(&m_data[0]), sizeof(m_data[0]) * m_size);
}
}
And the read function
void readFromBinFile(const char* path)
{
ifstream ifs(path, ios_base::in | ios_base::binary);
if (ifs)
{
// Read the size
ifs.read(reinterpret_cast<char*>(&m_size), sizeof(m_size));
// Read all the data
ifs.read(reinterpret_cast<char*>(&m_data[0]), sizeof(m_data[0]) * m_size);
}
}
Depending on how you define m_data you might need to allocate memory for it before reading the actual data.
Oh, and if you want to append data at the end of the array (but why would you, in the current code you show, you rewrite the whole array anyway) you write the size at the beginning, seek to the end, and then write the new data.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Decoding problems with Lempel-Ziv-Welch algorithm - c++

It looks like you are outputting the dictionary indices as ASCII encoded numbers. How are you going to tell the sequence 1,2,3 from 12,3 or 1,23. You need to encode the data in an unambiguous way using either 9-bit (10, 11 or whatever) numbers or some sort of prefix-free code like huffman coding.

Related

Cant compare wide characters | C++

C++ decoding LZ77-compressed data using std::fstream too slow

Why am i getting blank output after writing this filehandling code in c++?

Writing and reading a file

c++ writing and reading objects to binary files

Categories

Resources