I am using huffman algorithm to develop a file compressor and right now I am facing a problem which is:
By using the algorithm to the word:
stackoverflow, i get the following result:
a,c,e,f,k,l,r,s,t,v,w = 1 time repeated
o = 2 times repeated
a,c,e,f,k,l,r,s,t,v,w = 7.69231%
and
o = 15.3846%
So I start inserting then into a Binary Tree, which will get me the results:
o=00
a=010
e=0110
c=0111
t=1000
s=1001
w=1010
v=1011
k=1100
f=1101
r=1110
l=1111
which means the path for the character in the tree, considering 0 to be left and 1 to right.
then the word "stackoverflow" will be:
100110000100111010011111000010110110111011011111001010
and well, I want to put that whole value into a binary file to be in bits, which will result in 47bits, which would happen to be 6bytes, but instead I can only make it 47bytes because the minimun to put into a file with fwrite or fprintf is 1byte, by using sizeof(something).
Than my question is: how can I print in my file only a single bit?
Just write the "header" to the file: the number of bits and then "pack" the bits into bytes padding the last one. Here's a sample.
#include <stdio.h>
FILE* f;
/* how many bits in current byte */
int bit_counter;
/* current byte */
unsigned char cur_byte;
/* write 1 or 0 bit */
void write_bit(unsigned char bit)
{
if(++bit_counter == 8)
{
fwrite(&cur_byte,1,1,f);
bit_counter = 0;
cur_byte = 0;
}
cur_byte <<= 1;
cur_byte |= bit;
}
int main()
{
f = fopen("test.bits", "w");
cur_byte = 0;
bit_counter = 0;
/* write the number of bits here to decode the bitstream later (47 in your case) */
/* int num = 47; */
/* fwrite(num, 1, 4, f); */
write_bit(1);
write_bit(0);
write_bit(0);
/* etc... - do this in a loop for each encoded character */
/* 100110000100111010011111000010110110111011011111001010 */
if(bit_counter > 0)
{
// pad the last byte with zeroes
cur_byte <<= 8 - bit_counter;
fwrite(&cur_byte, 1, 1, f);
}
fclose(f);
return 0;
}
To do the full Huffman encoder you'll have to write the bit codes at the beginning, of course.
This is sort of an encoding issue. The problem is that files can only contain bytes - so 1 and 0 can only be '1' and '0' in a file - the characters for 1 and 0, which are bytes.
What you'll have to do is to pack the bits into bytes, creating a file that contains the bits in a set of bytes. You won't be able to open the file in a text editor - it doesn't know you want to display each bit as a 1 or 0 char, it will display whatever each packed byte turns out to be. You could open it with an editor that understands how to work with binary files, though. For instance, vim can do this.
As far as extra trailing bytes or an end-of-file marker, you're going to have to create some sort of encoding convention. For example, you can pack and pad with extra zeros, like you mention in your comments, but then by convention have the first N bytes be metadata - e.g. the data length, how many bits are interesting in your file. This sort of thing is very common.
You'll need to manage this yourself, by buffering the bits to write and only actually writing data when you have a complete byte. Something like...
void writeBit(bool b)
{
static char buffer=0;
static int bitcount=0;
buffer = (buffer << 1) | (b ? 1:0);
if (++bitcount == 8)
{
fputc(buffer); // write out the byte
bitcount = 0;
buffer = 0;
}
}
The above isn't reentrant (and is likely to be pretty inefficient) - and you need to make sure you somehow flush any half-written byte at the end, (write an extra 7 zero bits, maybe) but you should get the general idea.
Related
I was working on a Huffman project to compress text files. I was able to generate the required codes. I read the whole file and accordingly stored the codes in a "vector char" variable. I also padded the encoded vector.
vector<char> padding(vector<char> text)
{
int num = text.size();
unsigned int pad_value = 32-(num%32);
for(int i=0;i<pad_value;i++){
text.push_back('0');
}
string pad_info = bitset<32>(pad_value).to_string();
for(int i=pad_info.length()-1;i>=0;i--){
text.insert(text.begin(),pad_info[i]);
}
return text;
}
I padded on the base of 32 bits, as I was thinking if using an array of "unsigned int" to directly store the integers in a binary file so that they occupy 4 bytes for every 32 characters. I used this function for that:
vector<unsigned int> build_byte_array(vector<char> padded_text)
{
vector<unsigned int> byte_arr;
for(int i=0;i<padded_text.size();i+=32)
{
string byte="";
for(int j=i;j<i+32;j++){
byte += padded_text[j];
}
unsigned int b = stoul(byte,nullptr,2);
//cout<<b<<":"<<byte<<endl;
byte_arr.push_back(b);
}
return byte_arr;
}
Now the problem is when I write this byte array to binary file using
ofstream output("compressed.bin",ios::binary);
for(int i=0;i<byte_array.size();i++){
unsigned int a = byte_array[i];
output.write((char*)(&a),sizeof(a));
}
I get a binary file which is bigger than the original text file. How do I solve that or what error am I making.
Edit : I tried to compress a file of about 2,493 KB (for testing purposes) and it generated a compressed.bin file of 3,431 KB. So, I don't think padding is the issue here.
I also tried with 15KB file but the size of always increases after using this algo.
I tried using:
for(int i=0;i<byte_array.size();i++){
unsigned int a = byte_array[i];
char b = (char)a;
output.write((char*)(&a),sizeof(b));
}
but after using this I am unable to recover the original byte array when decompressing the file.
unsigned int a = byte_array[i];
output.write((char*)(&a),sizeof(a));
The size of the write is sizeof(a) which is usually 4 bytes.
An unsigned int is not a byte. A more suitable type for a byte would be std::byte, uint8_t, or unsigned char.
You are expanding your data with padding, so if you're not getting much compression or there's not much data to begin with, the output could easily be larger.
You don't need to pad nearly as much as you do. First off, you are adding 32 bits when the data already ends on a word boundary (when num is a multiple of 32). Pad zero bits in that case. Second, you are inserting 32 bits at the start to record how many bits you padded, where five bits would suffice to encode 0..31. Third, you could write bytes instead of ints, so the padding on the end could be 0..7 bits, and you could prepend three bits instead of five. The padding overall could be reduced from your current 33..64 bits to 3..10 bits.
I have implemented the Huffman coding algorithm in C++, and it's working fine. I want to create a text compression algorithm.
behind every file or data in the digital world, there is 0/1.
I want to persist the sequence of bits(0/1) that are generated by the Huffman encoding algorithm in the file.
my goal is to save the number of bits used in the file to store. I'm storing metadata for decoding in a separate file. I want to write bit by bit data to file, and then read the same bit by bit in c++.
the problem I'm facing with the binary mode is that it not allowing me to put data bit by bit.
I want to put "10101" as bit by bit to file but it put asci values or 8-bits of each character at a time.
code
#include "iostream"
#include "fstream"
using namespace std;
int main(){
ofstream f;
f.open("./one.bin", ios::out | ios::binary);
f<<"10101";
f.close();
return 0;
}
output
any help or pointer to help is appreciated. thank you.
"Binary mode" means only that you have requested that the actual bytes you write are not corrupted by end-of-line conversions. (This is only a problem on Windows. No other system has the need to deliberately corrupt your data.)
You are still writing a byte at a time in binary mode.
To write bits, you accumulate them in an integer. For convenience, in an unsigned integer. This is your bit buffer. You need to decide whether to accumulate them from the least to most or from the most to least significant positions. Once you have eight or more bits accumulated, you write out one byte to your file, and remove those eight bits from the buffer.
When you're done, if there are bits left in your buffer, you write out those last one to seven bits to one byte. You need to carefully consider how exactly you do that, and how to know how many bits there were, so that you can properly decode the bits on the other end.
The accumulation and extraction are done using the bit operations in your language. In C++ (and many other languages), those are & (and), | (or), >> (right shift), and << (left shift).
For example, to insert one bit, x, into your buffer, and later three bits in y, ending up with the earliest bits in the most significant positions:
unsigned buf = 0, bits = 0;
...
// some loop
{
...
// write one bit (don't need the & if you know x is 0 or 1)
buf = (buf << 1) | (x & 1);
bits++;
...
// write three bits
buf = (buf << 3) | (y & 7);
bits += 3;
...
// write bytes from the buffer before it fills the integer length
if (bits >= 8) { // the if could be a while if expect 16 or more
// out is an ostream -- must be in binary mode if on Windows
bits -= 8;
out.put(buf >> bits);
}
...
}
...
// write any leftover bits (it is assumed here that bits is in 0..7 --
// if not, first repeat if or while from above to clear out bytes)
if (bits) {
out.put(buf << (8 - bits));
bits = 0;
}
...
I'm reading a bunch of bit values from a text file which are in binary from because I stored them using fwrite. The problem is that the first value in the file is 5 bytes in size and the next 4800 values are 2 bytes in size. So when I try to cycle through the file and read the values it will give me the wrong results because my program does not know that it should take 5 bytes the first time and then 2 bytes the remaining 4800 times.
Here is how I'm cycling through the file:
long lSize;
unsigned short * buffer;
size_t result;
pFile = fOpen("dataValues.txt", "rb");
lSize = ftell(pFile);
buffer = (unsigned short *) malloc (sizeof(unsigned short)*lSize);
size_t count = lSize/sizeof(short);
for(size_t i = 0; i < count; ++i)
{
result = fread(buffer+i, sizeof(unsigned short), 1, pFile);
print("%u\n", buffer[i]);
}
I'm pretty sure I'm going to need to change my fread statement because the first value is of type time_t so I'll probably need a statement that looks like this:
result = fread(buffer+i, sizeof(time_t), 1, pFile);
However, this did not work work when I tried it and I think it's because I am not changing the starting position properly. I think that while I do read 5 bytes worth of data, I don't move the starting position enough.
Does anyone here have a good understanding of fread? Can you please let me know what I can change to make my program accomplish what I need.
EDIT:
This is how I'm writing to the file.
fwrite(&timer, sizeof(timer), 1, pFile);
fwrite(ptr, sizeof(unsigned short), rawData.size(), pFile);
EDIT2:
I tried to read the file using ifstream
int main()
{
time_t x;
ifstream infile;
infile.open("binaryValues.txt", ios::binary | ios::in);
infile.read((char *) &x, sizeof(x));
return 0;
}
However, now it doesn't compile and just give me a bunch of undefined reference to errors to code that I don't even have written.
I don't see the problem:
uint8_t five_byte_buffer[5];
uint8_t two_byte_buffer[2];
//...
ifstream my_file(/*...*/);
my_file.read(&five_byte_buffer[0], 5);
my_file.read(&two_byte_buffer[0], 2);
So, what is your specific issue?
Edit 1: Reading in a loop
while (my_file.read(&five_byte_buffer[0], 5))
{
my_file.read(&two_byte_buffer[0], 5);
Process_Data();
}
You can't. Streams are byte, almost always octet (8 bit byte) oriented.
You can easily enough build a bit-oriented stream on top of that. You just keep a few bytes in a buffer and keep track of which bit is current. Watch out for getting the last few bits, and attempts to mix byte access with bit access.
Untested but this is the general idea.
struct bitstream
{
unsigned long long rack; // 64 bits rack
FILE *fp; // file opened for reading
int rackpos; // 0 - 63, poisition of bits read.
}
int getbits(struct bitstream *bs, int Nbits)
{
unsigned long long mask = 0x8000 0000 0000 0000;
int answer = 0;
while(bs->rackpos > 8)
{
bs->rack <<= 8;
bs->rack |= fgetc(bs->fp);
bs->rackpos -= 8;
}
mask >>= bs->rackpos;
for(i=0;i<Nbits;i++)
{
answer <<= 1;
answer |= bs->rack & mask;
mask >>= 1;
}
bs->rackpos += Nbits;
return answer;
}
You need to decide how you know when the stream is terminated. As is you'll corrupt the last few bits with the EOF read by fgetc().
I have some data coming in from a sensor. The data is in the range of a signed int, 16 bits or so. I need to send the data out via Bluetooth.
Problem:
The data is -1564, lets say.The Bluetooth transmits -, 1, 5, 6, then 4. This is inefficient. I can process the data on the PC later, I just need the frequency to go up.
My Idea/ Solution:
Have it convert to binary, then to ASCII for output. I can convert the ASCII later in processing. I have the binary part (found on StackOverflow) here:
inline void printbincharpad(char c)
{
for (int i = 7; i >= 0; --i)
{
putchar( (c & (1 << i)) ? '1' : '0' );
}
}
This outputs in binary very well. But getting the bluetooth to transmit, say 24, spits out 1, 1, 0, 0, then 0. In fact, slower than just 2, then 4.
Say I have 65062, 5 bytes to transmit, coming out of the sensor. That is 1111111000100110 in binary, 16 bytes. To ASCII, it's �& (yes, the character set here is small, I know, but it's unique) just 2 bytes! In HEX it's FE26, 4 bytes. A savings of 3 vs decimal and 14 vs. binary and 2 vs. Hex. Ok, obviously, I want ASCII sent out here.
My Question:
So, how do I convert to ASCII if given a binary input?
I want to send that, the ASCII
Hedging:
Yes, I code in MatLab more than C++. This is for a microcontroller. The BAUD is 115200. No, I don't know how the above code works, I don't know where putchar's documentation is found. If you knw of a library that I need to run this, please tell me, as I do not know.
Thank you for any and all help or advice, I do appreciate it.
EDIT: In response to some of the comments: it's two 16 bit registers I am reading from, so data loss is impossible.
putchar writes to the standard output, which is usually the console.
You may take a look at the other output functions in the cstdio (or stdio.h) library.
Anyways, using putchar(), here's one way to achieve what you're asking for:
void print_bytes (int n)
{
char *p = (char *) &n ;
for (size_t i = 0; i < sizeof (n); ++i) {
putchar (p [i]) ;
}
}
If you know for certain that you only want 16 bits from the integer, you can simplify this like this:
void print_bytes (int n)
{
char b = n & 0xff ;
char a = (n >> 8) & 0xff ;
putchar (a) ;
putchar (b) ;
}
Looks like when you say ASCII, you mean Base 256. You can search for solutions to converting from Base 10 to Base 256.
Here is a C program that converts an string containing 65062 (5 characters) to a string of 2 characters:
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* inputString="65062";
int input;
char* tmpString;
char* outString;
int Counter;
input = atoi(inputString);
outString= malloc (sizeof(input) + 1);
tmpString = &input;
for (Counter=0; Counter < sizeof(input) ; Counter++) {
outString[Counter] = tmpString[Counter];
}
outString[sizeof(input)] = '\0';
printf ("outString = %s\n", outString);
free(outString);
}
I want to write a series of 0's to a binary file. As a char, this should be a space, however, I am receiving many other odd characters when I write to my file. I am not writing zeroes but something else it seems.
Am I doing this correctly?
Code:
int zero = 0;
myfile.write(reinterpret_cast<char *>(&zero),1790*sizeof(char));
Like this
for (int i = 0; i < 1790; ++i)
{
char zero = 0;
myfile.write(&zero, sizeof(char));
}
Your code writes 1790 bytes but zero is only four bytes big, so you end up writing random garbage.
Another way would be
char zero[1790] = { 0 };
myfile.write(zero, sizeof zero);
The point is that when you use write the size of the first argument to write must be at least as big as the value of the second argument to write.
You are writing 1790 bytes of random memory starting at address &zero. First 4 bytes of that memory will be zeroes (value of zero assuming sizeof(int)==4), the rest is probably not.