issues using stringstream to handle binary file - c++

I'm working with a binary file that I need to grab its useful contents from. The structure is:

Based on a quick look at the file, you don't have an "unknown amt of nulls" anywhere. The format appears to be:
N Bytes: number of animals, integer as text delimited by '\n'
24 Bytes per animal:
16 Bytes: name of animal padded with 0
4 Bytes: some 32 bit number (little endian)
4 Bytes: another 32 bit number (little endian)
You shouldn't be reading this as a text file, but instead as a raw binary file. There's absolutely no need for a stringstream, you can simply parse the number of animals by reading in one byte at a time and adding to the previous value * 10 until you reach '\n'.

Related

Why does base64-encoded data compress so poorly?

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.
Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.
It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html

Data not saved in binary form

I made a program as below
#include<iostream.h>
#include<string.h>
#include<stdio.h>
#include<fstream.h>
void main() {
char name[24];
cout << "enter string :";
gets(name);
ofstream fout;
fout.open("bin_data",ios::out|ios::binary);
fout.write((char*)&name,10);
fout.close();
}
But when I open the file bin_data by notepad I find that the string is saved in text format not in binary form...... Please help...
This code can save a word of 10 char.
But when I compile this code by turbo c++ v4.5 I find that. When I input 1 or 2 letter word it saves in text format(ignore garbage value) but when I input a word of 3 to 7 letter long it saves in binary format. and in 9 and 10 letter word again in text format..... Can anyone tell me the reason...?
Please compile and run program as I mentioned above and answer
Your data only contains text. It is represented by the very same bits in both text format and binary format.
Binary format means that your data is written to the file unchanged. If you were to use text format, some non-text characters would be modified. For example, byte 10 (which represents newline) could be changed to operating system specific newline (two bytes, 15 and 10, on Windows).
For binary values of text characters, see http://www.asciitable.com/
Your second example has a buffer overflow.
char name[24];
fout.write((char*)&name,10);
You reserve 24 bytes of data, which is filled by random bytes that happen to be at that point of memory. When you save a 2-character string to the buffer, it only overwrites first three bytes. The third byte is set to value 0, which tells you that the text ends at that point. If you were to call strlen(), it would tell you the amount of characters before the first 0 byte.
If your input is a 2-character text, and you choose to write 10 bytes from your buffer, the 7 bytes in the end are filled with invalid data. Note that this does not cause an access violation, because you have reserved data for 24 bytes.
See also: https://en.wikipedia.org/wiki/Null-terminated_string

Outputting Huffman codes to file

I have a program that reads a file and saves the frequency of each character. It then constructs a huffman tree based on each character's frequency and then outputs to a file the huffman codes for the tree.
So an input like "Hello World" would output this sequence to a file:
01010101 0010 010 010 01010 0101010 000 01010 00101 010 0001
This makes sense because the most frequent characters have the shortest codes. The issue is, this increases the file size ten-fold. I realized the reason why is because each 1 and 0 is being represented in memory as its own character, so they get each get expanded out to a byte of data.
I was thinking what I could do is convert each code (E.G. "010") to a character and save that to file - but that still would pad the code to be a byte long (Or mess it up if the code is longer than a byte).
How do I go about this? I can give code snippets if needed - I'm basically saving each code into a string so that's why the file's coming out so big (It's outputting each "bit" as a byte). If I were to convert the code to a long for example, then a code like 00010 would be represented as 2 and a code like 010 would also be represented as 2.
You basically have to do it a byte (or a word) at a time. Maintain a byte which you fill with bits, and a record of how many bits have been filled in so far. When you get to 8, write the byte and start over with an empty one.

How to modify a value in textual data file using C

while(!feof(fp))
{
fscanf(fp,"%d %s %d %d",&res[i].id,res[i].title,&res[i].price,&res[i].qty);
i++;
}
while(j<i)
{
printf("\nID:|%d|\tNAME:|%s|\tPRICE:|%d|\tQTY:|%d|",res[j].id,res[j].title,res[j].price,res[j].qty);
j++;
}
I have this piece of code which is collecting data from the file. Now I want to know if get an input from a user like res[id] and I want to decrease the quantity of that particular id how to do that?
If the file is in binary format it is easier to do what you want.
What is the difference between the text and the binary format? If the file is written in binary format, then a 32-bit integer will be represented as 32 consecutive bits in the file. While in text format the number will be represented as sequence of digits for instance 32.
So what's the big deal in that difference? Well if you replace 32 with 1243, in binary format the number will still take the same 32 bits so nothing else needs to be moved, all you change is these 4 bytes. While in the second case you add 2 more digits which will cause all the subsequent contents of the file to shift with two bytes.
In order to shift everything as needed, you will need to read the current contents of the file change the value and then write the contents back. I mean all the contents following the change you are doing.

Huffman Coding in JPEG

This is my JPEG picture hex content (I marked FFC4 marker on picture). As you can see, after the byte 0x01 there is value 0xA2! How that can be possible, because the standard says that the next 16 bytes after the 0x10 tell us how many codes of each length? It is impossible to have that number of codes with 1 bit. Am I wrong?
What you are seeing is the length of the huffman block (in big endian order) in bytes (subtract 2 to include length of length field).
The huffman block is 0x1a2 bytes long.
Following the length there is a single byte representing the huffman table information (table number and also whether or not the table is for AC or DC coefficients)
Start reading the length codes after the information value:
Information Byte = 0x00
Number of length 1 codes = 0
Number of length 2 codes = 0
Number of length 3 codes = 7
...