Zlib compression enlarging file - c++

I'm trying to use zlib in an iPhone app to compress a text file into a gzip file as a test. I am using the following code
const char *s = [[Path stringByReplacingOccurrencesOfString:[NSString stringWithFormat:#".%#", [Path pathExtension]] withString:#".gz"] UTF8String];
gzFile *fi = (gzFile *)gzopen(s, "wb");
const char *c = readFile(Path.UTF8String);
gzwrite(fi, c, strlen(c));
gzclose(fi);
where readFile() returns a const char* that was obtained from the file using the fgets() function. The problem is, when I use this to compress a file, it doesn't compress it, but instead the gzip file is larger than original file. For example, I have a text file that is 90 bytes, and after using this method the size of the gzip is 98 bytes. Why isn't the gzip smaller than the original file?

The GZip format includes fixed-size header information. Because you are compressing so little data, the header information is larger than the space you are saving.
90 bytes is generally not worth compressing.
http://www.gzip.org/zlib/rfc-gzip.html#header-trailer

Regardless of the compression algorithm used there's always a chance that the generated data will be larger than the input otherwise it wouldn't be possible to encode any combination of input bit patterns.
As already stated in your special case a very small file size compared to header overhead seems to be the problem.
Nevertheless it might be good to keep in mind that there's never a guarantee the "compressed" file size will be smaller.

The data you are trying to compress is too small and there is not a lot of redundancy, so there is nothing left to compress. Compression algorithms work, to put it very simply, by eliminating repeating sequences in data. In 90 bytes, you probably don't have much redundancy, unless it's text like "aaaaaaa....".
Fixed header overhead, as already mentioned.
Try a bigger data file.

Related

Is it possible to compress a compressed file by mixin and/or 'XOR'?

I want to compress a compressed file. But when I mixin the compressed file and compress it again, the file is larger than before everytime.
I have 2 methods for mixing. First, compressed file XOR another file(Generated).
Second, Swap some bits in compressed file.
Do you have any ideas about new methods or is it impossible ?
As mentioned in the comments, the first compression has already taken the extra "air" out so it cannot be "squeezed" smaller anymore. It is now close to random data, and compressing random data will always make it at least slightly larger due to compression headers.
The situation is different with lossy compression though, you can discard as much data as you wish.

save/write data structure into binary file and then restore from it in c++

I have a char** array that stores all sort of things like struct, integers and characters.
char** disk = new char*[100];
I set every block of disk to 64 bytes and use memcpy to store different information.
Then, I need to save this disk into a file, in which I can restore from it again.
But I don't know how to save this data structure into a binary file, all I know is output some text and read from this text.txt file using ifstream. I don't think this is efficient, so my question is what is the best way to write this disk into a file? How to write into a binary file? And how can I restore it (read from it?)? Could you please give me some examples?
Thanks!

Compressing data before sending it - controll characters?

In a multiplayer game setting, I was going to use zlib to compress larger strings before sending them. I placed the resulting data back into strings, which are to be sent as byte streams using TCP.
My problem is, that I need to place control characters into the string as well. For example, I need to add the original string length (in plain text) to the front of the compressed string, and seperate it from the compressed data using some symbol, like "|".
But I can't find a way of knowing which bytes are actual content and which bytes are control characters. Are there any characters that a zlib-compressed-string will never contain (besides 0, which I can't use since it marks the end of a c-string) which I can use to seperate "metadata" and "compressed data"?
No, there are no byte values that cannot be contained in a zlib stream. However a zlib stream is self-terminating. By simply using inflate() to decompress the stream, you will find the end when it returns Z_STREAM_END. The bytes not consumed by inflate() are the next bytes immediately after the zlib stream. Upon completion of decompression you know both how many compressed data bytes there are in the stream, as well as how many uncompressed bytes were generated.
If you are simply processing your entire stream, with the zlib stream imbedded, sequentially, then there is no need to store either the compressed or uncompressed lengths anywhere. That information is inherent in the zlib data. You would only need to store such data if you had a need to process your stream non-sequentially, or with the desire to access other data in stream after the zlib data without having to decompress the zlib stream.
If you control both creation of compressed blobs and decompression of them, you can prepend the size to the compressed data (prrobably by reserving some bytes at the beginning of the buffer), and then when decompressing skip the size and pass the pointer to compressed data to the decompression utility. This way you don't have to worry about messing up your compressed data with the size: the decompression code never sees the bytes that carry the size information.
How are you doing the compression? If you're piping the output
through an external program, there's not much you can do, but if
you're using something internal, like a compressing streambuf,
you should be able to output to the original streambuf when
necessary. With filtering streams from boost::iostream, for
example, you can write the length, then push the compression
stream, write the data to be compressed, then pop the
compression stream, and continue writing plain text. (You may
have to insert a judicious flush here and there, before changing
the filter stack.) Or you should be able to compress the parts
you want to compress into a buffer, and use
std::ostream::write to output them.

File Binary vs Text

Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.
As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.
All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.
Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.
Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.
It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.
All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).
All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.

tar.Z file format, structure, header

I am trying to figure out the file layout of
tar.Z file. (so called .taz file. compressed tar file).
this file can be produced with tar -Z option or
using unix compress utility(result are same)
I tried to google some document about this file structure
but there is no documentation about this file structure.
I know that this is LZW compressed file and starts with
its magic number "1F 9D" but thats all I can figure out.
someone please tell me more details about the file header or
anything.
I am not interested about how to uncompress this file, or
what linux command can process this file.
I want to know is internal file structure/header/format/layout.
thank you in advance
A .Z file is compressed using compress and can be uncompressed with uncompress (or on some machines this is called uncompress.real). This .Z file can hold any data. .tar.Z or .taz is just a .tar file that is compressed with compress.
The first 2 bytes (MAGIC_1 and MAGIC_2) are used to check if the .Z file really is a .Z file, and not something else with accidentally the same extension. These bytes are hardcoded in the sources.
The third byte is a settings byte and holds 2 values:
The most significant bit is the block mode.
The last 5 bits indicate the maximum size of the code table (the code table is used for lzw compression).
From the original code: BLOCK_MODE=0x80; byte3=(BIT|BLOCK_MODE); and BIT is in an if/else block where it is 12..16.
If block mode is turned on, in the code table a entity will be added at place 256 (remember 0..255 are filled with the values 0..255) and this will contain the CLEAR sign. So whenever the CLEAR sign is gotten from the data stream from the file, the code table has to be reverted to it's initial state (so it has only 0..256 in it).
The maximum code size indicates the amount of bits the code table can be. When the maximum is hit, there are no entities added to the code table anymore. So if the maximum code size is 0b00001100, it means that the code table can only hold 12 bits, so a maximum of 2^12=4096 entities.
The highest amount possible that is used by compress is 16 bit. That means that there are 2 bits in this settings field that are unused.
After these 3 bytes the raw LZW data starts. Because the LZW table starts at 9 bits, the 4th byte will be the same as the first byte of the input (in case of a .tar.Z file, or taz file, this byte will be the first byte of the uncompressed .tar file).
A tar.Z file is just a compressed tar file, so you will only find the 1F 9D magic number telling you to uncompress it.
When uncompressed you can read the tar file header:
http://www.fileformat.info/format/tar/corion.htm
Q: this file can be produced with tar -Z option or using unix compress utility(result are same)
A: Yes. "tar -cvf myfile.tar myfiles; compress myfile.tar" is equivalent to using "-Z". An even better choice is often "j" (using BZip, instead of Zip)
Q: What is the layout of a tar file?
A: There are many references, and much freely available source. For example:
http://en.wikipedia.org/wiki/Tar_%28file_format%29
Q: What is the format of a Unix compressed file?
A: Again: many references; easy to find sample source code:
http://en.wikipedia.org/wiki/Compress
Fot a .tgz (compressed tar file) you'll need both formats: you must first uncompress it, then untar it. The "tar" utility will do both for you, automagically :)