Why isn't lossless compression automatic on computers? - compression

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?

Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

Related

Is innodb compression compatible with full text search and is the memory compressed too?

I wan't to know if full text search can be used in compressed innodb tables, and if the compression will reduce both memory and disk usage or only disk, and i there is a performance impact using the compression.
"Compatibility" is easily answered by trying in a tiny table. I think it is compatible because the data is uncompressed whenever it comes into the buffer_pool.
"Compressed" is likely to save disk space, but the numbers I have heard are only 2x. Ordinary text usually compresses 3x, but InnoDB has headers, etc, that are not compressed. (JPG does not compress.)
As for reducing memory (buffer_pool) -- It is likely to consume extra memory because both the compressed and uncompressed copies of the data are in memory at least some of the time.
A reference: https://dev.mysql.com/doc/refman/8.0/en/innodb-compression-internals.html , plus pages around it.
My opinion is that InnoDB's compressed is rarely useful. Instead, I recommend compressing and decompressing individual columns in the client, thereby offloading that CPU task from the server. But that would not work for FULLTEXT, so maybe it would be useful for your application.

Recompressing Compressed Files

Can you keep sending the output of BZip2 (or any compression software) back through the compression process over and over again to make the output files smaller and smaller? Can you compress a file using one software (BZip2) that was already compressed using another method (Snappy)?
No and no. (For lossless compression.)
If the original file was extremely redundant, like megabytes of nothing but zeros, then the first, and maybe the second recompression will result in compression. But at some point there will be no gain from recompression, and instead a small increase in file size. For normal files, the first recompression will result in no gain.
This is true regardless of how you might mix lossless compressors.

C++ Write Data to Random HDD Sector

I need to write a program using C++ which is able to perform data write/read to/from both random and sequential hard disk sectors.
However, actually I am confused with the term sector and its relation with a file.
What I want to know is, if I simply:
1. Create a string contains word "Hello, world" and then;
2. Save the string into "myfile.txt",
does the data written in sequential or random sector? If it is sequential (I guess), then how can I write the string to random hard disk sector and then read it again? And also vice-versa.
What you are trying to do is pretty much impossible today because of file systems. If you want a file (which you seem to), you need a file system. A file system then places the data in some format it wants to the sectors it thinks are best. Advanced filesystems such as btrfs and zfs also do compression, checksumming and placing data on multiple hard disks. So you can't just write to a sector, because you would likely destroy data and you couldn't read it anymore because your file system doesn't understand your data format. Also it wouldn't even know that there is data there because file must be registered in the MFT/btrfs metadata/... tables.
TL;DR Don't try to do it, it will mess up your system and it won't work.

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?
I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

Multi-part gzip file random access (in Java)

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.
I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)
I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?
The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java
The design of GZIP, as you have realized, is not friendly to random access.
You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.
If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.
This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .
Do you have a GZipInputStream class? If so you are half-way there.