Compression performance related to chunk size in hdf5 files - compression

I would like to ask a question about the performance of compression
which is related to chunk size of hdf5 files.
I have 2 hdf5 files on hand, which have the following properties.
They both only contain one dataset, called "data".
File A's "data":
Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 5094125 x 6
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 10000 x 6
Compression: GZIP level = 7
File B's "data":
Type: HDF5 Scalar Dataset
No. of Dimensions: 2
Dimension Size: 6720 x 1000
Max. dimension size: Unlimited x Unlimited
Data type: 64-bit floating point
Chunking: 6000 x 1
Compression: GZIP level = 7
File A's size:
HDF5----19 MB
CSV-----165 MB
File B's size:
HDF5----60 MB
CSV-----165 MB
Both of them shows great compression on data stored when comparing to csv files.
However, the compression rate of file A is about 10% of original csv,
while that of file B is only about 30% of original csv.
I have tried different chunk size to make file B as small as possible, but it seems that 30% is the optimum compression rate. I would like to ask why file A can achieve a greater compression while file B cannot.
If file B can also achieve, what should the chunk size be?
Is that any rule to determine the optimum chunk size of HDF5 for compression purpose?
Thanks!

Chunking doesn't really affect the compression ratio per se, except in the manner #Ümit describes. What chunking does do is affect the I/O performance. When compression is applied to an HDF5 dataset, it is applied to whole chunks, individually. This means that when reading data from a single chunk in a dataset, the entire chunk must be decompressed - possibly involving a whole lot more I/O, depending on the size of the cache, shape of the chunk, etc.
What you should do is make sure that the chunk shape matches how you read/write your data. If you generally read a column at a time, make your chunks columns, for example. This is a good tutorial on chunking.

Related

Is double compression less effective?

Let's say we have multiple packages stored as .tar.gz files and we want to combine them into one bundle. Everything I know about lossless file compression is that it attempts to find patterns in the data. From that, my intuition is that it would be able to find more patterns and therefore produce smaller bundle if I first decompress the packages into .tar files and then combine them into one bundle.tar.gz. Is my intuition correct? Or is it not worth the hassle and creating the bundle from the .tar.gz files directly would produce similar results?
I tested it with a random collection of txts (RFC 1-500 from https://www.rfc-editor.org/retrieve/bulk/) and compressing each of them individually and then creating the final .tar.gz from the compressed files yields a 15% bigger result, which supports my intuition but maybe not to an extent I expected.
total size of txts: 5.6M
total size of individually compressed txts: 2.7M
size of .tar.gz from txts: 1.4M
size of .tar.gz from compressed txts: 1.6M
I would like to understand more how it behaves in general.
Compressing something with gzip that is already compressed will generally expand the data, but only by a very small amount, multiplying the size by about 1.0003.
The fact that you are getting a 15% benefit from decompressing the pieces and recompressing the bundle means that your pieces must be relatively small in order for gzip's 32K byte matching distance to find more matches and increase the compression by that much. (You did not say how many of these individually compressed texts there were.)
By the way, it is easy to combine several .tar files into a single .tar file. Each .tar file is terminated with 1024 zero bytes. Strip that from every .tar file except the last one, and concatenate them. Then you have one .tar file to compress.

Redshift table sizes & flavours of

Confused by the term 'table size' in Redshift.
We have :
svv_table_info.size
"Size of table in 1MB blocks"
svv_table_info.pct_used
"Percent of available space used"
... so I assume that a lot of the 'size' is empty space due to sort keys etc
Then we have this..
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
.. which uses the term 'minimum' table size.
But nowhere can I find an explanation of what they means in the real world ? Is this a theoretical minimum if optimally configured ?
Ultimately I need to find out the basic size of original tangible data without any overheads.
Then yes, how much disc space is it actually costing to store it in Redshift.
So if I took 1TB out of our on-prem database and shoved it into Redshift, I'd be looking to see something like 1TB (data) & 1.2TB (data + Redshift overheads).
Hope someone can help clarify 🤞
Redshift stores data in 1MB blocks and blocks are associated with a slice and a column. So if I have 2 slices in my cluster and a table with 4 columns (plus the 3 system columns to make 7) distributed as EVEN containing at least 2 rows, then my table will minimally take up 2 X 7 X 1MB of space (14MB on disk). This is all that article is saying.
Now if I insert 2 additional rows into this table, Redshift will makes new blocks for this data. So now my 4 rows of data take up 28MB of space. However, if I Vacuum the table the wasted space will be reclaimed and the table size will come back down to 14MB. (yes this is a bit of an oversimplification but trying to get the concepts across)
As a rule of thumb a single 1MB block will typically hold between 100,000 rows and 2,000,000 rows of compressed data. (yes this depends on the data not being monster varchars) So for our table above I can keep adding rows (and vacuuming) without increasing the table size on disk until I get a few hundred thousand rows (per slice) in the table. Redshift is very efficient at storing large chunks of data but very inefficient at storing small ones.
What Redshift knows about your data size is how many blocks it takes on disk (across all the nodes, slices, and columns). How big your data would be if it was stored differently (not in blocks, compressed or uncompressed) is not data that is tracked. As John noted, for big tables, Redshift stores data more efficiently than most other database (when compression is used).
You cannot translate from an existing database size to the size of a table in Redshift. This is because:
Columns are stored separately
Minimum block size is 1MB
Data in Redshift is compressed, so it can take considerably less space depending on the type of data and the compression type chosen
Given compression, your data is likely to be smaller in Redshift than an original (uncompressed) data source. However, you can't really calculate that in advance unless you have transferred similar data in the past and apply a similar ratio.

Compress .npy data to save space in disk

l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy') in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.
The shape of each .npy file is (number_of_examples,224,224,19) and values are float.
data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy
Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.
1)Is there an efficient way to compress my dataset in order to gain some free space disk ?
2) Is there an efficient way of saving files which take less space memory than np.save() ?
Thank you
You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset containing your .npy files would be:
tar -vfcJ dataset.tar.xz dataset/
This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.

fast encode bitmap buffer to png using libpng

My objective is to convert a 32-bit bitmap(BGRA) buffer into png image in real-time using C/C++. To achieve it, i used libpng library to convert bitmap buffer and then write into a png file. However it seemed to take huge time (~5 secs) to execute on target arm board (quad core processor) in single thread. On profiling, i found that libpng compression process (deflate algorithm) is taking more than 90% of time. So i tried to reduce it by using parallelization in some way. The end goal here is to get it done in less than 0.5 secs at least.
Now since a png can have multiple IDAT chunks, i thought of writing png with multiple IDATs in parallel. To write custom png file with multiple IDATs following methodology is adopted
1. Write PNG IHDR chunk
2. Write IDAT chunks in parallel
i. Split input buffer in 4 parts.
ii. compress each part in parallel using zlib "compress" function.
iii. compute CRC of chunk { "IDAT"+zlib compressed data }.
iv. create IDAT chunk i.e. { "IDAT"+zlib compressed data+ CRC}.
v. Write length of IDAT chunk created.
vi. Write complete chunk in sequence.
3. write IEND chunk
Now the problem is the png file created by this method is not valid or corrupted. Can somebody point out
What am I doing wrong?
Is there any fast implementation of zlib compress or multi-threaded png creation, preferably in C/C++?
Any other alternate way to achieve target goal?
Note: The PNG specification is followed in creating chunks
Update:
This method works for creating IDAT in parallel
1. add one filter byte before each row of input image.
2. split image in four equal parts. <-- may not be required passing pointer to buffer and their offsets
3. Compress Image Parts in parallel
(A)for first image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--store adler32 for current chunk, {a1=zstrm->adler} <--adler is of uncompressed data
(B)for second and third image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes, reduce length by 2
--store adler32 for current chunk zstrm->adler,{a2,a3 similar to A} <--adler is of uncompressed data
(C) for last image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FINISH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes and last 4-bytes of buffer, reduce length by 6
--here last 4 bytes should be equal to ztrm->adler,{a4=zstrm->adler} <--adler is of uncompressed data
4. adler32_combine() all four parts i.e. a1,a2,a3 & a4 <--last arg is length of uncompressed data used to calculate adler32 of 2nd arg
5. store total length of compressed buffers <--to be used in calculating CRC of complete IDAT & to be written before IDaT in file
6. Append "IDAT" to Final chunk
7. Append all four compressed parts in sequence to Final chunk
8. Append adler32 checksum computed in step 4 to Final chunk
9. Append CRC of Final chunk i.e.{"IDAT"+data+adler}
To be written in png file in this manner: [PNG_HEADER][PNG_DATA][PNG_END]
where [PNG_DATA] ->Length(4-bytes)+{"IDAT"(4-bytes)+data+adler(4-bytes)}+CRC(4-bytes)
Even when there are multiple IDAT chunks in a PNG datastream, they still contain a single zlib compressed datastream. The first two bytes of the first IDAT are the zlib header, and the final four bytes of the final IDAT are the zlib "adler32" checksum of the entire datastream (except for the 2-byte header), computed before compressing it.
There is a parallel gzip (pigz) under development at zlib.net/pigz. It will generate zlib datastreams instead of gzip datastreams when invoked as "pigz -z".
For that you won't need to split up your input file because the parallel compression happens internally to pigz.
In your step ii, you need to use deflate(), not compress(). Use Z_FULL_FLUSH on the first three parts, and Z_FINISH on the last part. Then you can concatenate them to a single stream, after pulling off the two-byte header from the last three (keep the header on the first one), and pulling the four-byte check values off of the last one. For all of them, you can get the check value from strm->adler. Save those.
Use adler32_combine() to combine the four check values you saved into a single check value for the complete input. You can then tack that on to the end of the stream.
And there you have it.

Hadoop input split (MapV1)

Problem : What is input split
How the input split is calculated in MapReduce v1?
Is the input Split is same as the HDFS block size?
Each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. For general case, calculation of input splits is done with FileInputFormat.
Calculation of input split size is done in InputFileFormat as:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
Some examples:
mapred.min.split.size mapred.max.split.size dfs.block.size Split Size
1 (default) Long.MAX_VALUE(default) 64MB(Default) 64MB
1 (default) Long.MAX_VALUE(default) 128MB 128MB
128MB Long.MAX_VALUE(default) 64MB 128MB
1 (default) 32MB 64MB 32MB
For detailed explanation, you can view here.