Compress .npy data to save space in disk - python-2.7

l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy') in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.
The shape of each .npy file is (number_of_examples,224,224,19) and values are float.
data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy
Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.
1)Is there an efficient way to compress my dataset in order to gain some free space disk ?
2) Is there an efficient way of saving files which take less space memory than np.save() ?
Thank you

You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset containing your .npy files would be:
tar -vfcJ dataset.tar.xz dataset/
This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.

Related

Is double compression less effective?

Let's say we have multiple packages stored as .tar.gz files and we want to combine them into one bundle. Everything I know about lossless file compression is that it attempts to find patterns in the data. From that, my intuition is that it would be able to find more patterns and therefore produce smaller bundle if I first decompress the packages into .tar files and then combine them into one bundle.tar.gz. Is my intuition correct? Or is it not worth the hassle and creating the bundle from the .tar.gz files directly would produce similar results?
I tested it with a random collection of txts (RFC 1-500 from https://www.rfc-editor.org/retrieve/bulk/) and compressing each of them individually and then creating the final .tar.gz from the compressed files yields a 15% bigger result, which supports my intuition but maybe not to an extent I expected.
total size of txts: 5.6M
total size of individually compressed txts: 2.7M
size of .tar.gz from txts: 1.4M
size of .tar.gz from compressed txts: 1.6M
I would like to understand more how it behaves in general.
Compressing something with gzip that is already compressed will generally expand the data, but only by a very small amount, multiplying the size by about 1.0003.
The fact that you are getting a 15% benefit from decompressing the pieces and recompressing the bundle means that your pieces must be relatively small in order for gzip's 32K byte matching distance to find more matches and increase the compression by that much. (You did not say how many of these individually compressed texts there were.)
By the way, it is easy to combine several .tar files into a single .tar file. Each .tar file is terminated with 1024 zero bytes. Strip that from every .tar file except the last one, and concatenate them. Then you have one .tar file to compress.

Are there any good tar compression algorithms for huge directories (1 - 10TB) of random content?

I want to backup and compress all of my data on my Linux PC. The size of all this files adds up to about 3.4 TB. I want to compress them with tar and some compression algorithm.
I already tried some algorithms like xz but they only yielded a marginal 10% of compression (lost between 100 to 300 gigs).
Are there any algorithms which yield 20% to 40% of compression for such huge amounts of data? Neither RAM nor processing power are a concern for me in regards of the algorithm (32 gigs of ram and a 4790k).
xz is already a very good compressor. It seems that your data is simply not very compressible. There is no magic compressor that will do two to four times better. You might get a little gain from a PPM-based compressor, maybe, but much.
Your only hope would be to perhaps recognize compression already applied to your data, decompress it, and then recompress it with something better than the original compressor.

Apache Hadoop: Insert compress data into HDFS

I need to upload 100 text files into HDFS to do some data transformation with Apache Pig.
In you opinion, what is the best option:
a) Compress all the text files and upload only one file,
b) Load all the text files individually?
It depends - on your files size, cluster parameters and processing methods.
If your text files are comparable in size with HDFS block size (i.e. block size = 256 MB, file size = 200 MB), it makes sense to load them as is.
If your text files are very small, there would be typical HDFS & small files problem - each file will occupy 1 hdfs block (not physically), so NameNode (which handles metadata) will suffer some overhead on managing lot of blocks. To solve this you could either merge your files into single one, use hadoop archives (HAR) or some custom file format (Sequence Files for example).
If custom format is used, you will have to do extra work with processing - it will be required to use custom input formats.
In my opinion, 100 is not that much to significantly affect NameNode performance, so both options seem to be viable.

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.
the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.
Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'
You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.
on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

tar.Z file format, structure, header

I am trying to figure out the file layout of
tar.Z file. (so called .taz file. compressed tar file).
this file can be produced with tar -Z option or
using unix compress utility(result are same)
I tried to google some document about this file structure
but there is no documentation about this file structure.
I know that this is LZW compressed file and starts with
its magic number "1F 9D" but thats all I can figure out.
someone please tell me more details about the file header or
anything.
I am not interested about how to uncompress this file, or
what linux command can process this file.
I want to know is internal file structure/header/format/layout.
thank you in advance
A .Z file is compressed using compress and can be uncompressed with uncompress (or on some machines this is called uncompress.real). This .Z file can hold any data. .tar.Z or .taz is just a .tar file that is compressed with compress.
The first 2 bytes (MAGIC_1 and MAGIC_2) are used to check if the .Z file really is a .Z file, and not something else with accidentally the same extension. These bytes are hardcoded in the sources.
The third byte is a settings byte and holds 2 values:
The most significant bit is the block mode.
The last 5 bits indicate the maximum size of the code table (the code table is used for lzw compression).
From the original code: BLOCK_MODE=0x80; byte3=(BIT|BLOCK_MODE); and BIT is in an if/else block where it is 12..16.
If block mode is turned on, in the code table a entity will be added at place 256 (remember 0..255 are filled with the values 0..255) and this will contain the CLEAR sign. So whenever the CLEAR sign is gotten from the data stream from the file, the code table has to be reverted to it's initial state (so it has only 0..256 in it).
The maximum code size indicates the amount of bits the code table can be. When the maximum is hit, there are no entities added to the code table anymore. So if the maximum code size is 0b00001100, it means that the code table can only hold 12 bits, so a maximum of 2^12=4096 entities.
The highest amount possible that is used by compress is 16 bit. That means that there are 2 bits in this settings field that are unused.
After these 3 bytes the raw LZW data starts. Because the LZW table starts at 9 bits, the 4th byte will be the same as the first byte of the input (in case of a .tar.Z file, or taz file, this byte will be the first byte of the uncompressed .tar file).
A tar.Z file is just a compressed tar file, so you will only find the 1F 9D magic number telling you to uncompress it.
When uncompressed you can read the tar file header:
http://www.fileformat.info/format/tar/corion.htm
Q: this file can be produced with tar -Z option or using unix compress utility(result are same)
A: Yes. "tar -cvf myfile.tar myfiles; compress myfile.tar" is equivalent to using "-Z". An even better choice is often "j" (using BZip, instead of Zip)
Q: What is the layout of a tar file?
A: There are many references, and much freely available source. For example:
http://en.wikipedia.org/wiki/Tar_%28file_format%29
Q: What is the format of a Unix compressed file?
A: Again: many references; easy to find sample source code:
http://en.wikipedia.org/wiki/Compress
Fot a .tgz (compressed tar file) you'll need both formats: you must first uncompress it, then untar it. The "tar" utility will do both for you, automagically :)