I have downloaded files in Geotiff format with which I want to do some statistical
analysis. Therefore I have converted the Geotiffs with gdal_translate into NetCDF files.
The problem is that it leads to an enormous file size growth, from ~20 MB to ~1.6 GB.
Did anyone have the same problem and has any advice?
The data can be found here: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/fire/data/burned_area/MODIS/pixel/v5.1/compressed/
Please help
I'm afraid this is always the case going from geotiff to standard netcdf with gdal_translate. Have you tried compressing the files in netcdf4 format with e.g.
cdo -f nc4 -z zip9 copy in.nc out.nc
?
Related
I have a bunch of large HDF5 files (all around 1.7G), which share a lot of their content – I guess that more than 95% of the data of each file is found repeated in every other.
I would like to compress them in an archive.
My first attempt using GNU tar with the -z option (gzip) failed: the process was terminated when the archive reached 50G (probably a file size limitation imposed by the sysadmin). Apparently, gzip wasn't able to take advantage of the fact that the files are near-identical in this setting.
Compressing these particular files obviously doesn't require a very fancy compression algorithm, but a veeery patient one.
Is there a way to make gzip (or another tool) detect these large repeated blobs and avoid repeating them in the archive?
Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.
I am wondering if anyone has tried using compression techniques for their LMDB files? Typically, lmdb files typically do not use any compression. I am wondering if anyone has successfully stored data in an lmdb using jpeg compression on lmdb and then used it for caffe. I need this because I am working on a developer board with very limited storage space. If so, can you please provide steps/code to do this?
thanks
Caffe also supports HDF5 which supports compression. If your dataset is smth like mnist - it may be a good choice.
I am trying to use python Wand library but any manipulation I do ends with resulting files being much larger (file size) than the original! Consider the following simple example I am testing on my Ubuntu machine:
with Image(filename='input.pdf', resolution=300) as test:
test.save(filename='output.pdf')
My input file is a scanned document of 10 pages at resolution 300dpi. It takes 3mb on disk. If I don't specify the resolution when opening the image, the output pdf is only 1mb but is in a very poor quality (unreadable). When specifying the resolution 300 (same as original), the resulting file is 30mb, 10x larger thant the original !
Any help on how to simply being able to save an image with the same compression/resolution as the original would be appreciated.
Thanks!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.
How can I compress the folder to a make it small enough?
I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)
Will zip\tar\cab\7z\ any other compression tool do a better job?
I don't mind letting the tool work for a few hours - but not more.
I rather not do it programmatically myself
Best options in your case is 7-zip.
Here is the options:
7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files
a - add files to archive
-r - Recurse subdirectories
-t7z - Set type of archive (7z in your case)
-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:
High compression ratio
Variable dictionary size (up to 4 GB)
Compressing speed: about 1 MB/s on 2 GHz CPU
Decompressing speed: about 10-20 MB/s on 2 GHz CPU
Small memory requirements for decompressing (depend from dictionary size)
Small code size for decompressing: about 5 KB
Supporting multi-threading and P4's hyper-threading
-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra
-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.
-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.
I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.
-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.
Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.
-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.
-mmtf=off - Set multithreading mode for filters to OFF.
-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).
-mqs=on - Sort files by type in solid archives. To store identical files together.
-bt - show execution time statistics
-bb3 - set output log level
7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.
Only if you're using command line 7-zip, see this answer.
https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files
I suggest 3 options that I've tried (in Windows):
7zip LZMA2 compression with dictionary size of 1536Mb
WinRar "solid" file
7zip WIM file
I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.
Update edit: Thanks to #Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.
New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!
Do the duplicated files have the same names? Are they usually less than 64 MB in size? Then you should sort by file name (without the path), use tar to archive all of the files in that order into a .tar file, and then use xz to compress to make a .tar.xz compressed archive. Duplicated files that are adjacent in the .tar file and are less than the window size for the xz compression level being used should compress to almost nothing. You can see the dictionary sizes, "DictSize" for the compression levels in this xz man page. They range from 256 KB to 64 MB.
WinRAR compresses by default each file separately. So there is no real gain on compressing a folder structure with many similar or even identical files by default.
But there is also the option to create a solid archive. Open help of WinRAR and open on Contents tab the item Archive types and parameters and click on Solid archives. This help page explains what a solid archive is and which advantages and disadvantages this archive file format has.
A solid archive with a larger dictionary size in combination with best compression can make an archive file with a list of similar files very small. For example I have a list of 327 binary files with file sizes from 22 KB to 453 KB which have in total 47 MB not included the cluster size of the partition. I can compress those 327 similar, but not identical files, into a RAR archive with a dictionary size of 4 MB having only 193 KB. That is of course a dramatic reduce of size.
Follow the link to help page about rarfiles.lst after reading help page about solid archive. It describes how you can control in which order the files are put into a solid archive. This file is located in program files folder of WinRAR and can be of course customized to your needs.
You have to take care also about option Files to store without compression in case of using GUI version of WinRAR. This option can be found after clicking on symbol/command Add on the tab Files. There are specified file types which are just stored in the archive without any compression like *.png, *.jpg, *.zip, *.rar, ... Those files contain usually already the data in compressed format and therefore it does not make much sense to compress them once again. But if duplicate *.jpg exist in a folder structure and a solid archive is created it makes sense to remove all file extensions from this option.
A suitable command line with using the console version Rar.exe of WinRAR and with using RAR5 archive file format would be:
"%ProgramFiles%\WinRAR\Rar.exe a -# -cfg- -ep1 -idq -m5 -ma5 -md128 -mt1 -r -s -tl -y -- "%UserProfile%\ArchiveFileName.rar" "%UserProfile%\FolderToArchive\"
The used switches in this example are explained in manual of Rar.exe which is the text file Rar.txt in program files directory of WinRAR. There can be also used WinRAR.exe with replacing the switch -idq by -ibck as explained in help of WinRAR on page Alphabetic switches list opened via last menu Help with a click on first menu item Help topics and expanding on first tab Contents the list item Command line mode and next the sublist item Switches and clicking on first item Alphabetic switches list.
By the way: There are applications like Total Commander, UltraFinder or UltraCompare and many others which support searching for duplicate files by various, user configurable criteria like finding files with same name and same size, or most secure, finding files with same size and same content, and providing functions to delete the duplicates.
Try eXdupe from www.exdupe.com, it uses deduplication and is so fast that it's practically disk I/O bound
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
So I want to know what other files will not reduce its size after compression.
Thanks
Sunil Kumar Sahoo
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
Compressed files will not reduce their size after compression.
Five years later, I have at least some real statistics to show of this.
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
PDF files are already compressed. They use the following compression algorithms:
LZW (Lempel-Ziv-Welch)
FLATE (ZIP, in PDF 1.2)
JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
There is a great link on this here
http://www.verypdf.com/pdfinfoeditor/compression.htm
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on.
You could even get files that are bigger because of the re-compressed file header.
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
You can add all Office 2007 file formats to the list (of #waqasahmed):
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
Truly random
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)