Python Wand is growing file sizes

Python Wand is growing file sizes - python-2.7

I am trying to use python Wand library but any manipulation I do ends with resulting files being much larger (file size) than the original! Consider the following simple example I am testing on my Ubuntu machine:
with Image(filename='input.pdf', resolution=300) as test:
test.save(filename='output.pdf')
My input file is a scanned document of 10 pages at resolution 300dpi. It takes 3mb on disk. If I don't specify the resolution when opening the image, the output pdf is only 1mb but is in a very poor quality (unreadable). When specifying the resolution 300 (same as original), the resulting file is 30mb, 10x larger thant the original !
Any help on how to simply being able to save an image with the same compression/resolution as the original would be appreciated.
Thanks!

Related

Saving JPG encoded array from a ROS sensor_msgs/CompressedImage to a file in roscpp

I have a Robot Operating System (ROS) .bag file containing .jpg compressed images in the form of sensor_msgs/CompressedImage messages. I have written a roscpp program that can access the raw data in the individual messages, but I'm having a hard time saving the array of raw jpg encoded data into a file.
Unfortunately, the bag files I have are very large and contain thousands of images, and I am working under a time constraint. I tried using rosbag play -i and image_view export to save off the images, but it's way too slow. I also tried using Python, but Python is slow, and I don't have a way to save the images (same problem as in C++).
Essentially, I need a way to prepend a valid jpg header to my data and save it in a file. Any suggestions are appreciated!

Creating an image header for a chunk of data that should already be an image is probably not the right approach. After all, jpegs are complex and the datastream should have all the information needed to decode them... how else would those tools be able to display them to you?
You can get a good idea of whether or not a binary blob contains an image by looking at the first and last bytes. Jpegs start with FF D8 and end with FF D9, for example. Some of the magic numbers for other files can be found here: https://en.wikipedia.org/wiki/Magic_number_(programming)#Magic_numbers_in_files

Compressing a folder with many duplicated files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.
How can I compress the folder to a make it small enough?
I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)
Will zip\tar\cab\7z\ any other compression tool do a better job?
I don't mind letting the tool work for a few hours - but not more.
I rather not do it programmatically myself

Best options in your case is 7-zip.
Here is the options:
7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files
a - add files to archive
-r - Recurse subdirectories
-t7z - Set type of archive (7z in your case)
-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:
High compression ratio
Variable dictionary size (up to 4 GB)
Compressing speed: about 1 MB/s on 2 GHz CPU
Decompressing speed: about 10-20 MB/s on 2 GHz CPU
Small memory requirements for decompressing (depend from dictionary size)
Small code size for decompressing: about 5 KB
Supporting multi-threading and P4's hyper-threading
-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra
-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.
-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.
I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.
-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.
Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.
-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.
-mmtf=off - Set multithreading mode for filters to OFF.
-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).
-mqs=on - Sort files by type in solid archives. To store identical files together.
-bt - show execution time statistics
-bb3 - set output log level

7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.
Only if you're using command line 7-zip, see this answer.
https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files

I suggest 3 options that I've tried (in Windows):
7zip LZMA2 compression with dictionary size of 1536Mb
WinRar "solid" file
7zip WIM file
I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.
Update edit: Thanks to #Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.
New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!

Do the duplicated files have the same names? Are they usually less than 64 MB in size? Then you should sort by file name (without the path), use tar to archive all of the files in that order into a .tar file, and then use xz to compress to make a .tar.xz compressed archive. Duplicated files that are adjacent in the .tar file and are less than the window size for the xz compression level being used should compress to almost nothing. You can see the dictionary sizes, "DictSize" for the compression levels in this xz man page. They range from 256 KB to 64 MB.

WinRAR compresses by default each file separately. So there is no real gain on compressing a folder structure with many similar or even identical files by default.
But there is also the option to create a solid archive. Open help of WinRAR and open on Contents tab the item Archive types and parameters and click on Solid archives. This help page explains what a solid archive is and which advantages and disadvantages this archive file format has.
A solid archive with a larger dictionary size in combination with best compression can make an archive file with a list of similar files very small. For example I have a list of 327 binary files with file sizes from 22 KB to 453 KB which have in total 47 MB not included the cluster size of the partition. I can compress those 327 similar, but not identical files, into a RAR archive with a dictionary size of 4 MB having only 193 KB. That is of course a dramatic reduce of size.
Follow the link to help page about rarfiles.lst after reading help page about solid archive. It describes how you can control in which order the files are put into a solid archive. This file is located in program files folder of WinRAR and can be of course customized to your needs.
You have to take care also about option Files to store without compression in case of using GUI version of WinRAR. This option can be found after clicking on symbol/command Add on the tab Files. There are specified file types which are just stored in the archive without any compression like *.png, *.jpg, *.zip, *.rar, ... Those files contain usually already the data in compressed format and therefore it does not make much sense to compress them once again. But if duplicate *.jpg exist in a folder structure and a solid archive is created it makes sense to remove all file extensions from this option.
A suitable command line with using the console version Rar.exe of WinRAR and with using RAR5 archive file format would be:
"%ProgramFiles%\WinRAR\Rar.exe a -# -cfg- -ep1 -idq -m5 -ma5 -md128 -mt1 -r -s -tl -y -- "%UserProfile%\ArchiveFileName.rar" "%UserProfile%\FolderToArchive\"
The used switches in this example are explained in manual of Rar.exe which is the text file Rar.txt in program files directory of WinRAR. There can be also used WinRAR.exe with replacing the switch -idq by -ibck as explained in help of WinRAR on page Alphabetic switches list opened via last menu Help with a click on first menu item Help topics and expanding on first tab Contents the list item Command line mode and next the sublist item Switches and clicking on first item Alphabetic switches list.
By the way: There are applications like Total Commander, UltraFinder or UltraCompare and many others which support searching for duplicate files by various, user configurable criteria like finding files with same name and same size, or most secure, finding files with same size and same content, and providing functions to delete the duplicates.

Try eXdupe from www.exdupe.com, it uses deduplication and is so fast that it's practically disk I/O bound

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.
Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?
Example:
Original File Compressed File
10 - 27 => 2-5
100-202 => 10-19
..............
10230-102020 => 217-298
Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.
Does anyone know of a compression library or similar readily available tool that can offer this functionality?

I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.
Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.
Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.
So, my file looked like this:
Index; compress(A1); compress(A2); compress(A3)
instead of
compress(A1;A2;A3).
If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15.
Your index file would then look like:
0 -> 0
5MB -> sizeof(compress 5MB)
10MB -> sizeof(compress 5MB) + sizeof(compress next 5MB)
The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.
Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.
Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

DICOM File compression

My line of work requires the use of DICOM files. Each DICOM file constitutes many .dcm files in a single directory. I am required to send these files over the network, a process which is somewhat so due to the massive size of the files.
I am also a programmer and I was wondering what is the ideal way to compress such files? I'm talking about a compression that will be made on the local computer and later decompressed on the destination computer (namely the compression is solely for speeding up the over-the-network transfer of the file). Is there a simple way to crop the DICOM files? (the files contain imaging of an entire head, whereas I'm only interested in a small part of the head).
Thanks!

In medical context, lossy compression is somewhere between not encouraged and forbidden. If you'd insist on cropping existing datasets the standard demands you to form at least new image & series UIDs. The standard does allow losless compression in the form of jpeg2000, but it is quite rare - if I had to bet I'd say your dataset is uncompressed altogether.
In my experience it is significantly better to compress a medical dataset as a solid archive - that is, unify all the images into a single stream. This makes a lot of sense, as there is typically a lot of similarity between nearby images and this is the way to take advantage of that similarity (a unified compression dictionary). This is available as a command line option both to rar and gzip compressors.

Solution:
gdcmconv --jpeg uncompressed.dcm compressed.dcm
or for better compression ratio:
gdcmconv --jpegls uncompressed.dcm compressed.dcm
See:
http://gdcm.sourceforge.net/html/gdcmconv.html
I would also recommend against lossy compression, you would need to be a DICOM wizard to do it properly (see derivation mechanism in the DICOM standard). I would also recommend against cropping the image (you would need to regenerate UIDs, get the Frame or Reference updated...)
HTH

You could use something simple like lzma compression on one end to pack up the files and send them over. This is the easiest solution, since you can grab something like gzip and pack/unpack the files easily programmaticly. This may help considerably, because modern computers prefer transmitting/receiving one large file over many small files (a single 1GB file will transfer much faster than 10000 100KB files).
As for actually reducing the aggregate size, each .dcm file is probably a slice (if you're looking at something like MRI or CT data), and the viewer you are using reconstructs the slices into the 3d image. Cropping them isn't impossible, but parsing the DICOM format is a bit tricky. I'm not aware of any free programs that will help you parse the DICOM files, but I haven't looked for some time.
Since DICOM is a container format, the image data you are after is usually stored in a common format (such as JPEG), so if you are able to grab the relevant part of the file to extract the image data, you can use any of the loads of image processing tools available to crop the image to whatever dimensions you choose.

We have a compression router called "DICOM Shrinkinator" that can do this as it transmits the study to PACS:
http://fluxinc.ca/medical/dicom-shrinkinator/

How to check if file is/isn't an image without loading full file? Is there an image header-reading library?

edit:
Sorry, I guess my question was vague. I'd like to have a way to check if a file is not an image without wasting time loading the whole image, because then I can do the rest of the loading later. I don't want to just check the file extension.
The application just views the images. By 'checking the validity', I meant 'detecting and skipping the non-image files' also in the directory. If the pixel data is corrupt, I'd like to still treat it as an image.
I assign page numbers and pair up these images. Some images are the single left or right page. Some images are wide and are the "spread" of the left and right pages. For example, pagesAt(3) and pagesAt(4) could return the same std::pair of images or a std::pair of the same wide image.
Sometimes, there is an odd number of 'thin' images, and the first image is to be displayed on its own, similar to a wide image. An example would be a single cover page.
Not knowing which files in the directory are non-images means I can't confidently assign those page numbers and pair up the files for displaying. Also, the user may decide to jump to page X, and when I later discover and remove a non-image file and reassign page numbers accordingly, page X could appear to be a different image.
original:
In case it matters, I'm using c++ and QImage from the Qt library.
I'm iterating through a directory and using the QImage constructor on the paths to the images. This is, of course, pretty slow and makes the application feel unresponsive. However, it does allow me to detect invalid image files and ignore them early on.
I could just save only the paths to the images while going through the directory and actually load them only when they're needed, but then I wouldn't know if the image is invalid or not.
I'm considering doing a combination of these two. i.e. While iterating through the directory, reading only the headers of the images to check validity and then load image data when needed.
So,
Will just loading the image headers be much faster than loading the whole image? Or is doing a bit of i/o to read the header mean I might as well finish off loading image in full? Later on, I'll be uncompressing images from archives as well, so this also applies to uncompressing just the header vs uncompressing the whole file.
Also, I don't know how to load/read just the image headers. Is there a library that can read just the headers of images? Otherwise, I'd have to open each file as a stream and code image header readers for all the filetypes on my own.

The Unix file tool (which has been around since almost forever) does exactly this. It is a simple tool that uses a database of known file headers and binary signatures to identify the type of the file (and potentially extract some simple information).
The database is a simple text file (which gets compiled for efficiency) that describes a plethora of binary file formats, using a simple structured format (documented in man magic). The source is in /usr/share/file/magic (in Ubuntu). For example, the entry for the PNG file format looks like this:
0 string \x89PNG\x0d\x0a\x1a\x0a PNG image
!:mime image/png
>16 belong x \b, %ld x
>20 belong x %ld,
>24 byte x %d-bit
>25 byte 0 grayscale,
>25 byte 2 \b/color RGB,
>25 byte 3 colormap,
>25 byte 4 gray+alpha,
>25 byte 6 \b/color RGBA,
>28 byte 0 non-interlaced
>28 byte 1 interlaced
You could extract the signatures for just the image file types, and build your own "sniffer", or even use the parser from the file tool (which seems to be BSD-licensed).

Just to add my 2 cents: you can use QImageReader to get information about image files without actually loading the files.
For example with the .format method you can check a file's image format.
From the official Qt doc ( http://qt-project.org/doc/qt-4.8/qimagereader.html#format ):
Returns the format QImageReader uses for reading images. You can call
this function after assigning a device to the reader to determine the
format of the device. For example: QImageReader reader("image.png");
// reader.format() == "png" If the reader cannot read any image from
the device (e.g., there is no image there, or the image has already
been read), or if the format is unsupported, this function returns an
empty QByteArray().

I don't know the answer about just loading the header, and it likely depends on the image type that you are trying to load. You might consider using Qt::Concurrent to go through the images while allowing the rest of the program to continue, if it's possible. In this case, you would probably initially represent all of the entries as an unknown state, and then change to image or not-an-image when the verification is done.

If you're talking about image files in general, and not just a specific format, I'd be willing to bet there are cases where the image header is valid, but the image data isn't. You haven't said anything about your application, is there no way you could add in a thread in the background that could maybe keep a few images in ram, and swap them in and out depending on what the user may load next? IE: a slide show app would load 1 or 2 images ahead and behind the current one. Or maybe have a question mark displayed next to the image name until the background thread can verify that validity of the data.

While opening and reading the header of a file on a local filesystem should not be too expensive, it can be expensive if the file is on a remote (networked) file system. Even worse, if you are accessing files saved with hierarchical storage management, reading the file can be very expensive.
If this app is just for you, then you can decide not to worry about those issues. But if you are distributing your app to the public, reading the file before you absolutely have to will cause problems for some users.
Raymond Chen wrote an article about this for his blog The Old New Thing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js