I received a sample with 1.000 fingerprints (WSQ files - 100 people with 10 fingerprints each) in WSQ format.
I know that WSQ is a compressed format. My question is: Is there a way to compress this again?
The compression algorithm should find patterns between fingerprints file.
Thanks a lot!
This is quite unlikely.
The compression algorithm applied directly on SWQ files may find some repetitions in the file structure, most likely in the header. This will provide maybe a few % of gains, but no more.
In any case, you can't be wrong with testing this hypothesis by using a simple compression program in "solid" mode, such as 7zip.
Yes, a reference implementation of a WSQ encoder (and decoder) is provided in the NBIS package:
http://www.nist.gov/itl/iad/ig/nbis.cfm
Related
I'm currently reverse engineering a firmware that seems to be compressed, but am really having hard time identifying which algorithm it is using.
I have the original uncompressed data dumped from the flash chip, below is some of human readable data, uncompressed vs (supposedly) compressed:
You can get the binary portion here, should it helps: Link
From what I can tell, it might be using Lempel-Ziv variant of compression algorithm such as LZO, LZF or LZ4.
gzip and zlib can be ruled out because there will be very little to no human readable data after compression.
I do tried to compress the dumped data with Lempel-Ziv variant algorithms mentioned above using their respective Linux cli tools, but none of them show exact same output as the "compressed data".
Another idea I have for now is to try to decompress the data with each algorithm and see what it gives. But this is very difficult due to lack of headers in the compressed firmware. (Binwalk and signsrch both detected nothing.)
Any suggestion on how I can proceed?
I'm trying to work out if there is a compression algorithm that can be trained beforehand, where you can use the trained data to compress and decompress data.
I don't know exactly how compression algorithms work, but I have an inkling that this is possible.
For example, if I compress these lines independently, it wouldn't compress very well.
banana: 1, tree: 2, frog: 3
banana: 7, tree: 9, elephant: 10
If I train the compression algorithm with 100's of sample lines beforehand, it would compress very well because it already has a way of mapping "banana" into a code/lookup value.
Pseudocode to help explain my question:
# Compressing side
rip = Rip()
trained = rip.train(data) # once off
send_trained_data_to_clients(trained)
compressed = rip.compress(data)
# And on the other end
rip = Rip()
rip.load_train_data(train)
data = rip.decompress(compressed)
Is there a common (i.e. has libraries for popular languages) compression algorithm that let's me do this?
What you are describing, in the parlance of most compression algorithms, would be a preset dictionary for the compressor.
I can't speak for all compression libraries, but zlib definitely supports this -- in the exact way you're imagining -- via the deflateSetDictionary() and inflateSetDictionary() functions. See the zlib manual for details.
It exists and it is called Lempel-Ziv coding, you can read more here:
http://en.wikipedia.org/wiki/LZ77_and_LZ78
Its one of several 'Dictionary' type lossless compression methods.
LZ is what your Zip archiver basically does.
I'm working on a library to work with Mobipocket-format ebook files, and I have LZ77-style PalmDoc decompression and compression working. However, PalmDoc compression is only one of the two currently-used types of text compression being used on ebooks in the wild, the other being Dictionary Huffman aka huffcdic.
I've found a couple of implementations of the huffcdic decoding algorithm, but I'd like to be able to compress to the same format, and so far I haven't been able to find any examples of how to do that yet. Has someone else already figured this out and published the code?
i have been trying to use http://bazaar.launchpad.net/~kovid/calibre/trunk/view/head:/src/calibre/ebooks/compression/palmdoc.c but compression doesnt produce identical results, & there are 3 - 4 descrepencies also read one related thread LZ77 compression of palmdoc
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
So I want to know what other files will not reduce its size after compression.
Thanks
Sunil Kumar Sahoo
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
Compressed files will not reduce their size after compression.
Five years later, I have at least some real statistics to show of this.
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
PDF files are already compressed. They use the following compression algorithms:
LZW (Lempel-Ziv-Welch)
FLATE (ZIP, in PDF 1.2)
JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
There is a great link on this here
http://www.verypdf.com/pdfinfoeditor/compression.htm
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on.
You could even get files that are bigger because of the re-compressed file header.
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
You can add all Office 2007 file formats to the list (of #waqasahmed):
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
Truly random
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)
How do you programmatically compress a WAV file to another format (PCM, 11,025 KHz sampling rate, etc.)?
I'd look into audacity... I'm pretty sure they don't have a command line utility that can do it, but they may have a library...
Update:
It looks like they use libsndfile, which is released under the LGPL. I for one, would probably just try using that.
Use sox (Sound eXchange : universal sound sample translator) in Linux:
SoX is a command line program that can convert most popular audio files to most other popular audio file formats. It can optionally
change the audio sample data type and apply one or more sound effects to the file during this translation.
If you mean how do you compress the PCM data to a different audio format then there are a variety of libraries you can use to do this, depending on the platform(s) that you want to support. If you just want to change the sample rate of the PCM data then you need a sample rate conversion algorithm instead, which is a completely different problem. Can you be more specific in your requirements?
You're asking about resampling, and more specifically downsampling, not compression. While both processes are lossy (meaning that you will suffer loss of information), downsampling works on raw samples instead of in the frequency domain.
If you are interested in doing compression, then you should look into lame or OGG vorbis libraries; you are no doubt familiar with MP3 and OGG technology, though I have a feeling from your question that you are interested in getting back a PCM file with a lower sampling rate.
In that case, you need a resampling library, of which there are a few possibilites. The most widely known is libsamplerate, which I honestly would not recommend due to quality issues not only within the generated audio files, but also of the stability of the code used in the library itself. The other non-commercial possibility is sox, as a few others have mentioned. Depending on the nature of your program, you can either exec sox as a separate process, or you can call it from your own code by using it as a library. I personally have not tried this approach, but I'm working on a product now where we use sox (for upsampling, actually), and we're quite happy with the results.
The other option is to write your own sample rate conversion library, which can be a significant undertaking, but, if you only are interested in converting with an integer factor (ie, from 44.1kHz to 22kHz, or from 44.1kHz to 11kHz), then it is actually very easy, since you only need to strip out every Nth sample.
In Windows, you can make use of the Audio Compression Manager to convert between files (the acm... functions). You will also need a working knowledge of the WAVEFORMAT structure, and WAV file formats. Unfortunately, to write all this yourself will take some time, which is why it may be a good idea to investigate some of the open source options suggested by others.
I have written a my own open source .NET audio library called NAudio that can convert WAV files from one format to another, making use of the ACM codecs that are installed on your machine. I know you have tagged this question with C++, but if .NET is acceptable then this may save you some time. Have a look at the NAudioDemo project for an example of converting files.