Use LZMA to codificate a stream of information - compression

The professor gave me a research paper that shows a way to efficiently compress some kind of data.
It's not worth to eplain the full algorithm since the question is not about that, I just introduce a little example that should allow you to undestand what the real question is about.
Our compression algorithm have is own dictionary which is a table (no matter how it is calculated, just assume that both compressor and decompressor have it), each table row has a string.
The compressor in order to compress a message will open it and start from begining, it will search for a match in the dictionary and eventually send a MATCH message with the row id, if nothing is found then a SET message with the message to set is sent.
Note that MATCH do not really have to be complete match, they can be followed by many MISSMATCH message each containing the byte offset wrong and the correct byte.
So for example the compressor might want to encode:
Now, in the paper they say that they entropy encode this "stream" of data using LZMA and they assume it's a trivial thing to do without giving further details.
I've searched online but I didn't come up with anything. Do you have any idea on how this last step could be done? Do you have any reference?

There is a stream compression algorithm with preset dictionary using LZMA as part of this open-source project: Zip-Ada . The preset dictionary is called there "training data".

Related

How to read/restore big data file (SEGY format) with C/C++?

I am working on a project which needs to deal with large seismic data of SEGY format (from several GB to TB). This data represents the 3D underground structure.
Data structure is like:
1st tract, 2,3,5,3,5,....,6
2nd tract, 5,6,5,3,2,....,3
3rd tract, 7,4,5,3,1,....,8
...
What I want to ask is, in order to read and deal with the data fast, do I have to convert the data into another form? Or it's better to read from the original SEGY file? And is there any existing C package to do that?
If you need to access it multiple times and
if you need to access it randomly and
if you need to access it fast
then load it to a database once.
Do not reinvent the wheel.
When dealing of data of that size, you may not want to convert it into another form unless you have to - though some software does do just that. I found a list of free geophysics software on Wikipedia that look promising; many are open source and read/write SEGY files.
Since you are a newbie to programming, you may want to consider if the Python library segpy suits your needs rather than a C/C++ option.
Several GB is rathe medium, if we are toking about poststack.
You may use segy and convert on the fly, you may invent your own format. It depends whot you needed to do. Without changing segy format it's enough to createing indexes to traces. If segy is saved as inlines - it's faster access throug inlines, although crossline access is not very bad.
If it is 3d seismic, the best way to have the same quick access to all inlines/crosslines is to have own format - based od beans, e.g 8x8 traces - loading all beans and selecting tarces access time may be very quick - 2-3 secends. Or you may use SSD disk, or 2,5x RAM as your SEGY.
To quickly access timeslices you have 2 ways - 3D beans or second file stored as timeslices (the quickes way). I did same kind of that 10 years ago - access time to 12 GB SEGY was acceptable - 2-3 seconds in all 3 directions.
SEGY in database? Wow ... ;)
The answer depends upon the type of data you need to extract from the SEG-Y file.
If you need to extract only the headers (Text header, Binary header, Extended Textual File headers and Trace headers) then they can be easily extracted from the SEG-Y file by opening the file as binary and extracting relevant information from the respective locations as mentioned in the data exchange formats (rev2). The extraction might depend upon the type of data (Post-stack or Pre-stack). Also some headers might require conversions from one format to another (e.g Text Headers are mostly encoded in EBCDIC format). The complete details about the byte locations and encoding formats can be read from the above documentation
The extraction of trace data is a bit tricky and depends upon various factors like the encoding, whether the no. of trace samples is mentioned in the trace headers, etc. A careful reading of the documentation and getting to know about the type of SEG data you are working on will surely make this task a lot easier.
Since you are working with the extracted data, I would recommend to use already existing libraries (segpy: one of the best python library I came across). There are also numerous free available SEG-Y readers, a very nice list has already been mentioned by Daniel Waechter; you can choose any one of them that suits your requirements and the type file format supported.
I recently tried to do something same using C++ (Although it has only been tested on post-stack data). The project can be found here.

Parsing deleted pdfs

I'm trying to do some file carving on a disk with c++. I can't find any resources on the web related to the on-disk structure of a pdf file. The thing is that I can find the %PDF-1.x token at the start of a cluster but I can't find out the size of a PDF file anywhere.
Let's say hypothetically that the file system entry for this particular document is lost. I find the start of the document and I keep reading until I run into the "startxref number %%EOF". The thing is that I don't know when to stop since there are multiple "%%EOF" markers in the content of a document.
I've tried stopping after reading, let's say 10 clusters, and not finding any pdf specific keyword like "obj", "stream", "trailer", "xref" anywhere. But it's quite arbitrary and it's not a deterministic method of finding the ending of the document so I can determine it's size.
I've also seen some "Length number" markers at the start of some "obj"s but the number doesn't really fit most of the time.
Any ideas on what I can try next? Is there a way to determine the exact size of the entire document? I'm interested in recovering documents programmatically.
Since PDF's are "free format" (pretty much like text files, but with less obviousness to humans when it comes to "reading" the content), it's probably hard to piece them together if they aren't in order.
A stream does have a length, which is a key to where the endstream goes. (A blank line before and after the stream itself). Streams are used t introduce bitmaps and similar things [fonts, line-art data in compressed form, etc] into the document). But if you have several 4KB segments that could go in as the same block in the middle of a stream then there's no way to tell which way they go, other than pasting it together and seeing which ones look sane and which doesn't. Similarly, if there are several segments of streams and objects, you can't really tell which goes where.
Of course, this applies to almost all types of files with "variable content" - you can find the first few kilobytes of a JPG, but knowing what the REST of the of is, won't be easy - only be visually inspecting the content can you determine which blocks of bytes belong where - if you get it wrong, you'll probably just get some random garbage.
The open source tool bulk_extractor has a module called scan_pdf that does pretty much what you are describing here. It can recognize the individual parts of a PDF file on a drive, automatically decompresses the compressed regions, and extracts text using a two strategies. It will recover data from fragments of PDFs even if the xref table cannot be found.

String Compression Algorithm

I've been wanting to compress a string in C++ and display it's compressed state to console. I've been looking around for something and can't find anything appropriate so far. The closest thing I've come to finding it this one:
How to simply compress a C++ string with LZMA
However, I can't find the lzma.h header which works with it anywhere.
Basically, I am looking for a function like this:
std::string compressString(std::string uncompressedString){
//Compression Code
return compressedString;
}
The compression algorithm choice doesn't really matter. Anybody can help me out finding something like this? Thank you in advance! :)
Based on the pointers in the article I'm fairly certain they are using XZ Utils, so download that project and make the produced library available in your project.
However, two caveats:
dumping a compressed string to the console isn't very useful, as that string will contain all possible byte values, most of which aren't displayable on a console
compressing short strings (actually any small quantity of data) isn't what most general-purpose compressors were designed for, in many cases the compressed result will be as big or even bigger than the input. However, I have no experience with LZMA on small data quantities, an extensive test with data representative for your use case will tell you whether it works as expected.
One algorithm I've been playing with that gives good compression on small amounts of data (tested on data chunks sized 300-500 bytes) is range encoding.

Extract and analyse sound from mp3 files

I have a set of mp3 files, some of which have extended periods of silence or periodic intervals of silence. How can I programmatically detect this?
I am looking for a library in C++, or preferably C#, that will allow me to examine the sound content of these files for the silences.
EDIT: I should elaborate what I am trying to achieve. I am capturing streaming sports commentary using VLC and saving it to mp3. When a game is delayed, or cancelled, the streaming commentary is replaced by a repetitive message saying commentary is not available. By looking for these periodic silences (or total silence), I can detect if there is no commentary and stop the streaming recording
For this reason I am reluctant to decompress the mp3 because if would mean my test for these silences would be very slow. Unless I can decode the last 5 minutes of the file?
Thanks
Andrew
I'm not aware of a library that will detect silence directly in the MP3 encoded data, since its not a trivial task to detect silence without first decompressing. Luckily, its easy to find libraries that decode MP3 files and access them as PCM data, and its trivial to detect silence in PCM Data. Here is one such Library for C# I found, but I'm sure there are tons: http://www.robburke.net/mle/mp3sharp/
Once you decode the data, you will have a list of PCM samples. In the most basic form, the algorithm you need to detect silence is simply to analyze a small chunks (could be as little as .25s or as much as several seconds), and make sure that the absolute value of each sample in the chunk is below a threshold. The threshold value you use determines how 'quiet' the sound has to be to be considered silence, and the chunk size determines how long the volume needs to be below that threshold to be considered silence (If you go with very short chunks, you will get lots of false positives due to samples near zero-crossings, but .25s or higher should be ok. There are improvements to the basic approach such as using historesis (which is basically using two thresholds, one for the transition to silence, and one for the transition from silence), and filtering.
Unfortunately, I don't know a library for C++ or C# that implements level detection off hand, and nothing immediately springs up on google, but at least for the simple version its pretty easy to code.
Edit: Also, this library seems interesting: http://naudio.codeplex.com/
Also, while not a true duplicate question, the answers here will be useful for you:
Detecting audio silence in WAV files using C#

How to concat two or more gzip files/streams

I want to concat two or more gzip streams without recompressing them.
I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.
Several notes:
Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.
Note: it is not duplicate of this because proposed solution is not fits my needs.
Clearification edit:
I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"
If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).
Only Opera handles this well.
So I need to create a single gzip stream of several chunks and send them without re-compressing.
Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.
It is still faster 4 times then fastest gzip compression. But it is not enough.
What I need is to find the data I need to save together with gzip file in order to
not run decompression procedure, and how do I find this data during compression.
Look at the RFC1951 and RFC1952
The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.
To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.
There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.
Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.
The gzip manual says that two gzip files can be concatenated as you attempted.
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
So it appears that the other tools may be broken. As seen in this bug report.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263
Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.
As others have mentioned you may be able to perform surgery:
http://www.gzip.org/zlib/rfc-gzip.html
And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.
In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.
Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.
It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.
Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.
Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.
If taring them is not out of the question (since the linked cat solution isn't viable for you):
tar cf A_B.gz.tar A.gz B.gz
Then, to get them back:
tar xf A_B.gz.tar