Any seekable compression library? - compression

I'm looking for a general compression library that supports random access during decompression. I want to compress wikipedia into a single compressed format and at the same time I want to decompress/extract individual articles from it.
Of course, I can compress each articles individually, but this won't give much compression ratio. I've heard LZO compressed file consists of many chunks which can be decompressed separately, but I haven't found out API+documentation for that. I can also use the Z_FULL_FLUSH mode in zlib, but is there any other better alternative?

xz-format files support an index, though by default the index is not useful. My compressor, pixz, creates files that do contain a useful index. You can use the functions in the liblzma library to find which block of xz data corresponds to which location in the uncompressed data.

for seekable compression build on gzip, there is dictzip from the dict server and sgzip from sleuth kit
note that you can't write to either of these and as seekable is reading any way

DotNetZip is a zip archive library for .NET.
Using DotNetZip, you can reference particular entries in the zip randomly, and can decompress them out of order, and can return a stream that decompresses as it extracts an entry.
With the benefit of those features, DotNetZip has been used within the implementation of a Virtual Path Provider for ASP.NET, that does exactly what you describe - it serves all the content for a particular website from a compressed ZIP file. You can also do websites with dynamic pages (ASP.NET) pages.
ASP.NET ZIP Virtual Path Provider, based on DotNetZip
The important code looks like this:
namespace Ionic.Zip.Web.VirtualPathProvider
{
public class ZipFileVirtualPathProvider : System.Web.Hosting.VirtualPathProvider
{
ZipFile _zipFile;
public ZipFileVirtualPathProvider (string zipFilename) : base () {
_zipFile = ZipFile.Read(zipFilename);
}
~ZipFileVirtualPathProvider () { _zipFile.Dispose (); }
public override bool FileExists (string virtualPath)
{
string zipPath = Util.ConvertVirtualPathToZipPath (virtualPath, true);
ZipEntry zipEntry = _zipFile[zipPath];
if (zipEntry == null)
return false;
return !zipEntry.IsDirectory;
}
public override bool DirectoryExists (string virtualDir)
{
string zipPath = Util.ConvertVirtualPathToZipPath (virtualDir, false);
ZipEntry zipEntry = _zipFile[zipPath];
if (zipEntry != null)
return false;
return zipEntry.IsDirectory;
}
public override VirtualFile GetFile (string virtualPath)
{
return new ZipVirtualFile (virtualPath, _zipFile);
}
public override VirtualDirectory GetDirectory (string virtualDir)
{
return new ZipVirtualDirectory (virtualDir, _zipFile);
}
public override string GetFileHash(string virtualPath, System.Collections.IEnumerable virtualPathDependencies)
{
return null;
}
public override System.Web.Caching.CacheDependency GetCacheDependency(String virtualPath, System.Collections.IEnumerable virtualPathDependencies, DateTime utcStart)
{
return null;
}
}
}
And VirtualFile is defined like this:
namespace Ionic.Zip.Web.VirtualPathProvider
{
class ZipVirtualFile : VirtualFile
{
ZipFile _zipFile;
public ZipVirtualFile (String virtualPath, ZipFile zipFile) : base(virtualPath) {
_zipFile = zipFile;
}
public override System.IO.Stream Open ()
{
ZipEntry entry = _zipFile[Util.ConvertVirtualPathToZipPath(base.VirtualPath,true)];
return entry.OpenReader();
}
}
}

bgzf is the format used in genomics.
http://biopython.org/DIST/docs/api/Bio.bgzf-module.html
It is part of the samtools C library and really just a simple hack around gzip. You can probably re-write it yourself if you don't want to use the samtools C implementation or the picard java implementation. Biopython implements a python variant.

You haven't specified your OS. Would it be possible to store your file in a compressed directory managed by the OS? Then you would have the "seekable" portion as well as the compression. The CPU overhead will be handled for you with unpredictable access times.

I'm using MS Windows Vista, unfortunately, and I can send the file explorer into zip files as if they were normal files. Presumably it still works on 7 (which I'd like to be on). I think I've done that with the corresponding utility on Ubuntu, also, but I'm not sure. I could also test it on Mac OSX, I suppose.

If individual articles are too short to get a decent compression ratio, the next-simplest approach is to tar up a batch of Wikipedia articles -- say, 12 articles at a time, or however many articles it takes to fill up a megabyte.
Then compress each batch independently.
In principle, that gives better compression than than compressing each article individually, but worse compression than solid compression of all the articles together.
Extracting article #12 from a compressed batch requires decompressing the entire batch (and then throwing the first 11 articles away), but that's still much, much faster than decompressing half of Wikipedia.
Many compression programs break up the input stream into a sequence of "blocks", and compress each block from scratch, independently of the other blocks.
You might as well pick a batch size about the size of a block -- larger batches won't get any better compression ratio, and will take longer to decompress.
I have experimented with several ways to make it easier to start decoding a compressed database in the middle.
Alas, so far the "clever" techniques I've applied still have worse compression ratio and take more operations to produce a decoded section than the much simpler "batch" approach.
For more sophisticated techniques, you might look at
MG4J: Managing Gigabytes for
Java
"Managing Gigabytes: Compressing and Indexing Documents and
Images" by Ian H. Witten,
Alistair Moffat, and Timothy C. Bell

Related

How do I combine hundreds of binary files to a single output file in c++?

I have a folder filled with hundreds of .aac files, and I'm trying to "pack" them into one file in the most efficient way that I can.
I have tried the following, but only end up with a file that's only a few bytes long or audio that sounds warbled and distorted heavily.
// Now we want to get all of the files in the folder and then repack them into an aac file as fast as possible
void Repacker(string FileName)
{
string data;
boost::filesystem::path p = "./tmp/";
boost::filesystem::ofstream aacwriter;
aacwriter.open("./" + FileName + ".aac", ios::app);
boost::filesystem::ifstream aacReader;
boost::filesystem::directory_iterator it{ p };
cout << "Repacking File!" << endl;
while (it != boost::filesystem::directory_iterator{}) {
aacReader.open(*it, std::ios::in | ios::binary);
cout << "Writing " << *it << endl;
aacReader >> data;
++it;
aacwriter << data;
aacReader.close();
}
aacwriter.close();
}
I have looked at the following questions to try and help solve this issue
Merging two files together
Merge multiple txt files into one
How do I read an entire file into a std::string in C++?
Read whole ASCII file into C++ std::string
Merging Two Files Into One
However unfortunately, none of these answer my question.
They all either have to do with text files, or functions that don't deal with hundreds of files at once.
I am trying to write Binary data, not text. The audio is either all warbled or the file is only a few bytes long.
If there's a memory efficent method to do this, please let me know. I am using C++ 20 and boost.
Thank you.
These files have an internal structure: header, blocks/frames, etc. and the simple presence of multiple headers within the concatenated file will mess up the expected result.
Take a look at the AAC file format structure, you'll see that it's not so simple.
Your best try should be to use FFMPEG, since it has a feature to concatenate media files without being forced to reencode data. It's a bit complex because FFMPEG's command line is quite complex and not always extremely intuitive, but it should works as long as all AAC files uses the same encoding and characteristics. Otherwise, you'll need to re-encode them - but it can be done automatically, too.
Check this web research to get some base informations.
Otherwise, you may use the base libraries used by FFMPEG, for example libavcodec (available at ffmpeg.org), Fraunhofer FDK AAC, etc. but you'll have way, way more work to do and, finally, you'll do exactly what FFMPEG already do, since it relies on these libraries. Other AAC libraries won't be really easier to use.
Obviously, you can also "embed" FFMPEG within your application, call tools like ffprobe to analyze files and call ffmpeg executable automatically, as a child process.
CAUTION: Take a GREAT care about licensing if you plan to distribute your program. FFMPEG licensing is really not simple, most of the time it's distributed as sources to avoid vicious cases.

How to write content of an object into a file in c++

I have a code in this format:
srcSAXController control(input_filename.c_str());
std::string output_filename = input_filename;
output_filename = "c-" + output_filename.erase(input_filename.rfind(XML_STR));
std:: ofstream myfile(output_filename.c_str());
coverage_handler handler(i == MAIN_POS ? true : false, output_filename);
control.parse(&handler);
myfile.write((char *)&control, sizeof(control));
myfile.close();
I want the content of object 'control' to be written into my file. How to fix the code above, so that content of the control object is written to the file.
In general you need much more than just writing the bytes of the object to be able to save and reload it.
The problem is named "serialization" and depending on a lot of factors there are several strategies.
For example it's important to know if you need to save and reload the object on the same system or if you may need to reload it on a different system; it's also fundamental to know if the object contains links to other objects, if the link graph is a simple tree or if there are possibly loops, if you need to support versioning etc. etc.
Writing the bytes to disk like the code is doing is not going to work even for something as simple as an object containing an std::string.

Pass Binary string/file content from c++ to node js

I'm trying to pass the content of a binary file from c++ to node using the node-gyp library. I have a process that creates a binary file using the .fit format and I need to pass the content of the file to js to process it. So, my first aproach was to extract the content of the file in a string and try to pass it to node like this.
char c;
std::string content="";
while (file.get(c)){
content+=c;
}
I'm using the following code to pass it to Node
v8::Local<v8::ArrayBuffer> ab = v8::ArrayBuffer::New(args.GetIsolate(), (void*)content.data(), content.size());
args.GetReturnValue().Set(ab);
In node a get an arrayBuffer but when I print the content to a file it is different to the one that show a c++ cout.
How can I pass the binary data succesfully?
Thanks.
Probably the best approach is to write your data to a binary disk file. Write to disk in C++; read from disk in NodeJS.
Very importantly, make sure you specify BINARY MODE.
For example:
myFile.open ("data2.bin", ios::out | ios::binary);
Do not use "strings" (at least not unless you want to uuencode). Use buffers. Here is a good example:
How to read binary files byte by byte in Node.js
var fs = require('fs');
fs.open('file.txt', 'r', function(status, fd) {
if (status) {
console.log(status.message);
return;
}
var buffer = new Buffer(100);
fs.read(fd, buffer, 0, 100, 0, function(err, num) {
...
});
});
You might also find these links helpful:
https://nodejs.org/api/buffer.html
<= Has good examples for specific Node APIs
http://blog.paracode.com/2013/04/24/parsing-binary-data-with-node-dot-js/
<= Good discussion of some of the issues you might face, including "endianness" and "interpreting numbers"
ADDENDUM:
The OP clarified that he's considering using C++ as a NodeJS Add-On (not a standalone C++ program.
Consequently, using buffers is definitely an option. Here is a good tutorial:
https://community.risingstack.com/using-buffers-node-js-c-plus-plus/
If you choose to go this route, I would DEFINITELY download the example code and play with it first, before implementing buffers in your own application.
It depends but for example using redis
Values can be strings (including binary data) of every kind, for
instance you can store a jpeg image inside a value. A value can't be
bigger than 512 MB.
If the file is bigger than 512MB, then you can store it in chunks.
But I wouldnt suggest since this is an in-memory data store
Its easy to implement in both c++ and node.js

Converting PDF to JPG like Photoshop quality - Commercial C++ / Delphi library

For the implementation of a Windows based page-flip application I need to be able to convert a large number of PDF pages into good quality JPG, not just thumbnails.
The aim is to achieve the best quality / file size for that, similar to Photoshops Save for Web does that.
Currently Im using Datalogics Adobe PDF Library SDK, which does not seem to be able to fullfil that task. I am thus looking for an alternative commcerical C++ or Delphi library which provides a good qualtiy / size / speed.
After doing some search here, I noticed that most posts are about GS & Imagekick, which I have also tested, but I am not satisfied with the output and the speed.
The target is to import the PDFs with 300dpi and convert them with JPG quality 50, 1500px height and an ouput size of 300-500kb.
If anyone could point out a good library for that task, I would be most greatful.
The Gnostice PDFtoolKit VCL may be a candidate. Convert to JPEG is one of the options.
I always recommend Graphics32 for all your image manipulation needs; you have several resamplers to choose. However, I don't think it can read PDF files as images. But if you can generate the big image yourself it may be a good choice.
Atalasoft DotImage (with the PDF rasterizer add-on) will do that (I work on PDF technologies there). You'd be working in C# (or another .NET) language:
ConvertToJpegs(string outfileStem, Stream pdf)
{
JpegEncoder encoder = new JpegEncoder();
encoder.Quality = 50;
int page = 1;
PdfImageSource source = new PdfImageSource(pdf);
source.Resolution = 300; // sets the rendering resolution to 200 dpi
// larger numbers means better resolution in the image, but will cost in
// terms of output file size - as resolution increases, memory used increases
// as a function of the square of the resolution, whereas compression only
// saves maybe a flat 30% of the total image size, depending on the Quality
// setting on the encoder.
while (source.HasMoreImages()) {
AtalaImage image = source.AcquireNext();
// this image will be in either 8 bit gray or 24 bit rgb depending
// on the page contents.
try {
string path = String.Format("{0}{1}.jpg", outFileStem, page++);
// if you need to resample the image, this is the place to do it
image.Save(path, encoder, null);
}
finally {
source.Release(image);
}
}
}
There is also Quick PDF Library
Have a look at DynaPDF. I know its pretty expensive but you can try the starter pack.
P.S.:before buying a product please make sure it meets your needs.

C++ : What's the easiest library to open video file

I would like to open a small video file and map every frames in memory (to apply some custom filter). I don't want to handle the video codec, I would rather let the library handle that for me.
I've tried to use Direct Show with the SampleGrabber filter (using this sample http://msdn.microsoft.com/en-us/library/ms787867(VS.85).aspx), but I only managed to grab some frames (not every frames!). I'm quite new in video software programming, maybe I'm not using the best library, or I'm doing it wrong.
I've pasted a part of my code (mainly a modified copy/paste from the msdn example), unfortunately it doesn't grabb the 25 first frames as expected...
[...]
hr = pGrabber->SetOneShot(TRUE);
hr = pGrabber->SetBufferSamples(TRUE);
pControl->Run(); // Run the graph.
pEvent->WaitForCompletion(INFINITE, &evCode); // Wait till it's done.
// Find the required buffer size.
long cbBuffer = 0;
hr = pGrabber->GetCurrentBuffer(&cbBuffer, NULL);
for( int i = 0 ; i < 25 ; ++i )
{
pControl->Run(); // Run the graph.
pEvent->WaitForCompletion(INFINITE, &evCode); // Wait till it's done.
char *pBuffer = new char[cbBuffer];
hr = pGrabber->GetCurrentBuffer(&cbBuffer, (long*)pBuffer);
AM_MEDIA_TYPE mt;
hr = pGrabber->GetConnectedMediaType(&mt);
VIDEOINFOHEADER *pVih;
pVih = (VIDEOINFOHEADER*)mt.pbFormat;
[...]
}
[...]
Is there somebody, with video software experience, who can advise me about code or other simpler library?
Thanks
Edit:
Msdn links seems not to work (see the bug)
Currently these are the most popular video frameworks available on Win32 platforms:
Video for Windows: old windows framework coming from the age of Win95 but still widely used because it is very simple to use. Unfortunately it supports only AVI files for which the proper VFW codec has been installed.
DirectShow: standard WinXP framework, it can basically load all formats you can play with Windows Media Player. Rather difficult to use.
Ffmpeg: more precisely libavcodec and libavformat that comes with Ffmpeg open- source multimedia utility. It is extremely powerful and can read a lot of formats (almost everything you can play with VLC) even if you don't have the codec installed on the system. It's quite complicated to use but you can always get inspired by the code of ffplay that comes shipped with it or by other implementations in open-source software. Anyway I think it's still much easier to use than DS (and much faster). It needs to be comipled by MinGW on Windows, but all the steps are explained very well here (in this moment the link is down, hope not dead).
QuickTime: the Apple framework is not the best solution for Windows platform, since it needs QuickTime app to be installed and also the proper QuickTime codec for every format; it does not support many formats, but its quite common in professional field (so some codec are actually only for QuickTime). Shouldn't be too difficult to implement.
Gstreamer: latest open source framework. I don't know much about it, I guess it wraps over some of the other systems (but I'm not sure).
All of this frameworks have been implemented as backend in OpenCv Highgui, except for DirectShow. The default framework for Win32 OpenCV is using VFW (and thus able only to open some AVI files), if you want to use the others you must download the CVS instead of the official release and still do some hacking on the code and it's anyway not too complete, for example FFMPEG backend doesn't allow to seek in the stream.
If you want to use QuickTime with OpenCV this can help you.
I have used OpenCV to load video files and process them. It's also handy for many types of video processing including those useful for computer vision.
Using the "Callback" model of SampleGrabber may give you better results. See the example in Samples\C++\DirectShow\Editing\GrabBitmaps.
There's also a lot of info in Samples\C++\DirectShow\Filters\Grabber2\grabber_text.txt and readme.txt.
I know it is very tempting in C++ to get a proper breakdown of the video files and just do it yourself. But although the information is out there, it is such a long winded process building classes to hand each file format, and make it easily alterable to take future structure changes into account, that frankly it just is not worth the effort.
Instead I recommend ffmpeg. It got a mention above, but says it is difficult, it isn't difficult. There are a lot more options than most people would need which makes it look more difficult than it is. For the majority of operations you can just let ffmpeg work it out for itself.
For example a file conversion
ffmpeg -i inputFile.mp4 outputFile.avi
Decide right from the start that you will have ffmpeg operations run in a thread, or more precisely a thread library. But have your own thread class wrap it so that you can have your own EventAgs and methods of checking the thread is finished. Something like :-
ThreadLibManager()
{
List<MyThreads> listOfActiveThreads;
public AddThread(MyThreads);
}
Your thread class is something like:-
class MyThread
{
public Thread threadForThisInstance { get; set; }
public MyFFMpegTools mpegTools { get; set; }
}
MyFFMpegTools performs many different video operations, so you want your own event
args to tell your parent code precisely what type of operation has just raised and
event.
enum MyFmpegArgs
{
public int thisThreadID { get; set; } //Set as a new MyThread is added to the List<>
public MyFfmpegType operationType {get; set;}
//output paths etc that the parent handler will need to find output files
}
enum MyFfmpegType
{
FF_CONVERTFILE = 0, FF_CREATETHUMBNAIL, FF_EXTRACTFRAMES ...
}
Here is a small snippet of my ffmpeg tool class, this part collecting information about a video.
I put FFmpeg in a particular location, and at the start of the software running it makes sure that it is there. For this version I have moved it to the Desktop, I am fairly sure I have written the path correctly for you (I really hate MS's special folders system, so I ignore it as much as I can).
Anyway, it is an example of using windowless ffmpeg.
public string GetVideoInfo(FileInfo fi)
{
outputBuilder.Clear();
string strCommand = string.Concat(" -i \"", fi.FullName, "\"");
string ffPath =
System.Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "\\ffmpeg.exe";
string oStr = "";
try
{
Process build = new Process();
//build.StartInfo.WorkingDirectory = #"dir";
build.StartInfo.Arguments = strCommand;
build.StartInfo.FileName = ffPath;
build.StartInfo.UseShellExecute = false;
build.StartInfo.RedirectStandardOutput = true;
build.StartInfo.RedirectStandardError = true;
build.StartInfo.CreateNoWindow = true;
build.ErrorDataReceived += build_ErrorDataReceived;
build.OutputDataReceived += build_ErrorDataReceived;
build.EnableRaisingEvents = true;
build.Start();
build.BeginOutputReadLine();
build.BeginErrorReadLine();
build.WaitForExit();
string findThis = "start";
int offset = 0;
foreach (string str in outputBuilder)
{
if (str.Contains("Duration"))
{
offset = str.IndexOf(findThis);
oStr = str.Substring(0, offset);
}
}
}
catch
{
oStr = "Error collecting file information";
}
return oStr;
}
private void build_ErrorDataReceived(object sender, DataReceivedEventArgs e)
{
string strMessage = e.Data;
if (outputBuilder != null && strMessage != null)
{
outputBuilder.Add(string.Concat(strMessage, "\n"));
}
}
Try using the OpenCV library. It definitely has the capabilities you require.
This guide has a section about accessing frames from a video file.
If it's for AVI files I'd read the data from the AVI file myself and extract the frames. Now use the video compression manager to decompress it.
The AVI file format is very simple, see: http://msdn.microsoft.com/en-us/library/dd318187(VS.85).aspx (and use google).
Once you have the file open you just extract each frame and pass it to ICDecompress() to decompress it.
It seems like a lot of work but it's the most reliable way.
If that's too much work, or if you want more than AVI files then use ffmpeg.
OpenCV is the best solution if video in your case only needs to lead to a sequence of pictures. If you're willing to do real video processing, so ViDeo equals "Visual Audio", you need to keep up track with the ones offered by "martjno". New windows solutions also for Win7 include 3 new possibilities additionally:
Windows Media Foundation: Successor of DirectShow; cleaned-up interface
Windows Media Encoder 9: It does not only include the programm, it also ships libraries for coding
Windows Expression 4: Successor of 2.
Last 2 are commercial-only solutions, but the first one is free. To code WMF, you need to install the Windows SDK.
I would recommend FFMPEG or GStreamer. Try and stay away from openCV unless you plan to utilize some other functionality than just streaming video. The library is a beefy build and a pain to install from source to configure FFMPEG/+GStreamer options.