Compression File - compression

Compression File - compression

I have a X bytes file. And I want to compress it in block of size 32Kb, for example.
Is there any lib that Can I do this?
I used Zlib for Delphi but I just can compress a full file in new compressed file.
Tranks a lot,
Pedro

Why don't you use a simple header to determine block boundaries? Consider this:
Read fixed amount of data from input into a buffer (say 32 KiB)
Compress that buffer with a "freshly created" deflate stream (underlying compression algorithm of ZLIB).
Write compressed size to output stream
Write compressed data to output stream
Go to step 1 until you reach end-of-file.
Pros:
You can decompress any block even in multi-threaded fashion.
Data corruption only limited to corrupted block. Rest of data can be restored.
Cons:
You loss most of contextual information (similarities between data). So, you will have lower compression ratio.
You need slightly more work.

Related

Size of compressed Opus frame (bytes)

How do I calculate the size of the compressed opus frame (number of bytes)? I have read the OggS Page and the TOC-Header. The next bytes should belong to the compressed frame, but how do I get the number of bytes?

You're inside an ogg file, I assume. Why can't you read it from the lacing table like any other data packet?
The first ogg page is OPUSHEAD, the second is OPUSTAGS, every page following that should just be the opus packets laced together, no special formatting or anything. It's in the spec here: https://wiki.xiph.org/OggOpus

C++/C Multiple threads to read gz file simultaneously

I am attempting to read a gzip-compressed file from multiple threads.
I was thinking this would significantly speed up decompression process as my gzread functions in multiple threads start from different file offset (using gseek), hence they read different parts of the file.
The simplified code is like
// in threads
auto gf = gzopen("file.gz",xxx);
gzseek(gf,offset);
gzread(xx);
gzclose(gf);
To my surprise, my multi-thread version program does not speed up at all. The 20-thread version uses exactly the same time as the single-thread version. I am pretty sure this is far away from the disk bottleneck.
I guess the zlib inflation functionality may need to decompress the entire file for reading even a small part, but I failed to get any clue from their manual.
Anyone have an idea how to speed up in my case?

Short answer: due to the serial nature of a deflate stream, gzseek() must decode all of the compressed data from the start up to the requested seek point. So you can't get any gain with what you are trying to do. In fact, the total cycles spent will increase with the square of the length of the compressed data! So don't do that.

tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.
Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:
/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
state->x.pos + offset >= 0) {
"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how in gzguts.h:
int how; /* 0: get header, 1: copy, 2: decompress */
Right. At the end of gz_open, a call to gz_reset sets how to 0. Returning to gzseek64, we end up with this modification to the state:
state->seek = 1;
state->skip = offset;
gzread, when called, processes this with a call to gz_skip:
if (state->seek) {
state->seek = 0;
if (gz_skip(state, state->skip) == -1)
return -1;
}
Following this rabbit hole just a bit further, we find that gz_skip calls gz_fetch until gz_fetch has processed enough input for the desired seek. gz_fetch, on its first loop iteration, calls gz_look which sets state->how = GZIP, which causes gz_fetch to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek.

zlib implementation have no multithreading (http://www.zlib.net/zlib_faq.html#faq21 - "Is zlib thread-safe? - Yes. ... Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.") and will decompress "entire file" up to seeked position.
And zlib format has bad alignment (bit alignment) / no offset fields (deflate format) to enable parallel decompression/seeking.
You may try another implementations of z (deflate/inflate), for example, http://zlib.net/pigz/ (or switch from ancient compression from the era of single core to non-zlib modern parallel formats, xz/lzma/something from google)
pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. To compile and use pigz, please read the README file in the source code distribution. You can read the pigz manual page here.
The manual page is http://zlib.net/pigz/pigz.pdf and it has useful information.
It uses format compatible to zlib, but adopted to parallel compress:
Each partial raw deflate stream is terminated by an empty stored block ... in order to end that partial bit stream at a byte boundary.
Still, DEFLATE format is bad for parallel decompression:
Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. Asaresult, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.

Convert VERY large ppm files to JPEG/JPG/PNG?

So I wrote a C++ program that produces very high resolution pictures (fractals).
I use fstream to save all the data in a .ppm file.
Everything works fine, but when I go into really high resolution (38400x21600) the ppm file has ~8 Gigabytes.
With my 16 Gigabytes of Ram, however, I am still not able to convert that picture. I downloaded couple of converters, but they couldn't handle it. Even Gimp crashed when I try to "export as...".
So, does anyone know a good converter that can handle really large ppm files? In fact, I even want to go above 100 Gigabytes. I don't care if it's slow, it should just work.
If there is no such converter: Is there a way to std::ofstream in a better way? Like maybe, is there a library that automaticly produces a PNG file?
Thanks for you help !
Edit: also I asked myself what might be the best format for saving these large images. I researched and JPEG looks quite pretty (small size, still good quality). But may be there a better format? Let me know. Thanks

A few thoughts...
An 8-bit PPM file of 38400x21600 should take 2.3GB. A 16-bit PPM file of the same dimensions requires twice as much, i.e. 4.6GB so I am not sure where you got 8GB from.
VIPS is excellent for processing large images, and if I take a 38400x21600 PPM file, and use the following command in Terminal (i.e. at the command-line), I can see it peaks at 58MB of RAM to do the conversion from PPM to JPEG...
vips jpegsave fractal.ppm fractal.jpg --vips-leak
memory: high-water mark 58.13 MB
That takes 31 seconds on a reasonable spec iMac and produces a 480MB file from my (random) data, so you would expect your result to be much smaller, since mine is pretty incompressible.
ImageMagick, on the other hand, takes 1.1GB peak working set of memory and does the same conversion in 74 seconds:
/usr/bin/time -l convert fractal.ppm fractal.jpg
73.81 real 69.46 user 4.16 sys
11616595968 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
4051124 page reclaims
4 page faults
0 swaps
0 block input operations
106 block output operations
0 messages sent
0 messages received
0 signals received
9 voluntary context switches
11791 involuntary context switches

Go to the Baby X resource compiler and download the JPEG encoder, savejpeg.c. It takes an rgb buffer which has to be flat in memory. Hack into it and replace with a version that accepts a stream of 16x16 blocks. Then write your own ppm loader that loads in a 16 pixel high strip at a time.
Now the system will scale up to huge images which don't fit in memory. How you're going to display them I don't know. But the JPEG will be to specification.
https://github.com/MalcolmMcLean/babyxrc

I'd suggest that a more efficient and faster solution would be to simply get more RAM - 128GB is not prohibitively expensive these days (or add swap space).

Read data from wav file before applying FFT

it's the first time when I'm working with wave files.
The problem is that I don't exactly understand how to properly read stored data. My code for reading:
uint8_t* buffer = new uint8_t[BUFFER_SIZE];
std::cout << "Buffering data... " << std::endl;
while ((bytesRead = fread(buffer, sizeof buffer[0], BUFFER_SIZE / (sizeof buffer[0]), wavFile)) > 0)
{
//do sth with buffer data
}
Sample file header gives me information that data is PCM (1 channel) with 8 bits per sample and sampling rate is 11025Hz.
Output data gives me (after updates) values from 0 to 255, so values are proper PCM values for 8bit modulation. But, any idea what BUFFER_SIZE would be prefferable to correctly read those values?
WAV file I'm using: http://www.wavsource.com/movies/2001.htm (daisy.wav)
TXT output: https://paste.ee/p/pXGvm

You've got two common situations. The first is where the WAV file represents a short audio sample and you want to read the whole thing into memory and manipulate it. So BUFFER_SIZE is a variable. Basically you seek to the end of the file to get its size, then load it.
The second common situation is that the WAV file represent fairly long audio recording, and you want to process it piecewise, often by writing to an output device in real time. So BUFFER_SIZE needs to be large enough to hold a bite-sized chunk, but not so large that you require excessive memory. Now often the size of a "frame" of audio is given by the output device itself, it expects 25 samples per second to synchronise with video or something similar. You generally need a double buffer to ensure that you can always meet the demand for more samples when the DAC (digital to analogue converter) runs out. Then on giving out a sample you load the next chunk of data from disk. Sometimes there isn't a "right" value for the chunk size, you've just got to go with something fairly sensible that balances memory footprint against the number of calls.
If you need to do FFT, it's normal to use a buffer size that is a power of two, to make the fast transform simpler. Size you need depends on the lowest frequency you are interested in.

Reading binary files, Linux Buffer Cache

I am busy writing something to test the read speeds for disk IO on Linux.
At the moment I have something like this to read the files:
Edited to change code to this:
const int segsize = 1048576;
char buffer[segsize];
ifstream file;
file.open(sFile.c_str());
while(file.readsome(buffer,segsize)) {}
For foo.dat, which is 150GB, the first time I read it in, it takes around 2 minutes.
However if I run it within 60 seconds of the first run, it will then take around 3 seconds to run. How is that possible? Surely the only place that could be read from that fast is the buffer cache in RAM, and the file is too big to fit in RAM.
The machine has 50GB of ram, and the drive is a NFS mount with all the default settings. Please let me know where I could look to confirm that this file is actually being read at this speed? Is my code wrong? It appears to take a correct amount of time the first time the file is read.
Edited to Add:
Found out that my files were only reading up to a random point. I've managed to fix this by changing segsize down to 1024 from 1048576. I have no idea why changing this allows the ifstream to read the whole file instead of stopping at a random point.
Thanks for the answers.

On Linux, you can do this for a quick troughput test:
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.863904 s, 243 MB/s
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.0748273 s, 2.8 GB/s
$ sync && echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/md0 of=/dev/null bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.919688 s, 228 MB/s
echo 3 > /proc/sys/vm/drop_caches will flush the cache properly

in_avail doesn't give the length of the file, but a lower bound of what is available (especially if the buffer has already been used, it return the size available in the buffer). Its goal is to know what can be read without blocking.
unsigned int is most probably unable to hold a length of more than 4GB, so what is read can very well be in the cache.
C++0x Stream Positioning may be interesting to you if you are using large files

in_avail returns the lower bound of how much is available to read in the streams read buffer, not the size of the file. To read the whole file via the stream, just keep
calling the stream's readsome() method and checking how much was read with the gcount() method - when that returns zero, you have read everthing.

It appears to take a correct amount of time the first time the file is read.
On that first read, you're reading 150GB in about 2 minutes. That works out to about 10 gigabits per second. Is that what you're expecting (based on the network to your NFS mount)?

One possibility is that the file could be at least in part sparse. A sparse file has regions that are truly empty - they don't even have disk space allocated to them. These sparse regions also don't consume much cache space, and so reading the sparse regions will essentially only require time to zero out the userspace pages they're being read into.
You can check with ls -lsh. The first column will be the on-disk size - if it's less than the file size, the file is indeed sparse. To de-sparse the file, just write to every page of it.
If you would like to test for true disk speeds, one option would be to use the O_DIRECT flag to open(2) to bypass the cache. Note that all IO using O_DIRECT must be page-aligned, and some filesystems do not support it (in particular, it won't work over NFS). Also, it's a bad idea for anything other than benchmarking. See some of Linus's rants in this thread.
Finally, to drop all caches on a linux system for testing, you can do:
echo 3 > /proc/sys/vm/drop_caches
If you do this on both client and server, you will force the file out of memory. Of course, this will have a negative performance impact on anything else running at the time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js