C++/C Multiple threads to read gz file simultaneously - c++

I am attempting to read a gzip-compressed file from multiple threads.
I was thinking this would significantly speed up decompression process as my gzread functions in multiple threads start from different file offset (using gseek), hence they read different parts of the file.
The simplified code is like
// in threads
auto gf = gzopen("file.gz",xxx);
gzseek(gf,offset);
gzread(xx);
gzclose(gf);
To my surprise, my multi-thread version program does not speed up at all. The 20-thread version uses exactly the same time as the single-thread version. I am pretty sure this is far away from the disk bottleneck.
I guess the zlib inflation functionality may need to decompress the entire file for reading even a small part, but I failed to get any clue from their manual.
Anyone have an idea how to speed up in my case?

Short answer: due to the serial nature of a deflate stream, gzseek() must decode all of the compressed data from the start up to the requested seek point. So you can't get any gain with what you are trying to do. In fact, the total cycles spent will increase with the square of the length of the compressed data! So don't do that.

tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.
Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:
/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
state->x.pos + offset >= 0) {
"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how in gzguts.h:
int how; /* 0: get header, 1: copy, 2: decompress */
Right. At the end of gz_open, a call to gz_reset sets how to 0. Returning to gzseek64, we end up with this modification to the state:
state->seek = 1;
state->skip = offset;
gzread, when called, processes this with a call to gz_skip:
if (state->seek) {
state->seek = 0;
if (gz_skip(state, state->skip) == -1)
return -1;
}
Following this rabbit hole just a bit further, we find that gz_skip calls gz_fetch until gz_fetch has processed enough input for the desired seek. gz_fetch, on its first loop iteration, calls gz_look which sets state->how = GZIP, which causes gz_fetch to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek.

zlib implementation have no multithreading (http://www.zlib.net/zlib_faq.html#faq21 - "Is zlib thread-safe? - Yes. ... Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.") and will decompress "entire file" up to seeked position.
And zlib format has bad alignment (bit alignment) / no offset fields (deflate format) to enable parallel decompression/seeking.
You may try another implementations of z (deflate/inflate), for example, http://zlib.net/pigz/ (or switch from ancient compression from the era of single core to non-zlib modern parallel formats, xz/lzma/something from google)
pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. To compile and use pigz, please read the README file in the source code distribution. You can read the pigz manual page here.
The manual page is http://zlib.net/pigz/pigz.pdf and it has useful information.
It uses format compatible to zlib, but adopted to parallel compress:
Each partial raw deflate stream is terminated by an empty stored block ... in order to end that partial bit stream at a byte boundary.
Still, DEFLATE format is bad for parallel decompression:
Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. Asaresult, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.

Related

Simple Fast read-process

I want to decompress data from a file on a very slow device (read speed=1Mo/s). The decompression algorithm can at least perform three times this speed.
What is the fastest way to parallelize those tasks in C/C++ so that reading process cannot be slow down by decompression and so use maximum bandwidth.
I already tried two threads with regular pipes. But I don't know if it is the best solution. At least it is not a zero-copy algo.
My current algo is wrong because I cannot success to perform blocking IO on pipes. (tried fcntl or fread/fdopen)
My unparallelized program is very simple. Somethings like
while(remainingToRead > 0){
int nb = fread(buffer, 1, bufferSize);
decompress(buffer, nb, bufferOut);
nb -= remainingToRead
}
One solution here is to use separate threads to read and decompress and to use two buffers so that these operations can be overlapped.
Pseudo code:
Read thread:
while (not finished)
{
while (no buffers free)
wait on condition variable
fill next buffer
mark buffer in use
set condition variable for decompression thread
}
Decompression thread:
while (not finished)
{
while (no buffers full)
wait on condition variable
decompress next buffer
mark buffer free
set condition variable for read thread
}
Note that getting this to work involves getting quite a bit of detail right - multi-threaded programming is always tricky.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?
I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.
I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.
If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?
This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)
So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();
It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.

Can a false sync word be found in the payload of an MPEG-1/MPEG-2 frame?

I know I can find other answers about this on SO, but I want clarifications from somebody who really knows MPEG-1/MPEG-2 (or MP3, obviously).
The start of an MPEG-1/2 frame is 12 set bits starting at a byte boundary, so bytes ff f*, where * is any nibble. Those 12 bits are called a sync word. This is a useful characteristic to find the start of a frame in any MPEG-1/2 stream.
My first question is: formally, can a false sync word be found or not in the payload of an MPEG-1/2 frame, outside its header?
If so, here's my second question: why does the sync word mechanism even exist then? If we cannot make sure that we found a new frame when reading fff, what is the purpose of this sync word?
Please do not even consider ID3 in your answer; I already know about sync words that can be found in ID3v2 payloads, but that's well documented.
I worked on MPEG-2 streams, more precisely Transport Streams (TS): I guess we can find similarities.
A TS is composed of Transport Packets, which have a header, starting with a sync byte 0x47.
We also can found 0x47 within the payload of the TP, but we know that it is not a sync byte because it is not aligned (TP have a fixed size of 188 bytes).
The sync word gives an entry point to someone that looks at the stream, and allows a program to synchronize his process with the stream, hence the name.
It also allows a fast browsing and parsing of the stream: in a TS you can jump from a packet to another (inspect header, check sync byte, skip 188 bytes and so on)
Finally it is a safety measure that helps you to spot errors (in the stream during transmission for example or in the process if a bug caused a bad alignment)
These argument are about TS but I think the same goes with your case : finding a sync word within a payload should not be an issue because you should always able to distinguish payload and header, most of the time with a length information (either because the size is fixed like in TP or because you have a TLV format).
can a false sync word be found or not in the payload of an MPEG-1/2
frame, outside its header?
According to this, "frame sync can be easily (and very frequently) found in any binary file." See the section titled "MPEG Audio Frame Header"
I confirmed this with an .mp3 song that I chose at random (stripped of ID3 tags). It had 5193 sync words, of which only 4898 were found to be valid (using code too long to be included here).
>>> f = open('notag.mp3', 'rb')
>>> r=f.read()
>>> r.count(b'\xff\xfb')
5193
why does the sync word mechanism even exist then? If we cannot make
sure that we found a new frame when reading fff, what is the purpose
of this sync word?
We can be (relatively) sure if we are checking the rest of the frame header, and not just the sync word. There are bits following the sync which can be used to:
identify a false positive or
give you useful info
With .mp3, you have to use those useful bits to calculate the size of the frame. By skipping ahead <frame-size> bytes before looking for the next sync word, you avoid any false syncs that may be present in the payload. See the section titled "How to calculate frame length" in that same link.

Compression File

I have a X bytes file. And I want to compress it in block of size 32Kb, for example.
Is there any lib that Can I do this?
I used Zlib for Delphi but I just can compress a full file in new compressed file.
Tranks a lot,
Pedro
Why don't you use a simple header to determine block boundaries? Consider this:
Read fixed amount of data from input into a buffer (say 32 KiB)
Compress that buffer with a "freshly created" deflate stream (underlying compression algorithm of ZLIB).
Write compressed size to output stream
Write compressed data to output stream
Go to step 1 until you reach end-of-file.
Pros:
You can decompress any block even in multi-threaded fashion.
Data corruption only limited to corrupted block. Rest of data can be restored.
Cons:
You loss most of contextual information (similarities between data). So, you will have lower compression ratio.
You need slightly more work.