Simple Fast read-process - c++

I want to decompress data from a file on a very slow device (read speed=1Mo/s). The decompression algorithm can at least perform three times this speed.
What is the fastest way to parallelize those tasks in C/C++ so that reading process cannot be slow down by decompression and so use maximum bandwidth.
I already tried two threads with regular pipes. But I don't know if it is the best solution. At least it is not a zero-copy algo.
My current algo is wrong because I cannot success to perform blocking IO on pipes. (tried fcntl or fread/fdopen)
My unparallelized program is very simple. Somethings like
while(remainingToRead > 0){
int nb = fread(buffer, 1, bufferSize);
decompress(buffer, nb, bufferOut);
nb -= remainingToRead
}

One solution here is to use separate threads to read and decompress and to use two buffers so that these operations can be overlapped.
Pseudo code:
Read thread:
while (not finished)
{
while (no buffers free)
wait on condition variable
fill next buffer
mark buffer in use
set condition variable for decompression thread
}
Decompression thread:
while (not finished)
{
while (no buffers full)
wait on condition variable
decompress next buffer
mark buffer free
set condition variable for read thread
}
Note that getting this to work involves getting quite a bit of detail right - multi-threaded programming is always tricky.

Related

AT command response parser

I am working on my own implementation to read AT commands from a Modem using a microcontroller and c/c++
but!! always a BUT!! after I have two "threads" on my program, the first one were I am comparing the possible reply from the Moden using strcmp which I believe is terrible slow
comparing function
if (strcmp(reply, m_buffer) == 0)
{
memset(buffer, 0, buffer_size);
buffer_size = 0;
memset(m_buffer, 0, m_buffer_size);
m_buffer_size = 0;
return 0;
}
else
return 1;
this one works fine for me with AT commands like AT or AT+CPIN? where the last response from the Modem is "OK" and nothing in the middle, but it is not working with commands like AT+CREG?, wheres it responses:
+REG: n,n
OK
and I am specting for "+REG: n,n" but I believe strncpy is very slow and my buffer data is replaced for "OK"
2nd "thread" where it enables a UART RX interruption and replaces my buffer data every time it receives new data
Interruption handle:
m_buffer_size = buffer_size;
strncpy(m_buffer, buffer, buffer_size + m_buffer_size);
Do you know any out there faster than strcmp? or something to improve the AT command responses reading?
This has the scent of an XY Problem
If you have seen the buffer contents being over written, you might want to look into a thread safe queue to deliver messages from the RX thread to the parsing thread. That way even if a second message arrives while you're processing the first, you won't run into "buffer overwrite" problems.
Move the data out of the receive buffer and place it in another buffer. Two buffers is rarely enough, so create a pool of buffers. In the past I have used linked lists of pre-allocated buffers to keep fragmentation down, but depending on the memory management and caching smarts in your microcontroller, and the language you elect to use, something along the lines of std::deque may be a better choice.
So
Make a list of free buffers.
When a the UART handling thread loop looks something like,
Get a buffer from the free list
Read into the buffer until full or timeout
Pass buffer to parser.
Parser puts buffer in its own receive list
Parsing sends a signal to wake up its thread.
Repeat until terminated. If the free list is emptied, your program is probably still too slow to keep up. Perhaps adding more buffers will allow the program to get through a busy period, but if the data flow is relatively constant and the free list empties out... Well, you have a problem.
Parser loop also repeats until terminated looks like:
If receive list not empty,
Get buffer from receive list
Process buffer
Return buffer to free list
Otherwise
Sleep
Remember to protect the lists from concurrent access by the different threads. C11 and C++11 have a number of useful tools to assist you here.

c++ streaming udp data into a queue?

I am streaming data as a string over UDP, into a Socket class inside Unreal engine. This is threaded, and runs in the background.
My read function is:
float translate;
void FdataThread::ReceiveUDP()
{
uint32 Size;
TArray<uint8> ReceivedData;
if (ReceiverSocket->HasPendingData(Size))
{
int32 Read = 0;
ReceivedData.SetNumUninitialized(FMath::Min(Size, 65507u));
ReceiverSocket->RecvFrom(ReceivedData.GetData(), ReceivedData.Num(), Read, *targetAddr);
}
FString str = FString(bytesRead, UTF8_TO_TCHAR((const UTF8CHAR *)ReceivedData));
translate = FCString::Atof(*str);
}
I then call the translate variable from another class, on a Tick, or timer.
My test case sends an incrementing number from another application.
If I print this number from inside the above Read function, it looks as expected, counting up incrementally.
When i print it from the other thread, it is missing some of the numbers.
I believe this is because I call it on the Tick, so it misses out some data due to processing time.
My question is:
Is there a way to queue the incoming data, so that when i pull the value, it is the next incremental value and not the current one? What is the best way to go about this?
Thank you, please let me know if I have not been clear.
Is this the complete code? ReceivedData isn't used after it's filled with data from the socket. Instead, an (in this code) undefined variable 'buffer' is being used.
Also, it seems that the while loop could run multiple times, overwriting old data in the ReceivedData buffer. Add some debugging messages to see whether RecvFrom actually reads all bytes from the socket. I believe it reads only one 'packet'.
Finally, especially when you're using UDP sockets over the network, note that the UDP protocol isn't guaranteed to actually deliver its packets. However, I doubt this is causing your problems if you're using it on a single computer or a local network.
Your read loop doesn't make sense. You are reading and throwing away all datagrams but the last in any given sequence that happen to be in the socket receive buffer at the same time. The translate call should be inside the loop, and the loop should be while(true), or while (running), or similar.

C++/C Multiple threads to read gz file simultaneously

I am attempting to read a gzip-compressed file from multiple threads.
I was thinking this would significantly speed up decompression process as my gzread functions in multiple threads start from different file offset (using gseek), hence they read different parts of the file.
The simplified code is like
// in threads
auto gf = gzopen("file.gz",xxx);
gzseek(gf,offset);
gzread(xx);
gzclose(gf);
To my surprise, my multi-thread version program does not speed up at all. The 20-thread version uses exactly the same time as the single-thread version. I am pretty sure this is far away from the disk bottleneck.
I guess the zlib inflation functionality may need to decompress the entire file for reading even a small part, but I failed to get any clue from their manual.
Anyone have an idea how to speed up in my case?
Short answer: due to the serial nature of a deflate stream, gzseek() must decode all of the compressed data from the start up to the requested seek point. So you can't get any gain with what you are trying to do. In fact, the total cycles spent will increase with the square of the length of the compressed data! So don't do that.
tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.
Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:
/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
state->x.pos + offset >= 0) {
"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how in gzguts.h:
int how; /* 0: get header, 1: copy, 2: decompress */
Right. At the end of gz_open, a call to gz_reset sets how to 0. Returning to gzseek64, we end up with this modification to the state:
state->seek = 1;
state->skip = offset;
gzread, when called, processes this with a call to gz_skip:
if (state->seek) {
state->seek = 0;
if (gz_skip(state, state->skip) == -1)
return -1;
}
Following this rabbit hole just a bit further, we find that gz_skip calls gz_fetch until gz_fetch has processed enough input for the desired seek. gz_fetch, on its first loop iteration, calls gz_look which sets state->how = GZIP, which causes gz_fetch to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek.
zlib implementation have no multithreading (http://www.zlib.net/zlib_faq.html#faq21 - "Is zlib thread-safe? - Yes. ... Of course, you should only operate on any given zlib or gzip stream from a single thread at a time.") and will decompress "entire file" up to seeked position.
And zlib format has bad alignment (bit alignment) / no offset fields (deflate format) to enable parallel decompression/seeking.
You may try another implementations of z (deflate/inflate), for example, http://zlib.net/pigz/ (or switch from ancient compression from the era of single core to non-zlib modern parallel formats, xz/lzma/something from google)
pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. To compile and use pigz, please read the README file in the source code distribution. You can read the pigz manual page here.
The manual page is http://zlib.net/pigz/pigz.pdf and it has useful information.
It uses format compatible to zlib, but adopted to parallel compress:
Each partial raw deflate stream is terminated by an empty stored block ... in order to end that partial bit stream at a byte boundary.
Still, DEFLATE format is bad for parallel decompression:
Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. Asaresult, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?
I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.
I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.
If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

Restarting Streaming OpenAL Source?

Why does my streaming OpenAL source somtimes go to AL_STOPPED state, forcing me to call alSourcePlay? This usually happens when I do not call send fast enough, i.e. in debug mode. Does the oal source automatically stop when it doesn't have enough queue buffers? How do I avoid that?
void send(audio_buffer audio) override
{
ALenum state;
alGetSourcei(source_, AL_SOURCE_STATE,&state);
if(state != AL_PLAYING)
alSourcePlay(source_); // This happens sometimes, usually when "send" is not called fast enough.
ALuint buffer = 0;
alSourceUnqueueBuffers(source_, 1, &buffer);
if(buffer)
{
alBufferData(buffer, AL_FORMAT_STEREO16, audio.data(), static_cast<ALsizei>(audio.size()*sizeof(int16_t)), 48000);
alSourceQueueBuffers(source_, 1, &buffer);
}
else
LOG << "Dropped audio.";
}
It sounds like your basic problem is that your audio stream is starved. There are a few options you can use to mitigate this, but they all have their own side effects:
(1) You can configure it to play from a looping buffer, to which you are supplying the relevant data. The downside to this is that it will audibly repeat itself if you starve the buffer too long, but it will have some better performance characteristics (fragmentation, etc).
(2) You can increase the send buffer size. This will only cover up small problems, and potentially increases the latency in dynamic content.
(3) Finally, you can thread the audio send operation, that way so long as the audio thread isn't starved, it can continue to send data in the background.
The high production / quality solution probably involes all three of these. Sorry for the lack of OpenAL specific terminology, but every audio system I've seen has these capabilities.