FAST DMA benefit from FPGA using threads in C++ - c++

I am transferring data from FPGA PCIe thru DMA, which is very fast. I have 500 data with each data comprise of 80000 BYTES. Hence the time for all 500 data receiving and saving in .bin file is 0.5 seconds. If I do the same in .txt file (which is my final goal) it takes 15 seconds.
Hence Now what I want is to use threads in c++, where 1 thread (I call it as master thread) take DMA data(single data at a time) and simultaneously open the 500 other threads (one for each file) each file saving thread wait for some trigger event etc. (not much idea, since CPU inherently runs in sequential manner, causing problem for an FPGA designer who deals in parallel domain)
Please see the case I have explained could be the solution, but I need to know how to implement it if it is correct in++ ????
case
1st data(thru DMA) comes in master thread (where global memory is assigned using malloc() ) -> thread for file 1 is waiting for any TRIGGER etc. and as soon as it gets this trigger, copy the memory contents to its own allocated memory and then starts saving in the file, meanwhile it also triggers the 'master thread' to increment its counter and receive the next data and the process continues for the whole 500 data.
I am mostly and FPGA guy and c++ at this high level is first time, I am determined but am stuck. really messed up for two days reading loads of material over threads (in c++) mainly starting from createthreads() and going on and on, I thought the WaitForSingleObject might be solution but I cannot understand how to implement this...
any idea would be appreciable. I do not seek any code, I just seek the way to implement. For example those familiar with VHDL, they might know in VHDL we can use
Code: wait until abc'event and abc = '1';
but what to do here?
Thanks
sraza

The performance measurement you give show that the problem has nothing to do with DMA or threads. What's slow is converting from binary to string data.
Not surprising, since C++ iostreams are miserably slow and even the C stdio functions are significantly suboptimal
Use an optimized function for number->string conversion, and your 15 second time for writing a text file will get a lot closer to that 0.5 second time you have for binary. I'd expect 1.0 seconds or less, from this single change.

Related

How to write data into a buffer and write the buffer into a binary file with a second thread?

I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?
First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.

Multithreaded Realtime audio programming - To block or Not to block

When writing audio software many people on the internet say it is paramount not to use either memory allocation or blocking code, i.e no locks. Due to the fact these are non deterministic so could cause the output buffer to underflow and the audio will glitch.
Real Time Audio Progrmaming
When I write video software, I generally use both, i.e. allocating video frames on the heap and passing between threads using locks and conditional variables (bounded buffers). I love the power this provides as a separate thread can be used for each operation, allowing the software to max out each of the cores, giving the best performance.
With audio I'd like to do something similar, passing frames of maybe 100 samples between threads, however, there are two issues.
How do I generate the frames without using memory allocation? I suppose I could use a pool of frames that have been pre-allocated but this seems messy.
I'm aware you can use lock free queue and that boost has a nice library to do this. This would be a great way to share between threads, but constantly polling the queue to see if data is available seems like a massive waist of CPU time.
In my experience using mutexes doesn't actually take much time at all, provided that the section where the mutex is locked is short.
What is the best way to achieve passing audio frames between threads, whilst keeping latency to a minimum, not wasting resources and using relatively little non-deterministic behaviour?
Seems like you did your research! You've already identified the two main problems that could be the root-cause of audio glitches. The question is: How much of this was important 10 years ago and is only folklore and cargo-cult programming these days.
My two cents:
1. Heap allocations in the rendering loop:
These can have quite a lot overhead depending on how small your processing chunks are. The main culprit is, that very few run-times have a per-thread heap, so each time you mess with the heap your performance depends on what other threads in your process do. If for example a GUI thread is currently deleting thousands of objects, and you - at the same time - access the heap from the audio rendering thread you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may sound messy, but in the end it's just two functions that you can hide somewhere in a utility source. Since you usually know your allocation sizes in advance there is a lot of opportunity to fine-tune and optimize your memory management. You can store your segments as a simple linked list for example. If done right this has the benefit that you allocate the last used buffer again. This buffer has a very high probability of beeing in the cache.
If fixed size allocators don't work for you have a look at ring-buffers. They fit the use-cases of streaming audio very well.
2. To lock, or not to lock:
I'd say, these days using mutex and semaphore locks are fine if you can estimate that you do less than 1000 to 5000 of them per second (on a PC, things are different on something like a Raspberry Pi etc.). If you stay below that range it is unlikely that the overhead shows up in a performance profile.
Translated to your use-case: If you for example work with 48kHz audio and 100 sample chunks you generate roughly 960 lock/unlock operation in a simple two thread consumer/producer pattern. that is well within the range. In case you completely max out the rendering thread the locking will not show up in a profiling. If you on the other hand only use like 5% of the available processing power the locks may show up, but you will not have a performance problem either :-)
Going lock-less is also an option, but so are hybrid solutions that first do some lock-less tries and then fall back to hard locking. You'll get the best of both worlds that way. There is a lot of good stuff to read about this topic on the net.
In any case:
You should raise the thread priority of your non GUI threads gently to make sure that if they run into a lock, they get out of it quickly. It is also a good idea to read what Priority Inversion is, and what you can do to avoid it:
https://en.wikipedia.org/wiki/Priority_inversion
'I suppose I could use a pool of frames that have been pre-allocated but this seems messy' - not really. Either allocate an array of frames, or new up frames in a loop, and then shove the indices/pointers onto a blocking queue. Now you have an auto-managed pool of frames. Pop one off when you need a frame, push it back on when you are done with it. No continual malloc/free/new/delete, no chance or memory-runaway, simpler debugging, and frame flow-control, (if the pool runs out, threads asking for frames will wait until frames are released back into the pool), all built in.
Using an array may seem easier/safer/faster than a new loop, but newing individual frames does have an advantage - you can easily change the number of frames in the pool at runtime.
Um, why are you passing frames of 100 samples between threads?
Assuming that you are working at a nominal sample rate of 44.1kHz, and passing 100 samples at a time between threads, that presumes that your thread switching rate must be at least 100 samples / (44100 samples/s * 2). The 2 represents both the producer and the consumer. That means you have a time slice of ~1.13 ms for every 100 samples you send. Nearly all operating systems run at time slices greater than 10 ms. So it is impossible to build an audio engine where you are sharing only 100 samples between threads at 44.1kHz on a modern OS.
The solution is to buffer more samples per time slice, either via a queue or by using larger frames. Most modern real time audio APIs use 128 samples per channel (on dedicated audio hardware) or 256 samples per channel (on game consoles).
Ultimately, the answer to your question is mostly the answer you would expect... Pass around uniquely owned queues of pointers to buffers, not the buffers themselves; manage ALL audio buffers in a fixed pool allocated at program start; and lock all queues for as little time as necessary.
Interestingly, this is one of the few good situations in audio programming where there is a distinct performance advantage to busting out the assembly code. You definitely don't want a malloc and free occurring with every queue lock. Operating-system provided atomic locking functions can ALWAYS be improved upon, if you know your CPU.
One last thing: there's no such thing as a lockfree queue. All multithread "lockfree" queue implementations rely on a CPU barrier intrinsic or a hard compare-and-swap somewhere to make sure that exclusive access to memory is guaranteed per thread.

Multi threaded reading from a file in c++?

My application uses text file to store data to file.
I was testing for the fastest way of reading it by multi threading the operation.
I used the following 2 techniques:
Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.
Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.
The test was run on various file sizes 100kB - 800MB.
Data in file:
100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...
The data is stored in 2D array of double.
Result: Both methods take approximately the same time for parsing the file.
Question: Which method should I choose?
Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.
Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.
Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.
Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.
If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.
To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)
I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing
With enough RAM you can do it without single-thread bottleneck. For Linux:
1) mmap you whole file to RAM with MAP_LOCKED, requires root or system wide permissions tune. Or without MAP_LOCKED for SSD, they handle random access well.
2) give each thread a start position. Process data from first newline after self start position to first newline after next thread start position.
PS What is your program CPU load? Probably HDD is the bottleneck.

CPU Cores not Utilized properly using QThreads

Using : C++ (MinGW), Qt4.7.4, Vista (OS), intel core2vPro
I need to process 2 huge files in exactly the same way. So i would like to call the processing routine from 2 separate threads for 2 separate files. The GUI thread does nothing heavy; just displays a label and runs an event loop to check for emission of thread termination conditions and quits the main Application accordingly. I expected this to utilize the two cores (intel core2) somewhat equally, but on the contrary i see from Task Manager that one of the cores is highly utilized and the other is not (though not every time i run the code); also the time taken to process the 2 files is much more than the time taken to process one file (i thought it should have been equal or a little more but this is almost equal to processing the 2 files one after another in a non-threaded application). Can i somehow force the threads to use the cores that i specify?
QThread* ptrThread1=new QThread;
QThread* ptrThread2=new QThread;
ProcessTimeConsuming* ptrPTC1=new ProcessTimeConsuming();
ProcessTimeConsuming* ptrPTC2=new ProcessTimeConsuming();
ptrPTC1->moveToThread(ptrThread1);
ptrPTC2->moveToThread(ptrThread2);
//make connections to specify what to do when processing ends, threads terminate etc
//display some label to give an idea that the code is in execution
ptrThread1->start();
ptrThread2->start(); //i want this thread to be executed in the core other than the one used above
ptrQApplication->exec(); //GUI event loop for label display and signal-slot monitoring
Reading in parallel from a single mechanical disk often times (and probably in your case) will not yield any performance gain, since the mechanical head of the disk needs to spin every time to seek the next reading location, effectively making your reads sequential. Worse, if a lot of threads are trying to read, the performance may even degrade with respect to the sequential version, because the disk head is bounced to different locations of the disk and thus needs to spin back where it left off every time.
Generally, you cannot do better than reading the files in a sequence and then processing them in parallel using perhaps a producer-consumer model.
With mechanical hard drives, you need to explicitly control the ratio of time spent doing sequential reads vs. time spent seeking. The canonical way of doing it is with n+m objects running on m+min(n, QThread::idealThreadCount()) threads. Here, m is the number of hard drives that the files are on, and n is the number of files.
Each of m objects is reading files from given hard drive in a round robin fashion. Each read must be sufficiently large. On modern hard drives, let's budget 70Mbytes/s of bandwidth (you can benchmark the real value), 5ms for a seek. To waste at most 10% of the bandwidth, you only have 100ms or 100ms/(5ms/seek)=20 seeks per second. Thus you must read at least 70Mbytes/(20seeks+1)=3.3 Megabytes from each file before reading from the next file. This thread fills a buffer with file data, and the buffer then signals the relevant computation object that is attached to the other side of the buffer. When a buffer is busy, you simply skip reading from given file until the buffer becomes available again.
The other n objects are computation objects, they perform a computation upon a signal from a buffer that indicates the buffer is full. As soon as the buffer data is not needed anymore, the buffer is "reset" so that the file reader can refill it.
All reader objects need their own threads. The computation objects can be distributed among their own threads in a round-robin fashion, so that the threads all have within +1, -0 objects of each other.
I thought my empirical data might be of some use to this discussion. I have a directory with 980 txt files that I would like to read. In the Qt/C++ framework and running on an Intel i5 quad core, I created a GUI Application and added a class worker to read a file given its path. I pushed the worker into a thread, then repeated adding an additional thread each run. I timed roughly 13 mins with 1 thread, 9 minutes with 2, and 8 minutes with 3. So, in my case there was some benefit, but it degraded quickly.

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.