Multithreaded Realtime audio programming - To block or Not to block - c++

When writing audio software many people on the internet say it is paramount not to use either memory allocation or blocking code, i.e no locks. Due to the fact these are non deterministic so could cause the output buffer to underflow and the audio will glitch.
Real Time Audio Progrmaming
When I write video software, I generally use both, i.e. allocating video frames on the heap and passing between threads using locks and conditional variables (bounded buffers). I love the power this provides as a separate thread can be used for each operation, allowing the software to max out each of the cores, giving the best performance.
With audio I'd like to do something similar, passing frames of maybe 100 samples between threads, however, there are two issues.
How do I generate the frames without using memory allocation? I suppose I could use a pool of frames that have been pre-allocated but this seems messy.
I'm aware you can use lock free queue and that boost has a nice library to do this. This would be a great way to share between threads, but constantly polling the queue to see if data is available seems like a massive waist of CPU time.
In my experience using mutexes doesn't actually take much time at all, provided that the section where the mutex is locked is short.
What is the best way to achieve passing audio frames between threads, whilst keeping latency to a minimum, not wasting resources and using relatively little non-deterministic behaviour?

Seems like you did your research! You've already identified the two main problems that could be the root-cause of audio glitches. The question is: How much of this was important 10 years ago and is only folklore and cargo-cult programming these days.
My two cents:
1. Heap allocations in the rendering loop:
These can have quite a lot overhead depending on how small your processing chunks are. The main culprit is, that very few run-times have a per-thread heap, so each time you mess with the heap your performance depends on what other threads in your process do. If for example a GUI thread is currently deleting thousands of objects, and you - at the same time - access the heap from the audio rendering thread you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may sound messy, but in the end it's just two functions that you can hide somewhere in a utility source. Since you usually know your allocation sizes in advance there is a lot of opportunity to fine-tune and optimize your memory management. You can store your segments as a simple linked list for example. If done right this has the benefit that you allocate the last used buffer again. This buffer has a very high probability of beeing in the cache.
If fixed size allocators don't work for you have a look at ring-buffers. They fit the use-cases of streaming audio very well.
2. To lock, or not to lock:
I'd say, these days using mutex and semaphore locks are fine if you can estimate that you do less than 1000 to 5000 of them per second (on a PC, things are different on something like a Raspberry Pi etc.). If you stay below that range it is unlikely that the overhead shows up in a performance profile.
Translated to your use-case: If you for example work with 48kHz audio and 100 sample chunks you generate roughly 960 lock/unlock operation in a simple two thread consumer/producer pattern. that is well within the range. In case you completely max out the rendering thread the locking will not show up in a profiling. If you on the other hand only use like 5% of the available processing power the locks may show up, but you will not have a performance problem either :-)
Going lock-less is also an option, but so are hybrid solutions that first do some lock-less tries and then fall back to hard locking. You'll get the best of both worlds that way. There is a lot of good stuff to read about this topic on the net.
In any case:
You should raise the thread priority of your non GUI threads gently to make sure that if they run into a lock, they get out of it quickly. It is also a good idea to read what Priority Inversion is, and what you can do to avoid it:
https://en.wikipedia.org/wiki/Priority_inversion

'I suppose I could use a pool of frames that have been pre-allocated but this seems messy' - not really. Either allocate an array of frames, or new up frames in a loop, and then shove the indices/pointers onto a blocking queue. Now you have an auto-managed pool of frames. Pop one off when you need a frame, push it back on when you are done with it. No continual malloc/free/new/delete, no chance or memory-runaway, simpler debugging, and frame flow-control, (if the pool runs out, threads asking for frames will wait until frames are released back into the pool), all built in.
Using an array may seem easier/safer/faster than a new loop, but newing individual frames does have an advantage - you can easily change the number of frames in the pool at runtime.

Um, why are you passing frames of 100 samples between threads?
Assuming that you are working at a nominal sample rate of 44.1kHz, and passing 100 samples at a time between threads, that presumes that your thread switching rate must be at least 100 samples / (44100 samples/s * 2). The 2 represents both the producer and the consumer. That means you have a time slice of ~1.13 ms for every 100 samples you send. Nearly all operating systems run at time slices greater than 10 ms. So it is impossible to build an audio engine where you are sharing only 100 samples between threads at 44.1kHz on a modern OS.
The solution is to buffer more samples per time slice, either via a queue or by using larger frames. Most modern real time audio APIs use 128 samples per channel (on dedicated audio hardware) or 256 samples per channel (on game consoles).
Ultimately, the answer to your question is mostly the answer you would expect... Pass around uniquely owned queues of pointers to buffers, not the buffers themselves; manage ALL audio buffers in a fixed pool allocated at program start; and lock all queues for as little time as necessary.
Interestingly, this is one of the few good situations in audio programming where there is a distinct performance advantage to busting out the assembly code. You definitely don't want a malloc and free occurring with every queue lock. Operating-system provided atomic locking functions can ALWAYS be improved upon, if you know your CPU.
One last thing: there's no such thing as a lockfree queue. All multithread "lockfree" queue implementations rely on a CPU barrier intrinsic or a hard compare-and-swap somewhere to make sure that exclusive access to memory is guaranteed per thread.

Related

Multiple Producer Multiple Consumer Lockfree Non Blocking Ring Buffer With Variable Length Write

I want to pass variable-length messages from multiple producers to multiple consumers, with low latency queue on multi-socket Xeon E5 systems. (400 bytes with a latency of 300 ns would be nice, for example.)
I've looked for existing implementations of lockless multiple-producer-multiple consumer (MPMC) queues using a non-blocking ring-buffer. But most implementations/algorithms online are node based (i.e. node is fixed length) such as boost::lockfree::queue, midishare, etc.
Of course, one can argue that the node type can be set to uint8_t or alike, but then the write will be clumsy and the performance will be horrible.
I'd also like the algorithm to offer overwrite detection on the readers' side that the readers will detect data being overwritten.
How can I implement a queue (or something else) that does this?
Sorry for a but late answer, but have a look at DPDK's Ring library. It is free (BSD license), blazingly fast (doubt you will find a faster solution for free) and supports all major architectures. There are lot's of examples as well.
to pass variable-length messages
The solution is to pass a pointer to a message, not a whole message. DPDK also offers memory pools library to allocate/deallocate buffers between multiple threads or processes. The memory pool is also fast, lock-free and supports many architectures.
So overall solution would be:
Create mempool(s) to share buffers among threads/processes. Each mempool supports just a fixed size buffer, so you might want to create few mempools to match your needs.
Create one MPMC ring or a set of SPSC ring pairs between your threads/processes. The SPSC solution might be faster, but it might not fit your design.
Producer allocates a buffer, fills it and passes a pointer to that buffer via the ring.
Consumer receives the pointer, reads the message and deallocates the buffer.
Sounds like a lot of work, but there are lots of optimizations inside DPDK mempools and rings. But will it fit 300ns?
Have a look at the official DPDK performance reports. While there is no official report for ring performance, there is a vhost/vistio test results. Basically, packets travel like this:
Traffic gen. -- Host -- Virtual Machine -- Host -- Traffic gen.
Host runs as one process, virtual machine as another.
The test result is ~4M packets per second for 512 byte packets. It does not fit your budget, but you need to do much, much less work...
You probably want to put pointers in your queue, rather than actually copying data into / out of the shared ring itself. i.e. the ring buffer payload is just a pointer.
Release/acquire semantics takes care of making sure that the data is there when you dereference a pointer you get from the queue. But then you have a deallocation problem: how does a producer know when a consumer is done using a buffer so it can reuse it?
If it's ok to hand over ownership of the buffer, then that's fine. Maybe the consumer can use the buffer for something else, like add it to a local free-list or maybe use it for something it produces.
For the following, see the ring-buffer based lockless MPMC queue analyzed in Lock-free Progress Guarantees. I'm imagining modifications to it that would make it suit your purposes.
It has a read-index and a write-index, and each ring-buffer node has a sequence counter that lets it detect writers catching up with readers (queue full) vs. readers catching up with writers (queue empty), without creating contention between readers and writers. (IIRC, readers read the write-index or vice versa, but there's no shared data that's modified by both readers and writers.)
If there's a reasonable upper bound on buffer sizes, you could have shared fixed-size buffers associated with each node in the ring buffer. Like maybe 1kiB or 4kiB. Then you wouldn't need a payload in the ring buffer; the index would be the interesting thing.
If memory allocation footprint isn't a big deal (only cache footprint) even 64k or 1M buffers would be mostly fine even if you normally only use the low 400 bytes of each. Parts of the buffer that don't get used will just stay cold in cache. If you're using 2MiB hugepages, buffers smaller than that are a good idea to reduce TLB pressure: you want multiple buffers to be covered by the same TLB entry.
But you'd need to claim a buffer before writing to it, and finish writing to it before finishing the second step of adding an entry to the queue. You probably don't want to do more than just memcpy, because a partially-complete write blocks readers if it becomes the oldest entry in the queue before it finishes. Maybe you could write-prefetch the buffer (with prefetchw on Broadwell or newer)
before trying to claim it, to reduce the time between you're (potentially) blocking the queue. But if there's low contention for writers, that might not matter. And if there's high contention so you don't (almost) always succeed at claiming the first buffer you try, write-prefetch on the wrong buffer will slow down the reader or writer that does own it. Maybe a normal prefetch would be good.
If buffers are tied directly to queue entries, maybe you should just put them in the queue, as long as the MPMC library allows you to use custom reader code that reads a length and copies out that many bytes, instead of always copying a whole giant array.
Then every queue control entry that producers / consumers look at will be in a separate cache line, so there's no contention between two producers claiming adjacent entries.
If you need really big buffers because your upper bound is like 1MiB or something, retries because of contention will lead to touching more TLB entries, so a more compact ring buffer with the large buffers separate might be a better idea.
A reader half-way through claiming a buffer doesn't block other readers. It only blocks the queue if it wraps around and a producer is stuck waiting for it. So you can definitely have your readers use the data in-place in the queue, if it's big enough and readers are quick. But the more you do during a partially-complete read, the higher chance that you sleep and eventually block the queue.
This is a much bigger deal for producers, especially if the queue is normally (nearly) empty: consumers are coming up on newly-written entries almost as soon as they're produced. This is why you might want to make sure to prefetch the data you're going to copy in, and/or the shared buffer itself, before running a producer.
400 bytes is only 12.5 cycles of committing 32 bytes per clock to L1d cache (e.g. Intel Haswell / Skylake), so it's really short compared to inter-core latencies or the time you have to wait for an RFO on a cache write-miss. So the minimum time between a producer making the claim of a node globally visible to when you complete that claim so readers can read it (and later entries) is still very short. Blocking the queue for a long time is hopefully avoidable.
That much data even fits in YMM 13 registers, so a compiler could in theory actually load the data into registers before claiming a buffer entry, and just do stores. You could maybe do this by hand with intrinsics, with a fully-unrolled loop. (You can't index the register file, so it has to be fully unrolled, or always store 408 bytes, or whatever.)
Or 7 ZMM registers with AVX512, but you probably don't want to use 512-bit loads/stores if you aren't using other 512-bit instructions, because of the effects on max-turbo clock speed and shutting down port 1 for vector ALU uops. (I assume that still happens with vector load/store, but if we're lucky some of those effects only happen with 512-bit ALU uops...)

Hard disk contention using multiple threads

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.
Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.
It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.
A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.

Writer/Reader buffer mechanism for large size - high freq data c++

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.
Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)
Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.
Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…
EDIT: Is there any other idea rather than bigger amount of RAM?
This can be done on a commodity PC (and can scale to a server without code changes).
Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.
Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.
As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.
I would create a 54.1GB1 file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer2.
What was most recently written will be more or less guaranteed3 to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.
Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).
You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.
What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata.
Anyone (including the producer) accessing this metadata needs to acquire a lock.
It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.
Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.
As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.
If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.
1 Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.
2 Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.
3 There is of course never really a guarantee, but it's what will be the case anyway.
54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).
Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.
Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.
Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.
Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

Loading batch of images - Thread allocation

So, I have a lot of images to be loaded from the disk, I was wondering how many threads should I allocate to the task to obtain maximum performance.
I am not specifying SO because my project is cross-platform.
I think I will work mainly with PNG, i.e. it is not slow to decompress but there is some decompression involved.
Also, if I end up creating one thread for each image, is the thread-overhead big enough to slow down considerably my process?
Sometimes a producer consumer architecture is good enough.
Other times what you describe could also work, given that you don't have more threads that the CPUs available can handle (ie more threads than #CPUs*2 usually (not always) leads to thrashing).
You need to perform some tests in order to see which model works best for you. Think about where do these images come from (disk? Are they in consecutive locations on disk or not. Does it make sense to produce multiple threads and just wait for disk IO to load a small chunk of one photo from disk, then context switch to another thread and do another seek on disk to get a small chunk of another file and so on.
I suggest try single thread application.
One thread per disk seems like a reasonable start. You could make it a runtime tuning parameter to see what works best, especially if there are, or might be, non-local network disks, (ie. high latency), or, as others have suggested, there is any decompression or video processing to be done.
One thread per image is not a good idea, again, as posted by others. You will need some P-C queues to feed the thread/s with objects that contain an image buffer + file spec and also to return the same objects after the load is done - continually creating/terminating/destroying threads is wasteful, difficult and prone to disaster.

Multithreaded image processing in C++

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.
Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.
I have several requirements:
Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
Would I need to lock my read-only and write-only arrays?
The input is only read from, but many operations require input from more than one pixel in the array.
The ouput is only written once per pixel.
Speed is also important (of course), but optimize executable size takes precedence.
Thanks.
More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks
Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)
Do you know what deadlock is? How about Livelock?
That said...
As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.
In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.
But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)
The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.
More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.
One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)
Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)
Under Unix/Linux, man:
* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.
Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.
Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!
Would I need to lock my read-only and write-only arrays?
It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.
Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!
Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!
I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.
Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.
If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!
Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.
Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!
Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!
Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.
If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)
One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.
Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)
/*ThreadA:*/ while(1){ mutex.lock(); printf("a\n"); usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){ mutex.lock(); printf("b\n"); usleep(100000); mutex.unlock(); }
What will happen?
Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().
Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.
If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.
If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.
You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.
OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.
In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.
For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.
I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.
As a bit of a left-field idea...
What systems are you running this on? Have you thought of using the GPU in your PCs?
Nvidia have the CUDA APIs for this sort of thing
I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.
I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.
As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.
Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...
To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.
Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).
There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call.
so instead of:
#pragma omp parallel for
for (i=0; i < numPixels; i++)
{ ...}
you would write:
parallel_for(0,numPixels,1,ToGrayScale());
where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).
parallel_for(0,numPixels,1,[&](int i)
{
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
});
-Rick
Check the Creating an Image-Processing Network walkthrough on MSDN, which explains how to use Parallel Patterns Library to compose a concurrent image processing pipeline.
I'd also suggest Boost.GIL, which generates highly efficient code. For simple multi-threaded example, check gil_threaded by Victor Bogado. The An image processing network using Dataflow.Signals and Boost.GIL explains an interestnig dataflow model too.
One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)
on unix-like, use pthreads it's simple and lightweight.
Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.
Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.
I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.
Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.
This is by far the simplest way IMHO to add threading to a SIMD task.
I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.
Just implement the map and reduce jobs.
As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.
Hope this is useful.
It is very possible, that bottleneck is not CPU but memory bandwidth, so multi-threading WON'T help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had a similar problem a while ago and I decided to optimize my code to use SSE instructions. Speed increase was almost 4x per single thread!
You also could use libraries like IPP or the Cassandra Vision C++ API that are mostly much more optimized than you own code.
There's another option of using assembly for optimization. Now, one exciting project for dynamic code generation is softwire (which dates back awhile - here is the original project's site). It has been developed by Nick Capens and grew into now commercially available swiftshader. But the spin-off of the original softwire is still available on gna.org.
This could serve as an introduction to his solution.
Personally, I don't believe you can gain significant performance by utilizing multiple threads for your problem.