Buffer underrun logic problem, threading tutorial?

Buffer underrun logic problem, threading tutorial? - c++

Ok, I tried all sorts of titles and they all failed (so if someone come up with a better title, feel free to edit it :P)
I have the following problem: I am using a API to access hardware, that I don't coded, to add libraries to that API I need to inherit from the API interface, and the API do everything.
I put in that API, a music generator library, the problem is that the mentioned API only call the music library when the buffer is empty, and ask for a hardcoded amount of data (exactly 1024*16 samples... dunno why).
This mean that the music generator library, cannot use all the CPU potential, while playing music, even if the music library is not keeping up, the CPU use remains low (like 3%), so in parts of the music that there are too complex stuff, the buffer underuns (ie: the soundcard plays the area in the buffer that is empty, because the music library function don't returned yet).
Tweaking the hardcoded number, would only make the software work in some machines, and not work in others, depending of several factors...
So I came up with two solutions: Hack the API with some new buffer logic, but I don't figured anything on that area.
Or the one that I actually figured the logic: Make the music library have its own thread, it will have its own separate buffer that it will fill all the time, when the API calls the music library for more data, instead of generating, it will plainly copy the data from that separate buffer to the soundcard buffer, and then resumes generate music.
My problem is that although I have several years of programming experience, I always avoided multi-threading, I don't know even where to start...
The question is: Can someone find another solution, OR point me into a place that will give me info on how to implement my threaded solution?
EDIT:
I am not READING files, I am GENERATING, or CALCULATING, the music, got it? This is NOT a .wav or .ogg library. This is why I mentioned CPU time, if I could use 100% CPU, I would never get a underrun, but I can only use CPU in the short time between the program realizing that the buffer is reaching the end, and the actual end of the buffer, and this time sometimes is less than the time the program takes to calculate the music.

I believe that the solution with separate thread that will prepare data for the library so that it is ready when requested is the best way to reduce latency and solve this problem. One thread generates music data and stores it in the buffer, and the APIs thread is getting data from that buffer when it needs it. In this case you need to synchronize access to the buffer whether you are reading or writing and make sure that you don't have too big buffer in those cases when API is too slow. To implement this, you need a thread, mutex and condition primitives from threading library and two flags - one to indicate when stop is requested and another one to ask thread to pause filling the buffer if API cannot keep up and it is getting too big. I'd recommend using Boost Thread library for C++, here are some useful articles with examples that comes to mind:
Threading with Boost - Part I: Creating Threads
Threading with Boost - Part II: Threading Challenges
The Boost.Threads Library

You don't necessarily need a new thread to solve this problem. Your operating system may provide an asynchronous read operation; for example, on Windows, you would open the file with the FILE_FLAG_OVERLAPPED to make any operations on it asynchronous.
If your operating system does support this functionality, you could make a large buffer that can hold a few calls worth of data. When the application starts, you fill the buffer, then once it's filled you can pass off the first section of the buffer to the API. When the API returns, you can read in more data to overwrite the section of the buffer that your last API call consumed. Because the read is asynchronous, it will fill the buffer while the API is playing music.
The implementation could be more complex than this, i.e. using a circular buffer or waiting until a few of the sections have been consumed, then reading in multiple sections at once, instead of reading in one section at a time.

Related

In Vulkan, is it beneficial for the graphics queue family to be separate from the present queue family?

As far as I can tell it is possible for a queue family to support presenting to the screen but not support graphics. Say I have a queue family that supports both graphics and presenting, and another queue family that only supports presenting. Should I use the first queue family for both processes or should I delegate the first to graphics and the latter to presenting? Or would there be no noticeable difference between these two approaches?

No such HW exists, so best approach is no approach. If you want to be really nice, you can handle the separate present queue family case with expending minimal brain-power on it. Though you have no way to test it on real HW that needs it. So I would say abort with a nice error message would be as adequate, until you can get your hands on actual HW that does it.
I think there is bit of a design error here on Khronoses part. Separate present queue does look like a more explicit way. But then, present op itself is not a queue operation, so the driver can use whatever it wants anyway. Also separate present requires extra semaphore, and Queue Family Ownership Transfer (or VK_SHARING_MODE_CONCURRENT resource). The history went the way that no driver is so extremist to report a separate present queue. So I made KhronosGroup/Vulkan-Docs#1234.
For rough notion of what happens at vkQueuePresentKHR, you can inspect Mesa code: https://github.com/mesa3d/mesa/blob/bf3c9d27706dc2362b81aad12eec1f7e48e53ddd/src/vulkan/wsi/wsi_common.c#L1120-L1232. There's probably no monkey business there using the queue you provided except waiting on your semaphore, or at most making a blit of the image. If you (voluntarily) want to use separate present queue, you need to measure and whitelist it only for drivers (and probably other influences) it actually helps (if any such exist, and if it is even worth your time).

First off, I assume you mean "beneficial" in terms of performance, and whenever it comes to questions like that you can never have a definite answer except by profiling the different strategies. If your application needs to run on a variety of hardware, you can have it profile the different strategies the first time it's run and save the results locally for repeated use, provide the user with a benchmarking utility they can run if they see poor performance, etc. etc. Trying to reason about it in the abstract can only get you so far.
That aside, I think the easiest way to think about questions like this is to remember that when it comes to graphics programming, you want to both maximize the amount of work that can be done in parallel and minimize the amount of work overall. If you want to present an image from a non-graphics queue and you need to perform graphics operations on it, you'll need to transfer ownership of it to the non-graphics queue when graphics operations on it have finished. Presumably, that will take a bit of time in the driver if nothing else, so it's only worth doing if it will save you time elsewhere somehow.
A common situation where this would probably save you time is if the device supports async compute and also lets you present from the compute queue. For example, a 3D game might use the compute queue for things like lighting, blur, UI, etc. that make the most sense to do after geometry processing is finished. In this case, the game engine would transfer ownership of the image to be presented to the compute queue first anyway, or even have the compute queue own the swapchain image from beginning to end, so presenting from the compute queue once its work for the frame is done would allow the graphics queue to stay busy with the next frame. AMD and NVIDIA recommend this sort of approach where it's possible.
If your application wouldn't otherwise use the compute queue, though, I'm not sure how much sense it makes or not to present on it when you have the option. The advantage of that approach is that once graphics operations for a given frame are over, you can have the graphics queue immediately release ownership of the image for it and acquire the next one without having to pause to present it, which would allow presentation to be done in parallel with rendering the next frame. On the other hand, you'll have to transfer ownership of it to the compute queue first and set up presentation there, which would add some complexity and overhead. I'm not sure which approach would be faster and I wouldn't be surprised if it varies with the application and environment. Of course, I'm not sure how many realtime Vulkan applications of any significant complexity fit this scenario today, and I'd guess it's not very many as "per-pixel" things tend to be easier and faster to do with a compute shader.

C++ Boost Object Serialization - Periodic Saving to Protect Data

I have a program that uses boost serialization that loads on program start up and saves on shutdown.
Every once in a while, the program will crash due to this or that and I expect that to be fairly normal. The problem is that when the program crashes, often the objects are not saved at all. Other times, some will be missing or the data will be corrupted. This could be disastrous if a user loses months and months of data. In a perfect world, every one would backup their data and they could just roll back the data file.
My first solution is to periodically save the objects to a different temporary data file during run time. That way if the program crashes they can revert to the temporary data file with minimal data loss. My concern is the effect on performance. As far as I understand (correct me if I am wrong), once you save an object, it can't be used anymore? If that is the case, then the periodic save routine would involve saving and deleting my pointers, then loading them up again.
My second solution is to simply make a copy of the data file during program start up. The user's loss of data would be limited to that session. However, this may not be sufficient as some users may run the program for days and days.
Any input would be appreciated.
Thanks in advance.

If you save an object graph with boost serialization, that object graph is still available and can be saved again without necessarily reading anything from disk.
If you want to go high-tech and introduce a lot more complexity, you can use Boost Interprocess library with a managed_shared_memory segment. This enables you to actually transparently work directly on a disk file (actually, on memory pages backed by file blocks). This introduces another issue, actually: how to prevent changes from frequently hitting the disk.
Gratuitous advice:
I think the best of all worlds would be if your object graph is (e.g.) a Composite pattern where all nodes are shared immutables. Now serialization is "free" (with Boost), you can easily handle multiple versions of the program state (often a "document" or "database", logically) and efficiently save/load them with Boost Serialization. This pattern facilitates undo/redo, concurrent operations, transactional commit ¹ etc.
¹ (! not without extra work, but in principle)

Reading many files in parallel

I have a cross platform project in C++ where I am mixing audio in real-time. I have several independent tracks as input, that I read from separate files on disk. I then mix these, apply some processing, and spit out a buffer with the resulting audio. The problem I am having is that of disk IO speed. For the current test I am performing, I have about 10 tracks that are read simultaneously from disk. Each track is in raw PCM, 48000 HZ 16 bit stereo. This means that there is a significant amount of data that needs to be read as quickly as possible. I have tried both simple fread calls as well as memory mapped files through Boost, but the issue is the same. When a file is first opened, it usually causes the audio to break up (presumably while the file is read into cache by the OS). After that, everything runs smoothly without a glitch. For the time being I use one thread per file in the common case, sometimes two files per thread. It is usually when I have two files per thread that the stalling/breakup of the stream occurs. Note that I do not know in advance which input files that need to be played, as this is controlled by the user. So my problem is how to read these initial blocks in such a way so that I don't get stalling/breakup. Also, when a new file is loaded it is not necessarily in the beginning that the reading must start.
I have a few thoughts:
Can we prefetch the files into cache by reading all of them once at startup but disregarding the data? I cannot store all of it in memory. But it seems bad to rely on internal behavior of the OS's read cashe, especially since this is cross platform.
Can we use a format such as Ogg Vorbis for compression, load the compressed data fully into memory and then decode on the fly? I am thinking that decoding 10 or more Vorbis streams might be too CPU intensive, but I have no benchmarks yet. At least in this way we turn it from an I/O bound task to a CPU bound one.
Can we do any other kind of clever buffering approach to make it so that the large reads are more equally distributed? I know very little about how I might accomplish that.
I am stuck at this point, and would appreciate any suggestions that might improve throughput.

Try doing the file loading using event processing.
This is where you open a bunch of file descriptors and let the operating system notify your programs when data is available.
The most broadly available api to do this with "select" (http://linux.die.net/man/2/select), wbut there are better methods ( poll, epoll, kqueue ). These are not available everywhere.
There are libraries that abstract this for you ( libev and libevent ).
So the way you do it is, one thread opens all the files you need and sets a 'watcher' on them. When data is available the watcher triggers, and call a callback.
The advantage is that you don't have a ton of threads waiting and sleeping checking all your open file descriptors. If that doesn't work then likely you are over saturating the hardware's io bandwidth - in which case you just have to wait. If that is the case then you need to do some buffering to avoid stutters.

As a rule of thumb, you need to perform file IO operations in a separate thread for real time operations. When user wants to mix a second audio file, you can just open a new thread and read the first N bytes of this second audio file and return the data read into the main thread. This will also cause a lag but it will not break the audio flow.

Loading batch of images - Thread allocation

So, I have a lot of images to be loaded from the disk, I was wondering how many threads should I allocate to the task to obtain maximum performance.
I am not specifying SO because my project is cross-platform.
I think I will work mainly with PNG, i.e. it is not slow to decompress but there is some decompression involved.
Also, if I end up creating one thread for each image, is the thread-overhead big enough to slow down considerably my process?

Sometimes a producer consumer architecture is good enough.
Other times what you describe could also work, given that you don't have more threads that the CPUs available can handle (ie more threads than #CPUs*2 usually (not always) leads to thrashing).
You need to perform some tests in order to see which model works best for you. Think about where do these images come from (disk? Are they in consecutive locations on disk or not. Does it make sense to produce multiple threads and just wait for disk IO to load a small chunk of one photo from disk, then context switch to another thread and do another seek on disk to get a small chunk of another file and so on.
I suggest try single thread application.

One thread per disk seems like a reasonable start. You could make it a runtime tuning parameter to see what works best, especially if there are, or might be, non-local network disks, (ie. high latency), or, as others have suggested, there is any decompression or video processing to be done.
One thread per image is not a good idea, again, as posted by others. You will need some P-C queues to feed the thread/s with objects that contain an image buffer + file spec and also to return the same objects after the load is done - continually creating/terminating/destroying threads is wasteful, difficult and prone to disaster.

Using pthreads with CUDA - design questions

I am writing some code that requires some disk I/O, and invoking a library that I wrote that does some computation and GPU work, and then more disk I/O to write the results back to a file.
I would like to create this as multi-threaded code, because the files are quite large. I want to be able to read in portion of the file, send it to the GPU library, and write a portion back to a file. The disk I/O involved is quite large (like 10GB), and the computation is fairly quick on the GPU.
My question is more of a design question. Should I use separate threads to pre-load data that goes to the GPU library, and only have the main thread actually execute the calls to the GPU library, and then send the resulting data to other threads to be written back out to disk, or should I go ahead and have all of the separate threads each do their own part - grab a chucnk of data, execute on the GPU, and write to disk, and then go get the next chunk of data?
I am using CUDA for my GPU library. Is cuda smart enough to not try to run two kernels on the GPU at once? I guess I will have to do the management manually to ensure that two threads dont try to add more data to the GPU than it has space?
Any good resources on the subject of multithreading and CUDA being used in combination is appreciated.

Threads will not help with disk I/O. Generally people tend to solve blocking problems by creating tons of threads. In fact, that only makes things worse. What you have to do is to use asynchronous I/O and not block on write (and read). You can use some generic solutions like libevent or Asio for this or work with lower-level API available on your platform. On Linux, AIO seems to be the best for files, but I haven't tried that yet. Hope it helps.

I encountered this situation with large files in my research work.
As far as I remember there is no much gain in threading the disk I/O work because is
very slow compared to the GPU I/O.
The strategy I used was to read synchronously from disk and to load data and execute asynchronously on the GPU.
Something like:
read from disk
loop:
async_load_to_gpu
async_execute
push_event
read from disk
check event complete or read more data from disk

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js