Libav multi-threaded decoding - c++

According to the documentation here, Libav provides the "infrastructure" for multithreaded decoding. But the docs are vague and confusing regarding how multithreaded decoding is implemented. Is it internally supported and just requires setting a flag in the struct, or does the user have to provide his own implementation with the provided functions? I searched a lot but could not find even one example of multithreaded video decoding with libav.

The link you have referred to looks like a description towards codec developers rather than to end-user of FFmpeg libraries using existing codecs.
Multi-threaded support is indeed implemented by framework itself - it requires FFmpeg to be built with thread support (like --enable-pthreads or --enable-w32threads configure options), varies across specific codecs (e.g. one codec may support multiple threads while others don't) and implement different approaches (decoding multiple frames in parallel or multiple slices within a single frame).
End-user application may configure the number of threads to utilize (via AVCodecContext::thread_count property set before avcodec_open2()) and threaded mode (AVCodecContext::thread_type set to FF_THREAD_FRAME or FF_THREAD_SLICE). Thread pool will be managed by FFmpeg itself, although some answers say that it is also possible using application-provided pool.
Some documents refer that AVCodecContext::thread_count default value set to 0 allows FFmpeg to automatically decide how many threads to use (which will be done based on a number of logical CPUs in system), but I've never tried this (always set this parameter manually). So probably it already does multi-threaded decoding on your system - check CPU load in task manager.
What FFmpeg doesn't do is managing multiple threads for reading packets from a file, decoding different streams in different threads and other similar things which a video player normally does - this is normally implemented by application itself. Although I recall some features have been integrated to FFmpeg simplifying implementation of these routines (like a packets queue).

Related

What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture

I read an interesting paper, entitled "A High-Resolution Side-Channel Attack on Last-Level Cache", and wanted to find out the index hash function for my own machine—i.e., Intel Core i7-7500U (Kaby Lake architecture)—following the leads from this work.
To reverse-engineer the hash function, the paper mentions the first step as:
for (n=16; ; n++)
{
// ignore any miss on first run
for (fill=0; !fill; fill++)
{
// set pmc to count LLC miss
reset_pmc();
for (a=0; a<n; a++)
// set_count*line_size=2^19
load(a*2^19);
}
// get the LLC miss count
if (read_pmc()>0)
{
min = n;
break;
}
}
How can I code the reset_pmc() and read_pmc() in C++? From all that I read online so far, I think it requires inline assembly code, but I have no clue what instructions to use to get the LLC miss count. I would be obliged if someone can specify the code for these two steps.
I am running Ubuntu 16.04.1 (64-bit) on VMware workstation.
P.S.: I found mention of these LONGEST_LAT_CACHE.REFERENCES and LONGEST_LAT_CACHE.MISSES in Chapter-18 Volume 3B of the Intel Architectures Software Developer's Manual, but I do not know how to use them.
You can use perf as Cody suggested to measure the events from outside the code, but I suspect from your code sample that you need fine-grained, programmatic access to the performance counters.
To do that, you need to enable user-mode reading of the counters, and also have a way to program them. Since those are restricted operations, you need at least some help from the OS kernel to do that. Rolling your own solution is going to be pretty difficult, but luckily there are several existing solutions for Ubunty 16.04:
Andi Kleen's jevents library, which among other things lets you read PMU events from user space. I haven't personally used this part of pmu-tools, but the stuff I have used has been high quality. It seems to use the existing perf_events syscalls for counter programming so and doesn't need a kernel model.
The libpfc library is a from-scratch implementation of a kernel module and userland code that allows userland reading of the performance counters. I've used this and it works well. You install the kernel module which allows you to program the PMU, and then use the API exposed by libpfc to read the counters from userspace (the calls boil down to rdpmc instructions). It is the most accurate and precise way to read the counters, and it includes "overhead subtraction" functionality which can give you the true PMU counts for the measured region by subtracting out the events caused by the PMU read code itself. You need to pin to a single core for the counts to make sense, and you will get bogus results if your process is interrupted.
Intel's open-sourced Processor Counter Monitor library. I haven't tried this on Linux, but I used its predecessor library, the very similarly named1 Performance Counter Monitor on Windows, and it worked. On Windows it needs a kernel driver, but on Linux it seems you can either use a drive or have it go through perf_events.
Use the likwid library's Marker API functionality. Likwid has been around for a while and seems well supported. I have used likwid in the past, but only to measure whole processes in a matter similar to perf stat and not with the marker API. To use the marker API you still need to run your process as a child of the likwid measurement process, but you can read programmatically the counter values within your process, which is what you need (as I understand it). I'm not sure how likwid is setting up and reading the counters when the marker API is used.
So you've got a lot of options! I think all of them could work, but I can personally vouch for libpfc since I've used it myself for the same purpose on Ubuntu 16.04. The project is actively developed and probably the most accurate (least overhead) of the above. So I'd probably start with that one.
All of the solutions above should be able to work for Kaby Lake, since the functionality of each successive "Performance Monitoring Architecture" seems to generally be a superset of the prior one, and the API is generally preserved. In the case of libpfc, however, the author has restricted it to only support Haswell's architecture (PMA v3), but you just need to change one line of code locally to fix that.
1 Indeed, they are both commonly called by their acronym, PCM, and I suspect that the new project is simply the officially open sourced continuation of the old PCM project (which was also available in source form, but without a mechanism for community contribution).
I would use PAPI, see http://icl.cs.utk.edu/PAPI/
This is a cross platform solution that has a lot of support, especially from the hpc community.

Why Apache Kafka Streams uses RocksDB and if how is it possible to change it?

During investigation within new features in Apache Kafka 0.9 and 0.10,
we had used KStreams and KTables. There is an interesting fact that Kafka uses RocksDB internally.
See Introducing Kafka Streams: Stream Processing Made Simple.
RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent).
And here there are simple questions:
Why Apache Kafka Streams uses RocksDB?
How is it possible to change it?
I had tried to search the answer, but I see only implicit reason, that RocksDB is very fast for operations in the range of about millions of operations per second.
On the other hand, I see some DBs that are coded in Java and perhaps end to end they could do that as well as they are not going over JNI.
RocksDB is used for several (internal) reasons (as you mentioned already for example its performance). Conceptually, Kafka Streams does not need RocksDB -- it is used as internal key-value cache and any other store offering similar functionality would work, too.
Comment from #miguno below (rephrased):
One important advantage of RocksDB in contrast to pure in-memory key-value stores is its ability to write to disc. Thus, a state larger than available main memory can be supported by Kafka Streams.
Comment from #miguno above:
FYI: "RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent)." As a user of Kafka Streams you don't need to install anything.
Using Kafka Streams DSL, as of 0.10.2 release (KAFKA-3825) it's possible to plug in custom state stores and to use a different key-value store.
Using Kafka Streams Processor API, you can implement your own store via StateStore interface and connect it to a processor node in your topology.

Let nvidia K20c use old stream management way?

From K20 different streams becomes fully concurrent(used to be concurrent on the edge).
However My program need the old way. Or I need to do a lot of synchronization to solve the dependency problem.
Is it possible to switch stream management to the old way?
CUDA C Programming Guide section on Asynchronous Current Execution
A stream is a sequence of commands (possibly issued by different host
threads) that execute in order. Different streams, on the other hand,
may execute their commands out of order with respect to one another or
concurrently; this behavior is not guaranteed and should therefore not
be relied upon for correctness (e.g., inter-kernel communication is
undefined).
If the application relied on Compute Capability 2.* and 3.0 implementation of streams then the program violates the definition of streams and any change to the CUDA driver (e.g. queuing of per stream requests) or new hardware will break the program.
If you need a temporary workaround then I would suggest moving all work to a single user defined stream. This may impact performance but it is likely the only temporary workaround.
Can you express the kernel dependencies with cudaEvent_t objects?
The Streams and Concurrency Webinar shows some quick code snippets on how to use events. Some of the details of that presentation are only applicable to pre-Kepler hardware, but I'm assuming from the original question that you're familiar with how things have changed since Fermi now that there are multiple command queues.

Interaction of two c/c++ programs

I'm in complete lack of understanding in this. Maybe this is too broad for stack, but here it goes:
Suppose I have two programs (written in C/C++) running simultaneously, say A and B, with different PIDs.
What are the options to make then interact with each other. For instance, how do I pass information from one to another like having one being able to wait for a signal from the other, and respond accordingly.
I know MPI, but MPI normally works for programs that are compiled using the same source (so, it works more for parallel computing than just interaction from completely different programs built to interact with each other).
Thanks
You must lookout for "IPC" (inter process communication). There are several types:
pipes
signals
shared memory
message queues
semaphores
files (per suggestion of #JonathanLeffler :-)
RPC (suggested by #sftrabbit)
Which is usually more geared towards Client/Server
CORBA
D-Bus
You use one of the many interprocess communication mechanisms, like pipes (one applications writes bytes into a pipe, the other reads from it. Imagine stdin/stdout.) or shared memory (a region of memory is mapped into both programs virtual address space and they can communicate through it).
The same source doesn't matter - once your programs are compiled the system doesn't know or care where they came from.
There are different ways to communicate between them depending on how much data, how fast, one way or bidirectional, predicatable rate etc etc....
The simplest is possibly just to use the network - note that if you are on the same machine the network stack will automatically use some higher performance system to actually send the data (ie shared memory)

Buffer underrun logic problem, threading tutorial?

Ok, I tried all sorts of titles and they all failed (so if someone come up with a better title, feel free to edit it :P)
I have the following problem: I am using a API to access hardware, that I don't coded, to add libraries to that API I need to inherit from the API interface, and the API do everything.
I put in that API, a music generator library, the problem is that the mentioned API only call the music library when the buffer is empty, and ask for a hardcoded amount of data (exactly 1024*16 samples... dunno why).
This mean that the music generator library, cannot use all the CPU potential, while playing music, even if the music library is not keeping up, the CPU use remains low (like 3%), so in parts of the music that there are too complex stuff, the buffer underuns (ie: the soundcard plays the area in the buffer that is empty, because the music library function don't returned yet).
Tweaking the hardcoded number, would only make the software work in some machines, and not work in others, depending of several factors...
So I came up with two solutions: Hack the API with some new buffer logic, but I don't figured anything on that area.
Or the one that I actually figured the logic: Make the music library have its own thread, it will have its own separate buffer that it will fill all the time, when the API calls the music library for more data, instead of generating, it will plainly copy the data from that separate buffer to the soundcard buffer, and then resumes generate music.
My problem is that although I have several years of programming experience, I always avoided multi-threading, I don't know even where to start...
The question is: Can someone find another solution, OR point me into a place that will give me info on how to implement my threaded solution?
EDIT:
I am not READING files, I am GENERATING, or CALCULATING, the music, got it? This is NOT a .wav or .ogg library. This is why I mentioned CPU time, if I could use 100% CPU, I would never get a underrun, but I can only use CPU in the short time between the program realizing that the buffer is reaching the end, and the actual end of the buffer, and this time sometimes is less than the time the program takes to calculate the music.
I believe that the solution with separate thread that will prepare data for the library so that it is ready when requested is the best way to reduce latency and solve this problem. One thread generates music data and stores it in the buffer, and the APIs thread is getting data from that buffer when it needs it. In this case you need to synchronize access to the buffer whether you are reading or writing and make sure that you don't have too big buffer in those cases when API is too slow. To implement this, you need a thread, mutex and condition primitives from threading library and two flags - one to indicate when stop is requested and another one to ask thread to pause filling the buffer if API cannot keep up and it is getting too big. I'd recommend using Boost Thread library for C++, here are some useful articles with examples that comes to mind:
Threading with Boost - Part I: Creating Threads
Threading with Boost - Part II: Threading Challenges
The Boost.Threads Library
You don't necessarily need a new thread to solve this problem. Your operating system may provide an asynchronous read operation; for example, on Windows, you would open the file with the FILE_FLAG_OVERLAPPED to make any operations on it asynchronous.
If your operating system does support this functionality, you could make a large buffer that can hold a few calls worth of data. When the application starts, you fill the buffer, then once it's filled you can pass off the first section of the buffer to the API. When the API returns, you can read in more data to overwrite the section of the buffer that your last API call consumed. Because the read is asynchronous, it will fill the buffer while the API is playing music.
The implementation could be more complex than this, i.e. using a circular buffer or waiting until a few of the sections have been consumed, then reading in multiple sections at once, instead of reading in one section at a time.