Matching images based on timestamps acquired from two different threads - c++

For a project I am working on I am using two different threads to acquire images from different hardware on a Windows system using Opencv and C++. For the purposes of this post we will call them camera1 and camera2. Due to varying frame rates I cannot capture both from both camera1 and camera2 on the same thread or try to capture one image at a time. Instead, I have buffers set up for both cameras. Camera1 runs at over double the frame rate of camera2.
A third thread gets a frame from camera2 and tries to match that to an image taken by camera1 at the nearest point in time. The problem I am having is that I can't find a good way to match an image from one source to an image that was acquired at roughly the same time in the other source.
Is there a method of assigning time stamps to these images that is accurate when used in separate threads that are not high priority? If not, is there another design method that would be better suited to a system like this.
I have had a hard time finding any information as to how far off methods of keeping track of time in Windows like clock() and queryperformancecounter are when used in separate threads on different CPUS. I have read that they are not accurate when used in separate threads, but I am acquiring frames at roughly 20 frames per second so even if they are off, it may be close enough to still work well.
Thank you in advance! Let me know if I should clear anything up or explain better.
Edit:
Unfortunately I don't think there is any way for me to get timestamps from the drivers or the hardware. I think I have figured out a way to pretty accurately determine when an image is actually captured relative to each thread though. The part I don't really know how to deal with is if I use clock() or something similar in two threads that are executing on separate cores, how far off the time stamps could be. If I could count on the time stamps on two different cores being within about 25ms from each other, that would be ideal. I just haven't really been able to figure out a good way to measure what the difference is between the two. I have done some more research and it appears that timeGetTime in the Windows API is not effected too much by being called by separate cpu cores. I am going to try that out.

To get a satisfying answer, you need to answer the following questions first:
How closely do the images need to match?
How much jitter / error in timestamping can you live with?
How much latency you expect between the moment the image is captured and the moment your thread receives it and can timestamp it?
Is the latency expected to be about the same for both cameras?
If you have a mostly static scene, then you can live with higher timestamping jitter and latency than if you have a very dynamic scene.
Precision of the clock() is not going to be the main factor in your timestamping error -- after all, you should care about relative error (how close the timestamps are between two threads), not absolute error (how close the timestamp is to the true atomic time).
On the other hand latencies, and especially jitter in latencies, between the hardware and your thread will drive most of the error in your timestamping. If you can find a way to get a timestamp in hardware or at least in the driver, it will go a long way toward improving accuracy of your timestamps.

Related

How to detect camera frame loss using Windows media API like Media Foundation or DirectShow?

I am writing an application for Windows that runs a CUDA accelerated HDR algorithm. I've set up an external image signal processor device that presents as a UVC device, and delivers 60 frames per second to the Windows machine over USB 3.0.
Every "even" frame is a more underexposed frame, and every "odd" frame is a more overexposed frame, which allows my CUDA code perform a modified Mertens exposure fusion algorithm to generate a high quality, high dynamic range image.
Very abstract example of Mertens exposure fusion algorithm here
My only problem is that I don't know how to know when I'm missing frames, since the only camera API I have interfaced with on Windows (Media Foundation) doesn't make it obvious that a frame I grab with IMFSourceReader::ReadSample isn't the frame that was received after the last one I grabbed.
Is there any way that I can guarantee that I am not missing frames, or at least easily and reliably detect when I have, using a Windows available API like Media Foundation or DirectShow?
It wouldn't be such a big deal to miss a frame and then have to purposefully "skip" the next frame in order to grab the next oversampled or undersampled frame to pair with the last frame we grabbed, but I would need to know how many frames were actually missed since a frame was last grabbed.
Thanks!
There is IAMDroppedFrames::GetNumDropped method in DirectShow and chances are that it can be retrieved through Media Foundation as well (never tried - they are possibly obtainable with a method similar to this).
The GetNumDropped method retrieves the total number of frames that the filter has dropped since it started streaming.
However I would question its reliability. The reason is that with these both APIs, the attribute which is more or less reliable is a time stamp of a frame. Capture devices can flexibly reduce frame rate for a few reasons, including both external like low light conditions and internal like slow blocking processing downstream in the pipeline. This makes it hard to distinguish between odd and even frames, but time stamp remains accurate and you can apply frame rate math to convert to frame indices.
In your scenario I would however rather detect large gaps in frame times to identify possible gap and continuity loss, and from there run algorithm that compares exposure on next a few consecutive frames to get back to sync with under-/overexposition. Sounds like a more reliable way out.
After all this exposure problem is highly likely to be pretty much specific to the hardware you are using.
Normally MFSampleExtension_Discontinuity is here for this. When you use IMFSourceReader::ReadSample, check this.

How to improve performance of an OpenCV algorithm written in C++

I am executing a computer vision program written in C++ and OpenCV on Raspberry Pi 3B and my camera module is picamera. Roughly I am calculating the deviation amount from the road and sending it to another platform.
Currently, my main methodology can not become simpler i.e, i can't remove any matrix operations. However, I need to increase my throughput further. For the time being, I receive 19-20 results per second. My camera FPS is set to 30.
I was wondering are there any way to increase throughput? For example, I have tried to use optimization levels on g++ (-O2, -O3) and didn't observe any increase.
Another option is to use multithreading since my throuhput is lower than camera FPS maybe i can capture another frame while other thread processing the already captured frame. However, I don't have any experience in multithreading therefore I wanted to ask whether anyone can approve this strategy or not since I have limited time I want to use my effort in the most fruitful way.
Any other suggestions are welcome. Thank you for your help.

Pre-loading audio buffers - what is reasonable and reliable?

I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.

My multithreaded game is at 100% CPU all the time. How can I manage thread activity to reduce the CPU load?

I have a DirectX game which spawns 2 boost threads on a dual-core system: 1 for gameplay/rendering (normally split into their own threads on a quad-core CPU), and 1 other thread which procedurally-generates the gameworld. I believe that my audio middleware also spawns its own threads for playing SFX and music.
The game is always running at 100% on the CPU, which in turn can cause some sputtering from the audio system. I'm hoping that I can reduce the CPU load by better managing the activity of that Generation Thread. While sometimes I need it running full-speed, there are other times (when the player isn't moving much) when it is just updating constantly while not really doing a whole lot.
Is it possible / advisable to manually manage how active a thread is? If so what strategies can I use to do that? I keep seeing people say that sleep() functions aren't really recommended, but I don't really know what else to do yet.
Alternatively, am I barking up the wrong tree by trying to squeeze cycles out of thread management, and would I be better served by traditional profiling/optimization?
Getting to 100% processor utilization means that you don't have a game clock. You are probably rendering frames as fast as the machine allows. Still pretty hard to get 100% exactly if you use multiple threads, that indicates that you don't synchronize threads either.
This is likely to require a pretty drastic rewrite. The pace ought to be set by the main render loop, the one that copies the back-buffer to the video adapter. It sets your target FPS, frames-per-second. Not infrequently, you use the vertical blanking interval of the monitor for that, it solves tearing problems by ensuring that the monitor gets updated at the exact right time. Which automatically gets the render loop paced to the monitor refresh rate. Typically 60 times per second on LCD monitors. A timer is an alternative. This prevents the main thread from burning 100% core, assuming it can keep up with the FPS.
You now have a steady game clock tick, discrete moments in time at which things need to happen and jobs need to be completed in order to update the game state. Like checking for player input. Inside the render loop, check for mouse/keyboard/controller input and use anything you get to update the game world objects.
And in turn determines what worker threads need to do. They'll have the duration of one pass through the render loop to get the job done they need to do. You use a synchronization object to wake them up. And another one, each, to let them signal that they are done with the current game loop tick. Which stops them from burning core, they should constantly be waiting on the signal to start working on the next frame. Note that there is a balancing requirement. If a worker thread needs more than one game tick to get the job done then the render loop will fall behind and miss a video adapter frame update. Your video starts to stutter. This is in general impossible to eliminate completely, do make sure that it doesn't affect the absolute game clock.
Audio should be the easier problem to solve, you just need to keep the sound card buffers filled with sufficient data to survive a couple of frames worth of sound.
Falling behind on the target FPS is very easy to determine. You automatically compensate for that by lowering the target FPS. So the program still runs acceptably on a slow machine, just not as smoothly. Net effect is that you'll stop burning 100% core on all threads.
I don't know anything about your system so this answer may be way off. I assume that there is such a thing as audio output buffer of some sort that you can track the size of. When the size of the buffer is so small that there is a danger that audio may stop you should do something to refill the buffer.
That "something" may be as easy as temporarily setting the priority of audio thread to higher value. Come to think of why not set it higher from the start? That would solve all the problems, right? Even better, just lower the priority of world generator thread.
When I worked on a voice-comm app for gamers many years ago, we hit this problem a lot. Many games are written to use every ounce of CPU. As such, some gamest would starve our app (that ran in the background) from functioning - causing audio drops and lost network connections. Many of those games would also call SetThreadPriority and SetPriorityClass with the REALTIME flags to basically consume all the CPU quantums with disregard of anything else running on the system.
The typical fix we asked of game developers we partnered with was to simply insert a "Sleep(0)" call between each frame of their main game loop so that our threads wouldn't get stalled. I think we later added a switch in a software update to make our process run at a higher priority mode. Since then, Windows has gotten better about multitasking and thread priority with respect to these issues.

What is the correct way to calculate the FPS given that GPUs have a task queue and are asynchronous?

I always assumed that the correct way to calculate the FPS was to simply time how long it took to do an iteration of your draw loop. And much of the internet seems to be in accordance.
But!
Modern graphics card are treated as asynchronous servers, so the draw loop sends out drawing instructions for vertex/texture/etc data already on the GPU. These calls do not block the calling thread until the request on the GPU completes, they are simply added to the GPU's task queue. So the surely the 'traditional' (and rather ubiquitous) method is just measuring the call dispatch time?
What prompted me to ask was I had implemented the traditional method and it gave consistently absurdly high framerates, even if what was being rendered caused the animation to become choppy. Re-reading my OpenGL SuperBible brought me to glGenQueries which allow me to time sections of the rendering pipeline.
To summarise, is the 'traditional' way of calculating FPS totally defunct with (barely) modern graphics cards? If so why are the GPU profiling techniques relatively unknown?
Measuring fps is hard. It's made harder by the fact that various people who want to measure fps don't necessarily want to measure the same thing. So ask yourself this. Why do you want an fps number?
Before I go on and dive into all the pitfalls and potential solutions, I do want to point out that this is by no means a problem specific to "modern graphics cards". If anything, it used to be way worse, with SGI-type machines where the rendering actually happened on a graphics susbsystem that could be remote to the client (as in, physically remote). GL1.0 was actually defined in terms of client-server.
Anyways. Back to the problem at hand.
fps, meaning frames per second, really is trying to convey, in a single number, a rough idea of the performance of your application, in a number that can be directly related to things like the screen refresh rate. for a 1st level approximation of performance, it does an ok job. It breaks completely as soon as you want to delve into more fine-grained analysis.
The problem is really that the thing that matters most as far as "feeling of smoothness" of an application, is when the picture you drew ends up on the screen. The secondary thing that matters quite a bit too is how long it took between the time you triggered an action and when its effect shows up on screen (the total latency).
As an application draws a series of frames, it submits them at times s0, s1, s2, s3,... and they end up showing on screen at t0, t1, t2, t3,...
To feel smooth you need all the following things:
tn-sn is not too high (latency)
t(n+1)-t(n) is small (under 30ms)
there is also a hard constraint on the simulation delta time, which I'll talk about later.
When you measure the CPU time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turns out, this, on average, is not far from the truth, as client code will never go "too far ahead" (this is assuming you're rendering frames all the time though. See below for other cases). What does happen in fact is that the GL will end up blocking the CPU (typically at SwapBuffer time) when it tries to go too far ahead. That blocking time is roughly the extra time taken by the GPU compared to the CPU on a single frame.
If you really want to measure t1-t0, as you mentioned in your own post, Queries are closer to it. But... Things are never really that simple. The first problem is that if you're CPU bound (meaning your CPU is not quick enough to always provide work to the GPU), then a part of the time t1-t0 is actually idle GPU time. That won't get captured by a Query. The next problem you hit is that depending on your environment (display compositing environment, vsync), queries may actually only measure the time your application spends on rendering to a back buffer, which is not the full rendering time (as the display has not been updated at that time). It does get you a rough idea of how long your rendering will take, but will not be precise either. Further note that Queries are also subject to the asynchronicity of the graphics part. So if your GPU is idle part of the time, the query may miss that part. (e.g. say your CPU is taking very long (100ms) to submit your frame. The the GPU executes the full frame in 10ms. Your query will likely report 10ms, even though the total processing time was closer to 100ms...).
Now, with respect to "event-based rendering" as opposed to continuous one I've discussed so far. fps for those types of workloads doesn't make much sense, as the goal is not to draw as many f per s as possible. There the natural metric for GPU performance is ms/f. That said, it is only a small part of the picture. What really matters there is the time it took from the time you decided you wanted to update the screen and the time it happened. Unfortunately, that number is hard to find: It typically starts when you receive an event that triggers the process and ends when the screen is updated (something that you can only measure with a camera capturing the screen output...).
The problem is that between the 2, you have potential overlap between the CPU and GPU processing, or not (or even, some delay between the time the CPU stops submitting commands and the GPU starts executing them). And that is completely up to the implementation to decide. The best you can do is to call glFinish at the end of the rendering to know for sure the GPU is done processing the commands you sent, and measure the time on the CPU. That solution does reduce the overall performance of the CPU side, and potentially the GPU side as well if you were going to submit the next event right after...
Last the discussion about the "hard constraint on simulation delta time":
A typical animation uses a delta time between frames to move the animation forward. The major problem is that for a fully smooth animation, you really want the delta time you use when submitting your frame at s1 to be t1-t0 (so that when t1 shows, the time that actually was spent from the previous frame was indeed t1-t0). The problem of course is that you have no idea what t1-t0 is at the time you submit s1... So you typically use an approximation. Many just use s1-s0, but that can break down - e.g. SLI-type systems can have some delays in AFR rendering between the various GPUs). You could also try to use an approximation of t1-t0 (or more likely t0-t(-1)) through queries. The result of getting this wrong is mostly likely micro-stuttering on SLI systems.
The most robust solution is to say "lock to 30fps, and always use 1/30s". It's also the one that allows the least leeway on content and hardware, as you have to ensure your rendering can indeed be done in those 33ms... But is what some console developers choose to do (fixed hardware makes it somewhat simpler).
"And much of the internet seems to be in accordance." doesn't seem totally correct for me:
Most of the publications would measure how long it takes to MANY iterations, then normalize. This way you can reasonably assume, that filling (and aemptying) the pipe are only a small part of the overall time.