IDirectXVideoDecoder performance - c++

I am trying to understand some of the nuances of IDirectXVideoDecoder. CAVEAT: The conclusions stated below are not based on the DirectX docs or any other official source, but are my own observations and understandings. That said...
In normal use, IDirectXVideoDecoder is easily fast enough to process frames at any sensible frame rate. However, if you aren't rendering the frames based on timecodes and instead are going "as fast as possible," then you eventually run into a bottleneck in the decoder and IDirectXVideoDecoder::BeginFrame starts returning E_PENDING.
Apparently at any given time, a system can only have X frames active in the decoder. Attempting to submit X + 1 gives you this error until one of the previous frames completes. On my (somewhat older) box, X == 4. On my newer box, X == 8.
Which brings us to my first question:
Q1: How do I find out how many simultaneous decoding operations a system supports? What property/attribute describes this?
Then there's the question of what to do when you hit this error. I can think of 3 different approaches, but they all have drawbacks:
1) Just do a loop waiting for a decoder to free up:
do {
hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
} while(hr == E_PENDING);
On the plus side, this approach gives the fastest throughput. On the minus side, this causes a massive amount of CPU time to get burned waiting for a decoder to free up (>93% of my execution time gets spent here).
2) Do a loop, and add a Sleep:
do {
hr = m_pVideoDecoder->BeginFrame(pSurface9Video[y], NULL);
if (hr == E_PENDING)
Sleep(1);
} while(hr == E_PENDING);
On the plus side, this significantly drops the CPU utilization. But on the minus side, it ends up slowing down the total throughput.
In trying to figure out why it's slowing things down, I made a few observations:
Normal time to process a frame on my system is ~4 milliseconds.
Sleep(1) can Sleep for as much as 8 milliseconds, even when there are CPUs available to run on.
Frames sent to the decoders aren't being added to a queue and decoded one at a time. It actually performs X decodings at the same time.
The result of all this is that if you try to Sleep, one of the decoders frequently ends up sitting idle.
3) Before submitting the next frame for decoding, wait for one of the previous frames to complete:
// LockRect doesn't return until the surface is ready.
D3DLOCKED_RECT lr;
// I don't think this matters. It may always return the whole frame.
RECT r = {0, 0, 2, 2};
hr = pSurface9Video[old]->LockRect(&lr, &r, D3DLOCK_READONLY);
if (SUCCEEDED(hr))
pSurface9Video[old]->UnlockRect();
This also drops the CPU usage, but it also has a throughput penalty. Maybe due to the 'surface' being in use longer than the 'decoder,' but more likely because the amount of time it takes to (pointlessly) transfer the frame back to memory.
Which brings us to the second question:
Q2: Is there some way here to maximize throughput without pointlessly pounding on the CPU?
Final thoughts:
It appears that LockRect must be doing a WaitForSingleObject. If I had access to that handle, waiting on it (without also copying the frame back) seems like it would be the best solution. But I can't figure out where to get it. I've tried GetDC, GetPrivateData, even looking at the debug data members for IDirect3DSurface9. I'm not finding it.
IDirectXVideoDecoder::EndFrame outputs a handle in a parameter named pHandleComplete. This sounds like exactly what I need. Unfortunately it is marked as "reserved" and doesn't seem to work. Unless there is a trick?
I'm pretty new to DirectX, so maybe I've got this all wrong?
Update 1:
Re Q1: Turns out both my machines only support 4 decoders (oops). This will make it harder to determine which property I'm looking for. While very few properties (none actually) return 8 on one machine and 4 on the other, there are several that return 4.
Re Q2: Since the (4) decoders are (presumably) shared between apps, the idea of finding out if the decoding is complete by (somehow) querying to see if the decoder is idle is a non-starter.
The call to create surfaces doesn't create handles (handle count stays the same across the call). So the idea of waiting on the "surface's handle" doesn't seem like it's going to pan out either.
The only idea I have left is to see if the surface is available by making some other call (besides LockRect) using it. So far I've tried calling StretchRect and ColorFill on a surface that the decoder is "still using," but they complete without error instead of blocking like LockRect.
There may not be a better answer here. So far it appears that for best performance, I should use #1. If CPU utilization is an issue, #2 is better than #1. If I'm going to be reading the surfaces back to memory anyway, then #3 makes sense, otherwise, stick with 1 or 2.

Related

How to fix execution speed inconsistencies in C++

This is most noticeable on graphic files. Let's take as an example the OpenGL base program (a spinning triangle).
Whenever I run one normally, with no other apps open in the background, it will spin slowly, but when I run a game in the background, it starts spinning like mad. It seems as if the computer doesn't allocate enough memory for the programs to run at maximum speed, and paradoxically, doing resource-consuming stuff will accelerate it because it gets more memory.
The only way I found to fix this partially is to put a higher value in the Sleep function, however this doesn't fix it completely nor is a consistent solution, as other problems may arise from it. Is there any good way to fix this and make the program run consistently?
This mostly happens because you are not capping your FPS so there's nothing preventing your render loop from being called as much as possible and your logic (that is controlling rotation is executing in same loop).
What happens is that most GPUs have power management so they keep their frequencies low when there's no demand, opening an expensive game makes your GPU bump up its power thus rendering a lot faster, thus calling your rendering loop more times.
To prevent this (and to separate logic from rendering time in general) you must control the frame rate and use the time as an input for your rotation, something like:
auto elapsed = ..
while (!exit) {
render();
auto delta = now - elapsed;
if (delta < time_per_frame)
delay(TIME_PER_FRAME - delta);
updateLogic(delta);
}
For starters, you need to understand what's going on with your program. It has nothing to do with memory, and I don't see a reason to think about memory.
Opening other programs could make your CPU go faster because of the load (doubtfull, but clearly more likely than memory allocation).
The other programs could be messing with some setting.
If you're using sleep(), signals can interrupt the call (no one ever looks at the return code of the function; there's a reason for it to be uint sleep(uint) and not void sleep(uint)).
If you can, don't use sleep. And if you're going to, check. sleep doesn't grant you that the whole time has passed (IMHO bad design, but I'm not a POSIX fan).
The usual behaviour I think would be to have your function called periodically as a callback. If you're going to do some sort of delay or sleep, you should check that the time has passed.
Taken that you want your logic tied to the render and you use some function like sleep that can be interrupted (based on the other answer):
while (!exit) {
auto startofFrame= now();
render();
auto toDelay= startOfFrame + TIME_PER_FRAME - now();
while (toDelay > 0) {
delay(toDelay);
toDelay= startOfFrame + TIME_PER_FRAME - now();
}
updateLogic();
}

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

OpenGL, measuring rendering time on gpu

I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.

Effective frame limiting

I have a simulation that I am trying to convert to "real time". I say "real time" because its okay for performance to dip if needed (slowing down time for the observers/clients too). However, if there is a small number of objects, I want to limit the performance so that it runs at a steady frame rate (~100 FPS in this case).
I tried sleep() and Sleep() for linux and windows respectively but it doesn't seem to be accurate enough as the FPS really dips to a fraction of what I was aiming for. I suppose this scenario is common for games, especially online games but I was not able to find any helpful material on the subject. What is the preferable way of frame limiting? Is there a sleep method that can guarantee that it won't give up more time than what was specified?
Note: I'm running this on 2 different clusters (linux and windows) and all nodes only have built-in video. So I have to implement limiting on both platforms and it shouldn't be video card based (if there is even such a thing). I also need to implement the limiting on just one thread/node because there is already synchronization between nodes and the others would automatically be limited if one thread is properly limited.
Edit: some pseudo code that shows how I implemented the current limiter:
while (ProcessControlMessages())
{
uint64 tStart;
SimulateFrame();
uint64 newT =_context.GetTimeMs64();
if (newT - tStart < DESIRED_FRAME_RATE_DURATION)
this_thread::sleep_for(chrono::milliseconds(DESIRED_FRAME_RATE_DURATION - (newT - tStart)));
}
I was also thinking if I could do the limiting every N frames, where N is a fraction of the desired frame rate. I'll give it a try and report back.
For games a frame limiter is usually inadequate. Instead, the methods that update the game state (in your case SimulateFrame()) are kept frame rate independent. E.g. if you want to move an object, then the actual offset is the object's speed multiplied with the last frame's duration. Similarly, you can do this for all kind of calculations.
This approach has the advantage that the user gets maximum frame rate while maintaining the real-timeness. However, you should watch out that the frame durations don't get too small ( < 1 ms). This could result in inaccurate calculations. In this case a small sleep with a fixed duration could help.
This is how games usually handle this problem. You have to check if your simulation is appropriate for this technique, too.
Instead of having each frame try to sleep for long enough to be a full frame, have them sleep to try to average out. Keep a global/thread owned time count. for each frame have a "desired earliest end time," calculated from the previous desired earliest end time, rather than from the current time
tGoalEndTime = _context.GetTimeMS64() + DESIRED_FRAME_RATE_DURATION;
while (ProcessControlMessages())
{
SimulateFrame();
uint64 end =_context.GetTimeMs64();
if (end < tGoalEndTime) {
this_thread::sleep_for(chrono::milliseconds(tGoalEndTime - end)));
tGoalEndTime += DESIRED_FRAME_RATE_DURATION;
} else {
tGoalEndTime = end; // we ran over, pretend we didn't and keep going
}
Note: this uses your example's sleep_for because I wanted to show the minimum number of changes to enact it. sleep_until works better here.
The trick is that any frame that sleeps too long immediately causes the next few frames to rush to catch up.
Note: You cannot get any timing within 2ms (20% jitter on 100fps) on modern consumer OSs. The quantum for threads on most consumer OSs is around 100ms, so the instant you sleep, you may sleep for multiple quantums before it is your turn. sleep_until may use a OS specific technique to have less jitter, but you can't rely on it.

Achieving game engine determinism with threading

I would like to achieve determinism in my game engine, in order to be able to save and replay input sequences and to make networking easier.
My engine currently uses a variable timestep: every frame I calculate the time it took to update/draw the last one and pass it to my entities' update method. This makes 1000FPS games seem as fast ad 30FPS games, but introduces undeterministic behavior.
A solution could be fixing the game to 60FPS, but it would make input more delayed and wouldn't get the benefits of higher framerates.
So I've tried using a thread (which constantly calls update(1) then sleeps for 16ms) and draw as fast as possible in the game loop. It kind of works, but it crashes often and my games become unplayable.
Is there a way to implement threading in my game loop to achieve determinism without having to rewrite all games that depend on the engine?
You should separate game frames from graphical frames. The graphical frames should only display the graphics, nothing else. For the replay it won't matter how many graphical frames your computer was able to execute, be it 30 per second or 1000 per second, the replaying computer will likely replay it with a different graphical frame rate.
But you should indeed fix the gameframes. E.g. to 100 gameframes per second. In the gameframe the game logic is executed: stuff that is relevant for your game (and the replay).
Your gameloop should execute graphical frames whenever there is no game frame necessary, so if you fix your game to 100 gameframes per second that's 0.01 seconds per gameframe. If your computer only needed 0.001 to execute that logic in the gameframe, the other 0.009 seconds are left for repeating graphical frames.
This is a small but incomplete and not 100% accurate example:
uint16_t const GAME_FRAMERATE = 100;
uint16_t const SKIP_TICKS = 1000 / GAME_FRAMERATE;
uint16_t next_game_tick;
Timer sinceLoopStarted = Timer(); // Millisecond timer starting at 0
unsigned long next_game_tick = sinceLoopStarted.getMilliseconds();
while (gameIsRunning)
{
//! Game Frames
while (sinceLoopStarted.getMilliseconds() > next_game_tick)
{
executeGamelogic();
next_game_tick += SKIP_TICKS;
}
//! Graphical Frames
render();
}
The following link contains very good and complete information about creating an accurate gameloop:
http://www.koonsolo.com/news/dewitters-gameloop/
To be deterministic across a network, you need a single point of truth, commonly called "the server". There is a saying in the game community that goes "the client is in the hands of the enemy". That's true. You cannot trust anything that is calculated on the client for a fair game.
If for example your game gets easier if for some reasons your thread only updates 59 times a second instead of 60, people will find out. Maybe at the start they won't even be malicious. They just had their machines under full load at the time and your process didn't get to 60 times a second.
Once you have a server (maybe even in-process as a thread in single player) that does not care for graphics or update cycles and runs at it's own speed, it's deterministic enough to at least get the same results for all players. It might still not be 100% deterministic based on the fact that the computer is not real time. Even if you tell it to update every $frequence, it might not, due to other processes on the computer taking too much load.
The server and clients need to communicate, so the server needs to send a copy of it's state (for performance maybe a delta from the last copy) to each client. The client can draw this copy at the best speed available.
If your game is crashing with the thread, maybe it's an option to actually put "the server" out of process and communicate via network, this way you will find out pretty fast, which variables would have needed locks because if you just move them to another project, your client will no longer compile.
Separate game logic and graphics into different threads . The game logic thread should run at a constant speed (say, it updates 60 times per second, or even higher if your logic isn't too complicated, to achieve smoother game play ). Then, your graphics thread should always draw the latest info provided by the logic thread as fast as possible to achieve high framerates.
In order to prevent partial data from being drawn, you should probably use some sort of double buffering, where the logic thread writes to one buffer, and the graphics thread reads from the other. Then switch the buffers every time the logic thread has done one update.
This should make sure you're always using the computer's graphics hardware to its fullest. Of course, this does mean you're putting constraints on the minimum cpu speed.
I don't know if this will help but, if I remember correctly, Doom stored your input sequences and used them to generate the AI behaviour and some other things. A demo lump in Doom would be a series of numbers representing not the state of the game, but your input. From that input the game would be able to reconstruct what happened and, thus, achieve some kind of determinism ... Though I remember it going out of sync sometimes.