How exactly are you meant to implement an IReferenceClock that can be set via IMediaFilter::SetSyncSource?
I have a system that implements GetTime and AdviseTime, UnadviseTime. When a stream starts playing it sets a base time via AdviseTime and then increases Stream Time for each subsequent advise.
However how am I supposed to know when a new graph has run? I need to set a zero point for a given reference clock. Otherwise if I create a reference clock and then, 10 seconds later, I start the graph I am now in the position that I don't know whether I should be 10 seconds down the playback or whether I should be starting from 0. Obviously the base time will say that I am starting from 0 but have I just stalled for 10 seconds and do I need to drop a bunch of frames?
I really can't seem to figure out how to write a proper IReferenceClock so any hints or ideas would be hugely appreciated.
Edit: One example of a problem I am having is that I have 2 graphs and 2 videos. The audio from both videos is going to a null renderer. The Video to a standard CLSID_VideoRenderer. Now If i set the same reference clock to both and then Run graph 1 all seems to be fine. However if 10 seconds down the line I run graph 2 then it will run as though the SetSyncSource is NULL for the first 10 seconds or so until it has caught up with the other video.
Obviously if the graphs called GetTime to get their "base time" this would solve the problem but this is not what I'm seeing happening. Both videos end up with a base time of 0 because thats the point I run them from.
Its worth noting that if I set no clock at all (or call SetDefaultSyncSource) then both graphs run as fast as they can. I assume this is due to the lack of an Audio Renderer ...
However how am I supposed to know when a new graph has run?
The clock runs on its own, it is the graph that aligns its operation against the clock and not otherwise. The graph receives outer Run call, then it checks current clock time and assigns base time, which is distributed among filters, as "current clock time + some time for the things to take off". The clock itself doesn't have to have a faintest idea about all this and its task is to keep running and keep incrementing time.
In particular, clock time does not have to reset to zero at any time.
From documentation:
The clock's baseline—the time from which it starts counting—depends on the implementation, so the value returned by GetTime is not inherently meaningful. What matters is the delta from when the graph started running.
When an application calls IMediaControl::Run to run the filter graph, the Filter Graph Manager calls IMediaFilter::Run on each filter. To compensate for the slight amount of time it takes for the filters to start running, the Filter Graph Manager specifies a start time slightly in the future.
BaseClasses offer CBaseReferenceClock class, which you can use as reference implementation (in refclock.*).
Comment to your edit:
You obviously not describing the case in full and you are omitting important details. There is a simple test: you can instantiate standard clock (CLSID_SystemClock) and use it on two regular graphs - they WILL run fine, even with time-separated Run times.
I suspect that you are doing some sync'ing or matching between the graphs and you are time stamping the samples, also using the clock. Presumably you are doing something wrong at that point and then you have hard time fixing it through the clock.
Related
I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc
I am creating a program where I show some graphical content, and I record the face of the viewer with the webcam using DirectShow. It is very important that I know the time difference between what's on the screen to when the webcam records a frame.
I don't care at all about reducing latency or anything like that, it can be whatever it's going to be, but I need to know the capture latency as accurately as possible.
When frames come in, I can get the stream times of the frames, but all those frames are relative to some particular stream start time. How can I access the stream start time, for a capture device? That value is obviously somewhere in the bowels of directshow, because the filter graph computes it for every frame, but how can I get at it? I've searched through the docs but haven't found it's secret yet.
I've created my own IBaseFilter IReferenceClock implementing classes, which do little more than report tons of debugging info. Those seem to be doing what they need to be doing, but they don't provide enough information.
For what it is worth, I have tried to investigate this by inspecting the DirectShow Event Queue, but no events concerning the starting of the filter graph seem to be triggered, even when I start the graph.
The following image recorded using the test app might help understand what I'm doing. The graphical content right now is just a timer counting seconds.
The webcam is recording the screen. At the particular moment that frame was captured, the system time was about 1.35 seconds or so. The time of the sample recorded in DirectShow was 1.1862 seconds (ignore the caption in the picture). How can I account for the difference of .1637 seconds in this example? The stream start time is key to deriving that value.
The system clock and the reference clock are both using the QueryPerformanceCounter() function, so I would not expect it to be timer wonkyness.
Thank you.
Filters in the graph share reference clock (unless you remove it, which is not what you want anyway) and stream times are relative to certain base start time of this reference clock. Start time corresponds to stream time of zero.
Normally, controlling application does not have access to this start time as filter graph manager chooses the value itself internally and passes to every filter in the graph as a parameter in IBaseFilter::Run call. If you have at least one filter of your own, you can get the value.
Getting absolute capture time in this case is a matter of simple math: frame time is base time + stream time, and you can always do IReferenceClock::GetTime to check current effective time.
If you don't have access to start time anyway and you don't want to add your own filter to the graph, there is a trick you can employ to define base start time yourself. This is what filter graph manager is doing anyway.
Starting the graphs in sync means using IMediaFilter::Run instead of IMediaControl::Run... Call IMediaFilter::Run on all graphs, passing this time... as the parameter.
try IReferenceClock::GetTime
Reference Clocks: https://msdn.microsoft.com/en-us/library/dd377506(v=vs.85).aspx
For more information here:
https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/1dc4123a-05cf-4036-a17e-a7648ca5db4e/how-do-i-know-current-time-stamp-referencetime-on-directshow-source-filter?forum=windowsdirectshowdevelopment
I have written a simulation module. For measuring latency, I am using this:
simTime().dbl() - tempLinkLayerFrame->getCreationTime().dbl();
Is this the proper way ? If not then please suggest me or a sample code would be very helpful.
Also, is the simTime() latency is the actual latency in terms of micro
seconds which I can write in my research paper? or do I need to
scale it up?
Also, I found that the channel data rate and channel delay has no impact on the link latency instead if I vary the trigger duration the latency varies. For example
timer = new cMessage("SelfTimer");
scheduleAt(simTime() + 0.000000000249, timer);
If this is not the proper way to trigger simple module recursively then please suggest one.
Assuming both simTime and getCreationTime use the OMNeT++ class for representing time, you can operate on them directly, because that class overloads the relevant operators. Going with what the manual says, I'd recommend using a signal for the measurements (e.g., emit(latencySignal, simTime() - tempLinkLayerFrame->getCreationTime());).
simTime() is in seconds, not microseconds.
Regarding your last question, this code will have problems if you use it for all nodes, and you start all those nodes at the same time in the simulation. In that case you'll have perfect synchronization of all nodes, meaning you'll only see collisions in the first transmission. Therefore, it's probably a good idea to add a random jitter to every newly scheduled message at the start of your simulation.
I'd like to profile my (multi-threaded) application in terms of timing. Certain threads are supposed to be re-activated frequently, i.e. a thread executes its main job once every fixed time interval. In other words, there's a fixed time slice in which all the threads a getting re-activated.
More precisely, I expect certain threads to get activated every 2ms (since this is the cycle period). I made some simplified measurements which confirmed the 2ms to be indeed effective.
For the purpose of profiling my app more accurately it seemed suitable to use Momentics' tool "Application Profiler".
However when I do so, I fail to interpret the timing figures that I selected. I would be interested in the average as well in the min and max time it takes before a certain thread is re-activated. So far it seems, the idea is to be only able to monitor the times certain functions occupy. However, even that does not really seem to be the case. E.g. I've got 2 lines of code that are put literally next to each other:
if (var1 && var2 && var3) var5=1; takes 1ms (avg)
if (var4) var5=0; takes 5ms (avg)
What is that supposed to tell me?
Another thing confuses me - the parent thread "takes" up 33ms on avg, 2ms on max and 1ms on min. Aside the fact that the avg shouldn't be bigger than max (i.e. even more I expect avg to be not bigger than 2ms - since this is the cycle time), it's actually increasing the longer I run the the profiling tool. So, if I would run the tool for half an hour the 33ms would actually be something like 120s. So, it seems that avg is actually the total amount of time the thread occupies the CPU.
If that is the case, I would assume to be able to offset against the total time using the count figure which doesn't work either. Mostly due to the figure being almost never available - i.e. there is only as a separate list entry (for every parent thread) called which does not represent a specific process scope.
So, I read QNX community wiki about the "Application Profiler", incl. the manual about "New IDE Application Profiler Enhancements", as well as the official manual articles about how to use the profiler tool.. but I couldn't figure out how I would use the tool to serve my interest.
Bottom line: I'm pretty sure I'm misinterpreting and misusing the tool for what it was intended to be used. Thus my question - how would I interpret the numbers or use the tool's feedback properly to get my 2ms cycle time confirmed?
Additional information
CPU: single core
QNX SDP 6.5 / Momentics 4.7.0
Profiling Method: Sampling and Call Count Instrumentation
Profiling Scope: Single Application
I enabled "Build for Profiling (Sampling and Call Count Instrumentation)" in the Build Options1
The System Profiler should give you what you are looking for. It hooks into the micro kernel and lets you see the state of all threads on the system. I used it in a similar setup to find out what our system was getting unexpected time-outs. (The cause turned out to be Page Waits on critical threads.)
I would like to achieve determinism in my game engine, in order to be able to save and replay input sequences and to make networking easier.
My engine currently uses a variable timestep: every frame I calculate the time it took to update/draw the last one and pass it to my entities' update method. This makes 1000FPS games seem as fast ad 30FPS games, but introduces undeterministic behavior.
A solution could be fixing the game to 60FPS, but it would make input more delayed and wouldn't get the benefits of higher framerates.
So I've tried using a thread (which constantly calls update(1) then sleeps for 16ms) and draw as fast as possible in the game loop. It kind of works, but it crashes often and my games become unplayable.
Is there a way to implement threading in my game loop to achieve determinism without having to rewrite all games that depend on the engine?
You should separate game frames from graphical frames. The graphical frames should only display the graphics, nothing else. For the replay it won't matter how many graphical frames your computer was able to execute, be it 30 per second or 1000 per second, the replaying computer will likely replay it with a different graphical frame rate.
But you should indeed fix the gameframes. E.g. to 100 gameframes per second. In the gameframe the game logic is executed: stuff that is relevant for your game (and the replay).
Your gameloop should execute graphical frames whenever there is no game frame necessary, so if you fix your game to 100 gameframes per second that's 0.01 seconds per gameframe. If your computer only needed 0.001 to execute that logic in the gameframe, the other 0.009 seconds are left for repeating graphical frames.
This is a small but incomplete and not 100% accurate example:
uint16_t const GAME_FRAMERATE = 100;
uint16_t const SKIP_TICKS = 1000 / GAME_FRAMERATE;
uint16_t next_game_tick;
Timer sinceLoopStarted = Timer(); // Millisecond timer starting at 0
unsigned long next_game_tick = sinceLoopStarted.getMilliseconds();
while (gameIsRunning)
{
//! Game Frames
while (sinceLoopStarted.getMilliseconds() > next_game_tick)
{
executeGamelogic();
next_game_tick += SKIP_TICKS;
}
//! Graphical Frames
render();
}
The following link contains very good and complete information about creating an accurate gameloop:
http://www.koonsolo.com/news/dewitters-gameloop/
To be deterministic across a network, you need a single point of truth, commonly called "the server". There is a saying in the game community that goes "the client is in the hands of the enemy". That's true. You cannot trust anything that is calculated on the client for a fair game.
If for example your game gets easier if for some reasons your thread only updates 59 times a second instead of 60, people will find out. Maybe at the start they won't even be malicious. They just had their machines under full load at the time and your process didn't get to 60 times a second.
Once you have a server (maybe even in-process as a thread in single player) that does not care for graphics or update cycles and runs at it's own speed, it's deterministic enough to at least get the same results for all players. It might still not be 100% deterministic based on the fact that the computer is not real time. Even if you tell it to update every $frequence, it might not, due to other processes on the computer taking too much load.
The server and clients need to communicate, so the server needs to send a copy of it's state (for performance maybe a delta from the last copy) to each client. The client can draw this copy at the best speed available.
If your game is crashing with the thread, maybe it's an option to actually put "the server" out of process and communicate via network, this way you will find out pretty fast, which variables would have needed locks because if you just move them to another project, your client will no longer compile.
Separate game logic and graphics into different threads . The game logic thread should run at a constant speed (say, it updates 60 times per second, or even higher if your logic isn't too complicated, to achieve smoother game play ). Then, your graphics thread should always draw the latest info provided by the logic thread as fast as possible to achieve high framerates.
In order to prevent partial data from being drawn, you should probably use some sort of double buffering, where the logic thread writes to one buffer, and the graphics thread reads from the other. Then switch the buffers every time the logic thread has done one update.
This should make sure you're always using the computer's graphics hardware to its fullest. Of course, this does mean you're putting constraints on the minimum cpu speed.
I don't know if this will help but, if I remember correctly, Doom stored your input sequences and used them to generate the AI behaviour and some other things. A demo lump in Doom would be a series of numbers representing not the state of the game, but your input. From that input the game would be able to reconstruct what happened and, thus, achieve some kind of determinism ... Though I remember it going out of sync sometimes.