Assume CPU has only single core, is context switching / concurrent processing of any real value?
Eg: If I have 2 video's being played, each of 5 mins, then with concurrency each video should take take 10 mins not five, since CPU flipped between video 1 and video 2 in concurrent processing?
Another way to describe concurrent processing is timesharing. The single core works on one program for a very short amount of time, then switches to the other process. So, if one process uses up all the power of that single core, and you start two of them, they will run at half the speed of just running one.
The slowdown is only visible if a single instance of the process requires more than half of the CPU power. And your video example doesn't. The speed of the video playing is usually dependent dependent on the video framerate. That framerate is usually about 30-60 FPS. That slow framerate gives the CPU plenty of time to show the frame on the other video. The problem of concurrency would be much more visible if the process was CPU intensive, like Minecraft.
Related
I am converting an audio signal processing application from Win XP to Win 7 (at least). You can imagine it is a sonar application - a signal is generated and sent out, and a related/modified signal is read back in. The application wants exclusive use of the audio hardware, and cannot afford glitches - we don't want to read headlines like "Windows beep causes missile launch".
Looking at the Windows SDK audio samples, the most relevant one to my case is the RenderExclusiveEventDriven example. Outside the audio engine, it prepares 10 seconds of audio to play, which provides it in 10ms chunks to the rendering engine via an IAudioRenderClient object's GetBuffer() and ReleaseBuffer(). It first uses these functions to pre-load a single 10ms chunk of audio, then relies on regular 10ms events to load subsequent chunks.
Hopefully this means there will always be 10-20ms of audio data buffered. How reliable (i.e. glitch-free) should we expect this to be on reasonably modern hardware (less than 18months old)?
Previously, one readily could pre-load at least half a second worth of audio into via the waveXXX() API, so that if Windows got busy elsewhere, audio continuity was less likely to be affected. 500ms seems like a lot safer margin than 10-20ms... but if you want both event-driven and exclusive-mode, the IAudioRenderClient documentation doesn't exactly make it clear if it is or is not possible to pre-load more than a single IAudioRenderClient buffer worth.
Can anyone confirm if more extensive pre-loading is still possible? Is it recommended, discouraged or neither?
If you are worried about launching missiles, I don't think you should be using Windows or any other non Real-Time operating system.
That said, we are working on another application that consumes a much higher bandwidth of data (400 MB/s continuously for hours or more). We have seen glitches where the operating system becomes unresponsive for up to 5 seconds, so we have large buffers on the data acquisition hardware.
Like with everything else in computing, the wider you go you:
increase throughput
increase latency
I'd say 512 samples buffer is the minimum typically used for non-demanding latency wise applications. I've seen up to 4k buffers. Memory use wise that's still pretty much nothing for contemporary devices - a mere 8 kilobytes of memory per channel for 16 bit audio. You have better playback stability and lower waste of CPU cycles. For audio applications that means you can process more tracks with more DSP before audio begins skipping.
On the other end - I've seen only a few professional audio interfaces, which could handle 32 sample buffers. Most are able to achieve 128 samples, naturally you are still limited to lower channel and effect count, even with professional hardware you increase buffering as your project gets larger, lower it back and disable tracks or effects when you need "real time" to capture a performance. In terms of lowest possible latency actually the same box is capable of achieving lower latency with Linux and a custom real time kernel than on Windows where you are not allowed to do such things. Keep in mind a 64 sample buffer might sound like 8 msec of latency in theory, but in reality it is more like double - because you have both input and output latency plus the processing latency.
For a music player where latency is not an issue you are perfectly fine with a larger buffer, for stuff like games you need to keep it lower for the sake of still having a degree of synchronization between what's going on visually and the sound - you simply cannot have your sound lag half a second behind the action, for music performance capturing together with already recorded material you need to have latency low. You should never go above what your use case requires, because a small buffer will needlessly add to CPU use and the odds of getting audio drop outs. 4k buffering for an audio player is just fine if you can live with half a second of latency between the moment you hit play and the moment you hear the song starting.
I've done a "kind of a hybrid" solution in my DAW project - since I wanted to employ GPGPU for its tremendous performance relative to the CPU I've split the work internally with two processing paths - 64 samples buffer for real time audio which is processed on the CPU, and another considerably wider buffer size for the data which is processed by the GPU. Naturally, they both come out through the "CPU buffer" for the sake of being synchronized perfectly, but the GPU path is "processed in advance" thus allowing higher throughput for already recorded data, and keeping CPU use lower so the real time audio is more reliable. I am honestly surprised professional DAW software hasn't taken this path yet, but not too much, knowing how much money the big fishes of the industry make on hardware that is much less powerful than a modern midrange GPU. They've been claiming that "latency is too much with GPUs" ever since Cuda and OpenCL came out, but with pre-buffering and pre-processing that is really not an issue for data which is already recorded, and increases the size of a project which the DAW can handle tremendously.
The short answer is yes, you can preload a larger amount of data.
This example uses a call to GetDevicePeriod to return the minimum service interval for the device (in nano seconds) and then passes that value along to Initialize. You can pass a larger value if you wish.
The down side to increasing the period is that you're increasing the latency. If you are just playing a waveform back and aren't planning on making changes on the fly then this is not a problem. But if you had a sine generator for example, then the increased latency means that it would take longer for you to hear a change in frequency or amplitude.
Whether or not you have drop outs depends on a number of things. Are you setting thread priorities appropriately? How small is the buffer? How much CPU are you using preparing your samples? In general though, a modern CPU can handle a pretty low-latency. For comparison, ASIO audio devices run perfectly fine at 96kHz with a 2048 sample buffer (20 milliseconds) with multiple channels - no problem. ASIO uses a similar double buffering scheme.
This is too long to be a comment, so it may as well be an answer (with qualifications).
Although it was edited out of final form of the question I submitted, what I had intended by "more extensive pre-loading" was not about the size of buffers used, so much as the number of buffers used. The (somewhat unexpected) answers that resulted all helped widen my understanding.
But I was curious. In the old waveXXX() world, it was possible to "pre-load" multiple buffers via waveOutPrepareHeader() and waveOutWrite() calls, the first waveOutWrite() of which would start playback. My old app "pre-loaded" 60 buffers out of a set of 64 in one burst, each with 512 samples played at 48kHz, creating over 600ms of buffering in a system with a cycle of 10.66ms.
Using multiple IAudioRenderClient::GetBuffer() and IAudioRenderClient::ReleaseBuffer() calls prior to IAudioCient::Start() in the WASAPI world, it appears that the same is still possible... at least on my (very ordinary) hardware, and without extensive testing (yet). This is despite the documentation strongly suggesting that exclusive, event-driven audio is strictly a double-buffering system.
I don't know that anyone should set out to exploit this by design, but I thought I'd point out that it may be supported.
I have a DirectX game which spawns 2 boost threads on a dual-core system: 1 for gameplay/rendering (normally split into their own threads on a quad-core CPU), and 1 other thread which procedurally-generates the gameworld. I believe that my audio middleware also spawns its own threads for playing SFX and music.
The game is always running at 100% on the CPU, which in turn can cause some sputtering from the audio system. I'm hoping that I can reduce the CPU load by better managing the activity of that Generation Thread. While sometimes I need it running full-speed, there are other times (when the player isn't moving much) when it is just updating constantly while not really doing a whole lot.
Is it possible / advisable to manually manage how active a thread is? If so what strategies can I use to do that? I keep seeing people say that sleep() functions aren't really recommended, but I don't really know what else to do yet.
Alternatively, am I barking up the wrong tree by trying to squeeze cycles out of thread management, and would I be better served by traditional profiling/optimization?
Getting to 100% processor utilization means that you don't have a game clock. You are probably rendering frames as fast as the machine allows. Still pretty hard to get 100% exactly if you use multiple threads, that indicates that you don't synchronize threads either.
This is likely to require a pretty drastic rewrite. The pace ought to be set by the main render loop, the one that copies the back-buffer to the video adapter. It sets your target FPS, frames-per-second. Not infrequently, you use the vertical blanking interval of the monitor for that, it solves tearing problems by ensuring that the monitor gets updated at the exact right time. Which automatically gets the render loop paced to the monitor refresh rate. Typically 60 times per second on LCD monitors. A timer is an alternative. This prevents the main thread from burning 100% core, assuming it can keep up with the FPS.
You now have a steady game clock tick, discrete moments in time at which things need to happen and jobs need to be completed in order to update the game state. Like checking for player input. Inside the render loop, check for mouse/keyboard/controller input and use anything you get to update the game world objects.
And in turn determines what worker threads need to do. They'll have the duration of one pass through the render loop to get the job done they need to do. You use a synchronization object to wake them up. And another one, each, to let them signal that they are done with the current game loop tick. Which stops them from burning core, they should constantly be waiting on the signal to start working on the next frame. Note that there is a balancing requirement. If a worker thread needs more than one game tick to get the job done then the render loop will fall behind and miss a video adapter frame update. Your video starts to stutter. This is in general impossible to eliminate completely, do make sure that it doesn't affect the absolute game clock.
Audio should be the easier problem to solve, you just need to keep the sound card buffers filled with sufficient data to survive a couple of frames worth of sound.
Falling behind on the target FPS is very easy to determine. You automatically compensate for that by lowering the target FPS. So the program still runs acceptably on a slow machine, just not as smoothly. Net effect is that you'll stop burning 100% core on all threads.
I don't know anything about your system so this answer may be way off. I assume that there is such a thing as audio output buffer of some sort that you can track the size of. When the size of the buffer is so small that there is a danger that audio may stop you should do something to refill the buffer.
That "something" may be as easy as temporarily setting the priority of audio thread to higher value. Come to think of why not set it higher from the start? That would solve all the problems, right? Even better, just lower the priority of world generator thread.
When I worked on a voice-comm app for gamers many years ago, we hit this problem a lot. Many games are written to use every ounce of CPU. As such, some gamest would starve our app (that ran in the background) from functioning - causing audio drops and lost network connections. Many of those games would also call SetThreadPriority and SetPriorityClass with the REALTIME flags to basically consume all the CPU quantums with disregard of anything else running on the system.
The typical fix we asked of game developers we partnered with was to simply insert a "Sleep(0)" call between each frame of their main game loop so that our threads wouldn't get stalled. I think we later added a switch in a software update to make our process run at a higher priority mode. Since then, Windows has gotten better about multitasking and thread priority with respect to these issues.
I always assumed that the correct way to calculate the FPS was to simply time how long it took to do an iteration of your draw loop. And much of the internet seems to be in accordance.
But!
Modern graphics card are treated as asynchronous servers, so the draw loop sends out drawing instructions for vertex/texture/etc data already on the GPU. These calls do not block the calling thread until the request on the GPU completes, they are simply added to the GPU's task queue. So the surely the 'traditional' (and rather ubiquitous) method is just measuring the call dispatch time?
What prompted me to ask was I had implemented the traditional method and it gave consistently absurdly high framerates, even if what was being rendered caused the animation to become choppy. Re-reading my OpenGL SuperBible brought me to glGenQueries which allow me to time sections of the rendering pipeline.
To summarise, is the 'traditional' way of calculating FPS totally defunct with (barely) modern graphics cards? If so why are the GPU profiling techniques relatively unknown?
Measuring fps is hard. It's made harder by the fact that various people who want to measure fps don't necessarily want to measure the same thing. So ask yourself this. Why do you want an fps number?
Before I go on and dive into all the pitfalls and potential solutions, I do want to point out that this is by no means a problem specific to "modern graphics cards". If anything, it used to be way worse, with SGI-type machines where the rendering actually happened on a graphics susbsystem that could be remote to the client (as in, physically remote). GL1.0 was actually defined in terms of client-server.
Anyways. Back to the problem at hand.
fps, meaning frames per second, really is trying to convey, in a single number, a rough idea of the performance of your application, in a number that can be directly related to things like the screen refresh rate. for a 1st level approximation of performance, it does an ok job. It breaks completely as soon as you want to delve into more fine-grained analysis.
The problem is really that the thing that matters most as far as "feeling of smoothness" of an application, is when the picture you drew ends up on the screen. The secondary thing that matters quite a bit too is how long it took between the time you triggered an action and when its effect shows up on screen (the total latency).
As an application draws a series of frames, it submits them at times s0, s1, s2, s3,... and they end up showing on screen at t0, t1, t2, t3,...
To feel smooth you need all the following things:
tn-sn is not too high (latency)
t(n+1)-t(n) is small (under 30ms)
there is also a hard constraint on the simulation delta time, which I'll talk about later.
When you measure the CPU time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turns out, this, on average, is not far from the truth, as client code will never go "too far ahead" (this is assuming you're rendering frames all the time though. See below for other cases). What does happen in fact is that the GL will end up blocking the CPU (typically at SwapBuffer time) when it tries to go too far ahead. That blocking time is roughly the extra time taken by the GPU compared to the CPU on a single frame.
If you really want to measure t1-t0, as you mentioned in your own post, Queries are closer to it. But... Things are never really that simple. The first problem is that if you're CPU bound (meaning your CPU is not quick enough to always provide work to the GPU), then a part of the time t1-t0 is actually idle GPU time. That won't get captured by a Query. The next problem you hit is that depending on your environment (display compositing environment, vsync), queries may actually only measure the time your application spends on rendering to a back buffer, which is not the full rendering time (as the display has not been updated at that time). It does get you a rough idea of how long your rendering will take, but will not be precise either. Further note that Queries are also subject to the asynchronicity of the graphics part. So if your GPU is idle part of the time, the query may miss that part. (e.g. say your CPU is taking very long (100ms) to submit your frame. The the GPU executes the full frame in 10ms. Your query will likely report 10ms, even though the total processing time was closer to 100ms...).
Now, with respect to "event-based rendering" as opposed to continuous one I've discussed so far. fps for those types of workloads doesn't make much sense, as the goal is not to draw as many f per s as possible. There the natural metric for GPU performance is ms/f. That said, it is only a small part of the picture. What really matters there is the time it took from the time you decided you wanted to update the screen and the time it happened. Unfortunately, that number is hard to find: It typically starts when you receive an event that triggers the process and ends when the screen is updated (something that you can only measure with a camera capturing the screen output...).
The problem is that between the 2, you have potential overlap between the CPU and GPU processing, or not (or even, some delay between the time the CPU stops submitting commands and the GPU starts executing them). And that is completely up to the implementation to decide. The best you can do is to call glFinish at the end of the rendering to know for sure the GPU is done processing the commands you sent, and measure the time on the CPU. That solution does reduce the overall performance of the CPU side, and potentially the GPU side as well if you were going to submit the next event right after...
Last the discussion about the "hard constraint on simulation delta time":
A typical animation uses a delta time between frames to move the animation forward. The major problem is that for a fully smooth animation, you really want the delta time you use when submitting your frame at s1 to be t1-t0 (so that when t1 shows, the time that actually was spent from the previous frame was indeed t1-t0). The problem of course is that you have no idea what t1-t0 is at the time you submit s1... So you typically use an approximation. Many just use s1-s0, but that can break down - e.g. SLI-type systems can have some delays in AFR rendering between the various GPUs). You could also try to use an approximation of t1-t0 (or more likely t0-t(-1)) through queries. The result of getting this wrong is mostly likely micro-stuttering on SLI systems.
The most robust solution is to say "lock to 30fps, and always use 1/30s". It's also the one that allows the least leeway on content and hardware, as you have to ensure your rendering can indeed be done in those 33ms... But is what some console developers choose to do (fixed hardware makes it somewhat simpler).
"And much of the internet seems to be in accordance." doesn't seem totally correct for me:
Most of the publications would measure how long it takes to MANY iterations, then normalize. This way you can reasonably assume, that filling (and aemptying) the pipe are only a small part of the overall time.
My application is rendering about 100 display lists / second. While I do expect this to be intensive for the gpu, I don't see why it brings my cpu up to 80 - 90 %. Arn't display lists stored in the graphics card and not in system memory? What would I have to do to reduce this crazy cpu usage? My objects never change so that's why im using DL's instead of VBO's. But really my main concern is cpu usage and how I could reduce it. I'm rendering ~60 (or trying to) frames per second.
Thanks
If you are referring to these, then I suspect the bottleneck is going to be CPU related. All the decoding of such files is done on the CPU. Sure, each individual command might result in several commands to your graphics card, which will execute quickly, but the CPU is stuck doing decoding duty.
You probably have VSYNC disabled. In which case your CPU will generate as many frames per second as possible. Of course most of them will be wasted because your monitor can't update 100s of times per second.
Enable VSYNC and check your CPU usage (and frame rate) again.
While display lists are compiled on the GPU, it does not mean there isn't some work required on the cpu (if not directly in your code then possibly in the driver) to actually specify the display list to call on the gpu.
If you want to find out where the cpu time is being spent, grab a profiler and fire up a call graph sampling test. You'll find out in no time where the cpu time is being spent.
Just looking on resources that break down how frames per second work. I know it has something to do with keeping track of Ticks and figure out how many ticks occured between each frame. But I never ran into any resources on why exactly you have to use the methods you use in order to get a smooth frame work. I am trying to get a thourough understanding of this. Can any explain or provide any good resources ? Thanks
There are basically two approaches.
In ActionScript (and many other engines), you request the player to call a certain function at a certain framerate. For Flash games, you'll set the framerate to be 30 FPS, and then you'll implement a function that listens for ENTER_FRAME events to do what you need to do. This means you get roughly 33 ms per frame (1000ms/30FPS=33.33ms/frame). If your code that responds to ENTER_FRAME takes more than 33 ms, you'll get some stuttering.
In home-rolled main loops (like you'd generally do in C++/SDL, for example), you run the main loop as fast as possible. This means the time between each frame will be variable. You still need to keep the "guts" of your frame code less than 33 ms to make sure you'll get at least 30 FPS, but your game will run faster than 30 FPS if not a lot's going on. To account for this, you need to program all your logic in terms of elapsed time since last frame, and abandon using frames themselves as a unit of time.
http://forums.xna.com/forums/t/42624.aspx
How do you separate game logic from display?
For a continously variable frame rate you can measure the time the last frame took and assume this frame will take the same length of time. This has the benefit of meaning that time runs more or less constantly. Your biggest issue with this approach is that it is entirely possible for a VSync'd game to change from 60 fps to 30 fps and back again on subsequent frames. From experience a good way to solve this is to average the last few frame times. This smooths the result out. In the 60 to 30 fps switch each frame will progress assuming 1/45 seconds and hence the 60fps frame run slow and the 30fps frame run fast and the perceived speed remains at 45fps.
Better still is to not use this sort of time step in your calculations. Set yourself a minimum fps ... say 10fps. You then calculate all your game logic at some multiple of these 1/10 second intervals. The render engine then knows where the object is and where it is heading towards and so can inter/extrapolate the object position until its next "decision" frame shows up. This has numerous advantages. It decouples your logic entirely from rendering. It allows you to spread the logic calculations over a number of frames. For example, at 60Hz, you can test at what point a logic object will interesect with the world if it maintains its path every 6th frame. This gives the bonus of allowing you to process some logic objects on different frames to spread calculation load across the time. Its biggest disadvantage is that if the frame rate drops below your target rate everything slows down.