Making GL textures uploading async and figuring out when it's done uploading - opengl

I have run into an issue where my application requires loading images dynamically at runtime. This ends up being a problem because it's not possible to load them all up front since I can't know which ones will be used... otherwise I have to upload everything. The problem is that some people do not have good PCs and have been complaining that loading all the images to the GPU takes a long time for them due to bad hardware.
My work around for the latter group was to upload textures just as is needed, and this worked for the most part. The problem is that during the application, there are times where delays occur when a series of images need to be uploaded and there's a delay due to the uploading.
I was researching how to get around this, and I have an idea: Users want a smooth experience and are okay if the textures are not immediately loaded but instead are absent. This makes it easy, as I can upload in the background and then just draw nothing for where the object should be, and then bring it into existence after it is done. This is acceptable because the uploads are usually pretty fast anyways, but they're slow enough that it causes it to dip under 60 fps for some people which causes some stutter. On average it causes anywhere from 1-3 frames of stutter so the uploads do resolve fast and on average less than 50ms.
My solution was to attempt something using a PBO to get some async-like uploading. Problem is I cannot find online how I can tell when the uploading is done. Is there a way to do this?
I figure there are four options:
There's a way to do what I want with OpenGL 3.1 onwards and that will be that
It is not possible to do (1), but I could place a fence in and then check if the fence is done, however I've never had to do this before so I'm not sure if it would work in this case
It's not possible, but I could make the assumption that everything would be uploaded in < 50ms and use some kind of timestamp to tell if it's drawable or not and just hope that it is the case (and if the time is < 50ms since issuing an upload, then draw nothing)
It's not possible do to this for texture uploading and I'm stuck
This leads me to my question: Can I tell when an asynchronous upload of pixels to a texture is done?

Fence sync objects tell when all previously issued commands have completed their execution. This includes asynchronous pixel transfer operations. So you can issue a fence after your transfers and use the sync object tools to check to see when it is done.
The annoying issue you'll have here is that it's very coarse-grained. Testing the fence also includes testing whether any non-transfer commands have also completed, despite the fact that the two operations are probably being handled by independent hardware. So if the transfer completes before the frame rendered before starting the transfer has finished rendering, the fence still won't be set. However, if you fire off a lot of texture uploading all at once, then the transfer operations will dominate the results.

Related

Is it less computer intensive to load and unload an image every time a new image is needed or to have all of the images loaded at once? (SDL)

I've created a small animation for a game that uses a set of images as frames, and the current image to render changes over a certain amount of time to create the animation illusion. I've done this in two different ways, and I'm wondering which one is more efficient to use.
Method 1:
A single image is loaded and rendered. When a different image needs to be rendered, a function is called that unloads the current image, and loads and renders the new one.
Method 2:
All of the images needed for the animation are loaded once, and then rendered as needed.
In simpler terms: Method 1 unloads the current image and loads the new one every time a different image is needed, and Method 2 keeps all the images needed loaded at once.
So basically, the question is, it better to constantly load and unload images to keep as little loaded as possible, or to have many loaded at all times and not unload/load anything during the program? Does the computer have a harder time loading and unloading images or keeping many images loaded at once?
I ran looked at the task manager while running each method. CPU usage of Method 1 (loading and unloading) fluctuated between 29% and 30%, while that of Method 2 fluctuated between 28% and 29%.
It appears that keeping all the images loaded is better according to these statistics, but the reason I don't really trust them is because the program only loads seven images.
As the game gets bigger, there could be hundreds of images loaded at once (Method 2) or an image loaded and unloaded nearly every frame (Method 1). Which method is less intensive? Thank you for your time.
Firstly you dont define 'intensive'. Having a gaming machine using 100% is good - or at least not necessarily bad. It depends on what else is going on on your system.
Second, performance analysis is done by measuring, not by thinking. YOu have measured and found an answer. Believe the answer.
In general it is faster to store things in memory that to load off disk over and over again (as your measurement shows). However you will end up using a lot of memory (no free lunches here). You dont say how much memory you have or how big the images are. Assuming each image is 10mb, then 100 of them takes 1gb. ON a modern desktop thats not a lot, on an embedded system running on an arduiono its a disaster
Why not try with 100 images and see what happens
Measurement only confirms that doing less work results in faster execution, and doing something once is faster (in a long term) than repeating the same thing multiple times (if you'll restart your animation from first frame - you'll need to reload the same images again). More of a problem here is a latency - what if your image is considerably big and it takes noticeable time to load it, will you just stop rendering until it loads? What if hard disk is under heavy load (especially HDD, SSDs are not that vulnerable to random access) and suddenly image takes 100 times more time to load? (yes can still happen with CPU, but not that likely or extremely).
If you can predict what part of data will be used soon then you'll only need to load that part till specified deadline. It is generally referred as 'streaming' and especially helpful for very big but predictable data - like audio or video files (since you always know playback speed and direction). With animations it may be more difficult - sometimes you know what animation will be played (e.g. cutscenes or trigger), sometimes they're just too general and any of them could be played. In latter case you'll probably need to keep animation data ready to prevent unwanted pauses.
If data is too big you may try to do some compression (e.g. keeping data in memory in compressed form with some fast compression algorithm and decompressing when needed - much faster than reading from slow disk; special cases like for images there may be even better solution).

How do you know what you've displayed is completely drawn on screen?

Displaying images on a computer monitor involves the usage of a graphic API, which dispatches a series of asynchronous calls... and at some given time, put the wanted stuff on the computer screen.
But, what if you are interested in knowing the exact CPU time at the point where the required image is fully drawn (and visible to the user)?
I really need to grab a CPU timestamp when everything is displayed to relate this point in time to other measurements I take.
Without taking account of the asynchronous behavior of the graphic stack, many things can get the length of the graphic calls to jitter:
multi-threading;
Sync to V-BLANK (unfortunately required to avoid some tearing);
what else have I forgotten? :P
I target a solution on Linux, but I'm open to any other OS. I've already studied parts of the xvideo extension for X.org server and the OpenGL API but I havent found an effective solution yet.
I only hope the solution doesn't involve hacking into video drivers / hardware!
Note: I won't be able to use the recent Nvidia G-SYNC thing on the required hardware. Although, this technology would get rid of some of the unpredictable jitter, I think it wouldn't completely solve this issue.
OpenGL Wiki suggests the following: "If GPU<->CPU synchronization is desired, you should use a high-precision/multimedia timer rather than glFinish after a buffer swap."
Does somebody knows how properly grab such a high-precision/multimedia timer value just after the swapBuffer call is completed in the GPU queue?
Recent OpenGL provides sync/fence objects. You can place sync objects in the OpenGL command stream and later wait for them to get passed. See http://www.opengl.org/wiki/Sync_Object

What is the accepted timing strategy when using Vertical Synchronisation?

Coming from a basic understanding of OpenGL programming, all required drawing operations are performed in a sequence, once per frame redraw. The performance of the hardware dictates essentially how fast this happens. As I understand, a game will attempt to draw as quickly as possible so redraw operations are essentially wrapped in a while loop. The graphics operations (graphics engine) will then be optimised to ensure the frame rate is acceptable for the application.
Graphics hardware supporting Vertical Synchronisation however locks frame rates to the display rate. A first question would be how should a graphics engine interact with the hardware synchronisation? Is this even possible or does the renderer work at maximum speed and the hardware selectively calls up the latest frame, discarding all unused previous frames..?
The motivation for this question is not that I am immediately intending to write a graphics engine, instead am debugging an issue with an existing system where the graphics of a moving scene appear to stutter onscreen. Symptomatically, the stutter is slight when VSync is turned off, when it is turned on either there is a significant and periodic stutter or alternatively the stutter is resolved entirely. I am somewhat clutching at straws as to what is happening or why, want to understand some more background information on graphics systems.
Summarily the question would be on how one is expected to interact with hardware redraw events and if that is even possible. However any additional information would be welcome.
A first question would be how should a graphics engine interact with the hardware synchronisation?
To avoid flicker modern rendering systems use double buffering i.e. there are two color plane buffers and after finishing drawing to one, the display readout pointer is set to the finished buffer plane. This buffer swap can happen synchronized or non-synchronized. With V-Sync enabled the buffer swap will be synchronized and the rendering thread blocks until the buffer swap happened.
Since with double buffering mandates buffer swaps this implicitly introduces a synchronization mechanism. This is how interactive rendering systems lock onto the display refresh.
Symptomatically, the stutter is slight when VSync is turned off, when it is turned on either there is a significant and periodic stutter or alternatively the stutter is resolved entirely.
This sounds like a badly written animation loop that assumes constant framerate locked onto the display refresh rate, based on the assumption that frames render faster than a display refresh interval and the buffer swap can be issued in time for the next retrace to happen.
The only robust way to deal with vertical synchronization is to actually measure the time between frame renderings and advance the rendering loop by that amount of time.
This is a guess, but:
The Problem Isn't Vertical Synchronization
I don't know what OS you're working with, but there are various ways to get information about the monitor and how fast the screen is refreshing (for the purposes of this answer, we'll assume your monitor is somewhat recent and redraws at a rate of 60 Hz, or 60 times every second, or once every 16.66666... milliseconds).
Renderers are usually paired up with an "Logic" side to the application: input, ui calculations, simulation running, etc. etc. It seems like the logic side of your application is running fast enough, but the Rendering side - i.e., the Draw Call as its commonly summed up into - is bounding the speed of your application.
Vertical Synchronization can exacerbate this in that if your Draw Call is made to happen every 16.66666 milliseconds - but it takes much longer than 16.666666 milliseconds - then you perceive a frame rate drop (i.e. frames will "stutter" because they're taking too long to produce a single frame). VSync - and the enabling or disabling thereof - is not something that bottlenecks your code: it just says "hey, since the Hardware is only going to take 1 frame from us every 16.666666 milliseconds, why make more draw calls than just one every 16.66666 milliseconds? As long as we do one draw call once for every passing of this time, our application will look as fluid as possible, and we don't have to waste time making more calls than that!"
The problem with that is that it assumes your code is going to run fast enough to make it in those 16.6666 milliseconds. If it does not, stuttering, lagging, visual artifacts, frozen frames, and other things manifest themselves on screen.
When you turn off VSync, you're telling your Render Call to be called as often as possible, as fast as possible. This may give it some extra wiggle room alongside the Logic call to get a frame rendered, so that when the Hardware Says "I'm gonna take a picture and put it on the screen now!" it's all prettied up, just in time, to get into posture and say cheese! (though by what you say, it barely makes it).
What To Do:
Start by profiling your code. Find out what functions are taking the most time. Judging by the stutter, something in your code is taking longer than is expected and is giving you undesirable performance. Make sure to profile first to find the critical sections of where you're burning away time, and figure out how to keep it correct and make it just as fast. You may want to figure out what's being called in the Render Call and profile the time it takes to complete one cycle of that specifically. Then time the Logic call(s) and see how long it takes to execute those as well. Then, chop away.
Good luck!

What is the correct way to calculate the FPS given that GPUs have a task queue and are asynchronous?

I always assumed that the correct way to calculate the FPS was to simply time how long it took to do an iteration of your draw loop. And much of the internet seems to be in accordance.
But!
Modern graphics card are treated as asynchronous servers, so the draw loop sends out drawing instructions for vertex/texture/etc data already on the GPU. These calls do not block the calling thread until the request on the GPU completes, they are simply added to the GPU's task queue. So the surely the 'traditional' (and rather ubiquitous) method is just measuring the call dispatch time?
What prompted me to ask was I had implemented the traditional method and it gave consistently absurdly high framerates, even if what was being rendered caused the animation to become choppy. Re-reading my OpenGL SuperBible brought me to glGenQueries which allow me to time sections of the rendering pipeline.
To summarise, is the 'traditional' way of calculating FPS totally defunct with (barely) modern graphics cards? If so why are the GPU profiling techniques relatively unknown?
Measuring fps is hard. It's made harder by the fact that various people who want to measure fps don't necessarily want to measure the same thing. So ask yourself this. Why do you want an fps number?
Before I go on and dive into all the pitfalls and potential solutions, I do want to point out that this is by no means a problem specific to "modern graphics cards". If anything, it used to be way worse, with SGI-type machines where the rendering actually happened on a graphics susbsystem that could be remote to the client (as in, physically remote). GL1.0 was actually defined in terms of client-server.
Anyways. Back to the problem at hand.
fps, meaning frames per second, really is trying to convey, in a single number, a rough idea of the performance of your application, in a number that can be directly related to things like the screen refresh rate. for a 1st level approximation of performance, it does an ok job. It breaks completely as soon as you want to delve into more fine-grained analysis.
The problem is really that the thing that matters most as far as "feeling of smoothness" of an application, is when the picture you drew ends up on the screen. The secondary thing that matters quite a bit too is how long it took between the time you triggered an action and when its effect shows up on screen (the total latency).
As an application draws a series of frames, it submits them at times s0, s1, s2, s3,... and they end up showing on screen at t0, t1, t2, t3,...
To feel smooth you need all the following things:
tn-sn is not too high (latency)
t(n+1)-t(n) is small (under 30ms)
there is also a hard constraint on the simulation delta time, which I'll talk about later.
When you measure the CPU time for your rendering, you end up measuring s1-s0 to approximate t1-t0. As it turns out, this, on average, is not far from the truth, as client code will never go "too far ahead" (this is assuming you're rendering frames all the time though. See below for other cases). What does happen in fact is that the GL will end up blocking the CPU (typically at SwapBuffer time) when it tries to go too far ahead. That blocking time is roughly the extra time taken by the GPU compared to the CPU on a single frame.
If you really want to measure t1-t0, as you mentioned in your own post, Queries are closer to it. But... Things are never really that simple. The first problem is that if you're CPU bound (meaning your CPU is not quick enough to always provide work to the GPU), then a part of the time t1-t0 is actually idle GPU time. That won't get captured by a Query. The next problem you hit is that depending on your environment (display compositing environment, vsync), queries may actually only measure the time your application spends on rendering to a back buffer, which is not the full rendering time (as the display has not been updated at that time). It does get you a rough idea of how long your rendering will take, but will not be precise either. Further note that Queries are also subject to the asynchronicity of the graphics part. So if your GPU is idle part of the time, the query may miss that part. (e.g. say your CPU is taking very long (100ms) to submit your frame. The the GPU executes the full frame in 10ms. Your query will likely report 10ms, even though the total processing time was closer to 100ms...).
Now, with respect to "event-based rendering" as opposed to continuous one I've discussed so far. fps for those types of workloads doesn't make much sense, as the goal is not to draw as many f per s as possible. There the natural metric for GPU performance is ms/f. That said, it is only a small part of the picture. What really matters there is the time it took from the time you decided you wanted to update the screen and the time it happened. Unfortunately, that number is hard to find: It typically starts when you receive an event that triggers the process and ends when the screen is updated (something that you can only measure with a camera capturing the screen output...).
The problem is that between the 2, you have potential overlap between the CPU and GPU processing, or not (or even, some delay between the time the CPU stops submitting commands and the GPU starts executing them). And that is completely up to the implementation to decide. The best you can do is to call glFinish at the end of the rendering to know for sure the GPU is done processing the commands you sent, and measure the time on the CPU. That solution does reduce the overall performance of the CPU side, and potentially the GPU side as well if you were going to submit the next event right after...
Last the discussion about the "hard constraint on simulation delta time":
A typical animation uses a delta time between frames to move the animation forward. The major problem is that for a fully smooth animation, you really want the delta time you use when submitting your frame at s1 to be t1-t0 (so that when t1 shows, the time that actually was spent from the previous frame was indeed t1-t0). The problem of course is that you have no idea what t1-t0 is at the time you submit s1... So you typically use an approximation. Many just use s1-s0, but that can break down - e.g. SLI-type systems can have some delays in AFR rendering between the various GPUs). You could also try to use an approximation of t1-t0 (or more likely t0-t(-1)) through queries. The result of getting this wrong is mostly likely micro-stuttering on SLI systems.
The most robust solution is to say "lock to 30fps, and always use 1/30s". It's also the one that allows the least leeway on content and hardware, as you have to ensure your rendering can indeed be done in those 33ms... But is what some console developers choose to do (fixed hardware makes it somewhat simpler).
"And much of the internet seems to be in accordance." doesn't seem totally correct for me:
Most of the publications would measure how long it takes to MANY iterations, then normalize. This way you can reasonably assume, that filling (and aemptying) the pipe are only a small part of the overall time.

Threading Model for a Game Engine

I'm interested in getting threading into the small engine I'm working on in my spare time, but I'm curious over what the best approuch is. I'm curious about the recommended way to sync the physics thread with the rest of the engine, similar to ThisGuy. I'm working with the Bullet Physics SDK, which already use the data copy method he was describing, but I was wondering, once bullet goes through one simulation then syncs the data back to the other threads, won't it result in something like vertical sync, where the rendering thread, half way through processing data suddenly starts using a newer and different set of information?
Is this something which the viewer will be able to notice? What if an explosion of some sort appears with the object that is meant to be destroyed?
If this is an issue, what is then is the best way to solve it?
Lock the physics thread so it can't do anything until the rendering thread (And basically every other thread) has gone through its frame? That seems like it would waste some CPU time. Or is the preferable method to triple buffer, copy the physics data to a second location, continue the physics simulation then copy that data to the rendering thread once its ready?
What approaches do you guys recommend?
The easiest and probably most used variant is to run physic, render, ai, ... threads in parallel and syncronise them after each of them has finished with a frame/timestep.
This is not the fastest solution, but the one with the fewest problems.
Writing back the data to the rendering thread while this is running, leads to massive syncronisation problems (e.g. you have to lock each vector/matrix while updating it).
To make the paralellisation efficent, you have to minimize the amount of data to syncronize, e.g. only write data to the render thread, that can possible be rendered.
When not synronizing after each frame, you can probably get the effect, that the physic/ai uses all the cpu power producing 60fps, while the renderer only renders 10fps, which in most cases is not, what you want.
A double buffering would also increase performance, but you still need to syncronize your threads. A problem is ai and physic or similar threads, because they possible want modify the same data