Is there anything easy and simple to profile functions in C++/OpenGL? All I could find was gDEBugger. Looking through the documentation I can't find a way to do what I want. Let me explain...
As I've said in other questions, I have a game with defense towers. Currently they are just 3 but this number is configurable. I have a single draw function for all the towers (this function may call other functions, doesn't matter) and I would like to profile this single function (for 3 towers and then increase the number and profile again). Then I would like to implement display lists for the towers, do the same profiling and see if there was any benefit on using display lists for this specific situation.
What profiling tool do you recommend for such a task? If it matters, I'm coding OpenGL on Windows with Visual Studio 10. Or can this be done with gDEBugger? Any pointers?
P.S: I'm aware that display lists were removed on OpenGL 3.1, but the above is just an example.
NVidia has one, so does AMD. And for Intel.
For coarse-grained monitoring you can measure the time it takes to execute a frame from the beginning to after your buffer swap or glFlush()/glFinish():
while( running )
{
start_time = GetTimeInMS();
RenderFrame();
SwapGLBuffers();
end_time = GetTimeInMS();
cout << "Frame time (ms): " << (end_time - start_time) << end;
}
Related
Firstly: Context is VR with HP Reverb G2, WMR runtime, DX12.
We're seeing some unexplained behaviour across developer machines when working with OpenXR. It looks as thought the OpenXR runtime is changing the way it presents depending on the machine setting for preferred GPU.
More specifically, we noticed that depending on the machines setting for preferred GPU, we are seeing a different method used when XrEndFrame is called. This is a big deal as the different method results in a blank screen being drawn into our current renderTarget!
The difference is that when the preferred device is an Nvidia GPU, xrEndFrame looks like this in PIX (in a graphics queue that is separate to our main render):
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
2 8063 Signal(pFence:obj#20, Value:62)
3 8064 Wait(pFence:obj#36, Value:31)
5 8065 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4084, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
6 8066 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:1}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4085, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
8 8067 Signal(pFence:obj#20, Value:63)
9 8068 Signal(pFence:obj#21, Value:31)
and when it isn't, (i.e. somehow maybe picking up Intel onboard?) it looks like this:
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
0 8064 Wait(pFence:obj#45, Value:21)
2 8065 ClearRenderTargetView(RenderTargetView:res#4008, ColorRGBA:{Element:0, Element:0, Element:0, Element:0}, NumRects:0, pRects:nullptr)
15 8066 DrawIndexedInstanced(IndexCountPerInstance:4, InstanceCount:2, StartIndexLocation:0, BaseVertexLocation:0, StartInstanceLocation:0)
17 8067 Signal(pFence:obj#22, Value:23)
18 8068 Signal(pFence:obj#23, Value:21)
The latter is clearing the current renderTargetView and drawing a quad over the top that is the dimensions of the headset display.
Yet- we've checked the rendering code and it is definitely not selecting the Intel graphics device. However the second behaviour goes away if we set 'preferred graphics processor' to nvidia gpu in nvidia control panel.
We can also see that the above behaviour is the result of a call to XrEndFrame, and that our rendering code is identical otherwise.
Any clue as to what part of the runtime might be looking at or influenced by this setting?
Unfortunately (fortuitously?) we found we need to work on the rendering code to be able to swap runtimes to say SteamVR, so right now we can't swap out the runtime.
Obviously we have a workaround, which is to set the preferred device. But understanding how/why this issue is occurring would be great.
So this was finally tracked down to an error on our part.
In our case we were using xrGetD3D12GraphicsRequirementsKHR to get the minimum graphics requirements for openxr.
This has an adapterLuid identifier in the structure XrGraphicsRequirementsD3D12KHR which we should have been using to select the GPU in the graphics API, but weren't.
Question summary: all four cores used when running a single threaded programme. Why?
Details: I have written a non-parallelised programme in Xcode (C++). I was in the process of parallelising it, and wanted to see whether what I was doing was actually resulting in more cores being used. To that end I used Instruments to look at the core usage. To my surprise, while my application is single threaded, all four cores were being utilised.
To test whether it changed the performance, I dialled down the number of cores available to 1 (you can do it in Instruments, preferences) and the speed wasn't reduced at all. So (as I knew) the programme isn't parallelised in any way.
I can't find any information on what it means to use multiple cores to perform single threaded tasks. Am I reading the Instruments output wrong? Or is the single-threaded process being shunted between different cores for some reason (like changing lanes on a road instead of driving in two lanes at once - i.e. actual parallelisation)?
Thanks for any insight anyone can give on this.
EDIT with MWE (apologies for not doing this initially).
The following is C++ code that finds primes under 500,000, compiled in Xcode.
#include <iostream>
int main(int argc, const char * argv[]) {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;
int num_max = 500000;
while (num <= num_max) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num){
primes++;
std::cout << "Prime: " << num << std::endl;
}
num++;
}
end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
std::cout << "This machine calculated all " << primes << " under " << num_max << " in " << runTime << " seconds." << std::endl;
return 0;
}
This runs in 36s or thereabouts on my machine, as shown by the final out and my phone's stopwatch. When I profile it (using instruments launched from within Xcode) it gives a run-time of around 28s. The following image shows the core usage.
instruments showing core usage with all 4 cores (with hyper threading)
Now I reduce number of available cores to 1. Re-running from within the profiler (pressing the record button), it says a run-time of 29s; a picture is shown below.
instruments output with only 1 core available
That would accord with my theory that more cores doesn't improve performance for a single thread programme! Unfortunately, when I actually time the programme with my phone, the above took about 1 minute 30s, so there is a meaningful performance gain from having all cores switched on.
One thing that is really puzzling me, is that, if you leave the number of cores at 1, go back to Xcode and run the program, it again says it takes about 33s, but my phone says it takes 1 minute 50s. So changing the cores is doing something to the internal clock (perhaps).
Hopefully that describes the problem fully. I'm running on a 2015 15 inch MBP, with 2.2GHz i7 quad core processor. Xcode 7.3.1
I want to premise your answer lacks a lots of information in order to proceed an accurate diagnostic. Anyway I'll try to explain you the most common reason IHMO, supposing you application doesn't use 3-rd part component which perform in a multi-thread way.
I think that could be a result of scheduler effect. I'm going to explain what I mean.
Each core of the processor takes a process in the system and executed it for a "short" amount of time. This is the most common solution in desktop operative system.
Your process is executed on a single core for this amount of time and then stopped in order to allow other process to continue. When your same process is resumed it could be executed in another core (always one core, but a different one). So a poor precise task manager with a low resolution time could register the utilization of all cores, even if it does not.
In order to verify whether the cause could be that, I suggest you to see the amount of CPU % used in the time your application is running. Indeed in case of a single thread application the CPU should be about 1/#numberCore , in your case 25%.
If it's a release build your compiler may be vectorising parallelise your code. Also libraries you link against, say the standard library for example, may be threaded or vectorised.
Problem
I originally posted this question which was apparently something that did not meet my customer spec. Hence I am redefining the problem:
To understand the problem a bit more, the timing diagram on the original post can be used. The delayer needs to be platform independent. To be precise, I run a job scheduler and apparently my current delayer is not going to be compatible with it. What I am stuck with is the "Independent" bit of the delayer. I have already knocked out a delayer in SIMULINK using Probe (probes for Sampling Time) and Variable Integer Delay blocks. However, during our acceptance phase we realised that the scheduler does not comply with such configuration and needs to be something more intrinsic and basic - something like a while loop running in C/C++ application.
Initial Solution
What I can think of a solution is the following:
Define a global and static time-slice variable called tslc. Basically, this is how often the scheduler runs. The unit could be in seconds
Define a function that has the following body:
void hold_for_secs(float* tslc, float* _delay, float* _tmr, char* _flag) {
_delay[0] -= tslc[0];
if (_delay[0] < (float)(1e-5)) {
_flag[0] = '1';
} else {
_flag[0] = '0';
}
}
Users please forgive my poor function-coding skilss, but I merely tried to come up with a solution. I would really appreciate if people help me out a little bit with suggestions here!
Computing Platform
Windows 2000 server
Target computing platform
An embedded system card - something similar to a modern graphics card or sound card that goes along one of the PCI slot. We do testing on a testbed and finally implement the solution on that embedded system card.
I'm analyzing some high resolution midi data. I'm writing it to the stdout but since there is so much data coming in it takes seconds for them all to display after I did the actual action.
Currently this line writes to the commandline:
std::vector<unsigned char> message;
...
printf("W 1 = %03d, W 2 = %03d, W 3 = %03d \n",(int)message[2],(int)message2[1],(int)message2[2]);
There's a good chance that this is a video driver issue - video card manufacturers probably don't always pay a lot of attention to console window performance. I've had rigs with painfully slow - I mean tooth-extraction painful - console windows that had probably 100 times improvement in that area by updating the video driver.
why dnt you use a string builder class like this one here and append all your output string and write to the output at the end?
What do you think?
I'm writing an old school ASCII DOS-Prompt game. Honestly I'm trying to emulate ZZT to learn more about this brand of game design (Even if it is antiquated)
I'm doing well, got my full-screen text mode to work and I can create worlds and move around without problems BUT I cannot find a decent timing method for my renders.
I know my rendering and pre-rendering code is fast because if I don't add any delay()s or (clock()-renderBegin)/CLK_TCK checks from time.h the renders are blazingly fast.
I don't want to use delay() because it is to my knowledge platform specific and on top of that I can't run any code while it delays (Like user input and processing). So I decided to do something like this:
do {
if(kbhit()) {
input = getch();
processInput(input);
}
if(clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval) {
renderTimer = clock();
render();
ballLogic();
}
}while(input != 'p');
Which should in "theory" work just fine. The problem is that when I run this code (setting the RenderInterval to 0.0333 or 30fps) I don't get ANYWHERE close to 30fps, I get more like 18 at max.
I thought maybe I'd try setting the RenderInterval to 0.0 to see if the performance kicked up... it did not. I was (with a RenderInterval of 0.0) getting at max ~18-20fps.
I though maybe since I'm continuously calling all these clock() and "divide this by that" methods I was slowing the CPU down something scary, but when I took the render and ballLogic calls out of the if statement's brackets and set RenderInterval to 0.0 I get, again, blazingly fast renders.
This doesn't make sence to me since if I left the if check in, shouldn't it run just as slow? I mean it still has to do all the calculations
BTW I'm compiling with Borland's Turbo C++ V1.01
The best gaming experience is usually achieved by synchronizing with the vertical retrace of the monitor. In addition to providing timing, this will also make the game run smoother on the screen, at least if you have a CRT monitor connected to the computer.
In 80x25 text mode, the vertical retrace (on VGA) occurs 70 times/second. I don't remember if the frequency was the same on EGA/CGA, but am pretty sure that it was 50 Hz on Hercules and MDA. By measuring the duration of, say, 20 frames, you should have a sufficiently good estimate of what frequency you are dealing with.
Let the main loop be someting like:
while (playing) {
do whatever needs to be done for this particular frame
VSync();
}
... /* snip */
/* Wait for vertical retrace */
void VSync() {
while((inp(0x3DA) & 0x08));
while(!(inp(0x3DA) & 0x08));
}
clock()-renderTimer > RenderInterval * CLOCKS_PER_SEC
would compute a bit faster, possibly even faster if you pre-compute the RenderInterval * CLOCKS_PER_SEC part.
I figured out why it wasn't rendering right away, the timer that I created is fine the problem is that the actual clock_t is only accurate to .054547XXX or so and so I could only render at 18fps. The way I would fix this is by using a more accurate clock... which is a whole other story
What about this: you are substracting from x (=clock()) y (=renderTimer). Both x and y are being divided by CLOCKS_PER_SEC:
clock()/CLOCKS_PER_SEC-renderTimer/CLOCKS_PER_SEC > RenderInterval
Wouldn't it be mor efficiente to write:
( clock() - renderTimer ) > RenderInterval
The very first problem I saw with the division was that you're not going to get a real number from it, since it happens between two long ints. The secons problem is that it is more efficiente to multiply RenderInterval * CLOCKS_PER_SEC and this way get rid of it, simplifying the operation.
Adding the brackets gives more legibility to it. And maybe by simplifying this phormula you will get easier what's going wrong.
As you've spotted with your most recent question, you're limited by CLOCKS_PER_SEC which is only about 18. You get one frame per discrete value of clock, which is why you're limited to 18fps.
You could use the screen vertical blanking interval for timing, it's traditional for games as it avoids "tearing" (where half the screen shows one frame and half shows another)