DirectX application "hiccups" every 3 seconds - c++

I've been investigating an issue in my DirectX 11 C++ application for over a week now, and so I'm turning to the good people on StackOverflow for any insight that may help me track this one down.
My application will run mostly at 60-90 frames per second, but every few seconds I'll get a frame that takes around a third of a second to finish. After much investigation, debugging, and using various code profilers, I have narrowed it down to calls to the DirectX API. However, from one slow frame to the next, it is not always the same API call that causes the slowdown. In my latest run, the calls that stall (always for about a fifth of a second) are
ID3D11DeviceContext:UpdateSubresource
ID3D11DeviceContext:DrawIndexed
IDXGISwapChain:Present
Not only is it not the same function that stalls, but each of these functions (mainly the first two) the slow call may be from various places in my code from one to the next.
According to multiple profiling tools and my own high resolution timers I placed in my code to help measure things, I found that this "hiccup" would occur at consistent intervals of just under 3 seconds (~2.95).
This application collects data from external hardware and uses DirectX to visualize that data in real time. While the application is running, the hardware may be idle or running at various speeds. The faster the hardware goes the more data is collected and must be visualized. I point this out because it may be useful when considering some of the characteristics of this bug:
The long frames don't occur while the hardware is idle. This makes sense to me because the software just has to redraw data it already has and doesn't have to transfer new data over to the GPU.
However, the long frames occur at these consistent 3 second intervals regardless of the speed the hardware is running. So even if my application is collecting twice the amount of data per second, the frequency of the long frames doesn't change.
The duration of these long frames is very consistent. Always between 0.25 and 0.3 seconds (I believe it is the slow call to the DirectX API that is consistent, so any variation on the overall frame duration is external to that call).
While field testing last week (when I first discovered the issue), I noticed that on a couple runs of the application, after a long time (probably 20 minutes or more) of continuous testing without interacting much with the program aside from watching it, the hiccup would go away. The hiccup would come back if we interacted with some features of the application or restarted the program. Doesn't make sense to me, but almost like the GPU "figured out" and fixed the issue but then reverted back when we changed up the pattern of work it had been doing previously. Unfortunately the nature of our hardware makes it difficult for me to replicate this in a lab environment.
This bug is occurring consistently on two different machines with very similar hardware (dual GTX580 cards). However, in recent versions of the application, this issue did not occur. Unfortunately the code has undergone many changes since then so it would be difficult to pinpoint what specific change is causing the issue.
I considered the graphics driver, and so updated to the latest version, but that didn't make a difference. I also considered the possibility that some other change was made to both computers, or possibly an update to software running on both of them, could be causing issues with the GPU. But I can't think of anything other than Microsoft Security Essentials that is running on both machines while the application runs, and I've already tried disabling it's Real-Time Protection feature to no avail.
While I would love for the cause to be an external program that I can just turn off, ultimately I worry that I must be doing something incorrectly/improperly with the DirectX API that is causing the GPU to have to make adjustments every few seconds. Maybe I am doing something wrong in the way I update data on the GPU (since the lag only happens when I'm collecting data to display). Then the GPU stalls every few seconds and whatever API function that happens to get called during a stall can't return as fast as it normally would?
Any suggestions would be greatly appreciated!
Thanks,
Tim
UPDATE (2013.01.21):
I finally gave in and went searching back through previous revisions of my application until I found a point where this bug wasn't occurring. Then I went revision by revision until I found exactly when the bug started happening and managed to pinpoint the source of my issue. The problem started occurring after I added an 'unsigned integer' field to a vertex type, of which I allocate a large vertex buffer. Because of the size of the vertex buffer, this change increased the size 184.65 MB (1107.87 MB to 1292.52). Because I do in fact need this extra field in my vertex structure, I found other ways to cut back on overall vertex buffer size, and got it down to 704.26 MB.
My best guess is that the addition of that field and the extra memory it required caused me to exceed some threshold/limit on the GPU. I'm not sure if it was an excess of total memory allocation, or an excess of some limit to a single vertex buffer. Either way, it seems that this excess caused the GPU to have to do some extra work every few seconds (maybe communicating with the CPU) every few seconds, and so my calls to the API had to wait on this. If anyone has any information that would clarify the implications of large vertex buffers, I'd love to hear it!
Thanks to everyone who gave me their time and suggestions.

1) Try turning of VSYNC
2) Are you allocating/deallocating large chunks of memory? Try to alloc memory at beginning of program and don't dealloc it, simply overwrite it (which is probably what you're doing with updatesubresource)
3) Put the interaction with the hardware device on a separate thread. After the device has completely finished passing data to your application, load it into the GPU. Do not let the device take control of the main thread. I suspect the device is blocking the main thread every so often, and I'm completely speculating but if you're you are copying data from the device to the GPU directly, the device is blocking occassionally and that causes the slowdown.

Related

Extremely slow ffmpeg/sws_scale() - only on heavy duty

I am writing a video player using ffmpeg (Windows only, Visual Studio 2015, 64 bit compile).
With common videos (up to 4K # 30FPS), it works pretty good. But with my maximum target - 4K # 60FPS, it fails. Decoding still is fast enough, but when it comes to YUV/BGRA conversion it is simply not fast enough, even though it's done in 16 threads (one thread per frame on a 16/32 core machine).
So as a first countermeasure I skipped the conversion of some frames and got a stable frame rate of ~40 that way. Comparing the two versions in Concurrency Visualizer, I found a strange issue I don't know the reason of.
.
Here's an image of the frameskip version:
You see that the conversion is pretty quick (average roughly ~35ms)
Thus, as multiple threads are used, it also should be quick enough for 60FPS, but it isn't!
.
The image of the non-frameskip version shows why:
The conversion of a single frame has become ten times slower than before (average roughly ~350ms). Now a heavy workload on many cores would of course cause a minor slowdown per core due to reduced turbo - let's say 10 or 20%. But never an extreme slowdown of ~1000%.
.
Interesting detail is, that the stack trace of the non-frameskip version shows some system activity I don't really understand - beginning with ntoskrnl.exe!KiPageFault+0x373. There are no exceptions, other error messages or such - it just becomes extremely slow.
Edit: A colleague just told me that this looks like a memory problem with paged-out memory at first glance - but my memory utilization is low (below 1GB, and more than 20GB free)
Can anyone tell me what could be causing this?
This is probably too old to be useful, but just for the record:
What's probably happening is that you're allocating 4k frames over and over again in multiple threads. The windows allocator really doesn't like that access pattern.
The malloc itself will not show up in the profiler, since only when the memory is actually accessed, will the OS fetch the pages. This shows up as ntoskrnl.exe!KiPageFault and gets attributed to the function first accessing the new memory.
Solutions include:
Using a different allocator (e.g. tbb_malloc, mimalloc, etc.)
Using your own per-thread or per process frame pool. ffmpeg does something similar internally, maybe you can just use that.

Is there any way to maximize my application's timeslice on Windows?

I have a 3D application that needs to generate a new frame roughly every 6ms or so. This frame-rate needs to be constant in order to not result in stuttering. To make matters worse, the application has to perform several moderately-heavy calculations (mostly preparing the 3D scene and copying data to the VRAM) so there is it consumes a fairly large amount of that ~6ms doing it's own stuff.
This has been a problem because Windows causes my application to stutter a bit when it tries to use the CPU for other things. Is there any way I could make Windows not "give away" timeslices to other processes? I'm not concerned about it negatively impacting background processes.
Windows will allow you to raise your application's priority. A process will normally only lose CPU time to other processes with the same or higher priority, so raising your priority can prevent CPU time from being "stolen".
Be aware, however, that if you go too far, you can render the system unstable, so if you're going to do this, you generally only want to raise priority a little bit, so it higher than other "normal" applications.
Also note that this won't make a huge difference. If you're running into small problem once in a while, increasing the priority may take care of the problem. If it's a constant problem, chances are that a priority boost won't be sufficient to fix it.
If you decide to try this, see SetPriorityClass and SetThreadPriority.
It normally depends on the scheduling algorithm used by your OS. Windows 7,8,XP,VISTA use the multilevel queue scheduling with round robin so increasing the priority of your application or thread or process will do what you want. Which version of Windows are u currently using?.. I can help u accordingly once i get to know that
You can raise your process priority, but I don’t think it will help much. Instead, you should optimize your code.
For a start, use a VS built-in profiler (Debug/Performance profiler menu) to find out where your app spends most time, optimize that.
Also, all modern CPUs are at least dual core (the last single-core Celeron is from 2013). Therefore, “there it consumes a fairly large amount of that ~6ms doing it's own stuff” shouldn’t be the case. Your own stuff should be running in a separate thread, not on the same thread you use to render. See this article for an idea how to achieve that. You probably don’t need that level of complexity, just 2 threads + 2 tasks (compute and render) will probably be enough, but that article should give you some ideas how to re-design your app. This approach however will bring 1 extra frame of input latency (for 1 frame the background thread will compute stuff, only the next frame the renderer thread will show the result of that), but with your 165Hz rendering you can probably live with that.

Why does my logging library cause performance tests to run faster?

I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.

Clueless on how to execute big tasks on C++ AMP

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!
TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.
I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.
Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?
XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.
I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.
It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.
It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.