OpenGL Shader Uniform becomes slower [closed]

OpenGL Shader Uniform becomes slower [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
The Problem
I'm developing some procedural terrain generation in C++ with OpenGL. As an IDE I'm using Microsoft VS2017. I can run the "experiment" without problems. But after about two hours of development the program slows down. Within ten minutes or so, the framerate drops from over 100 to 20. And shortly after that my GPU doesn't manage to render a frame every second. When launching the program, it takes an eternity to load the shaders and link the programs.
Possible Causes
After some debugging and profiling within VS2017, it turns out that over 98% of the time the CPU is waiting for the GPU to complete shader uniform actions. This includes finding the locations of uniform variables and loading three matricies to uniform variables.
Troubleshooting Steps
I've tried various different things to improve the situationg including the following ones but I could not fix the problem without restarting my computer
Copy .exe and assets to another folder
Copy .exe and assets to another physical devide
Relaunch VS2017
Decrease GPU and memory clock in MSI Afterburner
Check graphics card VRAM usage
Close background applications
My Computer
If this information helps someone, here is it:
Intel© Core© i5-6600K #3,5GHz
EVGA GeForce GTX 1060 6GB GDDR5
MSI Z170-A PRO
2x8GB DDR4-2133
Thermaltake 530W PSU
2x1TB HDD in RAID1 (Has the project on it)
128GB SSD
512GB HDD
Thanks in advance,
Elias

All of your "troubleshooting" steps are voodoo. It doesn't matter which IDE you use (it's just a glorified editor anyway). It doesn't matter where in the filesystem your executable resides (it's just block of storage device with a page mapping to the OS). Decreasing GPU and/or memory clock help with stability if you're running into thermal problems, but will not influence such creeping performance problems (also if there were a thermal problem, you'd notice it within minutes, not hours).
Sudden drops in performance after a system runs for some time can almost always be attributed to resource exhaustion, forcing the system to swap data around. The cause for resource exhaustion is improper allocation management, i.e. an imbalance between allocating something and freeing it again.
This is what you have to debug. For OpenGL every glGen…/glCreate… must be balanced by a matching glDelete…. For every use of new in your code, there must be a balancing delete (and for new …[] there must be a delete[] …).
If you push objects into a container (like std::vector, std::list, std::map and so on) make sure you also carry out the garbage, i.e. dispose of object you no longer use.

Related

CPU utilization degradation over time [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.
The question is - what should I look for, and what best-known-methods can I use to debug this?
Additional info:
I have enough RAM - most of it is unused at any given moment.
According to perfmon - memory doesn't grow (so I don't think it's leaking).
The code is a mix of .Net and native c++, with some data marshaling back and forth.
I saw this on several different machines (servers with 24 logical cores).
One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.
Edit 1
One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library.
Specifically, I noticed that I have the following warnings:
warning LNK4087: CONSTANT keyword is obsolete; use DATA
Edit 2
Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users.
Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx).
After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.
Conclusion Understand your warnings! Eliminate as much of them as you can!
Thanks.

For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)
One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.
You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.
As to what might be the issue, a degradation that slow is strange.
How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.
If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").
Let us know what you ultimately find out.

Low RAM usage + frequent allocation/deallocation causes Linux to swap out other programs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
the program i am working on at the moment processes a large amount of data (>32GB). Due to "pipelining" however, a maximum of arround 600 MB is present in the main memory at each given time (i checked that, that works as planned).
If the program has finished however and i switch back to the workspace with Firefox open, for example (but also others), it takes a while till i can use it again (also HDD is highly active for a while). This fact makes me wonder if Linux (operating system i use) swaps out other programs while my program is running and why?
I have 4 GB of RAM installed on my machine and while my program is active it never goes above 2 GB of utilization.
My program only allocates/deallocates dynamic memory of only two different sizes. 32 and 64 MB chunks. It is written in C++ and i use new and delete. Should Linux not be sufficiently smart enough to reuse these blocks once i freed them and leave my other memory untouched?
Why does Linux kick my stuff out of the memory?
Is this some other effect i have not considered?
Can i work arround this problem without writing a custom memory management system?

The most likely culprit is file caching. The good news is that you can disable file caching. Without caching, your software will run more quickly, but only if you don't need to reload the same data later.
You can do this directly with linux APIs, but I suggest you use a library such as Boost ASIO. If your software is I/O bound, you should additionally make use of asynchronous I/O to improve performance.

All the recently-used pages are causing older pages to get squeezed out of the disk cache. As a result, when some other program runs, it has to paged back in.
What you want to do is use posix_fadvise (or posix_madvise if you're memory mapping the file) to eject pages you've forced the OS to cache so that your program doesn't have a huge cache footprint. This will let older pages from other programs remain in cache.

DirectX application "hiccups" every 3 seconds

I've been investigating an issue in my DirectX 11 C++ application for over a week now, and so I'm turning to the good people on StackOverflow for any insight that may help me track this one down.
My application will run mostly at 60-90 frames per second, but every few seconds I'll get a frame that takes around a third of a second to finish. After much investigation, debugging, and using various code profilers, I have narrowed it down to calls to the DirectX API. However, from one slow frame to the next, it is not always the same API call that causes the slowdown. In my latest run, the calls that stall (always for about a fifth of a second) are
ID3D11DeviceContext:UpdateSubresource
ID3D11DeviceContext:DrawIndexed
IDXGISwapChain:Present
Not only is it not the same function that stalls, but each of these functions (mainly the first two) the slow call may be from various places in my code from one to the next.
According to multiple profiling tools and my own high resolution timers I placed in my code to help measure things, I found that this "hiccup" would occur at consistent intervals of just under 3 seconds (~2.95).
This application collects data from external hardware and uses DirectX to visualize that data in real time. While the application is running, the hardware may be idle or running at various speeds. The faster the hardware goes the more data is collected and must be visualized. I point this out because it may be useful when considering some of the characteristics of this bug:
The long frames don't occur while the hardware is idle. This makes sense to me because the software just has to redraw data it already has and doesn't have to transfer new data over to the GPU.
However, the long frames occur at these consistent 3 second intervals regardless of the speed the hardware is running. So even if my application is collecting twice the amount of data per second, the frequency of the long frames doesn't change.
The duration of these long frames is very consistent. Always between 0.25 and 0.3 seconds (I believe it is the slow call to the DirectX API that is consistent, so any variation on the overall frame duration is external to that call).
While field testing last week (when I first discovered the issue), I noticed that on a couple runs of the application, after a long time (probably 20 minutes or more) of continuous testing without interacting much with the program aside from watching it, the hiccup would go away. The hiccup would come back if we interacted with some features of the application or restarted the program. Doesn't make sense to me, but almost like the GPU "figured out" and fixed the issue but then reverted back when we changed up the pattern of work it had been doing previously. Unfortunately the nature of our hardware makes it difficult for me to replicate this in a lab environment.
This bug is occurring consistently on two different machines with very similar hardware (dual GTX580 cards). However, in recent versions of the application, this issue did not occur. Unfortunately the code has undergone many changes since then so it would be difficult to pinpoint what specific change is causing the issue.
I considered the graphics driver, and so updated to the latest version, but that didn't make a difference. I also considered the possibility that some other change was made to both computers, or possibly an update to software running on both of them, could be causing issues with the GPU. But I can't think of anything other than Microsoft Security Essentials that is running on both machines while the application runs, and I've already tried disabling it's Real-Time Protection feature to no avail.
While I would love for the cause to be an external program that I can just turn off, ultimately I worry that I must be doing something incorrectly/improperly with the DirectX API that is causing the GPU to have to make adjustments every few seconds. Maybe I am doing something wrong in the way I update data on the GPU (since the lag only happens when I'm collecting data to display). Then the GPU stalls every few seconds and whatever API function that happens to get called during a stall can't return as fast as it normally would?
Any suggestions would be greatly appreciated!
Thanks,
Tim
UPDATE (2013.01.21):
I finally gave in and went searching back through previous revisions of my application until I found a point where this bug wasn't occurring. Then I went revision by revision until I found exactly when the bug started happening and managed to pinpoint the source of my issue. The problem started occurring after I added an 'unsigned integer' field to a vertex type, of which I allocate a large vertex buffer. Because of the size of the vertex buffer, this change increased the size 184.65 MB (1107.87 MB to 1292.52). Because I do in fact need this extra field in my vertex structure, I found other ways to cut back on overall vertex buffer size, and got it down to 704.26 MB.
My best guess is that the addition of that field and the extra memory it required caused me to exceed some threshold/limit on the GPU. I'm not sure if it was an excess of total memory allocation, or an excess of some limit to a single vertex buffer. Either way, it seems that this excess caused the GPU to have to do some extra work every few seconds (maybe communicating with the CPU) every few seconds, and so my calls to the API had to wait on this. If anyone has any information that would clarify the implications of large vertex buffers, I'd love to hear it!
Thanks to everyone who gave me their time and suggestions.

1) Try turning of VSYNC
2) Are you allocating/deallocating large chunks of memory? Try to alloc memory at beginning of program and don't dealloc it, simply overwrite it (which is probably what you're doing with updatesubresource)
3) Put the interaction with the hardware device on a separate thread. After the device has completely finished passing data to your application, load it into the GPU. Do not let the device take control of the main thread. I suspect the device is blocking the main thread every so often, and I'm completely speculating but if you're you are copying data from the device to the GPU directly, the device is blocking occassionally and that causes the slowdown.

OpenGL stable enough for 24/7 running program?

I am creating a GUI program that will run 24/7. I couldn't find much online on the subject, but is OpenGL stable enough to run 24/7 for weeks on end without leaks, crashes, etc?
Should I have any concerns or anything to look into before delving too deep into using OpenGL?
I know that OpenGL and DirectX are primarily used for games or other programs that aren't used for very long lengths. Hopefully someone here has some experience with this or knowledge on the subject. Thanks.
EDIT: Sorry for the lack of detail. This will only be doing 2D rendering, and nothing too heavy, what I have now (which will be similar to production) already runs at a stable 900-1000 FPS on my i5 laptop with Radeon 6850m

Going into OpenGL just for making a GUI sounds insane. You should be worried more about what language you use if you are concerned about stuff like memory leaks. Remember that in C/C++ you manage memory on your own.
Furthermore, do you really need the GUI to be running 24/7? If you are making a service sort of application, you might as well leave it in the background and make a second application which provides the GUI. These two applications would communicate via soma IPC (sockets?). That's how this sort of thing usually works, not having a window open all the time.
In the end, memory leaks are not caused by some graphical library, but more by the programmer writing bad code. The library should be the last in your list of possible reasons for memory leaks/creashes.

I work for a company that makes (windows based) quality assurance software (machine vision) using Delphi.
The main operator screen shows the camera images at up to 20fps (2 x 10fps) with opengl overlay, and has essentially unbounded uptime (longest uptimes close to an year, longer is hard due to power downs for maintenance). Higher speed cameras have their display rates throttled.
I would avoid integrated video from intel for a while longer though. Since i5 it meets our minimal requirements (non power of 2 textures mostly), but the initial drivers were bad, and while they have improved there are occasional stability and regularity problems still.

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.

I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.

Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?

XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.

I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.

It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.

It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js