Using pixel shader to perform fast computation? [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I wish to run a very simple function a lot of times.
At first I thought about inlining the function (its only four lines long), so I figured that placing it in the header will do that automatically. gprof said that was a good idea. However I heard that pixel shaders are optimized for that purpose. I was wondering if this true? I have a simple function that takes 6 numbers and I wish to run it N times. Would a pixel shader speed things up?

Maybe a GPU could speed up your function, maybe not. It depends vastly on the function. GPUs are good at parallel execution. While a consumer-grade x86 CPU has 8 cores at most, graphic cards can execute a lot more calculations in parallel. But the bottleneck is often the transfer of data between GPU RAM and system RAM. When your function isn't actually that computationally expensive, that overhead might overshadow it.
In the end you can just try yourself, measure it, and see for yourself which is faster.
You might want to take a look at OpenCL, the most widely-supported standard for moving computation to the graphic card.
When you are living in Windows-land there is also DirectCompute which is a part of DirectX or the Accelerated Massive Parallelism extension for C++. There is also CUDA, but it only supports NVIDIA GPUs.

Related

OpenGL big 3D texture (>2GB) is very slow [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
My graphics card is GTX 1080 ti. I want to use OpenGL 3D texture. The pixel (voxel) format is GL_R32F. OpenGL did not report any errors when I initialized the texture and rendered with the texture.
When the 3D texture was small (512x512x512), my program ran fast (~500FPS).
However, if I increased the size to 1024x1024x1024 (4GB), the FPS dramatically dropped to less than 1FPS. When I monitored the GPU memory usage, the GPU memory does not exceed 3GB even though the texture size is 4GB and I have 11G in total.
If I changed pixel format to GL_R16F, it worked again and the FPS went back to 500FPS and the GPU memory consumption is about 6.2GB.
My hypothesis is that the 4GB 3D texture is not really on the GPU but on the CPU memory instead. In every frame, the driver is passing this data from CPU memory to GPU memory again and again. As a result, it slows down the performance.
My first question is whether my hypothesis is correct? If it is, why it happens even I have plenty of GPU memory? How do I enforce any OpenGL data to reside on GPU memory?
My first question is whether my hypothesis is correct?
It is not unplausible, at least.
If it is, why it happens even I have plenty of GPU memory?
That's something for your OpenGL implementation to decide. Note that this also might be some driver bug. It might also be some internal limit.
How do I enforce any OpenGL data to reside on GPU memory?
You can't. OpenGL does not have a concept of Video RAM or System RAM or even a GPU. You specify your buffers and textures and other objects and make the draw calls, and it is the GL implementation's job to map this to the actual hardware. However, there are no performance guarantees whatsoever - you might encounter a slow path or even a fallback to software rendering when you do certain things (with the latter being really uncommon in recent times, but conceptually, it is very possible).
If you want control over where to place data, when to actually transfer it, and so on, you have to use a more low-level API like Vulkan.

CPU utilization degradation over time [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.
The question is - what should I look for, and what best-known-methods can I use to debug this?
Additional info:
I have enough RAM - most of it is unused at any given moment.
According to perfmon - memory doesn't grow (so I don't think it's leaking).
The code is a mix of .Net and native c++, with some data marshaling back and forth.
I saw this on several different machines (servers with 24 logical cores).
One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.
Edit 1
One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library.
Specifically, I noticed that I have the following warnings:
warning LNK4087: CONSTANT keyword is obsolete; use DATA
Edit 2
Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users.
Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx).
After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.
Conclusion Understand your warnings! Eliminate as much of them as you can!
Thanks.
For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)
One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.
You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.
As to what might be the issue, a degradation that slow is strange.
How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.
If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").
Let us know what you ultimately find out.

C/C++ timer, like Swiss watches [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Can someone provide a list of timers for C/C++ that they provide god level accuracy?
If for example I take 100 computers and start the program at the same microsecond in all of them, I want the timers to display the same time on all computers (with different CPUs and different CPU loads) after a year of continuously running.
Platform: Linux
Accuracy: 1 second but the timer must be EXACLTY 1 second, not 1 second and 1/1000000. this extra 1/1000000 is not acceptable. In other words, in one year of running, not even a second of lost accuracy is acceptable.
The timer must not need extra hardware.
Q1: What's the best timer the mankind made and it's free (chrono, setitimer, the one of the many boost timers, or something else?)
Q2:: Using this best timer, what kind of accuracy I can expect, when using an Ivy Bridge CPU?
The best timing accuracy you can get with your program is synchronizing with an atomic clock device, like the USNO Master Clock.
/sarcasm off
To give a few hints:
The C++ standard doesn't guarantee anything beyond milliseconds accuracy, and even these might end up in tenths of ms jittering (depends on OS).
Your hardware timers might provide better accuracy, but your drivers/applications still might introduce unwanted latencies
If you're really going to get nearly precise for your requirements, don't forget to keep in mind compensation of relativistic effects like height position, speed, etc. of the measurement equipment
Q2: Using this best timer, what kind of accuracy I can expect, when using an Ivy Bridge CPU?
Multiples of nanoseconds I'd guess, if done right (forget about that atomic clock joke when going to this direction).

Making C++ run at full speed [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
To compare C++ and Java on certain tasks I have made two similar programs, one in Java and one in C++. When I run the Java one it takes 25% CPU without fluctuation, which you would expect as I'm using a quad core. However, the C++ version only uses about 8% and fluctuates havily. I run both programs on the same computer, on the same OS, with the same programs active in the background. How do I make C++ use one full core? These are 2 programs both not interrupted by anything. They both ask for some info and then enter an infinite loop until you exit the program, giving feedback on how many calculations per second.
The code:
http://pastebin.com/5rNuR9wA
http://pastebin.com/gzSwgBC1
http://pastebin.com/60wpcqtn
To answer some questions:
I'm basically looping a bunch of code and seeing how often per second it loops. The problem is: it doesn't use all the CPU it can use. The whole point is to have the same processor do the same task in Java and C++ and compare the amount of loops per second. But if one is using irregular amounts of CPU time and the other one is looping stable at a certain percentage they are hard to compare. By the way, if I ask it to execute this:
while(true){}
it takes 25%, why doesn't it do that with my code?
----edit:----
After some experimenting it seems that my code starts to use less than 25% if I use a cout statement. It isn't clear to me why a cout would cause the program to use less cpu (I guess it pauses until the statement is written which appearantly takes a while?
With this knowledge I will reprogram both programs (to keep them comparable) and just let it report the results after 60 seconds instead of every time it completed a loop.
Thanks for all the help, some of the tips were really helpful. After I discovered the answer someone also turned out to give this as an answer, so even if I wouldn't have found it myself I would have gotten the answer. Thanks!
(though I would like to know why a std::cout takes such an amount of time)
Your main loop has a cout in it, which will call out to the OS to write the accumulated output at some point. Either OS time is not counted against your app, or it causes some disk IO or other activity that forces your program to wait.
It's probably not accurate to compare both of these running at the same time without considering the fact that they will compete for cpu time. The OS will automatically choose the scheduling for these two tasks which can be affected by which one started first and a multitude of other criteria.
Running them both at the same time would require some type of configuration of the scheduling so that each one is confined to run to one (or two) cpus and each application uses different cpus. This can be done by having each main function execute a separate thread that performs all the work and then setting the cpu where this thread will run. In c++11 this can be done using a std::thread and then setting the underlying cpu affinity by getting the native_handle and setting it there.
I'm not sure how to do this in Java but I'm sure the process is similar.

System not reaching 100% CPU how to trouble shoot [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a application (basically C++ application) which has below properties
Multi Threaded
Each thread has its own thread attributes (like stack size, etc).
Multi process (i.e will run multiple process).
Run on a 8 core processor.
Uses shared memory/IPC's/extensive heap management (allocation/deallocation), system sleep etc.
So now, I am supposed to find the system CAPS at max CPU. The ideal way is to load the system to 100% CPU and them check the CAPS (successful) the system supports.
I know, that in complex systems, CPU will be "dead" for context switches, page swaps, I/O etc.
But my system is max able to run at 95% CPU (not more than that irrespective of the load). So the idea here is to find out these points which is really contributing to "CPU eating" and then see if I can engineer them to reduce/eliminate the unused CPU's.
Question
How do we find out which IO/Context switching... etc is the cause of the un-conquerable 5% CPU? Is there any tool for this? I am aware of OProfile/Quantify and vmstat reports. But none of them would give this information.
There may be some operations which I am not aware - which may restrict the MAX CPU utilization. Any link/document which can help me in understanding a detailed set of operation which will reduce my CPU usage would be very helpful.
Edit 1:
Added some more information
a. The OS under question is SUSE10 Linux server.
b. CAPS - it is the average CALLS you can run on your system per second. Basically a telecommunication term - But it can be considered generic - Assume your application provides a protocol implementation. How many protocol calls can you make per second?
"100% CPU" is a convenient engineering concept, not a mathematical absolute. There's no objective definition of what it means. For instance, time spent waiting on DRAM is often counted as CPU time, but time spent waiting on Flash is counted as I/O time. With my hardware hat on, I'd say that both Flash and DRAM are solid-state cell-organized memories, and could be treated the same.
So, in this case, your system is running at "100% CPU" for engineering purposes. The load is CPU-limited, and you can measure the Calls Per Second in this state.