Profiling OpenCL kernels - profiling

I am trying to optimize my OpenCL kernels and all I have right now is NVidia Visual Profiler, which seems rather constrained. I would like to see line-by-line profile of kernels to better understand issues with coalescing, etc. Is there a way to get more thorough profiling data than the one, provided by Visual Profiler?

I think that AMD CodeXL is what you are looking for. It's a free set of tools that contains an OpenCL debugger and a GPU profiler.
The OpenCL debugger allows you to do line-by-line debugging of your OpenCL kernels and host code, view all variables across different workgroups, view special events and errors that occur, etc..
The GPU profiler has a nice feature that generates a timeline displaying how long your program spends on tasks like data transfer and kernel execution.
For more info and download links, check out http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

No, there is no such tool but you can profile your code changes. Try measuring the speed of your code, change something and then measure it once again. clEnqueueNDRangeKernel has an Event argument which can be used with clGetEventProfilingInfo afterwards, the timer is very sharp, the accuracy is measured in order of microseconds. This is the only way to measure performance of a separate code part...

I haven't test it but I just found this program http://www.gremedy.com/gDEBuggerCL.php.
The description is: " This new product brings gDEBugger's advanced Debugging, Profiling and Memory Analysis abilities to the OpenCL developer's world..."

LTPV is an open-source, OpenCL profiler that may fit your requirements. It is, for now, only working under Linux.
(disclosure: I am the developer of this tool)

Related

gdi32.dll calls from nvoglv64.dll

I'm trying to profile a 64 bit OpenGL application using the MSVS 2013 profiler (CPU sampling). According to Sysinternals Process Explorer, my application seems to use only 60% of GPU ressources but 100% of a CPU core (since it's only single-threaded for the time being), so the CPU code seems to be the bottleneck. Now I tried to figure out what the hotspots are, in order to optimize/parallelize my code.
However, the profiling results tell that 98% of the time is spent by nvogl64v.dll -- more most notably, 75% within gdi32.dll, 6% in KernelBase.dll.
I have no clue what to do with this information and what optimiziations in my code could help. What conclusion can I draw from that? I'm using freeglut for windowing, the profiler tells negligible 2% is spent in freeglut.dll, thus in my idle and display functions, so I'm not sure if any changes in my update and draw loops would have any effect.
Any hints?
EDIT:
I now figured out how to load according debugging symobols from MS Symbol Servers, now I can go one step deeper into the callstack: Turns out, the portion of gdi32.dll is spent mainly in NtGdiDdDDIEscape (55%) and NtGdiDdDDIGetDeviceState (17%), while KernelBase.dll portion is due to SwitchToThread

Method to get GPU information for OS or OpenGL API

I have written a benchmark app (http://www.headline-benchmark.com) that rates graphics cards, but my problem is that I get the graphics card name from the OpenGL API using GL_STRING. For NVidia cards this works fine but for AMD cards I get useless naming like "R9 200 Series" which maps (currently) to four totally different graphics cards.
I've tried using the OpenCL API to get more card info (such as total number of compute units) as I can use this to disambiguate the AMD cards, but OpenCL is prone to crashing on older systems so I would rather avoid it. Is there any feature of the OpenGL API I can use that will give me more detail about the cards? Or indeed does AMD provide any kind of diagnostic command-line utilities that I could exploit?
What kind of details are you looking for?
Since you are using a benchmark utility, I would suggest using AMD's ADL API. This is roughly the same as NV's NVML API and they will both let you get the memory and GPU clock as well as GPU load %. Be aware that if you want to use this information, you should query it while your benchmark is in full-swing, because modern GPUs scale their clock rates back during idle load.
AMD has also recently released a new API, called AGS that is much more sophisticated than ADL and likely to give you the information you are looking for. Unfortunately I have not had a chance to work with it yet; ADL is mostly for power state management but still useful (particularly since it is cross-platform unlike AGS).

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

Get GPU temperature

I am really puzzled here. I want to create an application that does different events upon different temperatures of my graphics card which is an AMD one.
The reason i want to make such an applications is because, for a GPU i haven't found one, and the second is to ensure i never fry my card by reaching enormous temps.
However i have no idea how people(not connected to amd/intel/nvidia) write applications to monitor temperatures of any kind.
So how does it happen? Some APIs i don't know or something?
After a little bit of googling, i found this:
I think this is really vendor specific, it will probably involve interfacing directly with the motherboard or video driver and knowing which IOCTL represents the code for requesting the temperature. I reverse engineered a motherboard driver once for this purpose. It's not as hard as it sounds, download a manufacturer motherboard/BIOS utility and try to hook the function that gets called when that app needs to display the temperature to the user. Then watch for a call to DeviceIoControl() in Windows, or ioctl() in linux and see what the inputs / outputs are.
This may be your best bet. I found this information here:
http://www.gamedev.net/topic/557599-get-gpucpu-temperature/
Edit:
Also found this:
http://msdn.microsoft.com/en-us/library/aa389762%28v=VS.85%29.aspx
http://msdn.microsoft.com/en-us/library/aa394493%28VS.85%29.aspx
Hope it helps.
You could use one of the existing GPU temperature monitoring programs, such as GPU-Z, configure it for continuous monitoring, and read the log entries.
RivaTuner is another GPU monitoring program, which has a shared memory interface allowing other programs to get the data in real-time, but is nVidia focused. As long as your action isn't "reduce the GPU clock speed" it'll probably work well enough with ATI cards.

How to profile memory usage and performances of an openMPI program in C

I'm looking for a way to profile my openMPI program in C, i'm using openMPI 1.3 with Linux Ubuntu 9.10 and my programs are run under a Intel Duo T1600.
what I want in profile is cache-misses, memory usage and execution time in any part of the program.
thanks for reply
For Linux I recommend Zoom for this kind of profiling. You can get a free 30 day evaluation in order to try it out.
I finally found graphical tools for mpi profilling
vampir : www.vampir.eu and
paraprof at http://www.cs.uoregon.edu/research/tau/docs/paraprof/index.html
enjoy
Have a look at gprof and at Intel's VTune. Valgrind with the cachegrind tool could be useful, too.
Allinea MAP is ideal for this. It will highlight poor cache performance, memory usage and execution time right down to the source lines in your code. There is no need to recompile or instrument the application in order to profile it with Allinea MAP - which makes it unusually easy to get started with. On most HPC systems and with most MPIs it takes your binary, runs it, and loads up the source code automatically to display the recorded performance data.
Take a look to profiling MPI. Some tools for profiling is mpiP and pgprof.