How to collect hardware events of ArangoDB with profiling tool - profiling

On a Ubuntu server 14.04 (4.4.0-62-generic) on Intel Xeon CPU E5-2698 v4,
I am trying to collect hardware event counts for ArangoDB with Intel VTune.
But if I start collecting, the server will die right away.
I think the reason is that ArangoDB is collecting hardware events internally.
So I tried to turn off the ArangoDB's statistics gathering.
--server.statistics value
But still the same.
How can I collect hardware events of ArangoDB with profiling tool?

As you are mentioning Vtune and PMU (hardware events) I assume you are using General exploration collection or likes.
Vtune has several different collectors, and some of them (hotspots, advanced hotspots) do not use PMU. If you run those, does it still crash? Also Vtune now can open results collected by linux perf record, that uses slightly different PMU sampling method. Can you try running perf record -a -e ..(events list)... sleep 30; and see if the perf collector causes same crash? If it does not, you can rename the result file from perf.data to perf.data.perf and then import to Vtune.

Related

How to distinguish between a CPU and a memory bottleneck?

I want to filter some data (100s of MB to a few GBs). The user can change the filters so I have to re-filter the data frequently.
The code is rather simple:
std::vector<Event*> filteredEvents;
for (size_t i = 0; i < events.size(); i++){
const auto ev = events[i];
for (const auto& filter : filters) {
if (filter->evaluate(ev)) {
filteredEvents->push_back(ev);
break;
}
}
if (i % 512 == 0) {
updateProgress(i);
}
}
I now want to add another filter. I can do that either in a way that uses more CPU or that uses more memory. To decide between the two I would like to know what the bottleneck of the above loop is.
How do I profile the code to decide if the bottleneck is the cpu or the memory?
In case it matters, the project is written in Qt and I use Qt Creator as the idea. The platform is Windows. I am currently using Very Sleepy to profile my code.
Intel vTune is a free graphical tool available for all of Windows and Linux and macOS. If you want something easy to set up, straightforward to read, and you're using Intel CPU, I'd say vTune is a good choice. It can automatically give you suggestions on where your bottleneck is (core vs memory).
Under the hood I believe Intel vTune is collecting a bunch of PMU (performance monitoring unit) counter values, LBR, stack information etc. On Linux, you are more than welcome to utilize the Linux Perf tool and collect performance stats for yourself. For example, using perf record + perf report in tandem tells you the hotspot of your application. But if you're concerned about other metrics, for example cache miss behavior, you'll have to explicitly tell perf what performance counter to collect. perf mem is able to address some of that need. But afterall, Linux Perf is a lot more "hard core" than the graphical Intel vTune, you better know what counter values to look for if you want to make good use of Linux Perf - sometimes one counter will directly give you the metric you want to collect, other times you have to do some computation on several counter values to get your desired metric. Use perf list to appreciate how detailed it can profile your machine and system's performance.

How can I detect the presence of Intel Quick Sync from c++ code?

I want to detect whether Intel Quick Sync is present and enabled on the processor. FYI, it may be disabled (powered down) if you have no video cable plugged into the motherboard, or can be disabled in the BIOS.
Thanks
Ron
There is no general solution for something like this, the code would have to be specific to the OS you are running. It would likely boil down to a system call to determine the feature set of the processor you are running i.e. maybe something like cat /proc/cpuinfo if you were running Linux for example.
There are multiple ways to execute a system call in C++. Take a look at some of the previous answers here How to execute a command and get output of command within C++ using POSIX?
You can port the Intel System Analyzer Utility for Linux to C++ from Python. That's what I ended up doing.
That tool uses the output of cat /proc/cpuinfo and lspci to collect some info about the CPU, GPU, and software installation.

Limiting processor count for multi-threaded applications

I am developing a multi threaded application which ran fine on my development system which has 8 cores. When I ran it on a PC with 2 cores I encountered some synchronization issues.
Apart from turning off hyper-threading is there any way of limiting the number of cores an application can use so that I can emulate single and dual core environments for testing & debugging.
My application is written in C++ using Visual Studio 2010.
We always test in virtual machines nowadays since it's so easy to set up specific environments with given limitations.
For example, VMWare easily allows you to limit the number of processors in use, how much memory there is, hard disk sizes, the presence of USB or floppies or printers and all sorts of other wondrous things.
In fact, we have scripts which do all the work at the push of a button, from restoring the VM to a known initial state, then booting it up, installing the code over the network, running a test cycle then moving the results to an analysis machine on the network as well.
It greatly speeds up and simplifies the testing regime.
You want the SetProcessAffinityMask function or the SetThreadAffinityMask function.
The former works on the whole process and the latter on a specific thread.
You can also limit the active cores via the Windows Task Manager. Right click on process name and select "Set Affinity".

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

Profiling OpenCL kernels

I am trying to optimize my OpenCL kernels and all I have right now is NVidia Visual Profiler, which seems rather constrained. I would like to see line-by-line profile of kernels to better understand issues with coalescing, etc. Is there a way to get more thorough profiling data than the one, provided by Visual Profiler?
I think that AMD CodeXL is what you are looking for. It's a free set of tools that contains an OpenCL debugger and a GPU profiler.
The OpenCL debugger allows you to do line-by-line debugging of your OpenCL kernels and host code, view all variables across different workgroups, view special events and errors that occur, etc..
The GPU profiler has a nice feature that generates a timeline displaying how long your program spends on tasks like data transfer and kernel execution.
For more info and download links, check out http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
No, there is no such tool but you can profile your code changes. Try measuring the speed of your code, change something and then measure it once again. clEnqueueNDRangeKernel has an Event argument which can be used with clGetEventProfilingInfo afterwards, the timer is very sharp, the accuracy is measured in order of microseconds. This is the only way to measure performance of a separate code part...
I haven't test it but I just found this program http://www.gremedy.com/gDEBuggerCL.php.
The description is: " This new product brings gDEBugger's advanced Debugging, Profiling and Memory Analysis abilities to the OpenCL developer's world..."
LTPV is an open-source, OpenCL profiler that may fit your requirements. It is, for now, only working under Linux.
(disclosure: I am the developer of this tool)