How to distinguish between a CPU and a memory bottleneck? - c++

I want to filter some data (100s of MB to a few GBs). The user can change the filters so I have to re-filter the data frequently.
The code is rather simple:
std::vector<Event*> filteredEvents;
for (size_t i = 0; i < events.size(); i++){
const auto ev = events[i];
for (const auto& filter : filters) {
if (filter->evaluate(ev)) {
filteredEvents->push_back(ev);
break;
}
}
if (i % 512 == 0) {
updateProgress(i);
}
}
I now want to add another filter. I can do that either in a way that uses more CPU or that uses more memory. To decide between the two I would like to know what the bottleneck of the above loop is.
How do I profile the code to decide if the bottleneck is the cpu or the memory?
In case it matters, the project is written in Qt and I use Qt Creator as the idea. The platform is Windows. I am currently using Very Sleepy to profile my code.

Intel vTune is a free graphical tool available for all of Windows and Linux and macOS. If you want something easy to set up, straightforward to read, and you're using Intel CPU, I'd say vTune is a good choice. It can automatically give you suggestions on where your bottleneck is (core vs memory).
Under the hood I believe Intel vTune is collecting a bunch of PMU (performance monitoring unit) counter values, LBR, stack information etc. On Linux, you are more than welcome to utilize the Linux Perf tool and collect performance stats for yourself. For example, using perf record + perf report in tandem tells you the hotspot of your application. But if you're concerned about other metrics, for example cache miss behavior, you'll have to explicitly tell perf what performance counter to collect. perf mem is able to address some of that need. But afterall, Linux Perf is a lot more "hard core" than the graphical Intel vTune, you better know what counter values to look for if you want to make good use of Linux Perf - sometimes one counter will directly give you the metric you want to collect, other times you have to do some computation on several counter values to get your desired metric. Use perf list to appreciate how detailed it can profile your machine and system's performance.

Related

How to collect hardware events of ArangoDB with profiling tool

On a Ubuntu server 14.04 (4.4.0-62-generic) on Intel Xeon CPU E5-2698 v4,
I am trying to collect hardware event counts for ArangoDB with Intel VTune.
But if I start collecting, the server will die right away.
I think the reason is that ArangoDB is collecting hardware events internally.
So I tried to turn off the ArangoDB's statistics gathering.
--server.statistics value
But still the same.
How can I collect hardware events of ArangoDB with profiling tool?
As you are mentioning Vtune and PMU (hardware events) I assume you are using General exploration collection or likes.
Vtune has several different collectors, and some of them (hotspots, advanced hotspots) do not use PMU. If you run those, does it still crash? Also Vtune now can open results collected by linux perf record, that uses slightly different PMU sampling method. Can you try running perf record -a -e ..(events list)... sleep 30; and see if the perf collector causes same crash? If it does not, you can rename the result file from perf.data to perf.data.perf and then import to Vtune.

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

How can I get the number of logical CPUs on WinRT?

I'm trying to compile Boost 1.49.0 for WinRT. I've got it down to one method: GetSystemInfo(), which is used in boost::thread::hardware_concurrency() to obtain the number of logical processors on the system.
I haven't found any replacement in WinRT yet.
Is there an alternative method I could use?
You can call the Windows API function GetNativeSystemInfo, which is permitted in Metro style apps.
There doesn't seem to be a simple way to get this information in WinRT. If you just want to know the processor architecture then you can use Windows.System.ProcessorArchitecture but this will not tell you how many logical CPUs are available. Windows.System.Threading doesn't tell you this information either.
To get information about the physical CPU I've found this question on the MSDN forum which suggests that we can use DeviceEnumeration to get to this information. By using the GUID for GUID_DEVICE_PROCESSOR ({97FADB10-4E33-40AE-359C-8BEF029DBDD0}) you can enumerate over all processors.
In Javascript this should look something like this - for a C++ example see the Device Enumeration example on MSDN:
Windows.Devices.Enumeration.DeviceInformation.findAllAsync('"System.Devices.InterfaceClassGuid:="{97FADB10-4E33-40AE-359C-8BEF029DBDD0}""')
.then(function (info) {
for (var i = 0; i < info.length; i++) {
var device = info[i];
}
});
On my machine this gives me all sorts of devices, sound card, PCI and USB processors so I'm not sure if there is a better way to just get the CPU but I did get the info what CPU I have
"Intel(R) Core(TM) i7 CPU Q 740 # 1.73GHz"
Unfortunately this info doesn't seem to include a simple flag that tells you the number of CPUs and therefore I think it would be difficult to get to a number of logical CPUs. I suggest you ask on the MSDN forum. They are usually quite responsive.
System.Environment.ProcessorCount should give you the number of cores.

Profiling OpenCL kernels

I am trying to optimize my OpenCL kernels and all I have right now is NVidia Visual Profiler, which seems rather constrained. I would like to see line-by-line profile of kernels to better understand issues with coalescing, etc. Is there a way to get more thorough profiling data than the one, provided by Visual Profiler?
I think that AMD CodeXL is what you are looking for. It's a free set of tools that contains an OpenCL debugger and a GPU profiler.
The OpenCL debugger allows you to do line-by-line debugging of your OpenCL kernels and host code, view all variables across different workgroups, view special events and errors that occur, etc..
The GPU profiler has a nice feature that generates a timeline displaying how long your program spends on tasks like data transfer and kernel execution.
For more info and download links, check out http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
No, there is no such tool but you can profile your code changes. Try measuring the speed of your code, change something and then measure it once again. clEnqueueNDRangeKernel has an Event argument which can be used with clGetEventProfilingInfo afterwards, the timer is very sharp, the accuracy is measured in order of microseconds. This is the only way to measure performance of a separate code part...
I haven't test it but I just found this program http://www.gremedy.com/gDEBuggerCL.php.
The description is: " This new product brings gDEBugger's advanced Debugging, Profiling and Memory Analysis abilities to the OpenCL developer's world..."
LTPV is an open-source, OpenCL profiler that may fit your requirements. It is, for now, only working under Linux.
(disclosure: I am the developer of this tool)

Determining CPU usage in WinCE

I want to be able to get the current % CPU usage in a C++ program running under Wince.
I found this link that states where the source code is but I cannot find it in my platform builder installation - I expect this is because it isn't the Windows Automotive platform.
Does anyone know where I can find this source code or (even better) know how I can get this information directly? i.e. what DLL / function calls to make etc.
Since GetProcessTimes doesn't exist in CE, you have to calculate this.
You have to start with the toolhelp APIs to enumerate the processes and the threads in the processes. You then call GetThreadTimes for each thread and add all that up.
Bear in mind that the act of calculating this info will affect the CPU utilization.
I have found that GetIdleTime (or CeGetIdleTimeEx on WEC7 or newer) works well for calculating system-wide processor usage. Sample code for calculating processor idle time percentage is shown on GetIdleTime MSDN page. Obviously, processor utilization percentage can then be calculated by subtracting the idle time percentage from 100.
The MSDN page does warn that support for GetIdleTime is dependent on OAL implementation.
Note that when using the toolhelp APIs to calculate the CPU usage, you need to take two measurements, then calculate the difference. when doing so, you won't know how much CPU any threads that were terminated before the second sample took.
So, applications that often create short-lived threads will not be represented properly in your result.
You can look into Remote Task Monitor. It will let you get the current % CPU usage of your process (or thread), exactly what you are looking for. It also is very light weight, does not impact your device much.