C++ project performance boost by running another project in the background?

C++ project performance boost by running another project in the background? - c++

I'm trying to boost the performance of my C++ facial tracking algorithm by optimizing the most time-consuming sections of my code. Interestingly, 86% of the entire tracking processing time is consumed by its feature extraction section. That feature extractor takes an image as input and returns the feature vector as output. I isolated the feature extractor from the tracker into a separate project to write an optimized version from scratch and made sure the input and output format and containers are the same as the ones used in the original tracker. Here's the breakdown of the resulting process time.
Original one 3 ms/frame
Optimized version 1 ms/frame
As can be seen it's supposed to be 3 times faster. But when I insert the new feature extractor into my tracker, inside the tracker it runs at about 2.5 ms/frame. I can't understand why it gets slower inside the tracker as opposed to when it runs as a standalone project?
But here's the catch that I discovered by accident. If I run the tracker and at the same time I run another project in the background, all of a sudden, the feature extractor inside the tracker would converge to 1 ms/frame. But as soon as I stop the background project it goes back to 2.5 ms/frame. This happens both on my laptop with Ubuntu 16.04 and on my Desktop PC with Windows 10.
In an attempt to understand this sort of behavior I used my Ubuntu System Monitor and noticed the following regarding all the 8 CPU cores.
When I run the feature extractor alone 7 cores are engaged by about 5% and one of the cores is engaged by 100%.
When I insert the feature extractor in the tracker and run the tracker, all the 8 cores are engaged by about 30%.
Now when I run the tracker and along with the tracker I run any other program (let's say the standalone feature extractor), out of all the cores that are engaged by 30%, one jumps up to 100% engagement which may explain the performance boost I get in the tracker as a result.
The evidence suggests that in order for the feature extractor to run at its maximum potential of 1 ms/frame, at least one of the cores should be engaged by 100%. My question is how can I make this happen?

Related

Gauss Blur 3d image in cuda, sometimes it works sometimes it does not [duplicate]

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?

I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.

Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.

The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.

This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.

The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.

The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.

It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.

Time Consuming Tensorflow C++ Session->Run - Images for Real-time Inference

[Tensorflow (TF) on CPU]
I am using the skeleton code provided for C++ TF inference from GitHub [label_image/main.cc] in order to run a frozen model I have created in Python. This model is an FC NN with two hidden layers.
In my current project's C++ code, I run the NN's frozen classifier for each single image (8x8 pixels). For each sample, a Session->Run call takes about 0.02 seconds, which is expensive in my application, since I can have 64000 samples that I have to run.
When I send a batch of 1560 samples, the Session->Run call takes about 0.03 seconds.
Are these time measurements normal for the Session->Run Call? From the C++ end, should I send my frozen model batches of images and not single samples? From the Python end, are there optimisation tricks to alleviate that bottleneck? Is there a way to concurrently do Session-Run calls in C++?
Environment info
Operating System: Linux
Installed version of CUDA and cuDNN: N/A
What other attempted solutions have you tried?
I installed TF using the optimised instruction set for the CPU, but it does not seem to give me the huge time saving mentioned in StackOverflow
Unified the session for the Graph I created.
EDIT
It seems that MatMul is the bottleneck -- Any suggestions how to improve that?
Should I use 'optimize_for_inference.py' script for my frozen graph?
How can you measure the time in Python with high precision?
Timeline for feeding an 8x8 sample and getting the result in Python
Timeline for feeding an 8x8 batch and getting the result in Python

For the record, I have done two things that significantly increased the speed of my application:
Compiled TF to work on the optimised ISA of my machine.
Applied batching to my data samples.
Please feel free to comment here if you have questions about my answer.

Eclipse not closing previous build CPU usage 100%

I am using Eclipse for C/C++ development. I am trying to compile and run a project. When I compile and run the project after a while my CPU gets to 100% usage . I checked "Task Manager" and there I found that Eclipse isn't closing any of the previous build and it's running in the background which uses my CPU heavily. How do I solve this problem. When at 100% usage my PC becomes very very slow.

If you don't want the build to use up all your CPU time (maybe because you want to do other stuff while building) then you could decrease the parallelism of the build to a point where it leaves one or more cores unused. For example, if you have 8 cores you could configure your build to only use 6 of them.
Your build will take longer, but your machine will be more responsive for other tasks while the build runs.

Adding More RAM seems to have solved my problem. Disk usage is also low now. Maybe Since there wasnt enough RAM in my laptop the CPU was fetching data from the Disk directly which made the disk usage to go up.

Batch processing mode in Caffe - no performance gains

Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.

Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.

what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.

You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js