Batch processing mode in Caffe - no performance gains - c++

Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.

Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610

Related

Why isn't my colab notebook using the GPU?

When I run code on my colab notebook after having selected the GPU, I get a message saying "You are connected to a GPU runtime, but not utilizing the GPU". Now I understand similar questions have been asked before, but I still don't understand why. I am running PCA on a dataset over hundreds of iterations, for multiple trials. Without a GPU it takes about as long as it does on my laptop, which can be >12 hours, resulting in a time out on colab. Is colab's GPU restricted to machine learning libraries like tensorflow only? Is there a way around this so I can take advantage of the GPU to speed up my analysis?
Colab is not restricted to Tensorflow only.
Colab offers three kinds of runtimes: a standard runtime (with a CPU), a GPU runtime (which includes a GPU) and a TPU runtime (which includes a TPU).
"You are connected to a GPU runtime, but not utilizing the GPU" indicates that the user is conneted to a GPU runtime, but not utilizing the GPU, and so a less costly CPU runtime would be more suitable.
Therefore, you have to use a package that utilizes the GPU, such as Tensorflow or Jax. GPU runtimes also have a CPU, and unless you are specifically using packages that exercise the GPU, it will sit idle.

C++ project performance boost by running another project in the background?

I'm trying to boost the performance of my C++ facial tracking algorithm by optimizing the most time-consuming sections of my code. Interestingly, 86% of the entire tracking processing time is consumed by its feature extraction section. That feature extractor takes an image as input and returns the feature vector as output. I isolated the feature extractor from the tracker into a separate project to write an optimized version from scratch and made sure the input and output format and containers are the same as the ones used in the original tracker. Here's the breakdown of the resulting process time.
Original one 3 ms/frame
Optimized version 1 ms/frame
As can be seen it's supposed to be 3 times faster. But when I insert the new feature extractor into my tracker, inside the tracker it runs at about 2.5 ms/frame. I can't understand why it gets slower inside the tracker as opposed to when it runs as a standalone project?
But here's the catch that I discovered by accident. If I run the tracker and at the same time I run another project in the background, all of a sudden, the feature extractor inside the tracker would converge to 1 ms/frame. But as soon as I stop the background project it goes back to 2.5 ms/frame. This happens both on my laptop with Ubuntu 16.04 and on my Desktop PC with Windows 10.
In an attempt to understand this sort of behavior I used my Ubuntu System Monitor and noticed the following regarding all the 8 CPU cores.
When I run the feature extractor alone 7 cores are engaged by about 5% and one of the cores is engaged by 100%.
When I insert the feature extractor in the tracker and run the tracker, all the 8 cores are engaged by about 30%.
Now when I run the tracker and along with the tracker I run any other program (let's say the standalone feature extractor), out of all the cores that are engaged by 30%, one jumps up to 100% engagement which may explain the performance boost I get in the tracker as a result.
The evidence suggests that in order for the feature extractor to run at its maximum potential of 1 ms/frame, at least one of the cores should be engaged by 100%. My question is how can I make this happen?

Time Consuming Tensorflow C++ Session->Run - Images for Real-time Inference

[Tensorflow (TF) on CPU]
I am using the skeleton code provided for C++ TF inference from GitHub [label_image/main.cc] in order to run a frozen model I have created in Python. This model is an FC NN with two hidden layers.
In my current project's C++ code, I run the NN's frozen classifier for each single image (8x8 pixels). For each sample, a Session->Run call takes about 0.02 seconds, which is expensive in my application, since I can have 64000 samples that I have to run.
When I send a batch of 1560 samples, the Session->Run call takes about 0.03 seconds.
Are these time measurements normal for the Session->Run Call? From the C++ end, should I send my frozen model batches of images and not single samples? From the Python end, are there optimisation tricks to alleviate that bottleneck? Is there a way to concurrently do Session-Run calls in C++?
Environment info
Operating System: Linux
Installed version of CUDA and cuDNN: N/A
What other attempted solutions have you tried?
I installed TF using the optimised instruction set for the CPU, but it does not seem to give me the huge time saving mentioned in StackOverflow
Unified the session for the Graph I created.
EDIT
It seems that MatMul is the bottleneck -- Any suggestions how to improve that?
Should I use 'optimize_for_inference.py' script for my frozen graph?
How can you measure the time in Python with high precision?
Timeline for feeding an 8x8 sample and getting the result in Python
Timeline for feeding an 8x8 batch and getting the result in Python
For the record, I have done two things that significantly increased the speed of my application:
Compiled TF to work on the optimised ISA of my machine.
Applied batching to my data samples.
Please feel free to comment here if you have questions about my answer.

Low priority job in OpenCV

I am trying to call feature detectors from OpenCV in my C++ application written in Visual Studio. I would like to run this operation in the background and do not care about its timing. Actually, I prefer if it is not interfering with the main performance of the application. To this end, I perform the feature detection in a separate thread and tried to lower the priority of the thread with the command SetThreadPriority(). This is however not working, and while the OpenCV function is running, all the CPU cores are maxed out. Is there any way to control the priority of the tasks in OpenCV or even limit the CPU cores involved in its process?
Although I didn't find any way to reduce the priority of an OpenCV job, I could reduce the CPU usage by using setNumThreads(int numThreads).

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.