Caffe Batch processing no speedup - c++

I would like to speedup the forward pass of classification of a CNN using caffe.
I have tried batch classification in Caffe using code provided in here:
Modifying the Caffe C++ prediction code for multiple inputs
This solution enables me to give a vector of Mat, but it does not speed up anything. Even though the input layer is modified.
I am processing pretty small images (3x64x64) on a powerful pc with two GTX1080, and there is no issue in terms of memory.
I tried also changing the deploy.prototxt, but I get the same result.
It seems that at one point the forward pass of the CNN becomes sequential.
I have seen someone pointing this out here also:
Batch processing mode in Caffe - no performance gains
Another similar thread, for python : batch size does not work for caffe with deploy.prototxt
I have seen some things about MemoryDataLayer, but I am not sure this will solve my problem.
So I am kind of lost on what to do exactly... does anyone have any information on how to speedup classification time.
Thanks for any help !

Related

Codes for cross validation likelihood in Kernel Smoothing

I watched some many videos on proper codes for generating cross-validation likelihood when smooth kernel a curve and non of the package works well. I need simple codes let's say using "mtcars". Any help with that, please? And if I can change the bandwidth (h) what code to use?
I tried caret but did not work. I hope you can give me proper codes using mtars so I can use the codes for any data I may have.

Text Detection with YOLO on Challenging Images

I have images that look as follows:
My goal is to detect and recognize the number 31197394. I have already fine-tuned a deep neural network on text recognition. It can successfully identify the correct number, if it is provided it in the following format:
The only task that remains is the detection of the corresponding bounding box. For this purpose, I have tried darknet. Unfortunately, it's not recognizing anything. Does anyone have an idea of a network that performs better on these kind of images? I know, that amazon recognition is able to solve this task. But I need a solution that works offline. So my hopes are still high that there exist pre-trained networks that work. Thank's a lot for your help!
Don't say darknet doesn't work. It depends on how you labeled your dataset. It is true that the numbers you want to recognize are too small so if you don't make any changes to the image during the pre-processing phase, it would be complicated for a neural network to recognize them well. So what you can do that will surely work is:
1---> Before labeling, increase the size of all images by 2 times its current size (like 1000*1000)
2---> used this size (1000 * 1000) for the darket trainer instead of the default size proposed by darknet which is 416 * 416. You would then have to change the configuration file
3---> use the latest darknet version (yolo v4)
4---> On the configuration file, always keep a number of subdivisions at 1.
I also specify that this method is too greedy in memory, it is therefore necessary to provide a machine with RAM > 16 GB. The advantage is that it works...
Thanks for your answers guys! You were right, I had to finetune yolo to make it work. So I created a dataset and fine-tuned yolov5. I am surprised how good the results are. Despite only having about 300 images in total, I get an accuracy of 97% to predict the correct number. This is mainly due to strong augmentations. And indeed the memory requirements are large, but I could train on a 32 GM RAM machine. I can really encourage anyone who faces similar problems to give yolo a shot!!
Maybe use an R-CNN to identify the region where the number is and then pass that region to your fine-tuned neural network for the digit classification

Classify images with caffe directly from the GPU [duplicate]

I've read caffe2 tutorials and tried pre-trained models. I knew caffe2 will leverge GPU to run the model/net. But the input data seems always be given from CPU(ie. Host) memory. For example, in Loading Pre-Trained Models, after model is loaded, we can predict an image by
result = p.run([img])
However, image "img" should be read in CPU scope. What I look for is a framework that can pipline the images (which is decoded from a video and still resides in GPU memory) directly to the prediction model, instead of copying it from GPU to CPU scope, and then transfering to GPU again to predict result. Is Caffe or Caffe2 provides such functions or interfaces for python or C++? Or should I need to patch Caffe to do so? Thanks at all.
Here is my solution:
I'd found in tensor.h, function ShareExternalPointer() can exactly do what I want.
Feed gpu data this way,
pInputTensor->ShareExternalPointer(pGpuInput, InputSize);
then run the predict net through
pPredictNet->Run();
where pInputTensor is the entrance tensor for the predict net pPredictNet
I don't think you can do it in caffe with python interface.
But I think that it can be accomplished using the c++: In c++ you have access to the Blob's mutable_gpu_data(). You may write code that run on device and "fill" the input Blob's mutable_gpu_data() directly from gpu. Once you made this update, caffe should be able to continue its net->forward() from there.
UPDATE
On Sep 19th, 2017 PR #5904 was merged into master. This PR exposes GPU pointers of blobs via the python interface.
You may access blob._gpu_data_ptr and blob._gpu_diff_ptr directly from python at your own risk.
As you've noted, using a Python layer forces data in and out of the GPU, and this can cause a huge hit to performance. This is true not just for Caffe, but for other frameworks too. To elaborate on Shai's answer, you could look at this step-by-step tutorial on adding C++ layers to Caffe. The example given should touch on most issues dealing with layer implementation. Disclosure: I am the author.

Multi-Modal Image Alignment Issue

I am trying to align two multi-spectral images using multi-modal image registration techniques.
I built a prototype in MATLAB by first creating the optimizer and metric objects as follows:
[optimizer, metric] = imregconfig('Multimodal');
This creates an optimizer object of type OnePlusOneEvolutionaryOptimizer and metric of type MattesMutualInformation. The images are aligned as follows:
tform = imregtform(movingImage, fixedImage, 'rigid', optimizer, metric);
aligned = imwarp(movingImage,tform,'OutputView',imref2d(size(fixedImage)));
Then I went for a C++ implementation of the same algorithm which is offered by one of the examples in the ITK v4 library.
This example also gives correct results but here is the problem... The ITK version is way slower than the MATLAB version. I played around with the optimizer parameters and was able to speed it up a bit, but not comparable to MATLAB version.
MATLAB documentation of OnePlusOneEvolutionaryOptimizer states that the value of InitialRadius property is directly proportional to the algorithm's execution speed (compromising on robustness). The confusion here is that in ITK, the value of InitialRadius is inversely proportional to the execution speed as far as I tested.
I couldn't find literature/documentation describing how the optimizer parameters like InitialRadius and GrowthFactor are interpreted in ITK. Please help in providing explanation of these parameters and speeding up the algorithm.
The first thing to check is making sure you are compiling your program in Release mode, not Debug mode.
Documentation and source code for 1+1 optimizer in ITK are available online.

Reduce a Caffe network model

I'd like to use Caffe to extract image features. However, it takes too long to process an image, so I'm looking for ways to optimize for speed.
One thing I noticed is that the network definition I'm using has four extra layers on top the one from which I'm reading a result (and there are no feedback signals, so they should be safe to delete).
I tried to delete them from the definition file but it had no effect at all. I guess I might need to remove the corresponding part of the file that contains pre-trained weights, too. That is, however, a binary file (a protobuffer) so editing it is not that easy.
Do you think that removing the four layers might have a profound effect of the net performance?
If so then how do I get familiar with the file contents so that I could edit it and how do I know which parts to remove?
first, I don't think removing the binary weights will have any effect.
Second, you can do it easily using the python interface: see this tutorial.
Last but not least, have you tried running caffe time to measure the performance of your net? this may help you identify the bottlenecks of your computations.
PS,
You might find this thread relevant as well.
Caffemodel stores data as key-value pair. Caffe only copies weight for those layers (in train.prototxt) having exactly same name as caffemodel. Hence I don't think removing binary weights will work. If you want to change network structure, just modify train.prototxt and deploy.txt.
If you insist to remove weights from binary file, follow this caffe example.
And to make sure you delete right part, this visualizing tool should help.
I would retrain on a smaller input size, change strides, etc. However if you want to reduce file size, I'd suggest quantizing the weights https://github.com/yuanyuanli85/CaffeModelCompression and then using something like lzma compression (xz for unix). We do this so we can deploy to mobile devices. 8 bit weights compress nicely.