Extracting Features from VGG - computer-vision

I want to extract features from images in MS COCO dataset using a fine-tuned VGG-19 network.
However, it takes about 6~7 seconds per image, roughly 2 hours per 1k images. (even longer for other fine-tuned models)
There are 120k images in MS COCO dataset, so it'll take at least 10 days.
Is there any way that I can speed up the feature extraction process?

Well, this is not just a command. First you must check whether your GPU is powerful enough to wrestle with deep CNNs. Knowing your GPU model can answer this question.
Second, you have to compile and build Caffe framework with CUDA and GPU-enabled (CPU_Only disabled) in the Makefile.config (or CMakeLists.txt).
Passing all required steps (installing Nvidia Driver, installing CUDA and etc.) you can build caffe for GPU-use. Then by passing the GPU_Device_ID in your command-line you can benefit from speed provided by them.
Follow this link for building Caffe using GPU.
Hope it helps

This ipython notebook example explains the steps to extract features out of any caffe model really well: https://github.com/BVLC/caffe/blob/master/examples/00-classification.ipynb
In pycaffe, you can set gpu mode simply by using caffe.set_mode_gpu().

Related

What parameters in a EC2 virtual machine should I use to optimize H2O's XGBoost performance?

I'm trying to run H2O xgboost on r4.8x large. But it's taking too long to run (15+ hrs as opposed to 4 hours for GBM with same hyperparameter grid size).
Knowing that XGBoost uses cache optimization, is there any particular instance type that works best for H2O's XGBoost implementation?
My training data has 28K rows with 150 binary columns. And I'm running a grid search.
Changing your EC2 instance won't necessarily make it faster. You need to understand where is the bottleneck. Review the logs and see what takes time on GBM vs XGBoost. Is XGBoost creating deeper trees or more trees? It could be your settings are different between the two algorithms. Check that all the hyperparameters are similar (close as possible).
Also, XGBoost uses memory external to H2O's JVM. As mentioned in FAQ of H2O's XGBoost docs, try adding -extramempercent 120 and lowering your H2O memory.

Speeding up model training using MITIE with Rasa

I'm training a model for recognizing short, one to three sentence strings of text using the MITIE back-end in Rasa. The model trains and works using spaCy, but it isn't quite as accurate as I'd like. Training on spaCy takes no more than five minutes, but training for MITIE ran for several days non-stop on my computer with 16GB of RAM. So I started training it on an Amazon EC2 r4.8xlarge instance with 255GB RAM and 32 threads, but it doesn't seem to be using all the resources available to it.
In the Rasa config file, I have num_threads: 32 and set max_training_processes: 1, which I thought would help use all the memory and computing power available. But now that it has been running for a few hours, CPU usage is sitting at 3% (100% usage but only on one thread), and memory usage stays around 25GB, one tenth of what it could be.
Do any of you have any experience with trying to accelerate MITIE training? My model has 175 intents and a total of 6000 intent examples. Is there something to tweak in the Rasa config files?
So I am going to try to address this from several angles. First specifically from the Rasa NLU angle the docs specifically say:
Training MITIE can be quite slow on datasets with more than a few intents.
and provide two alternatives:
Use the mite_sklearn pipeline which trains using sklearn.
Use the MITIE fork where Tom B from Rasa has modified the code to run faster in most cases.
Given that you're only getting a single cores used I doubt this will have an impact, but it has been suggested by Alan from Rasa that num_threads should be set to 2-3x your number of cores.
If you haven't evaluated both of those possibilities then you probably should.
Not all aspects of MITIE are multi-threaded. See this issue opened by someone else using Rasa on the MITIE GitHub page and quoted here:
Some parts of MITIE aren't threaded. How much you benefit from the threading varies from task to task and dataset to dataset. Sometimes only 100% CPU utilization happens and that's normal.
Specifically on training data related I would recommend that you look at the evaluate tool recently introduced into the Rasa repo. It includes a confusion matrix that would potentially help identify trouble areas.
This may allow you to switch to spaCy and use a portion of your 6000 examples as an evaluation set and adding back in examples to the intents that aren't performing well.
I have more questions on where the 6000 examples came from, if they're balanced, and how different each intent is, have you verified that words from the training examples are in the corpus you are using, etc but I think the above is enough to get started.
It will be no surprise to the Rasa team that MITIE is taking forever to train, it will be more of a surprise that you can't get good accuracy out of another pipeline.
As a last resort I would encourage you to open an issue on the Rasa NLU GitHub page and and engage the team there for further support. Or join the Gitter conversation.

OpenCV training output

So I am creating my own classifiers using the OpenCV Machine Learning module for age estimation. I can train my classifiers but the training takes a long time so I would like to see some output (status classifier, iterations done etc.). Is this possible? I'm using ml::Boost, ml::LogisticalRegression and ml::RTrees all inheriting cv::StatModel. Just to be clear i'm not using the given application for recognizing objects in images (opencv_createsamples and opencv_traincascade). The documentation is very limited so it's very hard to find something in it.
Thanks
Looks like there's an open feature request for a "progress bar" to provide some rudimentary feedback... See https://github.com/Itseez/opencv/issues/4881. Personally, I gave up on using the OpenCV ML a while back. There are several high-quality tools available to build machine learning models. I've personally used Google's Tensorflow, but I've heard good things about Theano and Caffe as well.

OpenCV where is tracking.hpp

I want to use an OpenCV's implementation of the TLD tracker. Internet says that I have to include this file: opencv2/tracking.hpp (e.g. see https://github.com/Itseez/opencv_contrib/blob/master/modules/tracking/samples/tracker.cpp).
But there is no such a file.
Well, what must I do to use TrackerTLD in my C++ project?
(OpenCV 3.0.0 beta for Windows, installed from the .exe package from opencv.org)
As Floyd mentioned, to use TrackerTLD, you need to download OpenCV contrib repo. Instruction is in the link, so explaining it shouldn't be necessary.
However in my opinion using TrackerTLD from OpenCV repo is bad option - i've tested it (about a week or 2 ago) and it was terribly slow. If you are thinking about real time image processing, consider using other implementation of TLD or some other tracker. Right now i'm using this implementation and it's working really well. Note that tracking an object is quite a time consuming task so to perform a real time tracking i have to downscale every frame from 640x480 to 320x240 (Propably it would work well (and definitely faster) in even lower resolution). On the web page of author of this implementation you may find some informations about TLD algorithm (and implementation) and another tracker created by this author - CMT(Consensus-based Matching and Tracking of Keypoints). Unfortunetely i haven't test it yet so i can't tell anything about it.

batchedgemm source code?

I have a special sort of problem.
I have some research code that I have developed on my macbook using CUDA 4.1, especially using batchedgemm. I now have to run it on a cluster of gpu's that I have loaned from another institution.
My problem is that the Cluster only has CUDA 4.0 installed, and they are reluctant to upgrade fast.
Does anyone know if I can get the source for batchedgemm somewhere and compile it to work under 4.0?
I've writen my own kernel for doing batched multiplications, but it performes an order of about 10 slower than the library one - I would like to stand on the shoulders of great men instead of on their toes.
I understand the reluctance to quickly upgrade a production cluster. Many clusters use a module system which means multiple versions of the CUDA toolkit can coexist. The driver, however, needs to be upgraded to a version that supports the latest CUDA in use. This is why they would be reluctant because they would need to test their users' production codes and applications to avoid regression or failure.
Since CUBLAS is not open source, I recommend you try to develop your code on a separate machine and if you get a large speed up from batching, present that to the administrators as a reason to upgrade.