Understanding Dlib Kernel implementation - c++

I'm starting using dlib, and I have hard time understanding the way kernels are implemented. I started with the k-kmeans algorithm as I know this clustering method. However I cannot figure out where the kernel is computed. The input data are a matrix (not a kernel) and the algorithm never transform the data into a kernel.
I would expect a kernel class returning a square matrix. But I have not seen anything like this!
I want to use dlib to implement a clustering algorithm using kernels and dlib sounds a good solution to do so. Does anyone has a documentation on how it is implemented or can explain me how it does work?
thanks for your help!

A kernel is basically just a function that takes two input samples and outputs a single number. So yes, sometimes you will see code that then computes an N by N matrix of all the possible kernel function outputs for N samples. However, this is a somewhat naive implementation strategy since it requires O(N^2) RAM. So most real world kernel method software uses some kind of delayed evaluation or caching strategy to avoid this problem.
In the kernel K-means implementation in dlib this is done with the the kcentroid object. Inside the kcentroid you can see that it's invoking the kernel function in a number of places and doing all the "kernel stuff". You can read over the documentation for the kcentroid to understand what it does. Although, if you are just getting started with kernel methods then you will really need to get a book on the subject. I highly recommend picking one of these:
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf and Alexander J. Smola
Kernel Methods for Pattern Analysis by John Shawe-Taylor and Nello Cristianini

For a set of N data points, the kernel is usually specified by an NxN matrix whose (i,j)th entry gives the value of the kernel between data point i and data point j. This works for kernel methods as long as the matrix is symmetric and positive definite, which is guaranteed to be true for a true kernel.

Related

Working with many fixed-size matrices in CUDA kernels

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

How one would implement a 2-for particle-interaction loop using CUDA, and what is the resulting complexity?

This algorithm receives a world (list) of particles (3-dimensional vectors) and calls an interacting function between them. Or, in pseudocode:
function tick(world)
for i in range(world)
for j in range(world)
world[i] = interact(world[i], world[j])
Where interact is a function that takes 2 particles and return another one, and could be anything, for example:
function interact(a,b) = (a + b)*0.5
You can easily determine this algorithm is O(N^2) on the CPU. In my attempt to learn CUDA, I'm not sure how that could be implemented on the GPU. What would be the general structure of such algorithm, and what would be the resulting complexity? What if we knew the interact function didn't do anything if 2 particles were distant enough? Could we optimize it for locality?
What would be the general structure of such algorithm, and what would be the resulting complexity?
This is essentially the n-body problem. Solved using a direct particle-particle approach. It's been written about a lot. The order of the algorithm is O(N^2) on the GPU, just as it is on the CPU.
The core algorithm as implemented in CUDA doesn't change a lot except to take advantage of local block memory and optimize for it. Essentially the implementation would still come does to two loops.
The following paper is a good place to start, Chapter 31. Fast N-Body Simulation with CUDA.
Could we optimize it for locality?
Yes. Many n-body algorithms attempt to optimize for locality as gravitational and E-M forces decrease as a power of the distance between particles so distant particles can be ignored or their contribution can be approximated. Which of these approximation approaches to take largely depends on the type of system you are trying to simulate.
The following is a good overview of some of the more popular approaches,
Seminar presentation, N-body algorithms

Poor performance for calculating eigenvalues and eigenvectors on GPU

In some code we need to get auto vectors and auto values for the generalized eigenvalue problem with symmetric real matrices (Ax=lamba Bx). This code uses DSPGVX from LACPACK. We wanted to speed it up on GPU using a MAGMA function. We asked on this forum and got the answer about this
http://icl.cs.utk.edu/magma/docs/zhegvx_8cpp.html
The size of our matrices (N) goes from 100 to 50000 and even more, related to the number of atoms in a molecule. We observe:
a) for N bigger than 2500 (approx), MAGMA just does not work; segmentation fault
b) MAGMA runs always slower than LAPACK sequential, around 10 times slower
Is this behavior normal and could we overcome it? Can anybody report any reference where anybody working on this similar problems gets a decent speedup?
Thanks
In my experience you may be able to gain greater performance benefits by switching to a better eigensolver. The best solver that I know of is ARPACK. You will gain most benefit it your matrices have some structure, for example if they are sparse. This solver is also most efficient if you only need to extract a small fraction of the total number of eigenpairs.
I would start off by trying this solver on your problems running just on the CPU. You may find that this alone gives sufficient performance for your needs. If not then it is relatively easy to move the calculation core for ARPACK to the GPU. Or, there are parallel versions of ARPACK available.
Have you tried CULA http://www.culatools.com/ ? CULA is Lapack converted for CUDA by NVIDIA, so at least in theory it should have one of the best implementation for the generalized eigenvalue problem. I think the single precision version is free so you could give it a try.

mapreduce vs other parallel processing solutions

So, the questions are:
1. Is mapreduce overhead too high for the following problem? Does anyone have an idea of how long each map/reduce cycle (in Disco for example) takes for a very light job?
2. Is there a better alternative to mapreduce for this problem?
In map reduce terms my program consists of 60 map phases and 60 reduce phases all of which together need to be completed in 1 second. One of the problems I need to solve this way is a minimum search with about 64000 variables. The hessian matrix for the search is a block matrix, 1000 blocks of size 64x64 along a diagonal, and one row of blocks on the extreme right and bottom. The last section of : block matrix inversion algorithm shows how this is done. Each of the Schur complements S_A and S_D can be computed in one mapreduce step. The computation of the inverse takes one more step.
From my research so far, mpi4py seems like a good bet. Each process can do a compute step and report back to the client after each step, and the client can report back with new state variables for the cycle to continue. This way the process state is not lost computation can be continued with any updates.
http://mpi4py.scipy.org/docs/usrman/index.html
This wiki holds some suggestions, but does anyone have a direction on the most developed solution:
http://wiki.python.org/moin/ParallelProcessing
Thanks !
MPI is a communication protocol that allows for the implementation of parallel processing by passing messages between cluster nodes. The parallel processing model that is implemented with MPI depends upon the programmer.
I haven't had any experience with MapReduce but it seems to me that it is a specific parallel processing model and is designed to be simple to implement. This kind of abstraction should save you programming time and may or may not provide a suitable solution to your problem. It all depends on the nature of what you are trying to do.
The trick with parallel processing is that the most suitable solution is often problem specific and without knowing more specifics about your problem it is hard to make recommendations.
If you can tell us more about the environment that you are running your job on and where your program fits into Flynn's taxonomy, I might be able to provide some more helpful suggestions.