Im trying to build a NN in C++,
it is being trained on the MNIST handwritten numbers data set to classify a number from a 28*28 black and white image, i have done the same problem in python that worked with a decent success rate but i am trying to do it in c++ for fun, it has 784 inputs (1 for each pixel), 100 hidden and 10 output nodes, it has no biases and a learning rate of 0.3 this is the same as in the python one.
The NN takes in a vector of 784 pixel values normalised from 0.01 to 0.99 in grey scale of an image of a handwritten number and outputs a 10 dimensional vector to identify the image.
However in c++ it always outputs roughly 0.5 for all of the nodes after training. I have tested it with more hidden nodes, different learning rates and more training data but the more training data I trained it on the closer it gets to 0.5. I have also tested that all the functions and overloads work as intended. None of these have helped.
I understand that a CNN would probably work better in this situation but i do not fully understand how to program one.
Here is the training function:
void train(std::vector<double> inputs, std::vector<double> targets)
{
std::vector<std::vector<double>> inputs2D = {inputs};
std::vector<std::vector<double>> targets2D = {targets};
inputs2D = utils::transpose(inputs2D);
targets2D = utils::transpose(targets2D);
std::vector<std::vector<double>> hidden_in = utils::dot(ih_weight, inputs2D);
std::vector<std::vector<double>> hidden_out = utils::activation(hidden_in);
std::vector<std::vector<double>> final_in = utils::dot(ho_weight, hidden_out);
std::vector<std::vector<double>> final_out = utils::activation(final_in);
std::vector<std::vector<double>> output_error = targets2D + (-1* final_out);
std::vector<std::vector<double>> hidden_error = utils::dot(utils::transpose(ho_weight), output_error);
ho_weight = (learningRate * utils::dot((output_error * final_out * (1.0 + (-1.0 * final_out))), utils::transpose(hidden_out))) + ho_weight;
ih_weight = (learningRate * utils::dot((hidden_error * hidden_out * (1.0+(-1.0 * hidden_out))), utils::transpose(inputs2D))) + ih_weight;
}
and here is the full code and training data: here
it also includes the python file, and the c++ was compiled using c++ 11 with g++
I'm quite new to C++ so feel free to recommend any changes that would make my code better.
Related
Preferably I would use C++ to execute the code, but I would be open to any suggestion for a better language for the situation. I essentially want to use Strassen's algorithm to solve matrices, and I want to know how I would solve a matrices and measure its runtime.
# Version 3.6
import numpy as np
def split(matrix):
"""
Splits a given matrix into quarters.
Input: nxn matrix
Output: tuple containing 4 n/2 x n/2 matrices corresponding to a, b, c, d
"""
row, col = matrix.shape
row2, col2 = row//2, col//2
return matrix[:row2, :col2], matrix[:row2, col2:], matrix[row2:, :col2],
matrix[row2:, col2:]
def strassen(x, y):
"""
Computes matrix product by divide and conquer approach, recursively.
Input: nxn matrices x and y
Output: nxn matrix, product of x and y
"""
# Base case when size of matrices is 1x1
if len(x) == 1:
return x * y
# Splitting the matrices into quadrants. This will be done recursively
# untill the base case is reached.
a, b, c, d = split(x)
e, f, g, h = split(y)
# Computing the 7 products, recursively (p1, p2...p7)
p1 = strassen(a, f - h)
p2 = strassen(a + b, h)
p3 = strassen(c + d, e)
p4 = strassen(d, g - e)
p5 = strassen(a + d, e + h)
p6 = strassen(b - d, g + h)
p7 = strassen(a - c, e + f)
# Computing the values of the 4 quadrants of the final matrix c
c11 = p5 + p4 - p2 + p6
c12 = p1 + p2
c21 = p3 + p4
c22 = p1 + p5 - p3 - p7
# Combining the 4 quadrants into a single matrix by stacking horizontally and vertically.
c = np.vstack((np.hstack((c11, c12)), np.hstack((c21, c22))))
return c
I found the code above for the algorithm.
#include <time.h>
int main(void) {
clock_t tStart = clock();
/* Do your stuff here */
printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
return 0;
}
I found this code for measuring the runtime of the code. However I saw I could use
/usr/bin/time ./MyProgram if I have cygwin installed.
In short, how would I use my code to solve actual matrices using Strassen's algorithm and other matrice solving algorithms? Also how would I run the code? Thank you for the help, I am a novice to coding, and I am doing this in order to test the algorithmic efficiency of different matrice solving algorithms in different scenarios.
Time measurement
measuring of time depends on platform so what OS? On windows I would use Performance Counters. If you got access to x86 assembly you could also use RDTSC instruction but that needs some knowledge to use properly like setting affinity to single CPU, obtaining and stabilizing CPU frequency etc.
The OS time granularity is an issue too so if your measured process is too short you might need of some filtering of multiple measurements in order to get the correct values.
You can avoid some of the problems by measuring multiple repetitions of the process so the times get above 100ms and divide the resulting time by number of repetitions.
Also while reusing the same code/data while measuring the CACHE can be a problem too messing up your results.
running the code
The code of yours looks like Python so you can not use that in C/C++ directly instead you need to invoke it somehow with Python interpreter for example by creating python process with parameters telling it to open and run your source code. However in this case you need to wait until the code finishes by scanning its Handle if it is still valid... Of coarse you need to write and execute your python stuff in a way that when finished it closes. However I am afraid This will add big overhead as starting/stopping the python process alone could be much much slower than the matrix multiplication you measure ...
Other option is to having it in DLL or OBJ form and importing that into your C/C++ instead (however not sure if that is possible for Python code). This way you just have to call a function within your C/C++ app so no problems there ...
For some inspiration see:
Builder C++ calling VC++ class
In case the code is not too complex or does not require other libs and stuff you can try to port it to C/C++ code and use it directly.
I am struggling with my ability to implement a calculated gaussian kernel to return a blurred image.
My current code that calculates the kernel is below:
const int m = 5;
const int n = 5;
double sigma = std;
Mat Gauss;
double kernel[m][n];
for ( int x = 0; x < m; ++x )
for ( int y = 0; y < n; ++y )
{
kernel[x][y] = (1 / (sigma * (sqrt(2 * M_PI))))
* exp(-0.5 * (std::pow((x - avg) / sigma, 2.0)
+ pow((y - avg) / sigma, 2.0) ) / (2 * M_PI * sigma * sigma));
}
However, I can't figure out how to apply this to the image in a way that I am returned a blurred image.
I would appreciate it if anyone could give me some pointers in a way that I can apply this to an image.
I was thinking of using a for loop to replace the pixels of the original image but I could not properly implement this idea.
Thank you for your time.
It sounds like you want to compute a convolution of the original image with a Gaussian kernel, something like this:
blurred[x][y] = Integral (kernel[s][t] * original[x-s][y-t]) ds dt
There are a number of techniques for that:
Direct convolution: go through the grid and compute the above integral at each point. This works well for kernels with very small support, on the order of 5 grid points in each direction, but for kernels with larger support becomes too slow. For Gaussian kernels a rule of thumb for truncating support is about 3*sigma, so it's not unreasonable to do direct convolution with sigma under 2 grid points.
Fast Fourier Transform (FFT). This works reasonable fast for any kernel. Therefore FFT became the standard way to compute convolution of nearly anything with nearly anything. Direct convolution beats FFT only for kernel with very small support.
Analytical: integrals of some kernels have analytical expressions. In particular, integral of a Gaussian is the Erf function, and, at least on Unix systems, it's available as a function call. Moreover, on some hardware (such as GPUs) Erf is implemented in hardware. In some rare (but important) cases of coarse bi-level images one can replace convolution with Gaussian with a loop of Erf function calls.
For most computational system your best bet would be to go with FFT: it's fast and it's flexible enough to handle correctly any kernels and images.
In order to develop my implementation of the particle filter algorithm, I need to generate hypotheses about the movements relating to the object to be tracked: if I set N samples and if I use a 2-by-1 state vector, then at each step I have to generate N pairs of random values (a 2-by-N matrix). Moreover, if I know the statistics of movements (mean and standard deviation), then I could use the mean and standard deviation to generate all N values. Finally, to model the uncertainty of the movement, I could generate a noise matrix (a 2-by-N matrix) and add it to the matrix of movements.
Based on these premises, I have implemented the algorithm running in matlab, and I used the following code in order to generate the hypotheses of movement.
ds_mean = [dx_mean dy_mean];
ds_stddev = [dx_stddev dy_stddev];
d = 5;
V = zeros(2,N);
V(1,:) = normrnd(ds_mean(1),ds_stddev(1),1,N); % hypotheses of movement on x axis
V(2,:) = normrnd(ds_mean(2),ds_stddev(2),1,N); % hypotheses of movement on y axis
E = d*randn(2,N); % weighted noise
M = V + E; % hypotheses of movement
A problem occurred when I had to implement the same algorithm using C++ and OpenCV: substantially, while the above matlab code generates good predictions (it works great), instead the same code written in C++ (see the code below) generates poor predictions (ie far away from the object). Why?
RNG m_rng;
x_mean = // ...
y_mean = // ...
x_stddev = // ...
y_stddev = // ...
Mat velocity(STATE_DIM, NUM_PARTICLES, DataType<double>::type);
m_rng.fill(velocity.row(0), RNG::NORMAL, x_mean, x_stddev);
m_rng.fill(velocity.row(1), RNG::NORMAL, y_mean, y_stddev);
Mat noise(STATE_DIM, NUM_PARTICLES, DataType<double>::type);
m_rng.fill(noise,RNG::NORMAL,0,1);
noise *= d; % weighted noise
movements = velocity + noise;
How to make sure that the C++ algorithm works as well as the algorithm implemented in matlab?
I think I just serendipitously answered your question here, or at least provided an alternative solution.
https://stackoverflow.com/a/13897938/1899861
I believe this will generate proper random numbers, and has been tested to death when compiled using Microsoft C on Intel processors (386, 486, Pentium).
FYI, 4.0 * atan(1.0) yields a much better value of PI than the constant in the above environment.
I'm using OpenCV for an application in computer vision. I'd like to accelerate some matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 2.4.1 has a number of GPU accelerated functions. How well do they perform in your experience? Am I better off using another library (e.g. Thrust) instead?
EDIT
Sample application: Calculate squared Euclidean distance matrix on GPU. Currently, my GPU accelerated (and vectorized) implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C++ implementation with OpenCV.
Matlab implementation:
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);
[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);
pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);
% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';
end
UPDATE Here's another Matlab implementation that accomplishes the same (thanks to https://stackoverflow.com/a/7774323/1121420). But it runs only on CPU because bsxfun is not supported by PCT. Still looking for C++ alternative though.
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.
K = bsxfun(#plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');
end
I find ArrayFire to be much faster and have started using it instead of the GPU kernels in OpenCV for image processing. Here are some benchmarks I found comparing ArrayFire (used to be in a different interface called LibJacket) to OpenCV and it's been true in my benchmarking too that ArrayFire is 2-4X faster than the GPU functions in OpenCV. From what I hear, NVIDIA didn't write the GPU kernels in OpenCV but contracted those out to someone, which may be why they are so slow. Since I'm only using 1 GPU, I can use ArrayFire for free.
Update, given the new MATLAB code posted by #Alex: I ran the benchmark of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the CPU, but Jacket and ArrayFire kick butt. HW specs are:
Intel(R) Xeon(R) CPU X5660 # 2.80GHz
NVIDIA Tesla M2090
Results of CPU vs GPU using Parallel Computing Toolbox gpuArray (fully warmed up). CPU is faster than PCT gpuArray:
>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.
Results of CPU vs GPU using Jacket (fully warmed up). Jacket beats PCT gpuArray by 3.7X and beats the CPU by 3X
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.
Here is the modified code that let's you run all that easily:
function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';
end
Jacket does support BSXFUN on the GPU, and it does improve the speeds somewhat:
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.
Note that the sizes used here are pretty small, so most CUDA code that attempts to run on these small sizes is likely to perform poorly. That's why I like to use AccelerEyes' stuff, because those guys have optimized the heck out of the GPU, unlike PCT gpuArray, Thrust, OpenCV, each of which I've tried in the past.
Here is the ArrayFire Free C++ results:
Time: 0.0003577 seconds
Speedups: 19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN
Here is the ArrayFire code I wrote for this:
static array SqEuclideanDist(array P, array Q)
{
// 0 based indexing
array pmag = sum(P * P, 1);
array qmag = sum(Q * Q, 1);
int np = P.dims(0);
int nq = Q.dims(0);
array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
return K;
}
int main(int argc, char **argv)
{
double *P_cpu = new double[1581 * 3];
double *Q_cpu = new double[189 * 3];
array P = array(1581, 3, P_cpu);
array Q = array(189 , 3, Q_cpu);
af::sync();
int iter = 1000;
timer::tic();
for (int i = 0; i < iter; i++) {
array K = SqEuclideanDist(P, Q);
af::eval(K);
}
af::sync();
printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);
delete[] P_cpu;
delete[] Q_cpu;
}
They've been contributed by NVidia, so does have good performance on CUDA compatible cards.
The real performance depends on the card itself and the function you are using.
In my experience only cvRotate and cvResize had a better performance than a normal Intel cpu.
(Note: I was only interested in image related functions)
I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8) to the training set.
I'm inputting a number X, and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\] and b=0=theta\[0\].
The training set is the array x and y, where (x[i],y[i]) is a point.
void train()
{
double delta;
for (int i = 0; i < x.size(); i++)
{
delta = y[i]-hypothesis(x[i]);
theta[1] += alpha*delta*x[i];
theta[0] += alpha*delta*1;
}
}
void C_Approx::display()
{
std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}
some of the results I'm getting:
I input a number, it runs train() a few times, then display()
1
0.33616x + 0.33616 f(x)=0.67232
1
0.482408x + 0.482408 f(x)=0.964816
1
0.499381x + 0.499381 f(x)=0.998762
1
0.499993x + 0.499993 f(x)=0.999986
1
0.5x + 0.5 f(x)=1
An example of it diverging after it passed 8:
1
0.33616x + 0.33616 f(x)=0.67232
2
0.705508x + 0.509914 f(x)=1.21542
3
0.850024x + 0.449928 f(x)=1.29995
4
0.936062x + 0.330346 f(x)=1.26641
5
0.951346x + 0.231295 f(x)=1.18264
6
0.992876x + 0.137739 f(x)=1.13062
7
0.932206x + 0.127372 f(x)=1.05958
8
1.00077x + 0.000493063 f(x)=1.00126
9
-0.689325x + -0.0714712 f(x)=-0.760797
10
4.10321e+08x + 4.365e+07 f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21 f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40 f(x)=-4.27216e+41
I tried the solution proposed here of scaling the step and ended up with similar results.
What am I doing wrong?
Your implementation is good. Generally, stochastic gradient descent might diverge when α is too large. What you would do with a large dataset is take a reasonably sized random sample, find α that gives you the best results, and then use it for the rest.
I have experienced the same problem (albeit in Java) because my learning rate was too big.
For short, I was using α = 0.001 and I had to push it to 0.000001 to see actual convergence.
Of course these values are linked to your dataset.
When your cost function increases or cycles up and down, you usually have too large a value for alpha. What alpha are you using?
Start out with an alpha = 0.001 and see if that converges? If not try various alphas (0.003, 0.01, 0.03, 0.1, 0.3, 1) and find one that converges quickly.
Scaling the data (normalization) won't help you with only 1 feature (your theta[1]) as normalization only applies to 2+ features (multivariate linear regression).
Also bear in mind that for a small number of features you can use the Normal Equation to get the correct answer.
use backtracking line search to guaranty convergence. It is very simple to implement. See Stephen Boyd, Convex Optimization for reference. You can choose some standard alpha, beta values for backtracking line search, for example 0.3 and 0.8.
If I understand you correctly, your training set only has a non-zero gradient at the edge of a line? Unless you start at the line (actually start exactly at one of your training points) you won't find the line. You are always at a local minimum.
It's not clean from your description what problem you're solving.
Also it's very dangerous to post links to external resources - you can be blocked in stackoverflow.
In any case - gradient descend method and (subgradient descend too) with fixed step size (ML community call it learning rate) should not necesseray converge.
p.s.
Machine Learning community is not interesting in "convergence condition" and "convergence to what" - they are interested in create "something" which pass cross-validation with good result.
If you're curious about optimization - start to look in convex optimization. Unfortunately it's hard to find job on it, but it append clean vision into what happens in various math optimization things.
Here is source code which demonstrate it for simple quadratic objective:
#!/usr/bin/env python
# Gradiend descend method (without stepping) is not converged for convex
# objective
alpha = 0.1
#k = 10.0 # jumping around minimum
k = 20.0 # diverge
#k = 0.001 # algorithm converged but gap to the optimal is big
def f(x): return k*x*x
def g(x): return 2*k*x
x0 = 12
xNext = x0
i = 0
threshold = 0.01
while True:
i += 1
xNext = xNext + alpha*(-1)*(g(xNext))
obj = (xNext)
print "Iteration: %i, Iterate: %f, Objective: %f, Optimality Gap: %f" % (i, xNext, obj, obj - f(0.0))
if (abs(g(xNext)) < threshold):
break
if i > 50:
break
print "\nYou launched application with x0=%f,threshold=%f" % (x0, threshold)