Performance of SpatialConvolution implementation in tensorflow - c++

I realized a SpatialConvolution function by referring the implementation of tensorflow (use eigen).
The implementation in tensorflow is located at SpatialConvolution and I also find one related reply about the implementation : https://stackoverflow.com/a/58955289/7587433
My implementation is as follows: (since my data is row-major, I only keep half of the code)
// Description: Convolution
// Input:
// - name: input0 type: float shape: Shape{7680, 15, 200, 1}
// - name: input1 type: float shape: Shape{5, 200, 1, 200}
// Output:
// - name: output0 type: float shape: Shape{7680, 11, 1, 200}
void Convolution_float_float_float_cpu_Convolution_270(float* input0, float* input1, float* output0)
{
Eigen::array<Eigen::IndexPair<Eigen::Index>, 1> contract_dims;
contract_dims[0] = Eigen::IndexPair<Eigen::Index>(1, 0);
Eigen::array<Eigen::Index, 4> in_dims({7680, 15, 200, 1});
Eigen::array<Eigen::Index, 4> out_dims({7680, 11, 1, 200});
Eigen::array<Eigen::Index, 4> kernel_dims({5, 200, 1, 200});
Eigen::DSizes<Eigen::Index, 2> pre_contract_dims;
pre_contract_dims[1] = kernel_dims[2] * kernel_dims[1] * kernel_dims[0];
pre_contract_dims[0] = out_dims[1] * out_dims[2];
for (int i = 0; i < 1; ++i) {
pre_contract_dims[0] *= in_dims[i];
}
Eigen::DSizes<Eigen::Index, 4> post_contract_dims;
post_contract_dims[3] = kernel_dims[3];
post_contract_dims[2] = out_dims[2];
post_contract_dims[1] = out_dims[1];
for (int i = 0; i < 1; ++i) {
post_contract_dims[i] = in_dims[i];
}
Eigen::DSizes<Eigen::Index, 2> new_kernel_dims;
new_kernel_dims[0] = kernel_dims[2] * kernel_dims[1] * kernel_dims[0];
new_kernel_dims[1] = kernel_dims[3];
Eigen::TensorMap<Eigen::Tensor<float, 4, Eigen::RowMajor>>
in(static_cast<float *>(input0), in_dims),
out(static_cast<float *>(output0), out_dims),
kernel(static_cast<float *>(input1), kernel_dims);
out.device(*global_thread_pool_device) = in
.extract_image_patches(kernel_dims[1], kernel_dims[0], 1,
1, 1, 1,
Eigen::PADDING_VALID)
.reshape(pre_contract_dims)
.contract(kernel.reshape(new_kernel_dims), contract_dims)
.reshape(post_contract_dims);
}
By handling the same data and set the number of thread in threadpool as 1 (intra_op_parallelism_threads in tensorflow), looks my implementation is about 30% slower than tensorflow. My compiler option is "-std=gnu++11 -O3 -march=native", and tensorflow's XLA is not enabled. I have no idea about what lead to the performance gap. If anyone could give some hints that would be a great help.

After digging into the code, we found tensorflow realized custom eigen kernels based on MKL. With the implementations in eigen_contraction_kernel.h/.cpp and eigen_spatial_convolutions.h, We can get the same performance with tensorflow.

Related

How to pass [[Int]] into MTLBuffer in Swift and receive it in Metal?

What I'm trying to do
I have an [[Int]] in Swift and I'm trying to pass it into Metal through a buffer. Basically, I'm trying to use Metal to add two matrices multithreaded and give them back to Swift. So far, this is more difficult than expected.
The problem
Metal tells me that my graph isn't a pointer and I suspect I'm passing it into Metal wrong.
import MetalKit
let graph: [[Int]] = [
[0, 1, 2, 999, 999, 999],
[1, 0, 999, 5, 1, 999],
[2, 999, 0, 2, 3, 999],
[999, 5, 2, 0, 2, 2],
[999, 1, 3, 2, 0, 1],
[999, 999, 999, 2, 1, 0]]
func fooFunc(gra: [[Int]]) {
let count = gra.count
let device = MTLCreateSystemDefaultDevice()
let commandQueue = device?.makeCommandQueue()
let gpuFunctionLibrary = device?.makeDefaultLibrary()
let funcGPUFunction = gpuFunctionLibrary?.makeFunction(name: "MetalFunc")
var funcPipelineState: MTLComputePipelineState!
do {
funcPipelineState = try device?.makeComputePipelineState(function: funcGPUFunction!)
} catch {
print(error)
}
let graphBuff = device?.makeBuffer(bytes: gra,
length: MemoryLayout<Int>.size * count * count,
options: .storageModeShared)
let resultBuff = device?.makeBuffer(length: MemoryLayout<Int>.size * count,
options: .storageModeShared)
let commandBuffer = commandQueue?.makeCommandBuffer()
let commandEncoder = commandBuffer?.makeComputeCommandEncoder()
commandEncoder?.setComputePipelineState(additionComputePipelineState)
commandEncoder?.setBuffer(graphBuff, offset: 0, index: 0)
commandEncoder?.setBuffer(resultBuff, offset: 0, index: 1)
let threadsPerGrid = MTLSize(width: count, height: 1, depth: 1)
let maxThreadsPerThreadgroup = additionComputePipelineState.maxTotalThreadsPerThreadgroup // 1024
let threadsPerThreadgroup = MTLSize(width: maxThreadsPerThreadgroup, height: 1, depth: 1)
commandEncoder?.dispatchThreads(threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder?.endEncoding()
commandBuffer?.commit()
commandBuffer?.waitUntilCompleted()
let resultBufferPointer = resultBuff?.contents().bindMemory(to: Int.self,
capacity: MemoryLayout<Int>.size * count)
print("Result: \(Int(resultBufferPointer!.pointee) as Any)")
}
gpuDijkstra(gra: graph)
#include <metal_stdlib>
using namespace metal;
kernel void MetalFunc(constant int *graph [[ buffer(0) ]],
constant int *result [[ buffer(1) ]],
uint index [[ thread_position_in_grid ]]
{
const int size = sizeof(*graph);
int result[size][size];
for(int k = 0; k<size; k++){
result[index][k]=graph[index][k]+graph[index][k]; //ERROR: Subscripted value is not an array, pointer, or vector
}
}
I don't know about the Swift implementation but your kernel function is wrong.
If you want to write to and read from the same buffer you should use Device address space attribute.
The device address space name refers to buffer memory objects
allocated from the device memory pool that are both readable and
writeable.
device int *result /* write and read */
constant int *graph /* read only */
This is the correct implementation:
#include <metal_stdlib>
using namespace metal;
kernel void MetalFunc(constant int *graph [[ buffer(0) ]],
device int *result [[ buffer(1) ]],
uint index [[ thread_position_in_grid ]]
{
/* Point to an array (graph buffer) */
constant int(*r)[6][6] = (constant int(*)[6][6])graph;
/* Point to an array (result buffer) */
device int(*w)[6][6] = (device int(*)[6][6])result;
/* Read the first element */
int i = (*r)[0][0];
(*w)[0][0] = 777; /* Write to the first element */
(*w)[5][5] = 999; /* Write to the last element */
}

AVX calculation precision

I wrote a program to display the mandelbrot set. To speed it up, I used AVX (really AVX2) instructions through the <immintrin.h> header.
The problem is: The result of the AVX computation (with double precision) has artifacts, and it differs to the result when computed using "normal" doubles.
In detail, there is a function getIterationCount which calculates the number of iterations until the mandelbrot sequence exceeds 4, or assumes the point is included in the set if the sequences does not exceed 4 during the first N steps.
The code looks like this:
#include "stdafx.h"
#include <iostream>
#include <complex>
#include <immintrin.h>
class MandelbrotSet {
public:
int getIterationCount(const std::complex<double>, const int) const noexcept;
__m256i getIterationCount(__m256d cReal, __m256d cIm, unsigned maxIterations) const noexcept;
};
inline int MandelbrotSet::getIterationCount(const std::complex<double> c, const int maxIterations) const noexcept
{
double currentReal = 0;
double currentIm = 0;
double realSquare;
double imSquare;
for (int i = 0; i < maxIterations; ++i) {
realSquare = currentReal * currentReal;
imSquare = currentIm * currentIm;
currentIm = 2 * currentReal * currentIm + c.imag();
currentReal = realSquare - imSquare + c.real();
if (realSquare + imSquare >= 4) {
return i;
}
}
return -1;
}
const __m256i negone = _mm256_set_epi64x(-1, -1, -1, -1);
const __m256i one = _mm256_set_epi64x(1, 1, 1, 1);
const __m256d two = _mm256_set_pd(2, 2, 2, 2);
const __m256d four = _mm256_set_pd(4, 4, 4, 4);
//calculates for i = 0,1,2,3
//output[i] = if ctrl[i] == 0b11...1 then onTrue[i] else onFalse[i]
inline __m256i _mm256_select_si256(__m256i onTrue, __m256i onFalse, __m256i ctrl) {
return _mm256_or_si256(_mm256_and_si256(onTrue, ctrl), _mm256_and_si256(onFalse, _mm256_xor_si256(negone, ctrl)));
}
inline __m256i MandelbrotSet::getIterationCount(__m256d cReal, __m256d cIm, unsigned maxIterations) const noexcept {
__m256i result = _mm256_set_epi64x(0, 0, 0, 0);
__m256d currentReal = _mm256_set_pd(0, 0, 0, 0);
__m256d currentIm = _mm256_set_pd(0, 0, 0, 0);
__m256d realSquare;
__m256d imSquare;
for (unsigned i = 0; i <= maxIterations; ++i)
{
realSquare = _mm256_mul_pd(currentReal, currentReal);
imSquare = _mm256_mul_pd(currentIm, currentIm);
currentIm = _mm256_mul_pd(currentIm, two);
currentIm = _mm256_fmadd_pd(currentIm, currentReal, cIm);
currentReal = _mm256_sub_pd(realSquare, imSquare);
currentReal = _mm256_add_pd(currentReal, cReal);
__m256i isSmaller = _mm256_castpd_si256(_mm256_cmp_pd(_mm256_add_pd(realSquare, imSquare), four, _CMP_LE_OS));
result = _mm256_select_si256(_mm256_add_epi64(one, result), result, isSmaller);
//if (i % 10 == 0 && !isSmaller.m256i_i64[0] && !isSmaller.m256i_i64[1] && !isSmaller.m256i_i64[2] && !isSmaller.m256i_i64[3]) return result;
}
return result;
}
using namespace std;
int main() {
MandelbrotSet m;
std::complex<double> point(-0.14203954214360026, 1);
__m256i result_avx = m.getIterationCount(_mm256_set_pd(-0.14203954214360026, -0.13995837669094691, -0.13787721123829355, -0.13579604578563975),
_mm256_set_pd(1, 1, 1, 1), 2681);
int result_normal = m.getIterationCount(point, 2681);
cout << "Normal: " << result_normal << ", AVX: " << result_avx.m256i_i64[0] << ", at point " << point << endl;
return 0;
}
When I run this code, I get the following result:
(The point -0.14203954214360026 + i is chosen intentionally, because both methods return the same/almost the same value in most points)
Normal: 13, AVX: 20, at point (-0.14204,1)
A difference of 1 might be acceptable, but a difference of 7 seems quite big, since both methods use double precision.
Have AVX instructions a lower precision than "normal" instruction? If not, why do both results differ so much?
I use MS Visual Studio 2017, MS Visual C++ 2017 15.6 v14.13 141 and my computer has a i7-7700K Processor. The Project is compiled for x64. The result is the same if it is compiler with no or full optimization.
The rendered results look like this:
AVX:
Normal
The values of realSquare and imSquare during the loop are as follows:
0, 0, 0
1, 0.0201752, 1
2, 1.25858, 0.512543
3, 0.364813, 0.367639
4, 0.0209861, 0.0715851
5, 0.0371096, 0.850972
6, 0.913748, 0.415495
7, 0.126888, 0.0539759
8, 0.00477863, 0.696364
9, 0.69493, 0.782567
10, 0.0527514, 0.225526
11, 0.0991077, 1.48388
12, 2.33115, 0.0542994
13, 4.5574, 0.0831971
In the AVX loop the values are:
0, 0, 0
1, 0.0184406, 1
2, 1.24848, 0.530578
3, 0.338851, 0.394109
4, 0.0365017, 0.0724287
5, 0.0294888, 0.804905
6, 0.830307, 0.478687
7, 0.04658, 0.0680608
8, 0.024736, 0.78746
9, 0.807339, 0.519651
10, 0.0230712, 0.0872787
11, 0.0400014, 0.828561
12, 0.854433, 0.404359
13, 0.0987707, 0.0308286
14, 0.00460416, 0.791455
15, 0.851277, 0.773114
16, 0.00332154, 0.387519
17, 0.270393, 1.14866
18, 1.02832, 0.0131355
19, 0.773319, 1.51892
20, 0.776852, 10.0336
Reversing the order of the arguments passed to _mm256_set_pd solves the problem.
If you inspect the value of cReal in the debugger you'll see that the first element is set to -0.13579604578563975 not -0.14203954214360026.

GEOS OverlayOp intersection operation

I am using GEOS 3.6.2 to compute an intersection between two polygons. I was able to construct my polygons, but when I try to compute the intersection it won't work.
Compiling my program in Debug mode, I get the error message:
The inferior stopped because it received a signal from the operating
system.
Signal name : SIGSEG
Signal meaning : Segmentation fault
Any idea where I'm wrong?
Here is my code:
#include <geos/geom/Polygon.h>
#include <geos/geom/LinearRing.h>
#include <geos/geom/CoordinateSequenceFactory.h>
#include <geos/geom/GeometryFactory.h>
#include <geos/geom/Geometry.h>
#include <geos/operation/overlay/OverlayOp.h>
#include <iostream>
#include <array>
////////////////////////////////////////////////////////////////////////////////
geos::geom::Polygon* MakePoly(std::vector<std::vector<int>> const& polyCoords)
{
geos::geom::GeometryFactory* factory = geos::geom::GeometryFactory::create().get();
geos::geom::CoordinateSequence* temp = factory->getCoordinateSequenceFactory()->create((std::size_t) 0, 0);
std::vector<std::vector<int>>::const_iterator it_x = polyCoords.begin();
int size = it_x->size();
for (int i=0; i<size; i++)
{
temp->add(geos::geom::Coordinate(polyCoords[0][i], polyCoords[1][i]));
}
geos::geom::LinearRing *shell=factory->createLinearRing(temp);
//NULL in this case could instead be a collection of one or more holes
//in the interior of the polygon
return factory->createPolygon(shell,NULL);
}
////////////////////////////////////////////////////////////////////////////////
int main()
{
// Create geometry.
std::vector<std::vector<int>> polyCoords1 = {
{1, 1, 2, 2, 1, 1, 4, 5, 4, 1},
{1, 2, 2, 4, 4, 5, 5, 3, 1, 1}
};
geos::geom::Polygon* poly1 = MakePoly(polyCoords1);
std::vector<std::vector<int>> polyCoords2 = {
{4, 4, 6, 6, 4},
{1, 5, 5, 1, 1}
};
geos::geom::Polygon* poly2 = MakePoly(polyCoords2);
// Actually perform the operation.
geos::operation::overlay::OverlayOp intersection(poly1, poly2);
// Extracting the geometry of the intersection (position of the error).
geos::geom::Geometry* intersectionGeo = intersection.getResultGeometry( geos::operation::overlay::OverlayOp::OpCode::opINTERSECTION );
std::cout<<intersectionGeo->getArea()<<std::endl;
}
The problem in your code is getting the GeometryFactory pointer.
geos::geom::GeometryFactory::create() returns a smart pointer (std::unique_ptr) so after this line:
geos::geom::GeometryFactory* factory = geos::geom::GeometryFactory::create().get();
The unique_ptr returned by create is disposed.
Change that line with:
geos::geom::GeometryFactory::Ptr factory = geos::geom::GeometryFactory::create();
And the code works.

Store pointer to Eigen Vector 'segment' without copy?

I have an Eigen Vector that I would like to refer to a segment at a later time (e.g. pass between functions) instead of modifying immediately.
Eigen::Matrix<float, Eigen::Dynamic, 1> vec(10);
// initialize
vec << 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
I would like to create a pointer to a segment that I can refer to later. The following works but it creates a copy so any changes made to the segment are not reflected in the original vector.
const int start = 2;
const int end = 8
Eigen::Matrix<float, Eigen::Dynamic, 1> *block = new Eigen::Matrix<float, Eigen::Dynamic, 1>(end - start + 1, 1);
*block = vec.segment(start-1,end-1);
How can I keep a reference to the segment without copying?
You can use an Eigen::Map to wrap an existing segment of memory without copying. I'm not sure why you're allocating the *block object and not just using block. Using a Map it would look like
Eigen::Map<Eigen::VectorXf> block(&vec(start - 1), end - start + 1);
You then use the Map as you would a normal VectorXd, sans resizing and stuff. Simpler yet (at least according to #ggael), you can use an Eigen:Ref to refer to part of an Eigen object without inducing a copy. For example:
void times2(Eigen::Ref< Eigen::VectorXf> rf)
{
rf *= 2.f;
}
int main()
{
Eigen::Matrix<float, Eigen::Dynamic, 1> vec(10);
// initialize
vec << 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
const int start = 2;
const int end = 8;
// This would work as well
//Eigen::Map<Eigen::VectorXf> block(&vec(start - 1), end - start + 1);
Eigen::Ref<Eigen::VectorXf> block = vec.segment(start, end - start + 1);
std::cout << block << "\n\n";
times2(block);
std::cout << vec << "\n";
return 0;
}
P.S. I think you're misusing the segment function. It takes a beginning position an the number of elements, i.e. (start, end-start+1).

Neural Network not learning - MNIST data - Handwriting recognition

I have written a Neural Network Program. It works for Logic Gates, but when I try to use it for recognizing handwritten digits - it simply does not learn.
Please find the code below:
// This is a single neuron; this might be necessary in order to understand remaining code
typedef struct SingleNeuron
{
double outputValue;
std::vector<double> weight;
std::vector<double> deltaWeight;
double gradient;
double sum;
}SingleNeuron;
Then I initialize the net. I set weights to be random value between -0.5 to +0.5, sum to 0, deltaWeight to 0
Then comes the FeedForward:
for (unsigned i = 0; i < inputValues.size(); ++i)
{
neuralNet[0][i].outputValue = inputValues[i];
neuralNet[0][i].sum = 0.0;
// std::cout << "o/p Val = " << neuralNet[0][i].outputValue << std::endl;
}
for (unsigned i = 1; i < neuralNet.size(); ++i)
{
std::vector<SingleNeuron> prevLayerNeurons = neuralNet[i - 1];
unsigned j = 0;
double thisNeuronOPVal = 0;
// std::cout << std::endl;
for (j = 0; j < neuralNet[i].size() - 1; ++j)
{
double sum = 0;
for (unsigned k = 0; k < prevLayerNeurons.size(); ++k)
{
sum += prevLayerNeurons[k].outputValue * prevLayerNeurons[k].weight[j];
}
neuralNet[i][j].sum = sum;
neuralNet[i][j].outputValue = TransferFunction(sum);
// std::cout << neuralNet[i][j].outputValue << "\t";
}
// std::cout << std::endl;
}
My transfer function and its derivative is mentioned at the end.
After this I try to back-propagate using:
// calculate output layer gradients
for (unsigned i = 0; i < outputLayer.size() - 1; ++i)
{
double delta = actualOutput[i] - outputLayer[i].outputValue;
outputLayer[i].gradient = delta * TransferFunctionDerivative(outputLayer[i].sum);
}
// std::cout << "Found Output gradients "<< std::endl;
// calculate hidden layer gradients
for (unsigned i = neuralNet.size() - 2; i > 0; --i)
{
std::vector<SingleNeuron>& hiddenLayer = neuralNet[i];
std::vector<SingleNeuron>& nextLayer = neuralNet[i + 1];
for (unsigned j = 0; j < hiddenLayer.size(); ++j)
{
double dow = 0.0;
for (unsigned k = 0; k < nextLayer.size() - 1; ++k)
{
dow += nextLayer[k].gradient * hiddenLayer[j].weight[k];
}
hiddenLayer[j].gradient = dow * TransferFunctionDerivative(hiddenLayer[j].sum);
}
}
// std::cout << "Found hidden layer gradients "<< std::endl;
// from output to 1st hidden layer, update all weights
for (unsigned i = neuralNet.size() - 1; i > 0; --i)
{
std::vector <SingleNeuron>& currentLayer = neuralNet[i];
std::vector <SingleNeuron>& prevLayer = neuralNet[i - 1];
for (unsigned j = 0; j < currentLayer.size() - 1; ++j)
{
for (unsigned k = 0; k < prevLayer.size(); ++k)
{
SingleNeuron& thisNeueon = prevLayer[k];
double oldDeltaWeight = thisNeueon.deltaWeight[j];
double newDeltaWeight = ETA * thisNeueon.outputValue * currentLayer[j].gradient + (ALPHA * oldDeltaWeight);
thisNeueon.deltaWeight[j] = newDeltaWeight;
thisNeueon.weight[j] += newDeltaWeight;
}
}
}
These are the TransferFuntion and its derivative;
double TransferFunction(double x)
{
double val;
//val = tanh(x);
val = 1 / (1 + exp(x * -1));
return val;
}
double TransferFunctionDerivative(double x)
{
//return 1 - x * x;
double val = exp(x * -1) / pow((exp(x * -1) + 1), 2);
return val;
}
One thing I observed If i use standard sigmoid function to be my transfer function AND if I pass output of neuron to transfer function - Result is INFINITY. But tanh(x) works fine with this value
So if I am using 1/1+e^(-x) as transfer function I have to pass Sum of Net Inputs and with tanh being my transfer function I have to pass output of current neuron.
I do not completely understand why this is the way it is, may be this calls for a different question.
But this question is really about something else: NETWORK IS WORKING FOR LOGIC GATES BUT NOT FOR CHARACTER RECOGNITION
I have tried many variations/combinations of Learning Rate and Acceleration and # hidden layers and their sizes. Please find the results below:
AvgErr: 0.299399 #Pass799
AvgErr : 0.305071 #Pass809
AvgErr : 0.303046 #Pass819
AvgErr : 0.299569 #Pass829
AvgErr : 0.30413 #Pass839
AvgErr : 0.304165 #Pass849
AvgErr : 0.300529 #Pass859
AvgErr : 0.302973 #Pass869
AvgErr : 0.299238 #Pass879
AvgErr : 0.304708 #Pass889
AvgErr : 0.30068 #Pass899
AvgErr : 0.302582 #Pass909
AvgErr : 0.301767 #Pass919
AvgErr : 0.303167 #Pass929
AvgErr : 0.299551 #Pass939
AvgErr : 0.301295 #Pass949
AvgErr : 0.300651 #Pass959
AvgErr : 0.297867 #Pass969
AvgErr : 0.304221 #Pass979
AvgErr : 0.303702 #Pass989
After looking at the results you might feel this guy is simply stuck into local minima, but please wait and read through:
Input = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
Output = 0.0910903, 0.105674, 0.064575, 0.0864824, 0.128682, 0.0878434, 0.0946296, 0.154405, 0.0678767, 0.0666924
Input = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Output = 0.0916106, 0.105958, 0.0655508, 0.086579, 0.126461, 0.0884082, 0.110953, 0.163343, 0.0689315, 0.0675822
Input = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Output = 0.105344, 0.105021, 0.0659517, 0.0858077, 0.123104, 0.0884107, 0.116917, 0.161911, 0.0693426, 0.0675156
Input = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
Output = , 0.107113, 0.101838, 0.0641632, 0.0967766, 0.117149, 0.085271, 0.11469, 0.153649, 0.0672772, 0.0652416
Above is the output of epoch #996, #997,#998 and #999
So simply network is not learning. For this e.g. I have used ALPHA = 0.4, ETA = 0.7, 10 hidden layers each of 100 neurons and average is over 10 epochs. If you are worried about Learning Rate being 0.4 or so many hidden layers I have already tried their variations. For e.g. for learning rate being 0.1 and 4 hidden layers - each of 16
Input = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
Output = 0.0883238, 0.0983253, 0.0613749, 0.0809751, 0.124972, 0.0897194, 0.0911235, 0.179984, 0.0681346, 0.0660039
Input = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Output = 0.0868767, 0.0966924, 0.0612488, 0.0798343, 0.120353, 0.0882381, 0.111925, 0.169309, 0.0676711, 0.0656819
Input = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Output = 0.105252, 0.0943837, 0.0604416, 0.0781779, 0.116231, 0.0858496, 0.108437, 0.1588, 0.0663156, 0.0645477
Input = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
Output = 0.102023, 0.0914957, 0.059178, 0.09339, 0.111851, 0.0842454, 0.104834, 0.149892, 0.0651799, 0.063558
I am so damn sure that I have missed something. I am not able to figure it out. I have read Tom Mitchel's algorithm so many times, but I don't know what is wrong. Whatever example I solve by hand - works! (Please don't ask me to solve MNIST data images by hand ;) ) I do not know where to change the code, what to do.. please help out..
EDIT -- Uploading more data as per suggestions in comments
1 Hidden Layer of 32 -- still no learning.
Expected Output -- Input is images between 0-9, so a simple vector describing which is current image, that bit is 1 all others are 0. So i would want output to be as close to 1 for that particular bit and others being close to 0 For e.g. if input is Input = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0] I would want output to be something like Output = 0.002023, 0.0914957, 0.059178, 0.09339, 0.011851, 0.0842454, 0.924834, 0.049892, 0.0651799, 0.063558 (THis is vague, hand-generated)
Here are the links of other researcher's work.
Stanford
SourceForge -- This is rather a library
Not only these 2, there are so many sites showing the demos.
Things are working quite fine for them. If I set my network parameters(Alpha, ETA) like them I am not getting results like them, so this is reassurance that something is wrong with my code.
EDIT 2
Adding more failure cases
Accelaration - 0.7, Learning Rate 0.1
Accelaration - 0.7, Learning Rate 0.6
In both of the above cases Hidden layers were 3, each of 32 neurons.
This answer is copied from the OP's comment on the question.
I solved the puzzle. I had made the worst possible mistake. I was giving wrong input. I have used opencv to scan the images, instead of using reshape I was using resize and so input was linear interpolation of images. So my input was wrong. There was nothing wrong with the code. My network is 784 - 65 - 10 giving 96.43% accuracy.