How should we set the number of the neurons in the hidden layer in neural network? [closed] - python-2.7

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In neural network theory - setting up the size of hidden layers seems to be a really important issue. Is there any criteria how to choose the number of neurons in a hidden layer?

Yes - this is a really important issue. Basically there are two ways to do that:
Try different topologies and choose best: due to the fact that number of neurons and layers are a discrete parameters you cannot differentiate your loss function with respect to this parameters in order to use a gradient descent methods. So the easiest way is to simply set up different topologies and compare them using either cross-validation or division of your training set to - training / testing / validating parts. You can also use a grid / random search schemas to do that. Libraries like scikit-learn have appropriate modules for that.
Dropout: the training framework called dropout could also help. In this case you are setting up relatively big number of nodes in your layers and trying to adjust a dropout parameter for each layer. In this scenario - e.g. assuming that you will have a two-layer network with 100 nodes in your hidden layer with dropout_parameter = 0.6 you are learning the mixture of models - where every model is a neural network with size 40 (approximately 60 nodes are turned off). This might be also considered as figuring out the best topology for your task.

There are also a few other algorithms for creating and pruning hidden layer neurons on the fly. The one I'm most familiar with is Cascade Correlation, which gets pretty good performance for many applications, despite the fact that the hidden layer starts with a single neuron and adds others as needed.
For further reading see:
The original paper by Scott E. Fahlman and Christian Lebiere, The Cascade-
Correlation Learning Architecture.
• Gábor Balázs' Cascade Correlation Neural Networks: A Survey.
• Lutz Precehelt's Investigation of the CasCor Family of Learning Algorithms Investigation of the CasCor Family of Learning Algorithms.
There are many other such algorithms for dynamically constructing the hidden layer, which can be found scattered across the Internet in various .pdf research papers. Some sleuthing may be worthwhile to avoid reinventing the wheel and may turn up just the right method for the problem you're trying to solve. Neural net research is spread out across many varied disciplines so there's no telling what else is out there; keeping track of all the new algorithms is a daunting prospect. I hope that helps though.

You have to set the number of neurons in hidden layer in such a way that it shouldn't be more than # of your training example. There are no thumb rule for number of neurons.
Ex: If you are using MINIST Dataset then you might have ~ 78K training example. So make sure that combination of Neural Network (784-30-10) = 784*30 + 30*10 which are less than training examples. but if you use like (784-100-10) then it exceeds the # of training example and highly probable to over-fit.
In short, make sure you are not over-fitting and hence you have good chances to get good result.

Related

What is considered a good accuracy for trained Word2Vec on an analogy test?

After training Word2Vec, how high should the accuracy be during testing on analogies? What level of accuracy should be expected if it is trained well?
The analogy test is just a interesting automated way to evaluate models, or compare algorithms.
It might not be the best indicator of how well word-vectors will work for your own project-specific goals. (That is, a model which does better on word-analogies might be worse for whatever other info-retrieval, or classification, or other goal you're really pursuing.) So if at all possible, create an automated evaluation that's tuned to your own needs.
Note that the absolute analogy scores can also be quite sensitive to how you trim the vocabulary before training, or how you treat analogy-questions with out-of-vocabulary words, or whether you trim results at the end to just higher-frequency words. Certain choices for each of these may boost the supposed "correctness" of the simple analogy questions, but not improve the overall model for more realistic applications.
So there's no absolute accuracy rate on these simplistic questions that should be the target. Only relative rates are somewhat indicative - helping to show when more data, or tweaked training parameters, seem to improve the vectors. But even vectors with small apparent accuracies on generic analogies might be useful elsewhere.
All that said, you can review a demo notebook like the gensim "Comparison of FastText and Word2Vec" to see what sorts of accuracies on the Google word2vec.c `questions-words.txt' analogy set (40-60%) are achieved under some simple defaults and relatively small training sets (100MB-1GB).

open source CRFs implementation for computer vision problems? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
There are several open source implementations of conditional random fields (CRFs) in C++, such as CRF++, FlexCRF, etc. But from the manual, I can only understand how to use them for 1-D problems such as text tagging, it's not clear how to apply them in 2-D vision problems, suppose I have computed the association potentials at each node and the interaction potentials at each edge.
Did anyone use these packages for vision problems, e.g., segmentation? Or they simply cannot be used in this way?
All in all, is there any open source packages of CRFs for vision problems?
Thanks a lot!
The newest version of dlib has support for learning pairwise Markov random field models over arbitrary graph structures (including 2-D grids). It estimates the parameters in a max-margin sense (i.e. using a structural SVM) rather than in a maximum likelihood sense (i.e. CRF), but if all you want to do is predict a graph labeling then either method is just as good.
There is an example program that shows how to use this stuff on a simple example graph. The example puts feature vectors at each node and the structured SVM uses them to learn how to correctly label the nodes in the graph. Note that you can change the dimensionality of the feature vectors by modifying the typedefs at the top of the file. Also, if you already have a complete model and just want to find the most probable labeling then you can call the underlying min-cut based inference routine directly.
In general, I would say that the best way to approach these problems is to define the graphical model you want to use and then select a parameter learning method that works with it. So in this case I imagine you are interested in some kind of pairwise Markov random field model. In particular, the kind of model where the most probable assignment can be found with a min-cut/max-flow algorithm. Then in this case, it turns out that a structural SVM is a natural way to find the parameters of the model since a structural SVM only requires the ability to find maximum probability assignments. Finding the parameters via maximum likelihood (i.e. treating this as a CRF) would require you to additionally have some way to compute sums over the graph variables, but this is pretty hard with these kinds of models. For this kind of model, all the CRF methods I know about are approximations, while the SVM method in dlib uses an exact solver. By that I mean, one of the parameters of the algorithm is an epsilon value that says "run until you find the optimal parameters to within epsilon accuracy", and the algorithm can do this efficiently every time.
There was a good tutorial on this topic at this year's computer vision and pattern recognition conference. There is also a good book on Structured Prediction and Learning in Computer Vision written by the presenters.

Online (as opposed to bulk processed) data mining packages [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
By "bulk processed" I mean a static data set of facts (as in a CSV) processed all at once to extract knowledge. While "online", it uses a live backing store: facts are added as they happen ("X buys Y") and queries happen on this live data ("what would you reccomend to a person who is looking at y right now?").
I have (mis)used the term real-time, but I dont mean that results must come within a fixed time. ('''Edit: replaced real-time with online above''')
I have in mind a recommendation engine which uses live data. However all online resources (such as SO questions) I encountered make no distinction between real-time and bulk processing data mining packages. I had to search individually:
Carrot2 which reads from Lucene/Solr and other live datasets (online)
Knime which does scheduled execution on static files (bulk)
Mahout which runs on Hadoop (and Pregel-based Giraph in future) (online?)
a commercial package that integrates with Cassandra (online?)
What are the online data-mining packages?
Is there a reason why the literature makes no distinction between online and bulk processing packages? Or is all practical data-mining actually bulk operation in nature?
For some algorithms, there are online versions available. For example for LOF, the local outlier factor, there is an online variant. I believe there are also online variants of k-means (and in fact, the original MacQueen version can be seen as "online", although most people turn it into an offline version by reiterating it until convergence), but see below for the problem with the k parameter.
However, online operation often comes at a significant performance cost. Up to the point where it is faster to run the full algorithm on a snapshow every hour instead of continuously updating the results. Think of internet search engines. Most large-scale search engines still do not allow "online" queries, but instead you query the last index that was built, probably a day or more ago.
Plus, online operation needs a significant amount of additional work. It's easy to compute a distance matrix, it is much harder to online update it by adding and removing columns, and synchronize all dependant results. In general, most data-mining results are just too complex to perform this. It's easy to compute the mean of a data stream, for example. But '''often there is just no known solution on updating the results without rerunning the - expensive - process'''. In other situations, you will even need to change the algorithm paramters. So at some point, a new cluster may form. k-means however is not meant to have new clusters appear. So essentially, you can't just write an online version of k-means. It will be a different algorithm, as it needs to dynamically modify the input parameter "k".
So usually, the algorithms will already be either online or offline. And a software package will not be able to turn offline algorithms into online algorithms.
online data-mining algorithms imply that they compute results in real time, and usually implies that the algorithms are incremental. That is, the model is updated each time it sees a new training instance, and no periodic retraining with a batch algorithm is needed. Many machine learning libraries, like Weka provide incremental versions of batch algorithms. Also check moa project and spark streaming. The literature does make a distinction between the two, although the most of the "traditional" ML algorithms do not work in an online mode without infrastructure and computation optimizations.

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a few sets of questions regarding outlier detection:
Can we find outliers using k-means and is this a good approach?
Is there any clustering algorithm which does not accept any input from the user?
Can we use support vector machine or any other supervised learning algorithm for outlier detection?
What are the pros and cons of each approach?
I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.
I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:
set.seed(123)
sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
rnorm(n, mean[2],sd[2]))
# generate three clouds of points, well separated in the 2D plane
xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
sim.xy(100, c(2.5,0), c(.4,.2)),
sim.xy(100, c(1.25,.5), c(.3,.2)))
xy[1,] <- c(0,2) # convert 1st obs. to an outlying value
km3 <- kmeans(xy, 3) # ask for three clusters
km4 <- kmeans(xy, 4) # ask for four clusters
As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.
One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).
This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).
A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:
R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.
(It is available in the mvoutlier R package.)
It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in
Handl, J., Knowles, J., and Kell, D.B.
(2005). Computational cluster validation in post-genomic data analysis.
Bioinformatics 21(15): 3201-3212.
that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.
There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.
This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)
Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.
Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.
A related topic is anomaly detection, about which you will find a lot of papers.
That really deserves a new (and probably more focused) question :-)
1) Can we find outliers using k-means, is it a good approach?
Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as
by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.
2) Is there any clustering algorithm which does not accept any input from the user?
Maybe you can achieve some valuable knowledge on this topic:
Dirichlet Process Clustering
Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.
3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?
Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.
4) What is the pros and cons of each approach?
Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:
Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.
Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.
k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.
No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.
one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.
Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...
You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.
Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!

Tips for an AI for a 2D racing game [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a school project to build an AI for a 2D racing game in which it will compete with several other AIs.
We are given a black and white bitmap image of the racing track, we are allowed to choose basic stats for our car (handling, acceleration, max speed and brakes) after we receive the map. The AI connects to the game's server and gives to it several times a second numbers for the current acceleration and steering. The language I chose is C++, by the way.
The question is
What is the best strategy or algorithm (since I want to try and win)? I currently have in mind some ideas found on the net and one or two of my own, but I would like before I start to code that my perspective is one of the best.
What good books are there on that matter?
What sites should I refer to?
There's no "right answer" for this problem - it's pretty open-ended and many different options might work out.
You may want to look into reinforcement learning as a way of trying to get the AI to best determine how to control the car once it's picked the different control statistics. Reinforcement learning models can train the computer to try to work toward a good system for making particular maneuvers in terms of the underlying control system.
To determine what controls you'll want to use, you could use some flavor of reinforcement learning, or you may want to investigate supervised learning algorithms that can play around with different combinations of controls and see how good of a "fit" they give for the particular map. For example, you might break the map apart into small blocks, then try seeing what controls do well in the greatest number of blocks.
In terms of plotting out the path you'll want to take, A* is a well-known algorithm for finding shortest paths. In your case, I'm not sure how useful it will be, but it's the textbook informed search algorithm.
For avoiding opponent racers and trying to drive them into trickier situations, you may need to develop some sort of opponent modeling system. Universal portfolios are one way to do this, though I'm not sure how useful they'll be in this instance. One option might be to develop a potential field around the track and opponent cars to help your car try to avoid obstacles; this may actually be a better choice than A* for pathfinding. If you're interested in tactical maneuvers, a straightforward minimax search may be a good way to avoid getting trapped or to find ways to trap opponents.
I am no AI expert, but I think the above links might be a good starting point. Best of luck with the competition!
What good books are there on that matter?
The best book I have read on this subject is "Programming Game AI by Example" by Mat Buckland. It has chapters on both path planning and steering behaviors, and much more (state machines, graph theory, the list goes on).
All the solutions above are good, and people have gone to great length to test them out. Look up "Togelius and Lucas" or "Loiacono and Lanzi". They have tries things like neuroevolution, imitation (done via reinforcement learning), force fields, etc. From my point of view the best way to go is center line. That will take an hour to implement. In contrast, neuroevolution (for example) is neither easy nor fast. I did my dissertation on that and it can easily take several months full time, if you have the right hardware.