Python K-Means Z-Transform - python-2.7

I want to use the k-means to cluster my results, but I have a lot of question.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
My input data looks like this:
ID ABC XYZ UVW MSE
10 A X U 102000
12 B Y V 9000
Is it possible to cluster different types of input data with the K-Means? Like in my case charakters and numbers?
K-means choose a random centre for the clustering process. If i run the clustering often will my results are changing or is the output a stable result?
I want to know, which ID is in which cluster. How i get this information out of software?
EDIT:
If I only Cluster my MSE and afterwards I check, which attributes are effected, is this solution, which makes sense?

K-means tries to minimize variance (=squared errors).
What would the squared error of abc and def be?
Only use it for continuous data. And don't expect it to do magic, what you get is usually only a very bad approximation of what you'd be looking for. Running it multiple times will usually give you different results because there exists no 'good' result.

Related

proper activation function at output and loss function to optimize for OCR?

I am trying to make a CNN model on IAM handwritten words data(which has images of words handwritten by multiple people and targets are text in the images). So, I can encode words to numbers(A=0, B=1 and so on for capital, small and punctuation). But, I couldn't quite decide on what loss to use for this problem, in tensforflow.keras.
I have found out about something called CTC loss on some repositories on GitHub. But, ctc loss was used with softmax as activation in final dense layer. And since softmax outputs probabilities for each class that sum up to 1; So, I presume softmax can't output probabilties for two "o"s at the same time.
So, what are the proper activation and loss functions that I can use for a problem like this one?

Classify K-means in Text Mining

The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).

Imbalance between errors in data summary and tree visualization in Weka

I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.
The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.

SegNet results of train set (test via test_segmentation.py)

I run SegNet on my own dataset (by Segnet tutorial). I see great results via test_segmentation.py.
my problem is that I want to see the real net results and not test_segmentation own colorisation (via classes).
for example, if I have trained net with 2 classes, so after the train I will see not only 2 colors (as we see with the classes), but we will see the real net color segmentation ([0.22,0.19,0.3....) lighter and darker as the net see it]
I hope that I explained myself well. thanks for helping.
You could use a python script to achieve what you want. Take a look at this script.
The command out = out['argmax'], extracts the raw output, so you can get a segmentation map with 'lighter and darker' values as you wanted.
When you say the 'real' net color segmentation I will assume that you mean the probability maps. Effectively the last layer will have one map for every class; and if you check the function predict in inference.py, they take the argmax; that is the channel (which represents the class) with the highest probability. If you want to get these maps, you just have to get the data without computing the argmax; something like:
predicted = net.blobs['prob'].data
I solve it. the solution is to range cmin and cmax from 0 to 1 in the scipy saving method. for example: scipy.misc.toimage(output, cmin=0.0, amax=1).save(/path/.../image.png)

Neural Network seems to work fine until used for processing data (all of the results are practically the same)

I have recently implemented a typical 3 layer neural network (input -> hidden -> output) and I'm using the sigmoid function for activation. So far, the host program has 3 modes:
Creation, which seems to work fine. It creates a network with a specified number of input, hidden and output neurons, initializes the weights to either random values or zero.
Training, which loads a dataset, computes the output of the network then backpropagates the error and updates the weights. As far as I can tell, this works ok. The weights change, but not extremely, after training on the dataset.
Processing, which seems to work ok. However, the data output for the dataset which was used for training, or any other dataset for that matter is very bad. It's usually either just a continuuous stream of 1's, with an occasional 0.999999 or every output value for every input is 0.9999 with the last digits being different between inputs. As far as I could tell there was no correlation between those last 2 digits and what was supposed to be outputed.
How should I go about figuring out what's not working right?
You need to find a set of parameters (number of neurons, learning rate, number of iterations for training) that works well for classifying previously unseen data. People often achieve this by separating their data into three groups: training, validation and testing.
Whatever you decide to do, just remember that it really doesn't make sense to be testing on the same data with which you trained, because any classifcation method close to reasonable should be getting everything 100% right under such a setup.