Weka Question: Which cluster do this Iris attributes belongs to? - weka

I'm totally new into Weka and data science, I got an assignment to detect which of the following Iris attributes (SW, SL, PW, PL) belongs to which cluster? can you assist me? Thanks!
enter image description here

The iris dataset that comes with Weka has three classes (Iris-setosa, Iris-versicolor, Iris-virginica).
If you want to see how well clusters determined by your cluster algorithm align with the class labels, you need to select Classes to clusters evaluation in the Weka Explorer or via the -c <class_att_index> option on the command-line.
The following command uses SimpleKMeans with three clusters on the iris dataset that comes with Weka (-c last uses the last attribute as class and performs clusters to classes evaluation):
java -cp weka.jar weka.clusterers.SimpleKMeans -N 3 -c last -t data/iris.arff
Which will result in this output:
=== Clustering stats for training data ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 6.998114004826762
Initial starting points (random):
Cluster 0: 6.1,2.9,4.7,1.4
Cluster 1: 6.2,2.9,4.3,1.3
Cluster 2: 6.9,3.1,5.1,2.3
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(150.0) (61.0) (50.0) (39.0)
=========================================================
sepallength 5.8433 5.8885 5.006 6.8462
sepalwidth 3.054 2.7377 3.418 3.0821
petallength 3.7587 4.3967 1.464 5.7026
petalwidth 1.1987 1.418 0.244 2.0795
Clustered Instances
0 61 ( 41%)
1 50 ( 33%)
2 39 ( 26%)
Class attribute: class
Classes to Clusters:
0 1 2 <-- assigned to cluster
0 50 0 | Iris-setosa
47 0 3 | Iris-versicolor
14 0 36 | Iris-virginica
Cluster 0 <-- Iris-versicolor
Cluster 1 <-- Iris-setosa
Cluster 2 <-- Iris-virginica
Incorrectly clustered instances : 17.0 11.3333 %

Related

Amazon QuickSight - Working out size of network

I have a database table with a record for each IOT device connected, each device has a unique device id and a unique network id associated with it.
For example:
device_id
network_id
1
1
2
1
3
1
4
2
5
2
6
3
7
3
8
3
9
3
10
4
I would like to be able visualise the size of each network based on its id. So I would have an output like such based on the above data:
network_id
size
1
3
2
2
3
4
4
1
I'm not currently sure how to do this
I found that using the countOver function worked for this
I made a calculated field called NetworkSize which was defined as:
countOver
(
{device_id}
,[{network_id}]
)
Which gives the right output I was looking for
However I have to include device_id in the visual which is a bit inconvenient

WEKA: Print the Indexes of Test data instances w.r.t original data at the time of cross validation

I have a query about the indexes of test data instances chosen by weka at the time of cross validation. How to print the indexes of the test data instances which are being evaluated ?
==================================
I have chosen:
Dataset : iris.arff
Total instances : 150
Classifier : J48
cross validation: 10 fold
I have also made output prediction as "PlainText"
=============
In the output window I can see like this :-
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
2 3:Iris-virginica 3:Iris-virginica 0.976
3 3:Iris-virginica 3:Iris-virginica 0.976
4 3:Iris-virginica 3:Iris-virginica 0.976
5 3:Iris-virginica 3:Iris-virginica 0.976
6 1:Iris-setosa 1:Iris-setosa 1
7 1:Iris-setosa 1:Iris-setosa 1
....
...
...
Total 10 test data set.(15 instances in each).
======================
As WEKA uses startified cross validation, instances in the test data sets are randomly choosen.
So, How to print the indexes of test data w.r.t the data in original file?
i.e
inst# actual predicted error prediction
1 3:Iris-virginica 3:Iris-virginica 0.976
This result is for which instance in main data (among total 50 Iris-virginica) ?
===============
After a lot of search, I have found that the below youtube video is helpful for the above problem.
Hope this will be helpful for any future visitor with same queries.
Weka Tutorial 34: Generating Stratified Folds (Data Preprocessing)

Scikit-learn labelencoder: how to preserve mappings between batches?

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do?
If I apply the same Labelencoder instance to different batches will the mappings remain the same?
I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?
I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0

AWS Elastic Search drop size abruptly

We have a social like app, and we started using the AWS ElasticcSearch Service in production, but we started to have a problem with ES, the ES version is the 2.3.
The cluster configuration is:
Data node: 2
Data node types: m3.medium.elasticsearch
Dedicated master instance count: 3
Dedicated master instance type: t2.small.elasticsearch.
Capacity of each data node: 50GB.
The problem is that in less than thirty minutes one of the node free storage size went from 9 GB to 0 GB, we did not know how this happened.
We have 4 types of documents, where one of them is a dynamic type, lets call it Group type, that is because every document of Group can have N fields that represents the friends of a Group.
Something like
{
13: [1,2,3,4],
5: [1,3,4],
user_ids: [1,2,3,4,6,7],
id: 1
}
This means that the users with ID 13 and 5 are friends with some of the users of the Group with ID 1.
So this document can grows according to the amount of users.
If anyone had or has the same problem, or just fully understand the Elastic Search architecture it would be awesome his help.
Indices info:
curl -XGET 'http://host/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open .kibana-4 1 1 5 0 1.9mb 1017.3kb
green open X 1 1 2259502 29575 57.5gb 28.7gb
green open Y 1 1 113156 0 21.7mb 10.8mb
curl -XGET 'http://host/_cat/nodes?v&h=host,id,ip,rp,hp,d,cpu,v,r,m,n
host id ip rp hp d cpu v r m n
x.x.x.x tIgm x.x.x.x 95 5 5.7gb 0 2.3.2 - m Shatter
x.x.x.x puUF x.x.x.x 95 6 5.7gb 0 2.3.2 - m Justice
x.x.x.x 1qZi x.x.x.x 97 54 17.7gb 7 2.3.2 d - Allatou
x.x.x.x lcty x.x.x.x 97 60 17.7gb 8 2.3.2 d - Amergin
x.x.x.x Nq1H x.x.x.x 5 15 5.7gb 0 2.3.2 - * Arkus
Thanks a lot!
I have managed to resolve the problem.
My problem is known as Mapping Explosion
Having variables keys in the mapping, like I had in the Group document type, will result on an evergrowing index.

Way to get SCSI disk names in Linux C++ application

In my Linux C++ application I want to get names of all SCSI disks which are present on the
system. e.g. /dev/sda, /dev/sdb, ... and so on.
Currently I am getting it from the file /proc/scsi/sg/devices output using below code:
host chan SCSI id lun type opens qdepth busy online
0 0 0 0 0 1 128 0 1
1 0 0 0 0 1 128 0 1
1 0 0 1 0 1 128 0 1
1 0 0 2 0 1 128 0 1
// If SCSI device Id is > 26 then the corresponding device name is like /dev/sdaa or /dev/sdab etc.
if (MAX_ENG_ALPHABETS <= scsiId)
{
// Device name order is: aa, ab, ..., az, ba, bb, ..., bz, ..., zy, zz.
deviceName.append(1, 'a'+ (char)(index / MAX_ENG_ALPHABETS) - 1);
deviceName.append(1, 'a'+ (char)(index % MAX_ENG_ALPHABETS));
}
// If SCSI device Id is < 26 then the corresponding device name is liek /dev/sda or /dev/sdb etc.
else
{
deviceName.append(1, 'a'+ index);
}
But the file /proc/scsi/sg/devices also contains the information about the disk which were previously present on the system. e.g If I detach the disk (LUN) /dev/sdc from the system
the file /proc/scsi/sg/devices still contains info of /dev/sdc which is invalid.
Tell me is there any different way to get the SCSI disk names? like a system call?
Thanks
You can simply read list of all files like /dev/sd* (in C, you would need to use opendir/readdir/closedir) and filter it by sdX (where X is one or two letters).
Also, you can get list of all partitions by reading single file /proc/partitions, and then filter 4th field by sdX:
$ cat /proc/partitions
major minor #blocks name
8 0 52428799 sda
8 1 265041 sda1
8 2 1 sda2
8 5 2096451 sda5
8 6 50066541 sda6
which would give you list of all physical disks together with their capacity (3rd field).
After get disk name list from /proc/scsi/sg/devices, you can verify the existence through code. For example, install sg3-utils, and use sg_inq to query whether the disk is active.