How can we use clustering results in weka ? - weka

I am using Weka for my internship but I have a little knowledge about data mining. So, maybe someone knows how can I apply the following results on my data-sets to get all data by cluster ? The method that I use now is to compute distances between my attributes and the mean value of each cluster then I classify them by the nearest value. But this method is too rough for me .
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: wcet_cluster6 - Copie-weka.filters.unsupervised.attribute.Remove-R1-3,5-weka.filters.unsupervised.attribute.Remove-R5-12
Instances: 467
Attributes: 4
max
alt
stmt
bb
Test mode:evaluate on training data
=== Model and evaluation on training set ===
EM
Number of clusters selected by cross validation: 6
Cluster
Attribute 0 1 2 3 4 5
(0.28) (0.11) (0.25) (0.16) (0.04) (0.17)
==================================================================
max
mean 9.0148 10.9112 11.2826 10.4329 11.2039 10.0546
std. dev. 1.8418 2.7775 3.0263 2.5743 2.2014 2.4614
alt
mean 0.0003 19.6467 0.4867 2.4565 44.191 8.0635
std. dev. 0.0175 5.7685 0.5034 1.3647 10.4761 3.3021
stmt
mean 0.7295 77.0348 3.2439 12.3971 140.9367 33.9686
std. dev. 1.0174 21.5897 2.3642 5.1584 34.8366 11.5868
bb
mean 0.4362 53.9947 1.4895 7.2547 114.7113 22.2687
std. dev. 0.5153 13.1614 0.9276 3.5122 28.0919 7.6968
Time taken to build model (full training data) : 4.24 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 163 ( 35%)
1 50 ( 11%)
2 85 ( 18%)
3 73 ( 16%)
4 18 ( 4%)
5 78 ( 17%)
Log likelihood: -9.09081
Thanks for your help!!

I think no-one can really answer this. Some tips off the top of my head.
You have used the EM clustering algorithm, see animated gif on wikipedia page. From Weka's Documentation Synopsis:
"EM assigns a probability distribution to each instance which
indicates the probability of it belonging to each of the clusters. "
Is this complex output really what you want?
It also selects a number of clusters for you (unless you constrain that number).
In weka 3.7 you can use the unsupervised attribute filter "ClusterMembership" in the Preprocess dialog to replace your dataset with a result of the cluster assignments. You need to select one reference attribute, though. By default it selects the last one. This creates hard-to -interpret output.

Related

Weka j48 output

I have confusion about the numbers at the end of the branches of a J48 tree. For example, using the weather.nominal data the tree looks the same, whether the Test options are set to Use training set or Cross-validation or Percentage split.
This is the output:
J48 pruned tree
------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
According to the textbook by the authors of this software, in an example using this exact data they say, "In the tree structure, a colon introduces the class label that has been assigned to a particular leaf, followed by the number of instances that reach that leaf, expressed as a decimal number because of the way the algorithm uses fractional instances to handle missing values. If there were incorrectly classified instances (there aren’t in this example) their number would appear, too: thus 2.0/1.0 means that two instances reached that leaf, of which one is classified incorrectly"
So this means that no instances were incorrectly classified in the above tree with the weather.nominal dataset.
On the other hand, when the test options are set to either 'Use training set' or 'Percentage split' (with the default random seed), there are incorrectly classified instances. For example, with a 60 percentage split, it shows the following
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 2 40 %
Incorrectly Classified Instances 3 60 %
There seems to be a contradiction here but I must be missing something. Is the tree shown initially not the tree that is built with the 60 percentage split?
That is not stated anywhere as far as I have seen but I can't think of any other explanation.
Just for completeness, the data is here:
outlook,temperature,humidity,windy,play
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
If you take a closer look at the output, you will see the following:
=== Classifier model (full training set) ===
The model that is being depicted there is the model that was trained on the full dataset, not your split.
The next section has the following heading:
=== Evaluation on test split ===
The statistics that you are referring to are based on a model trained and evaluated on your dataset split.

Trajectory Analysis (SAS): Incorrect number of start values

I am attempting a trajectory analysis in SAS (proc traj).
Following instructions found online, I first begin by testing two quadratic models, then three, then four (i.e., order 2 2, order 2 2 2, order 2 2 2 2, order 2 2 2 2 2).
I determined that a three-group linear model is the best fit (order 1 1 1;)
I then wish to add time stable covariates with the risk command. As found online, I did this by adding the start parameters provided in the Log.
At this point, I receive a notice: "Incorrect number of start values. There should be 10 start values based on the model specifications.").
I understand that it's possible to delete some of the 12 parameter estimates provided - But how do I select which ones to remove?
Thank you.
Code:
proc traj data=followupyes outplot=op outstat=os out=of outest=oe itdetail;
id youthid;
title3 'linear 3-gp model ';
var pronoun_allpar1-pronoun_allpar3;
indep time1-time3;
model logit;
ngroups 3;
order 1 1 1;
weight wgt_00;
start 0.031547 0.499724 1.969017 0.859566 -1.236747 0.007471
0.771878 0.495458 0.000000 0.000000 0.000000 0.000000;
risk P00_45_1;
run;
%trajplot (OP, OS, "linear 3-gp model ", "Traj of Pronoun Support", "Pron Support", "Time");
Because you are estimating a model with 3 linear trajectories, you will need 2 start values for each of your 3 groups.
See here for more info: https://www.andrew.cmu.edu/user/bjones/example.htm

Simple Neptune Gremlin query to perform date comparison degrades due to large join

We have a graph that contains both customer and product verticies. For a given product, we want to find out how many customers who signed up before DATE have purchased this product. My query looks something like
g.V('PRODUCT_GUID') // get product vertex
.out('product-customer') // get all customers who ever bought this product
.has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))) // see if the customer was created after a given date
.count() // count the results
This query is incredibly slow, so I looked at the neptune profiler and saw something odd. Below is the full profiler output. Ignore the elapsed time in the profiler. This was after many attempts at the same query, so the cache is warm. in the wild, it can take 45 seconds or more.
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V('PRODUCT_GUID').out('product-customer').has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))).count()
Original Traversal
==================
[GraphStep(vertex,[PRODUCT_GUID]), VertexStep(OUT,[product-customer],vertex), HasStep([created_on.gte(Sat Nov 28 00:33:44 UTC 2020)]), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneCountGlobalStep {
JoinGroupNode {
PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586, indexTime=0, joinTime=14, numSearches=1, actualTotalOutput=13424}
PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574, indexTime=10, joinTime=140, numSearches=13424}
}, annotations={path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep], joinStats=true, optimizationTime=0, maxVarId=8, executionTime=165}
}
]
Physical Pipeline
=================
NeptuneCountGlobalStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574})
Runtime (ms)
============
Query Execution: 164.996
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneCountGlobalStep 1 1 164.919 100.00
>TOTAL - - 164.919 -
Predicates
==========
# of predicates: 131
Results
=======
Count: 1
Output: [22]
Index Operations
================
Query execution:
# of statement index ops: 13425
# of unique statement index ops: 13425
Duplication ratio: 1.0
# of terms materialized: 0
In particular
DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^) .], {estimatedCardinality=1285574})
This line surprises me. The way I'm reading this is that Neptune is ignoring the verticies coming from ".out('product-customer')" to satisfy the ".has('created_on'...)" requirement, and is instead joining on every single customer vertex that has the created_on attribute.
I would have expected that the cardinality is only the number of customers with an edge from the product, not every single customer.
I'm wondering if there's a way to only run this comparison on the customers coming from the "out('product-customer')" step.
Neptune actually must solve the first pattern,
(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6)
before it can solve the second,
(?3, <created_on>, ?7, ?)
Each quad pattern is an indexed lookup bound by at least two fields. So the first lookup uses the SPOG index in Neptune bound by the Subject (the ID) and the Predicate (the edge label). This will return a set of Objects (the vertex IDs for the vertices at the other end of the product-customer edges) and references them via the ?3 variable for the next pattern.
In the next pattern those vertex IDs (?3) are bound with the Predicate (property key of created-on) to evaluate the condition of the date range. Because this is a conditional evaluation, each vertex in the set of ?3 has to be evaluated (each 'created-on' property on each of those vertices has to be read).

Graph evolution of quantile non-linear coefficient: can it be done with grqreg? Other options?

I have the following model:
Y_{it} = alpha_i + B1*weight_{it} + B2*Dummy_Foreign_{i} + B3*(weight*Dummy_Foreign)_ {it} + e_{it}
and I am interested on the effect on Y of weight for foreign cars and to graph the evolution of the relevant coefficient across quantiles, with the respective standard errors. That is, I need to see the evolution of the coefficients (B1+ B3). I know this is a non-linear effect, and would require some sort of delta method to obtain the variance-covariance matrix to obtain the standard error of (B1+B3).
Before I delve into writing a program that attempts to do this, I thought I would try and ask if there is a way of doing it with grqreg. If this is not possible with grqreg, would someone please guide me into how they would start writing a code that computes the proper standard errors, and graphs the quantile coefficient.
For a cross section example of what I am trying to do, please see code below.
I use grqred to generate the evolution of the separate coefficients (but I need the joint one)-- One graph for the evolution of (B1+B3) with it's respective standard errors.
Thanks.
(I am using Stata 14.1 on Windows 10):
clear
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
gen weight_foreign= weight*foreign
label var weight_foreign "Interaction weight and foreign car"
qreg gptm weight foreign weight_foreign , q(.5)
grqreg weight weight_foreign , ci ols olsci reps(40)
*** Question 1: How to constuct the plot of the coefficient of interest?
Your second question is off-topic here since it is statistical. Try the CV SE site or Statalist.
Here's how you might do (1) in a cross section, using margins and marginsplot:
clear
set more off
sysuse auto
set scheme s1color
gen gptm = 1000/mpg
label var gptm "gallons / 1000 miles"
sqreg gptm c.weight##i.foreign, q(10 25 50 75 95) reps(500) coefl
margins, dydx(weight) predict(outcome(q10)) predict(outcome(q25)) predict(outcome(q50)) predict(outcome(q75)) predict(outcome(q95)) at(foreign=(0 1))
marginsplot, xdimension(_predict) xtitle("Quantile") ///
legend(label(1 "Domestic") label(2 "Foreign")) ///
xlabel(none) xlabel(1 "Q10" 2 "Q25" 3 "Q50" 4 "Q75" 5 "Q95", add) ///
title("Marginal Effect of Weight By Origin") ///
ytitle("GPTM")
This produces a graph like this:
I didn't recast the CI here since it would look cluttered, but that would make it look more like your graph. Just add recastci(rarea) to the options.
Unfortunately, none of the panel quantile regression commands play nice with factor variables and margins. But we can hack something together. First, you can calculate the sums of coefficients with nlcom (instead of more natural lincom, which the lacks the post option), store them, and use Ben Jann's coefplot to graph them. Here's a toy example to give you the main idea where we will look at the effect of tenure for union members:
set more off
estimates clear
webuse nlswork, clear
gen tXu = tenure*union
local quantiles 1 5 10 25 50 75 90 95 99 // K quantiles that you care about
local models "" // names of K quantile models for coefplot to graph
local xlabel "" // for x-axis labels
local j=1 // counter for quantiles
foreach q of numlist `quantiles' {
qregpd ln_wage tenure union tXu, id(idcode) fix(year) quantile(`q')
nlcom (me_tu:_b[tenure]+_b[tXu]), post
estimates store me_tu`q'
local models `"`models' me_tu`q' || "'
local xlabel `"`xlabel' `j++' "Q{sub:`q'}""'
}
di "`models'
di `"`xlabel'"'
coefplot `models' ///
, vertical bycoefs rescale(100) ///
xlab(none) xlabel(`xlabel', add) ///
title("Marginal Effect of Tenure for Union Members On Each Conditional Quantile Q{sub:{&tau}}", size(medsmall)) ///
ytitle("Wage Change in Percent" "") yline(0) ciopts(recast(rcap))
This makes a dromedary curve, which suggests that the effect of tenure is larger in the middle of the wage distribution than at the tails:

Weka Document Clustering: Doc ID not visible in the output

I have to crawl Wikipedia to get HTML pages of countries. I have successfully crawled. Now to build clusters, I have to do KMeans. I am using Weka for that.
I have used this code to convert my directory into arff format:
https://weka.wikispaces.com/file/view/TextDirectoryToArff.java
Here is its output:
enter image description here
Then I opened that file in Weka and performed StringToWordVector conversion with these parameters:
Then I performed Kmeans. The output I am getting is:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
Relation: text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 28
Attributes: 1040
[list of attributes omitted]
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
Number of iterations: 2
Within cluster sum of squared errors: 1915.0448503841326
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(28) (22) (6)
====================================================================================
.
.
.
.
.
bolsheviks 0.3652 0.3044 0.5878
book 0.3229 0.3051 0.3883
border 0.4329 0.5509 0
border-left-style 0.4329 0.5509 0
border-left-width 0.3375 0.4295 0
border-spacing 0.3124 0.3304 0.2461
border-width 0.5128 0.2785 1.372
boundary 0.309 0.3007 0.3392
brazil 0.381 0.3744 0.4048
british 0.4387 0.2232 1.2288
brown 0.2645 0.2945 0.1545
cache-control=max-age=87840 0.4913 0.4866 0.5083
california 0.5383 0.5085 0.6478
called 0.4853 0.6177 0
camp 0.4591 0.5451 0.1437
canada 0.3176 0.3358 0.251
canadian 0.2976 0.1691 0.7688
capable 0.2475 0.315 0
capita 0.388 0.1188 1.375
carbon 0.3889 0.445 0.1834
caribbean 0.4275 0.5441 0
carlsbad 0.548 0.5339 0.5998
caspian 0.4737 0.5345 0.2507
category 0.2216 0.2821 0
censorship 0.2225 0.0761 0.7596
center 0.4829 0.4074 0.7598
central 0.211 0.0805 0.6898
century 0.2645 0.2041 0.4862
chad 0.3636 0.0979 1.3382
challenger 0.5008 0.6374 0
championship 0.6834 0.8697 0
championships 0.2891 0.1171 0.9197
characteristics 0.237 0 1.1062
charon 0.5643 0.4745 0.8934
china
.
.
.
.
.
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 22 ( 79%)
1 6 ( 21%)
How to check which DocId is in which cluster? I have searched a lot but didnt find anything.
Also, is there any other good Java Library for Kmeans and agglomerate clustering?