Apriori in WEKA - weka

I'm new to all these Data mining, WEKA Tool etc.,
In my academic project I have to deal with bug reports. I have them in my SQL Server. I took the Bug summary attribute and applied tokenization,stop words removal and stemming techniques.
All the stemmed words in the summary are stored in database ; separated. Now I have to apply Frequent pattern mining algorithm and find out frequent item sets by using WEKA tool. I have my arff file like this.
#relation ItemSets
#attribute bugid integer
#attribute summary string
#data
755113,enhanc;keep;log;recommend;share
759414,access;review;social
763806,allow;intrus;less;provid;shrunken;sidebar;social;specifi
767221,datacloneerror;deeper;dig;framework;jsm
771353,document;integr;provid;secur;social
785540,avail;determin;featur;method;provid;social;whether
785591,chat;dock;horizont;nest;overlap;scrollbar
787767,abus;api;implement;perform;runtim;warn;worker
After opening it in Weka, under the Associate tab of WEKA Explorer I'm unable to start the process(Start button is disabled) with Apriori selected.
Now please suggest me how to find frequent itemsets on the summary attribute using WEKA. I.m in need of serious help. Help will be appreciated. Thanks in advance!

The reason why Apriori is not available using your file in Weka is that Apriori only allows nominal attribute values. What sort of rules are you trying to find? Could you give an example of rules you want to obtain?
values_you_want_to_be_the_antecedent_part_of_your_rule ==> values_you_want_to_be_the_consequent_part_of_your_rule
Changing your attributes to nominal like this
#relation ItemSets
#attribute bugid {755113, 759414, 763806}
#attribute summary {'enhanc;keep;log;recommend;share', 'access;review;social', 'allow;intrus;less;provid;shrunken;sidebar;social;specifi'}
#data
755113,'enhanc;keep;log;recommend;share'
759414,'access;review;social'
763806,'allow;intrus;less;provid;shrunken;sidebar;social;specifi'
will only give you rules like
bugid=755113 1 ==> summary=enhanc;keep;log;recommend;share 1 <conf:(1)> lift:(3) lev:(0.22)
If you're looking for frequent itemsets among the summary words, the bugid is irrelevant and you can remove it from your file. Apriori is used to obtain association rules e.g. enhanc, keep gives log with support X and confidence Y. To find frequent itemsets, you need to restructure your data so that each summary word is an attribute with values true/false or true/missing, see this question.
Try the following file in Weka. Select Associate, choose Apriori, double-click on the white input field next to the Choose button. There, set outputItemSets to true. In the console output, you will see all frequent itemsets and all obatined rules with sufficient support.
#relation ItemSets
#attribute enhanc {true}
#attribute keep {true}
#attribute log {true}
#attribute recommend {true}
#attribute share {true}
#attribute access {true}
#attribute review {true}
#attribute social {true}
#attribute allow {true}
#attribute intrus {true}
#attribute less {true}
#attribute provid {true}
#attribute shrunken {true}
#attribute sidebar {true}
#attribute specifi {true}
#data
true,true,true,true,true,?,?,?,?,?,?,?,?,?,?
?,?,?,?,?,true,true,true,?,?,?,?,?,?,?
?,?,?,?,?,?,?,true,true,true,true,true,true,true,true
The questionmarks ? represent a missing value.

Related

How do I convert an attribute to the "date" type when opening a CSV data set in Weka?

How to use the ChangeDateFormat filter in Weka (Waikato Environment for Knowledge Analysis) properly on preprocessing files with a date attribute?
I have the following CSV file:
Date,Open,High,Low,Close,Adj Close,Volume
2004-08-19,49.813290,51.835709,47.800831,49.982655,49.982655,44871361
2004-08-20,50.316402,54.336334,50.062355,53.952770,53.952770,22942874
2004-08-23,55.168217,56.528118,54.321388,54.495735,54.495735,18342897
2004-08-24,55.412300,55.591629,51.591621,52.239197,52.239197,15319808
... and so on.
When I open it with WEKA, it recognizes the first attribute as "Nominal", not "Date". Then, when I try to apply the ChangeDateFormat filter from filters → unsupervised → attributes on the "Date" attribute and click "Apply", Weka gives me an error:
Problem filtering instances: Chosen attribute not date.
However, there are no filters like "NominalToDate", only the "NominalToBinary" and "NominalToString", and there are no filters like "StringToDate".
Therefore, I had to rename this file to .arff and add the #attribute headers the following way:
#relation GOOG
#attribute Date date 'yyyy-MM-dd'
#attribute Open numeric
#attribute High numeric
#attribute Low numeric
#attribute Close numeric
#attribute AdjClose numeric
#attribute Volume numeric
#data
However, I didn't like the idea of manually tinkering the files, so I want to know how to use the ChangeDateFormat filter in this case.
How can I use the ChangeDateFormat filter to specify the date format of the imported files, and, if it is not possible with the ChangeDateFormat, what are the use cases for this filter?
As you see, the datetime in my dataset does not have the time part.
To import a CVS with a date attribute without time, turn on the "Invoke options dialog" checkbox in the "Open file..." dialog.
 
Then, in the "Options" dialog, specify the index of the date attribute and adjust the format accordingly by deleting the time part.
As a result, the date attribute in the imported dataset will have the "Timestamp" type.

Is it possible to merge two different set of dataset which consist of different attributes in weka?

I'm quite new in Weka.
I was wondering, is it possible for Weka to classify 2 different set of database which consists of different attributes in Weka?
Example:
Dataset A : #attributes {UserID, Tags, Descriptions}
#data
a,#user, writing books
Dataset B : #attributes {UserID, Longitude, Latitude, Dates}
#data
xyz ,7895231, 453221.1, 28.10.2012
Is it possible to merge Dataset A and B with different attribute into 1 dataset in Weka ? I was told that I can manually merge it in the excel before Weka classify it but I was wandering how does Weka read the data? Is it row by row? Is it logical to put in these form (excel) and put value 0?
Dataset AB : UserID, Tags, Descriptions, UserID, Longitude,
Latitude, Dates
a, #user, writing books, 0, 0,0
xyz, 0, 0 , 7895231, 453221.1, 28.10.2012
Yes. This is covered in this posting:
https://list.waikato.ac.nz/pipermail/wekalist/2009-April/043232.html
This also covers the situation in which you want to append two files (add instances).
This is done in the Weka Command Line Interface (CLI).
One trick to this is that there seems to be a line length limit, so move your files to the default directory (which seems to be Program Files/Weka-3-8), so you don't have a problem with long paths.
Suppose we have the file "merge A.arff" consisting of
#relation 'merge A'
#attribute UserID numeric
#attribute A1 {Joe,Bill,Larry}
#attribute A2 numeric
#attribute Aclass {pos,neg}
#data
1,Joe,17,pos
3,Joe,42,neg
5,Bill,8,neg
7,Larry,4,neg
and the file "merge B.arff" consisting of
#relation 'merge B'
#attribute BUserID numeric
#attribute Blong numeric
#attribute Blat numeric
#data
1,-180,42
3,-182,45
5,-179,36
7,-184,38
then if you open the CLI and type the following after the > prompt
java weka.core.Instances merge "merge A.arff" "merge B.arff"
the following will be dumped to the console:
#relation 'merge A_merge B'
#attribute UserID numeric
#attribute A1 {Joe,Bill,Larry}
#attribute A2 numeric
#attribute Aclass {pos,neg}
#attribute BUserID numeric
#attribute Blong numeric
#attribute Blat numeric
#data
1,Joe,17,pos,1,-180,42
3,Joe,42,neg,3,-182,45
5,Bill,8,neg,5,-179,36
7,Larry,4,neg,7,-184,38
For some reason, I'm having trouble piping this directly to another file, e.g.
java weka.core.Instances merge "merge A.arff" "merge B.arff" > "output.arff"
Either it's not creating the file, or I can't find where it's creating it. But one problem at a time!

Converting nominal attribute to numeric value using Weka

Suppose nominal attribute is Outlook which contains three values Sunny , Overcast and Rainy. I want to convert this values of outlook attribute in numeric form i.e. 1,2,3 (order can be change). I saw one filter nominaltobinary in weka but this will create three columns. I don't want to create separate column for each value. How I can do this using Weka.
In the ARFF, if you are using it, you can have a comment which specifies what the values of the "Outlook" attribute are.
For example, you ARFF can contain this comment at the top -
%% Numeric values for the "Outlook" Attribute
%% Sunny = 1
%% Overcast = 2
%% Rainy = 3
%% Windy = 4
Then you can define the attribute as -
#attribute Outlook {1,2,3,4}
I dont think there is a way to do this in the UI. But you can use a text editor to edit the ARFF itself.
For this you can use "RenameNominalValues" filter under unsupervised ---> attributes.
Then under "selectedAttribute" type the attribute and
under "valueReplacements" type as Sunny:1,Overcast:2,Rainy:3,Windy:4

Unable to determine structure as arff

I am trying to upload an arff file on weka but it is creating this problem:
Unable to determine structure as arff (Reason:
java.io.IOException:}expected at end of enumeration,read Token[EOL],
line 4)
#RELATION data1
#ATTRIBUTE attribute_0 {"T,"N,"A,"C,"V}
#ATTRIBUTE attribute_1 REAL
#ATTRIBUTE attribute_2 {""VRoot"",""0""",""1""",""Hide1"",1,10001",1",10002",10003",10004",10005",10006",10007",10008",10009",10010",10011",10012",10013",10014",10015",10016",10017",10018",10019",10020",10021",10022",10023",10024",10025",10026",10027",100
According to the ARFF Format Documentation, REAL is not a valid attribute type.
Try NUMERIC.
Also be careful with quotes. The parser may assume that " is used to quote strings, and your quotes do not match.

Weka GUI - TF-IDF is not calculated - Please Help For My Academic Work

I want to use KNN algorithm with TF-IDF in WEKA GUI. Firstly I run the algorithm in default conditions. Secondly I choose "IDFTransform" and "TFTransform" as "true" in StringToWordVector filter and run.
There is no difference in two results.
Result1:
Correctly Classified Instances 1346 91.3781 %
Result2:
Correctly Classified Instances 1346 91.3781 %
My ".arff" file is as follows:
#relation et9
#attribute 'alis' real
#attribute 'banka' real
...
#attribute 'urun' real
#attribute 'class' {yes, no}
#data
70,0,0,0,3,0,40,0,3,1,0,0,20,0,717,2,4,0,0,0,2,5,0,0,0,717,0,1,0,30,yes
22,0,0,63,158,0,1,0,7,0,10,0,4,0,57,0,0,0,0,204,0,0,2,2,0,530,0,0,6,0,yes
0,0,1,0,0,0,0,0,2,1,3,0,0,0,0,0,5,0,0,0,0,0,2,1,0,0,0,0,0,0,no
...
I know that StringToWordVector is used for strings. But I want to calculate TF-IDF for this ".arff" file. How can I use my current ".arff" file and have KNN algorithm result with TF-IDF?
(This is my academic work. Please help...)
According to Weka's documentation, the StringToWordVector filter "Converts String attributes into a set of attributes representing word occurrences [...]". Therefore, applying this filter to an arff file that does not contain any String attributes will have no effect on the dataset.
In order to make use of this filter, you will need to prepare an arff file that contains a String attribute, where the value of this attribute is the text for the given instance. For example, if each instance represents one tweet, then the text from the tweet would be the value for this String attribute. More information on working with text in weka is documented here.