How do I convert an attribute to the "date" type when opening a CSV data set in Weka? - weka

How to use the ChangeDateFormat filter in Weka (Waikato Environment for Knowledge Analysis) properly on preprocessing files with a date attribute?
I have the following CSV file:
Date,Open,High,Low,Close,Adj Close,Volume
2004-08-19,49.813290,51.835709,47.800831,49.982655,49.982655,44871361
2004-08-20,50.316402,54.336334,50.062355,53.952770,53.952770,22942874
2004-08-23,55.168217,56.528118,54.321388,54.495735,54.495735,18342897
2004-08-24,55.412300,55.591629,51.591621,52.239197,52.239197,15319808
... and so on.
When I open it with WEKA, it recognizes the first attribute as "Nominal", not "Date". Then, when I try to apply the ChangeDateFormat filter from filters → unsupervised → attributes on the "Date" attribute and click "Apply", Weka gives me an error:
Problem filtering instances: Chosen attribute not date.
However, there are no filters like "NominalToDate", only the "NominalToBinary" and "NominalToString", and there are no filters like "StringToDate".
Therefore, I had to rename this file to .arff and add the #attribute headers the following way:
#relation GOOG
#attribute Date date 'yyyy-MM-dd'
#attribute Open numeric
#attribute High numeric
#attribute Low numeric
#attribute Close numeric
#attribute AdjClose numeric
#attribute Volume numeric
#data
However, I didn't like the idea of manually tinkering the files, so I want to know how to use the ChangeDateFormat filter in this case.
How can I use the ChangeDateFormat filter to specify the date format of the imported files, and, if it is not possible with the ChangeDateFormat, what are the use cases for this filter?
As you see, the datetime in my dataset does not have the time part.

To import a CVS with a date attribute without time, turn on the "Invoke options dialog" checkbox in the "Open file..." dialog.
 
Then, in the "Options" dialog, specify the index of the date attribute and adjust the format accordingly by deleting the time part.
As a result, the date attribute in the imported dataset will have the "Timestamp" type.

Related

Is it possible to merge two different set of dataset which consist of different attributes in weka?

I'm quite new in Weka.
I was wondering, is it possible for Weka to classify 2 different set of database which consists of different attributes in Weka?
Example:
Dataset A : #attributes {UserID, Tags, Descriptions}
#data
a,#user, writing books
Dataset B : #attributes {UserID, Longitude, Latitude, Dates}
#data
xyz ,7895231, 453221.1, 28.10.2012
Is it possible to merge Dataset A and B with different attribute into 1 dataset in Weka ? I was told that I can manually merge it in the excel before Weka classify it but I was wandering how does Weka read the data? Is it row by row? Is it logical to put in these form (excel) and put value 0?
Dataset AB : UserID, Tags, Descriptions, UserID, Longitude,
Latitude, Dates
a, #user, writing books, 0, 0,0
xyz, 0, 0 , 7895231, 453221.1, 28.10.2012
Yes. This is covered in this posting:
https://list.waikato.ac.nz/pipermail/wekalist/2009-April/043232.html
This also covers the situation in which you want to append two files (add instances).
This is done in the Weka Command Line Interface (CLI).
One trick to this is that there seems to be a line length limit, so move your files to the default directory (which seems to be Program Files/Weka-3-8), so you don't have a problem with long paths.
Suppose we have the file "merge A.arff" consisting of
#relation 'merge A'
#attribute UserID numeric
#attribute A1 {Joe,Bill,Larry}
#attribute A2 numeric
#attribute Aclass {pos,neg}
#data
1,Joe,17,pos
3,Joe,42,neg
5,Bill,8,neg
7,Larry,4,neg
and the file "merge B.arff" consisting of
#relation 'merge B'
#attribute BUserID numeric
#attribute Blong numeric
#attribute Blat numeric
#data
1,-180,42
3,-182,45
5,-179,36
7,-184,38
then if you open the CLI and type the following after the > prompt
java weka.core.Instances merge "merge A.arff" "merge B.arff"
the following will be dumped to the console:
#relation 'merge A_merge B'
#attribute UserID numeric
#attribute A1 {Joe,Bill,Larry}
#attribute A2 numeric
#attribute Aclass {pos,neg}
#attribute BUserID numeric
#attribute Blong numeric
#attribute Blat numeric
#data
1,Joe,17,pos,1,-180,42
3,Joe,42,neg,3,-182,45
5,Bill,8,neg,5,-179,36
7,Larry,4,neg,7,-184,38
For some reason, I'm having trouble piping this directly to another file, e.g.
java weka.core.Instances merge "merge A.arff" "merge B.arff" > "output.arff"
Either it's not creating the file, or I can't find where it's creating it. But one problem at a time!

Converting nominal attribute to numeric value using Weka

Suppose nominal attribute is Outlook which contains three values Sunny , Overcast and Rainy. I want to convert this values of outlook attribute in numeric form i.e. 1,2,3 (order can be change). I saw one filter nominaltobinary in weka but this will create three columns. I don't want to create separate column for each value. How I can do this using Weka.
In the ARFF, if you are using it, you can have a comment which specifies what the values of the "Outlook" attribute are.
For example, you ARFF can contain this comment at the top -
%% Numeric values for the "Outlook" Attribute
%% Sunny = 1
%% Overcast = 2
%% Rainy = 3
%% Windy = 4
Then you can define the attribute as -
#attribute Outlook {1,2,3,4}
I dont think there is a way to do this in the UI. But you can use a text editor to edit the ARFF itself.
For this you can use "RenameNominalValues" filter under unsupervised ---> attributes.
Then under "selectedAttribute" type the attribute and
under "valueReplacements" type as Sunny:1,Overcast:2,Rainy:3,Windy:4

weka- replace null value in a nominal attribute with a string

I am cleaning a data set with google open refine and then trying to use it in Weka to do some cluster analysis. I am dealing with a nominal column that stores range of salaries.
I've specified the attribute as below
#ATTRIBUTE Income {'0-30000','30000-50000','50000-75000','75000-150000','>150000'}
In the data set there are rows in which the 'Income' column is null and I suppose that is the reason why I get the error:
'nominal value not declared in header, read Token line 13'
Is there a way I can replace null values with a string( and then specify the string in the attribute)? - If so how do i specify it in the #ATRRIBUTE row?
Or would it be possible to include the null in the set of attributes?
Thanks

Unable to determine structure as arff

I am trying to upload an arff file on weka but it is creating this problem:
Unable to determine structure as arff (Reason:
java.io.IOException:}expected at end of enumeration,read Token[EOL],
line 4)
#RELATION data1
#ATTRIBUTE attribute_0 {"T,"N,"A,"C,"V}
#ATTRIBUTE attribute_1 REAL
#ATTRIBUTE attribute_2 {""VRoot"",""0""",""1""",""Hide1"",1,10001",1",10002",10003",10004",10005",10006",10007",10008",10009",10010",10011",10012",10013",10014",10015",10016",10017",10018",10019",10020",10021",10022",10023",10024",10025",10026",10027",100
According to the ARFF Format Documentation, REAL is not a valid attribute type.
Try NUMERIC.
Also be careful with quotes. The parser may assume that " is used to quote strings, and your quotes do not match.

Weka GUI - TF-IDF is not calculated - Please Help For My Academic Work

I want to use KNN algorithm with TF-IDF in WEKA GUI. Firstly I run the algorithm in default conditions. Secondly I choose "IDFTransform" and "TFTransform" as "true" in StringToWordVector filter and run.
There is no difference in two results.
Result1:
Correctly Classified Instances 1346 91.3781 %
Result2:
Correctly Classified Instances 1346 91.3781 %
My ".arff" file is as follows:
#relation et9
#attribute 'alis' real
#attribute 'banka' real
...
#attribute 'urun' real
#attribute 'class' {yes, no}
#data
70,0,0,0,3,0,40,0,3,1,0,0,20,0,717,2,4,0,0,0,2,5,0,0,0,717,0,1,0,30,yes
22,0,0,63,158,0,1,0,7,0,10,0,4,0,57,0,0,0,0,204,0,0,2,2,0,530,0,0,6,0,yes
0,0,1,0,0,0,0,0,2,1,3,0,0,0,0,0,5,0,0,0,0,0,2,1,0,0,0,0,0,0,no
...
I know that StringToWordVector is used for strings. But I want to calculate TF-IDF for this ".arff" file. How can I use my current ".arff" file and have KNN algorithm result with TF-IDF?
(This is my academic work. Please help...)
According to Weka's documentation, the StringToWordVector filter "Converts String attributes into a set of attributes representing word occurrences [...]". Therefore, applying this filter to an arff file that does not contain any String attributes will have no effect on the dataset.
In order to make use of this filter, you will need to prepare an arff file that contains a String attribute, where the value of this attribute is the text for the given instance. For example, if each instance represents one tweet, then the text from the tweet would be the value for this String attribute. More information on working with text in weka is documented here.