How to use the ChangeDateFormat filter in Weka (Waikato Environment for Knowledge Analysis) properly on preprocessing files with a date attribute?
I have the following CSV file:
Date,Open,High,Low,Close,Adj Close,Volume
2004-08-19,49.813290,51.835709,47.800831,49.982655,49.982655,44871361
2004-08-20,50.316402,54.336334,50.062355,53.952770,53.952770,22942874
2004-08-23,55.168217,56.528118,54.321388,54.495735,54.495735,18342897
2004-08-24,55.412300,55.591629,51.591621,52.239197,52.239197,15319808
... and so on.
When I open it with WEKA, it recognizes the first attribute as "Nominal", not "Date". Then, when I try to apply the ChangeDateFormat filter from filters → unsupervised → attributes on the "Date" attribute and click "Apply", Weka gives me an error:
Problem filtering instances: Chosen attribute not date.
However, there are no filters like "NominalToDate", only the "NominalToBinary" and "NominalToString", and there are no filters like "StringToDate".
Therefore, I had to rename this file to .arff and add the #attribute headers the following way:
#relation GOOG
#attribute Date date 'yyyy-MM-dd'
#attribute Open numeric
#attribute High numeric
#attribute Low numeric
#attribute Close numeric
#attribute AdjClose numeric
#attribute Volume numeric
#data
However, I didn't like the idea of manually tinkering the files, so I want to know how to use the ChangeDateFormat filter in this case.
How can I use the ChangeDateFormat filter to specify the date format of the imported files, and, if it is not possible with the ChangeDateFormat, what are the use cases for this filter?
As you see, the datetime in my dataset does not have the time part.
To import a CVS with a date attribute without time, turn on the "Invoke options dialog" checkbox in the "Open file..." dialog.
Then, in the "Options" dialog, specify the index of the date attribute and adjust the format accordingly by deleting the time part.
As a result, the date attribute in the imported dataset will have the "Timestamp" type.
I want to create a custom column in Power-Query-Editor, and I have two different column types:
Number RequestedDate
110 03.12.2019
100 30.04.2020
The new column should look like:
Code
10003.12.2019
10030.04.2020
How can I do so, the code: [Number]&[RequestedDate]gives an error.
One possible to solution is to use the formula Text.From, then you convert the numeric type and the date type to text form:
=Text.From([Number])&Text.From([RequestedDate])
That should work in "create custom column".
I am cleaning a data set with google open refine and then trying to use it in Weka to do some cluster analysis. I am dealing with a nominal column that stores range of salaries.
I've specified the attribute as below
#ATTRIBUTE Income {'0-30000','30000-50000','50000-75000','75000-150000','>150000'}
In the data set there are rows in which the 'Income' column is null and I suppose that is the reason why I get the error:
'nominal value not declared in header, read Token line 13'
Is there a way I can replace null values with a string( and then specify the string in the attribute)? - If so how do i specify it in the #ATRRIBUTE row?
Or would it be possible to include the null in the set of attributes?
Thanks
I am currently in the process of writing some code to analyse the mushrooms data off UCI using Weka. I am trying to get the values (i.e. coefficients) of the attributes, but the attribute name is truncated (indicated by the "..."), and am unable to get the full set of coefficients from the attributes.
e.g.
#attribute -0.251a=e+0.242m=k+0.241n=k-0.224t=p+0.213f=f... numeric
Any help would be greatly appreciated.
I believe your attribute names are being truncated because of an option in the PCA filter.
-A
Maximum number of attributes to include in
transformed attribute names.
(-1 = include all, default: 5)
Using the following code I change the value of this option to -1 and print an attribute name from the transformed data.
Instances originalTrain=...//load the training data
PrincipalComponents pca = new PrincipalComponents(); // new PCA filter
pca.setMaximumAttributeNames(-1); //set the value to -1
pca.setInputFormat(originalTrain);// inform filter about dataset
Instances newData = Filter.useFilter(originalTrain, pca); // apply filter
System.out.println(newData.attribute(0).name()); //look at new name
An example of the obviously untruncated attribute name is (scroll to view):
0.257stalksurfacebelowring=k+0.256stalksurfaceabovering=k+0.234ringtype=l+0.231odor=f-0.215ringtype=p-0.212stalksurfaceabovering=s+0.206sporeprintcolor=h-0.195stalksurfacebelowring=s+0.185bruises+0.18 stalkroot=b-0.176stalkcolorbelowring=w-0.175stalkcolorabovering=w-0.173odor=n-0.139sporeprintcolor=n-0.134sporeprintcolor=k+0.133habitat=p+0.133gillcolor=b+0.13 stalkcolorbelowring=b+0.13 stalkcolorabovering=b+0.129population=v+0.128stalkcolorabovering=n-0.125population=s-0.124stalkroot=e+0.121stalkcolorbelowring=n-0.119capcolor=w+0.119stalkcolorbelowring=p+0.119stalkcolorabovering=p-0.11gillspacing-0.105stalkroot=c-0.101gillcolor=n+0.094sporeprintcolor=w-0.087capshape=b-0.085gillcolor=k-0.082odor=l-0.082odor=a-0.082habitat=m+0.08 capcolor=y-0.08gillcolor=w+0.078gillcolor=h-0.076population=n-0.073habitat=g-0.072gillsize+0.068odor=y+0.068odor=s-0.067population=a-0.065capsurface=s-0.064odor=p+0.063gillcolor=g-0.059stalksurfaceabovering=f+0.057capsurface=y-0.057ringnumber=t-0.057stalksurfacebelowring=f+0.055ringnumber=o+0.051population=y-0.05habitat=u-0.048stalkcolorabovering=o-0.048stalkcolorbelowring=o+0.047veilcolor=w-0.046population=c+0.046capshape=k+0.046ringtype=e-0.046gillattachment-0.045stalkcolorabovering=g-0.045stalkcolorbelowring=g+0.043capcolor=e-0.041stalkroot=r-0.039gillcolor=u+0.039capcolor=g+0.034habitat=l-0.034veilcolor=n-0.034veilcolor=o-0.033habitat=w-0.031capcolor=p-0.031odor=c-0.031stalksurfacebelowring=y-0.031sporeprintcolor=r+0.03 capshape=f-0.029capcolor=n-0.028gillcolor=o-0.024stalkshape-0.024sporeprintcolor=o-0.024sporeprintcolor=y-0.024sporeprintcolor=b-0.024gillcolor=y-0.023gillcolor=e-0.023capcolor=b-0.023stalkcolorabovering=e-0.023stalkcolorbelowring=e-0.019gillcolor=r-0.018capshape=s-0.018sporeprintcolor=u-0.015capshape=x+0.012habitat=d+0.009gillcolor=p-0.006capsurface=g+0.005capsurface=f-0.004capshape=c+0.003stalkcolorbelowring=y-0.003stalkcolorabovering=y-0.003veilcolor=y+0.001stalksurfaceabovering=y+0.001capcolor=u+0.001capcolor=r-0.001capcolor=c+0 stalkcolorabovering=c+0 odor=m+0 ringtype=n+0 stalkcolorbelowring=c+0 ringnumber=n+0 ringtype=f
I want to use KNN algorithm with TF-IDF in WEKA GUI. Firstly I run the algorithm in default conditions. Secondly I choose "IDFTransform" and "TFTransform" as "true" in StringToWordVector filter and run.
There is no difference in two results.
Result1:
Correctly Classified Instances 1346 91.3781 %
Result2:
Correctly Classified Instances 1346 91.3781 %
My ".arff" file is as follows:
#relation et9
#attribute 'alis' real
#attribute 'banka' real
...
#attribute 'urun' real
#attribute 'class' {yes, no}
#data
70,0,0,0,3,0,40,0,3,1,0,0,20,0,717,2,4,0,0,0,2,5,0,0,0,717,0,1,0,30,yes
22,0,0,63,158,0,1,0,7,0,10,0,4,0,57,0,0,0,0,204,0,0,2,2,0,530,0,0,6,0,yes
0,0,1,0,0,0,0,0,2,1,3,0,0,0,0,0,5,0,0,0,0,0,2,1,0,0,0,0,0,0,no
...
I know that StringToWordVector is used for strings. But I want to calculate TF-IDF for this ".arff" file. How can I use my current ".arff" file and have KNN algorithm result with TF-IDF?
(This is my academic work. Please help...)
According to Weka's documentation, the StringToWordVector filter "Converts String attributes into a set of attributes representing word occurrences [...]". Therefore, applying this filter to an arff file that does not contain any String attributes will have no effect on the dataset.
In order to make use of this filter, you will need to prepare an arff file that contains a String attribute, where the value of this attribute is the text for the given instance. For example, if each instance represents one tweet, then the text from the tweet would be the value for this String attribute. More information on working with text in weka is documented here.