Converting nominal attribute to numeric value using Weka - weka

Suppose nominal attribute is Outlook which contains three values Sunny , Overcast and Rainy. I want to convert this values of outlook attribute in numeric form i.e. 1,2,3 (order can be change). I saw one filter nominaltobinary in weka but this will create three columns. I don't want to create separate column for each value. How I can do this using Weka.

In the ARFF, if you are using it, you can have a comment which specifies what the values of the "Outlook" attribute are.
For example, you ARFF can contain this comment at the top -
%% Numeric values for the "Outlook" Attribute
%% Sunny = 1
%% Overcast = 2
%% Rainy = 3
%% Windy = 4
Then you can define the attribute as -
#attribute Outlook {1,2,3,4}
I dont think there is a way to do this in the UI. But you can use a text editor to edit the ARFF itself.

For this you can use "RenameNominalValues" filter under unsupervised ---> attributes.
Then under "selectedAttribute" type the attribute and
under "valueReplacements" type as Sunny:1,Overcast:2,Rainy:3,Windy:4

Related

How do I convert an attribute to the "date" type when opening a CSV data set in Weka?

How to use the ChangeDateFormat filter in Weka (Waikato Environment for Knowledge Analysis) properly on preprocessing files with a date attribute?
I have the following CSV file:
Date,Open,High,Low,Close,Adj Close,Volume
2004-08-19,49.813290,51.835709,47.800831,49.982655,49.982655,44871361
2004-08-20,50.316402,54.336334,50.062355,53.952770,53.952770,22942874
2004-08-23,55.168217,56.528118,54.321388,54.495735,54.495735,18342897
2004-08-24,55.412300,55.591629,51.591621,52.239197,52.239197,15319808
... and so on.
When I open it with WEKA, it recognizes the first attribute as "Nominal", not "Date". Then, when I try to apply the ChangeDateFormat filter from filters → unsupervised → attributes on the "Date" attribute and click "Apply", Weka gives me an error:
Problem filtering instances: Chosen attribute not date.
However, there are no filters like "NominalToDate", only the "NominalToBinary" and "NominalToString", and there are no filters like "StringToDate".
Therefore, I had to rename this file to .arff and add the #attribute headers the following way:
#relation GOOG
#attribute Date date 'yyyy-MM-dd'
#attribute Open numeric
#attribute High numeric
#attribute Low numeric
#attribute Close numeric
#attribute AdjClose numeric
#attribute Volume numeric
#data
However, I didn't like the idea of manually tinkering the files, so I want to know how to use the ChangeDateFormat filter in this case.
How can I use the ChangeDateFormat filter to specify the date format of the imported files, and, if it is not possible with the ChangeDateFormat, what are the use cases for this filter?
As you see, the datetime in my dataset does not have the time part.
To import a CVS with a date attribute without time, turn on the "Invoke options dialog" checkbox in the "Open file..." dialog.
 
Then, in the "Options" dialog, specify the index of the date attribute and adjust the format accordingly by deleting the time part.
As a result, the date attribute in the imported dataset will have the "Timestamp" type.

Custom column in Power-Query with mixed data types

I want to create a custom column in Power-Query-Editor, and I have two different column types:
Number RequestedDate
110 03.12.2019
100 30.04.2020
The new column should look like:
Code
10003.12.2019
10030.04.2020
How can I do so, the code: [Number]&[RequestedDate]gives an error.
One possible to solution is to use the formula Text.From, then you convert the numeric type and the date type to text form:
=Text.From([Number])&Text.From([RequestedDate])
That should work in "create custom column".

weka- replace null value in a nominal attribute with a string

I am cleaning a data set with google open refine and then trying to use it in Weka to do some cluster analysis. I am dealing with a nominal column that stores range of salaries.
I've specified the attribute as below
#ATTRIBUTE Income {'0-30000','30000-50000','50000-75000','75000-150000','>150000'}
In the data set there are rows in which the 'Income' column is null and I suppose that is the reason why I get the error:
'nominal value not declared in header, read Token line 13'
Is there a way I can replace null values with a string( and then specify the string in the attribute)? - If so how do i specify it in the #ATRRIBUTE row?
Or would it be possible to include the null in the set of attributes?
Thanks

Extracting full Attribute Name from Weka PCA

I am currently in the process of writing some code to analyse the mushrooms data off UCI using Weka. I am trying to get the values (i.e. coefficients) of the attributes, but the attribute name is truncated (indicated by the "..."), and am unable to get the full set of coefficients from the attributes.
e.g.
#attribute -0.251a=e+0.242m=k+0.241n=k-0.224t=p+0.213f=f... numeric
Any help would be greatly appreciated.
I believe your attribute names are being truncated because of an option in the PCA filter.
-A
Maximum number of attributes to include in
transformed attribute names.
(-1 = include all, default: 5)
Using the following code I change the value of this option to -1 and print an attribute name from the transformed data.
Instances originalTrain=...//load the training data
PrincipalComponents pca = new PrincipalComponents(); // new PCA filter
pca.setMaximumAttributeNames(-1); //set the value to -1
pca.setInputFormat(originalTrain);// inform filter about dataset
Instances newData = Filter.useFilter(originalTrain, pca); // apply filter
System.out.println(newData.attribute(0).name()); //look at new name
An example of the obviously untruncated attribute name is (scroll to view):
0.257stalksurfacebelowring=k+0.256stalksurfaceabovering=k+0.234ringtype=l+0.231odor=f-0.215ringtype=p-0.212stalksurfaceabovering=s+0.206sporeprintcolor=h-0.195stalksurfacebelowring=s+0.185bruises+0.18 stalkroot=b-0.176stalkcolorbelowring=w-0.175stalkcolorabovering=w-0.173odor=n-0.139sporeprintcolor=n-0.134sporeprintcolor=k+0.133habitat=p+0.133gillcolor=b+0.13 stalkcolorbelowring=b+0.13 stalkcolorabovering=b+0.129population=v+0.128stalkcolorabovering=n-0.125population=s-0.124stalkroot=e+0.121stalkcolorbelowring=n-0.119capcolor=w+0.119stalkcolorbelowring=p+0.119stalkcolorabovering=p-0.11gillspacing-0.105stalkroot=c-0.101gillcolor=n+0.094sporeprintcolor=w-0.087capshape=b-0.085gillcolor=k-0.082odor=l-0.082odor=a-0.082habitat=m+0.08 capcolor=y-0.08gillcolor=w+0.078gillcolor=h-0.076population=n-0.073habitat=g-0.072gillsize+0.068odor=y+0.068odor=s-0.067population=a-0.065capsurface=s-0.064odor=p+0.063gillcolor=g-0.059stalksurfaceabovering=f+0.057capsurface=y-0.057ringnumber=t-0.057stalksurfacebelowring=f+0.055ringnumber=o+0.051population=y-0.05habitat=u-0.048stalkcolorabovering=o-0.048stalkcolorbelowring=o+0.047veilcolor=w-0.046population=c+0.046capshape=k+0.046ringtype=e-0.046gillattachment-0.045stalkcolorabovering=g-0.045stalkcolorbelowring=g+0.043capcolor=e-0.041stalkroot=r-0.039gillcolor=u+0.039capcolor=g+0.034habitat=l-0.034veilcolor=n-0.034veilcolor=o-0.033habitat=w-0.031capcolor=p-0.031odor=c-0.031stalksurfacebelowring=y-0.031sporeprintcolor=r+0.03 capshape=f-0.029capcolor=n-0.028gillcolor=o-0.024stalkshape-0.024sporeprintcolor=o-0.024sporeprintcolor=y-0.024sporeprintcolor=b-0.024gillcolor=y-0.023gillcolor=e-0.023capcolor=b-0.023stalkcolorabovering=e-0.023stalkcolorbelowring=e-0.019gillcolor=r-0.018capshape=s-0.018sporeprintcolor=u-0.015capshape=x+0.012habitat=d+0.009gillcolor=p-0.006capsurface=g+0.005capsurface=f-0.004capshape=c+0.003stalkcolorbelowring=y-0.003stalkcolorabovering=y-0.003veilcolor=y+0.001stalksurfaceabovering=y+0.001capcolor=u+0.001capcolor=r-0.001capcolor=c+0 stalkcolorabovering=c+0 odor=m+0 ringtype=n+0 stalkcolorbelowring=c+0 ringnumber=n+0 ringtype=f

Weka GUI - TF-IDF is not calculated - Please Help For My Academic Work

I want to use KNN algorithm with TF-IDF in WEKA GUI. Firstly I run the algorithm in default conditions. Secondly I choose "IDFTransform" and "TFTransform" as "true" in StringToWordVector filter and run.
There is no difference in two results.
Result1:
Correctly Classified Instances 1346 91.3781 %
Result2:
Correctly Classified Instances 1346 91.3781 %
My ".arff" file is as follows:
#relation et9
#attribute 'alis' real
#attribute 'banka' real
...
#attribute 'urun' real
#attribute 'class' {yes, no}
#data
70,0,0,0,3,0,40,0,3,1,0,0,20,0,717,2,4,0,0,0,2,5,0,0,0,717,0,1,0,30,yes
22,0,0,63,158,0,1,0,7,0,10,0,4,0,57,0,0,0,0,204,0,0,2,2,0,530,0,0,6,0,yes
0,0,1,0,0,0,0,0,2,1,3,0,0,0,0,0,5,0,0,0,0,0,2,1,0,0,0,0,0,0,no
...
I know that StringToWordVector is used for strings. But I want to calculate TF-IDF for this ".arff" file. How can I use my current ".arff" file and have KNN algorithm result with TF-IDF?
(This is my academic work. Please help...)
According to Weka's documentation, the StringToWordVector filter "Converts String attributes into a set of attributes representing word occurrences [...]". Therefore, applying this filter to an arff file that does not contain any String attributes will have no effect on the dataset.
In order to make use of this filter, you will need to prepare an arff file that contains a String attribute, where the value of this attribute is the text for the given instance. For example, if each instance represents one tweet, then the text from the tweet would be the value for this String attribute. More information on working with text in weka is documented here.