Weka: Convert Nominal to Numeric - weka

When I imported a CSvfile in Weka, it reads some numeric variable as Nominal Type. I would like to convert them to Numeric but Im not seeing any option in Weka.
I tried to open the .arff file using Notepad and Notepad++. I remove the variables and change it to numeric
example:
#attribute thours {' ',18,4,48,42,56,35,40,30,14,54,24,36,20,77,25,70,0,16,34,60,64,21,32,6,84,23,31,52,28,50,66,45,12,10,33,11,22,98,8,3,65,72,9,26,15,63,5,27,51,39,105,7,2,58,43,90,68,46,44,47,112,49,91,37,1,41,104,78,96,75,74,62,71,76,89,13,38,19,29,59,92,81,55,57,53,67,80,102,100,17}
to
#attribute thours numeric
and save the file. when i imprted the fiel again, Im getting an error
"...not recognized as an 'Arff data files' file. reason: numebr expected, read Token ], line 78"
Any help is greatly appreciated. Thanks.
Dixi

I believe the reason for your error is that one or more entries of the variable, "thours", is missing. This is represented in the attribute description as the single quotes. If those values are indeed supposed to be missing, you should change it to the format Weka expects in a ".arff", which is a question mark "?".
This link provides a very detailed description of ".arff" files, and what is expected in them.

Related

Weka continually throws error "Unable to determine structure as arff"

I'm trying to import an arff file into weka and I am continually getting the following error
Unable to determine structure as arff (Reason: java.io.IOException: } expected at end of enumeration, read Token[EOL], line 20
The closing bracket } is present and I can't find any other errors with line 20. In fact the error reappears after I've deleted line 20. I've attached a link to the arff with a couple lines of data: Link
I see three issues here. FIRST: In the file you linked to, the last attribute is los_category, which is NUMERIC:
#ATTRIBUTE los_category NUMBERIC
But the last variable in your data line is clearly not numeric (Low).
21,22,165315,4/9/2196__12:26:00_PM,4/10/2196__3:54:00_PM,?,EMERGENCY,EMERGENCY_ROOM_ADMIT,DISC-TRAN_CANCER/CHLDRN_H,Private,?,UNOBTAINABLE,MARRIED,WHITE,4/9/2196__10:06:00_AM,4/9/2196__1:24:00_PM,BENZODIAZEPINE_OVERDOSE,0,1,1.144444444,Low
You've defined with #ATTRIBUTE statements 20 variables (lines 3-22) but in fact your data lines have 21 variables.
SECOND, you have time variables (e.g. admittime) as numeric; but they clearly have non-numeric characters. I know there's a specific format that ARFF files want date/time in, but I'm not an expert in that and can't be definitive about a fix. This is definitely a problem, though. When I create a file with just your first 3 variables, it loads fine. When I add the fourth (#ATTRIBUTE admittime NUMERIC) I get the same error as you report.
THIRD, that line 19 (#ATTRIBUTE diagnosis) is hundreds of characters long. You might want to treat that as a STRING variable type for now, just to be sure you aren't overloading the read buffer with that huge line.

SAS format is loaded but cannot be used

I loaded a format and my log says:
NOTE: Format $DEPOSIT is already on the library WORK.FORMATS.
NOTE: Format $DEPOSIT has been output.
But when I use it:
D_SYS = PUT(SOURCE,$DEPOSIT.);
I get:
ERROR 48-59: The format DEPOSIT was not found or could not be loaded.
If you try to apply a character format to a numeric value (and the reverse) then SAS will silently convert the format specification to match the data you are applying it to.
So you created the character format $DEPOSIT and are trying to apply it to the numeric variable SOURCE. So the error message is saying that the numeric format DEPOSIT does not exist.
Check that the variable SOURCE actually exists. SAS will create a numeric variable if you reference a variable that does not exist. If your variable really is numeric then you might get it to work if you convert SOURCE to character, but make sure to transform the numbers into character strings that match what the format expects.
D_SYS = PUT(cats(SOURCE),$DEPOSIT.);

Convert Scientific notation to text or integer using regex in Notepad++

I am looking to convert a scientific notation number into the full integer number.
E.g:
8.18234E+11 => 818234011668
Excel reformatted all my upc codes within a csv and this solution is not working for me.
I have my csv open in Notepad++ and would love to do this using a regex find and replace.
Thanks.
The damage is already done and cannot be recovered from the CSV file. 8.18234E+11 could be anything* from 818233500000 to 818234499999.
To prevent Excel from rounding large numbers, you need to store them as text. If you set the cell format to text, any value inserted from then on should be automatically interpreted as text. In OpenOffice Calc (I don't have MS Excel), you can also prefix a numeric value with ' to get it interpreted as text no matter the cell format.
There is a chance that the correct value is stored in the original XLS (or XSLX or ODS or the live Excel session or ...) file. If not, then you'll have to enter the data again. If the data is there, you need to store it as text or increase the number of significant digits in the exported CSV. If you only have the exported data, then you're out of luck.
*UPC has a single check digit, so only 100 000 out of the 1 000 000 codes are actually valid UPC codes.

Weka with Missing Values

I've a question about weka as this person:
Hi all:
I felt really strange about WEKA on this.
I have prepared a CSV file which has lots of missing values. One
missing value in this file is basic just no any value between pair of
commas i.e. ,random_value1,,random_value2. This is an example of the
format. You can see there is a pair of commas, between them is just
nothing not even a white_space, and it should indicates a missing
value of the data.
The weird thing is when I read this CSV into WEKA, WEKA assigns all
missing values to a question mark, i.e. '?'. This is exactly how WEKA
expresses it.
And then when I run testing analysis, WEKA started working on these
'?' as some sort useful information. It just missing values, could
WEKA please just jump over it?
These problem became really wasting. Analysis results read like if
missing then value missing, missing assocciates with missing, missing
correlates missing.
Can WEKA reads missing value as missing value, not some sort question
marks? Or can I tell WEKA that for all '?', treat them as missing
values?
Thanks guys
He solved his problem using this solution:
I found a way to tell WEKA about the missings. Just use the fine_and_replace function of a ASCII editor, replace all '?' to ?.
>
but I didn't know how can download ASCII Editor and use it ,, can anyone inform me ????
I suggest you to use notepad2 or notepad++ in windows.
You don't have to work on with missing values. Different algorithms work differently on missing values. So, don't worry, it will be handled just the way it should have been.

How to convert a text file into ARFF format?

I'm using WEKA tool for text classification, and I have to convert plain text files into ARFF format. However, I don't know how to do that. Can anyone please help me to convert a text file into ARFF format?
Thank you Renklauf for ur response,
I didn't understood these points "Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line." .. can u plz explain in brief ..
Suppose the text data is like a simple sport article like
" Basketball is a team sport, the objective being to shoot a ball through a basket horizontally positioned to score points while following a set of rules. Usually, two teams of five players play on a marked rectangular court with a basket at each width end. Basketball is one of the world's most popular and widely viewed sports" ...
This is my text document and I want to convert this to arff format .. and after that I need to use that arff format file for SVM text classification ..
For a document classification task, each document is considered an attribute and must be enclosed in quotes. Suppose you have a corpus of 10 sports articles tagged as either pro-Yankees or pro-Red Sox for a classifier that automatically classifies sports articles as either pro-Yankees or pro-Red Sox. You need to take each document, enclose it in quotes,place it on a single line, and then place your {yankees, red_sox} attribute value after the quotes-enclosed string.
#relation yankeesOrRedSox
#attribute article string
#attribute yankeesOrSox { yankees, red_sox }
#data
"text of article 1 here", yankees
.
.
.
"text of article 10 here", red_sox
It's key that the article is placed on a single line. When I began using Weka for text classification, this is a point that caused me a lot of frustration at first. Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line. Notepad++ has a Join Lines function that allows you to place a lot of text on a single line.
Hope this helps.