weka sparse arff file - weka

I am making a sparse arff file but it will not load into Weka. I get the error that I have the wrong number of values in the #attribute class line, it expects 1 and rejects receiving 12. What am I doing wrong? My file looks like this:
%ARFF file for questions data
%
#relation brazilquestions
#attribute att0 numeric
#attribute att1 numeric
#attribute att2 numeric
#attribute att3 numeric
%there are 469 attributes which represent my bag of words
#attribute class {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia,
Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica,
Terapeutica_medicamentosa, Odontopediatria, Cirurgia}
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
{155 1, 76 1, 126 1, 78 1, 341 1, 148 1, Odontopediatria}
%and then 81 more instances of data
Any ideas about what is wrong with my syntax? I followed the example exactly from the book Data Mining by Witten/Frank/Hall. Thanks in advance!

the problem in data section .
you must put the index of the class attribute
for example :
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
correct it like the following
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 ,470 Estomatologia}

In your document you declared 5 attributes but in #data you are adding 7 attributes, then you should to complete the rest of values in #data. You can see this in the manual

The attribute name for the instance class value needs to be listed, too. (See the Sparse ARFF file description.)
Your file:
#attribute myclass {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia,
Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica,
Terapeutica_medicamentosa, Odontopediatria, Cirurgia}
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , Estomatologia}
Should be:
#data
{126 1, 147 1, 199 1, 56 1, 367 1, 400 1 , myclass Estomatologia}

#ATTRIBUTE class string
Try using this instead of
#attribute class {Odontologia_coletiva, Periodontia, Pediatria, Estomatologia, Dentistica, Ortodontia, Endodontia, Cardiologia, Terapeutica, Terapeutica_medicamentosa, Odontopediatria, Cirurgia}

Related

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?
This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.
You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64
SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

Model building and reading Sparse ARFF File in WEKA

I have the following sparse ARFF File in Weka – I want to build a classifier from the given sparse ARFF file (training dataset) using a Weka Java API. The program is reading the file [not throwing any Exception] but not able to read the instances. When I print the number of instances – the program prints as 0. Thanks in advance for inputs.
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 6 1}
{3 3, 4 2, 6 0}
ArffLoader loader = new ArffLoader();
loader.setFile(new File("C:\\SAMPLE-01.arff"));
Instances data = loader.getStructure();
data.setClassIndex(data.numAttributes() - 1);
System.out.println("Number of Attributes : " + data.numAttributes());
System.out.println("Number of Instances : " + data.numInstances());
I believe that your sparse data is not properly formatted in your ARFF file. It should be something like this:
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 5 1}
{3 3, 4 2, 5 0}
And then you can use a class somewhat that looks like mine:
public class SO_Test {
DataSource source = null;
Instances data = null;
public void setDataset(String trainingFile){
try {
source = new DataSource(trainingFile);
} catch (Exception e) {
e.printStackTrace();
}
try {
data = source.getDataSet();
} catch (Exception e) {
e.printStackTrace();
}
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
}
public static void main(String[] args) throws Exception{
SO_Test s = new SO_Test();
s.setDataset("1.arff");
System.out.println("Number of Attributes : " + s.data.numAttributes());
System.out.println("Number of Instances : " + s.data.numInstances());
}
}

How to extract character component in values and replace the values with -99

My data looks like:
VAR_A: 134, 15M3, 2004, 301ME, 201E, 41, 53, 22
I'd like to change this vector like below:
VAR_A: 134, -99, 2004, -99, -99, 41, 53, 22
If a value contain characters (e.g., M, E), I want to change those values with -99.
How could I do it in R? I've heard that regular expression would be a possible way, but I'm not good at it.
It seems to me you want to replace the values that are not digits, if that is the case ...
x <- c('134', '15M3', '2004', '301ME', '201E', '41', '53', '22')
sub('.*\\D.*', '-99', x)
# [1] "134" "-99" "2004" "-99" "-99" "41" "53" "22"
Or essentially you could do:
x[grepl('\\D', x)] <- -99
as.numeric(x)
# [1] 134 -99 2004 -99 -99 41 53 22

R + converting a integer to a hh:mm format using regex + gsub

interval is a subset of 5 minute intervals for a 25 hour period
> interval
[1] 45 50 55 100 105 110 115 120 125 130 135 2035 2040 2045 2050 2055 2100 2105 2110 2115 2120 2125
I want to insert : to put it in a time fomat that i can convert to a time format
> gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', interval)
[1] "45" "50" "55" "1:00" "1:05" "1:10" "1:15" "1:20" "1:25" "1:30" "1:35" "20:35" "20:40" "20:45"
[15] "20:50" "20:55" "21:00" "21:05" "21:10" "21:15" "21:20" "21:25"
I have got it working for nearly all my examples.
How do I get it so that it works on the numbers "5" ... "45" "50" "55"
Found this duplicate here but this does not use gsub
An easy way to do this would be to make sure all the inputs have at least 4 characters:
gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', sprintf('%04d',interval))
# "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# "21:10" "21:15" "21:20" "21:25"
Using sub:
> sub('..\\K', ':', sprintf('%04d',interval), perl=T)
# [1] "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# [10] "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# [19] "21:10" "21:15" "21:20" "21:25"