I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.
I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?
This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.
You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64
SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.
I have the following sparse ARFF File in Weka – I want to build a classifier from the given sparse ARFF file (training dataset) using a Weka Java API. The program is reading the file [not throwing any Exception] but not able to read the instances. When I print the number of instances – the program prints as 0. Thanks in advance for inputs.
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 6 1}
{3 3, 4 2, 6 0}
ArffLoader loader = new ArffLoader();
loader.setFile(new File("C:\\SAMPLE-01.arff"));
Instances data = loader.getStructure();
data.setClassIndex(data.numAttributes() - 1);
System.out.println("Number of Attributes : " + data.numAttributes());
System.out.println("Number of Instances : " + data.numInstances());
I believe that your sparse data is not properly formatted in your ARFF file. It should be something like this:
#RELATION ample
#ATTRIBUTE T1 numeric
#ATTRIBUTE T2 numeric
#ATTRIBUTE T3 numeric
#ATTRIBUTE T4 numeric
#ATTRIBUTE T5 numeric
#ATTRIBUTE C1 {0, 1}
#DATA
{0 3, 1 2, 2 1, 5 1}
{3 3, 4 2, 5 0}
And then you can use a class somewhat that looks like mine:
public class SO_Test {
DataSource source = null;
Instances data = null;
public void setDataset(String trainingFile){
try {
source = new DataSource(trainingFile);
} catch (Exception e) {
e.printStackTrace();
}
try {
data = source.getDataSet();
} catch (Exception e) {
e.printStackTrace();
}
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
}
public static void main(String[] args) throws Exception{
SO_Test s = new SO_Test();
s.setDataset("1.arff");
System.out.println("Number of Attributes : " + s.data.numAttributes());
System.out.println("Number of Instances : " + s.data.numInstances());
}
}
My data looks like:
VAR_A: 134, 15M3, 2004, 301ME, 201E, 41, 53, 22
I'd like to change this vector like below:
VAR_A: 134, -99, 2004, -99, -99, 41, 53, 22
If a value contain characters (e.g., M, E), I want to change those values with -99.
How could I do it in R? I've heard that regular expression would be a possible way, but I'm not good at it.
It seems to me you want to replace the values that are not digits, if that is the case ...
x <- c('134', '15M3', '2004', '301ME', '201E', '41', '53', '22')
sub('.*\\D.*', '-99', x)
# [1] "134" "-99" "2004" "-99" "-99" "41" "53" "22"
Or essentially you could do:
x[grepl('\\D', x)] <- -99
as.numeric(x)
# [1] 134 -99 2004 -99 -99 41 53 22
interval is a subset of 5 minute intervals for a 25 hour period
> interval
[1] 45 50 55 100 105 110 115 120 125 130 135 2035 2040 2045 2050 2055 2100 2105 2110 2115 2120 2125
I want to insert : to put it in a time fomat that i can convert to a time format
> gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', interval)
[1] "45" "50" "55" "1:00" "1:05" "1:10" "1:15" "1:20" "1:25" "1:30" "1:35" "20:35" "20:40" "20:45"
[15] "20:50" "20:55" "21:00" "21:05" "21:10" "21:15" "21:20" "21:25"
I have got it working for nearly all my examples.
How do I get it so that it works on the numbers "5" ... "45" "50" "55"
Found this duplicate here but this does not use gsub
An easy way to do this would be to make sure all the inputs have at least 4 characters:
gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', sprintf('%04d',interval))
# "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# "21:10" "21:15" "21:20" "21:25"
Using sub:
> sub('..\\K', ':', sprintf('%04d',interval), perl=T)
# [1] "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# [10] "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# [19] "21:10" "21:15" "21:20" "21:25"