Why do I get floating point values in my weka output? - weka

I'm running the J48 algorithm on a dataset, and in the output I get something like this:
J48 pruned tree
------------------
attribute1 = n: class1 (253.41/3.75)
attribute1 = y
| attribute2 = n: class2 (145.71/4.0)
| attribute2 = n: class1 (40.68/3.0)
I'm wondering what the stuff in the parenthesis means. I read somewhere that the first value is correctly classified instances because of that choice, and that the second is how many errors. But how can this be a decimal number? How do you classify something 0.41 correctly?

I found the answers here:
http://weka.wikispaces.com/What+do+those+numbers+mean+in+a+J48+tree%3F
Basically it divides up instances with missing values and that would count as a fractional instance in all the trees.

For sample dataset: https://www.cs.vassar.edu/~cs366/data/weka_files/vote.arff
Vote.arff in Weka
Decision tree result: physician-fee-freeze = n: democrat (253.41/3.75).
First number indicated the number of correct things that reach that node. ( in this democrats) and the second number after “/” shows number of incorrect things that reach that node ( in this case republicans)
Total number of instances: 435
Total number of no (also integral number of correct things): 253
Probability of having no: 253/435 = 0.58
Total number of missing data: 11
Total number of times where it is coming with “no”: 8
Probability: 8/11 = 0.72
Total probability that missing data could be no: 0.58 X 0.72 = 0.42
Total number of correct things: 253+0.42 = 253.42 ~ 253.41
The number after the “/”shows number of incorrect things that reach that node. Now if you see this data it has five incorrect instances where “republican” is the result while “physician fee freeze” is “n” (or “?”)
Those five can be split as following:
Total number incorrect instances with “n” : 2
Total number incorrect instances with “?”: 3
Similar formula:
2+(253/435)*3=3.75

Related

Calculate cumulative % based on sum of next row

I want to calculate % based on below formula. Its little bit tricks and I am kind of stuck and not sure how to do it. Thanks if anyone can help.
I have few records in below table which are grouped by Range
Range Count
0-10 50
10-20 12
20-30 9
30-40 0
40-50 0
50-60 1
60-70 4
70-80 45
80-90 16
90-100 7
Other 1
I want to have one more column which has the cumulative % based on sum of next row against total row count (145), something like below
Range Count Cumulative % of Range
0-10 50 34.5% (which is 50/145)
10-20 12 42.7% (which is 62/145)
20-30 9 48.9% (which is 71/145)
30-40 0 48.9% (which is 71/145)
40-50 0 48.9% (which is 71/145)
50-60 1 49.6% (which is 72/145)
60-70 4 52.4% (which is 76/145)
70-80 45 83.4% (which is 121/145)
80-90 16 94.5% (which is 137/145)
90-100 7 99.3% (which is 144/145)
Other 1 100.0% (which is 145/145)
Follow the below steps to get your answer. Please vote and accept the answer, If you find the solution helpful.
1st step - - Create an index column from your range column. I have replaced "Other" Value to 999. You can replace it to much bigger number, which is unlikely to be there in your dataset. Convert this new column into whole number
Sort Column = if(Sickness[Range] = "Other",9999,CONVERT(LEFT(Sickness[Range],SEARCH("-",Sickness[Range],1,LEN(Sickness[Range])+1)-1),INTEGER))
2nd Step - Use the below measure to get value:
Measure =
var RunningTotal = CALCULATE(SUM(Sickness[Count]),FILTER(all(Sickness),Sickness[Sort Column] <= MAX(Sickness[Sort Column])))
var totalSum = CALCULATE(SUM(Sickness[Count]),ALL())
Return
RunningTotal/totalSum
Below is the output that exactly matches your requirement.
For cumulative calculation, a ordering is always required. Anyway, if your given values in column "Range" is real - this will also work for this purpose as an ascending ordering on this filed keep data in expected order. Do this following to get your desired output.
Create the following measure-
count_percentage =
VAR total_of_count =
CALCULATE(
SUM(your_table_name[Count]),
ALL(your_table_name)
)
VAR cumulative_count =
CALCULATE(
SUM(your_table_name[Count]),
FILTER(
ALL(your_table_name),
your_table_name[Range] <= MIN(your_table_name[Range])
)
)
RETURN cumulative_count/total_of_count
Here is the final output-

How do I generate predicted counts from a negative binomial regression with a logged independent variable in Stata?

I have a set of data with a dependent variable that is a count, and several independent variables. My primary independent variable is large dollar values. If I divide the dollar values by 10,000(to keep the coefficients manageable), the models(negative binomial and zero-inflated negative binomial) run in Stata and I can generate predicted counts with confidence intervals. However, theoretically it is more logical to take the natural log of this variable. When I do that, the models still run but now predicted counts on range between 0.22-0.77 or so. How do I fix this so the predicted counts generate correctly?
Your question does not show any code or data. It's nearly impossible to know what is going wrong without these two ingredients. Your questions reads as "I did some stuff to this other stuff with surprising results." In order to ask a good question, you should replicate your coding approach with a dataset that everyone would have access to, like rod93.
Here's my attempt at that, which shows reasonably similar predictions with nbreg from both models:
webuse rod93, clear
replace exposure = exposure/10000
nbreg deaths exposure age_mos, nolog
margins
predictnl d1 =predict(n), ci(lb1 ub1)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[exposure]*exposure[1])
di d1[1]
gen ln_exp = ln(exposure)
nbreg deaths ln_e age_mos, nolog
margins
predictnl d2 =predict(n), ci(lb2 ub2)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[ln_e]*ln(exposure[1]))
di d2[1]
sum d? lb* ub*, sep(2)
This produces very similar predictions and confidence intervals:
. sum d? lb* ub*, sep(2)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
d1 | 21 84.82903 25.44322 12.95853 104.1868
d2 | 21 85.0432 25.24095 32.87827 105.1733
-------------+---------------------------------------------------------
lb1 | 21 64.17752 23.19418 1.895858 80.72885
lb2 | 21 59.80346 22.01917 10.9009 79.71531
-------------+---------------------------------------------------------
ub1 | 21 105.4805 29.39726 24.02121 152.7676
ub2 | 21 110.2829 29.16468 51.76427 143.856

what is this program doing exactly? (SAS)

I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;

Creating indicator variable in Stata

In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?
One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.

Stata: Uniquely sorting points within groups

I'm conducting a household survey with a random sample of 200 villages. Using QGIS, I picked a random point 5-10km from my original villages. I then obtained, from the national statistical office, the village codes for those 200 "neighbor" villages - as well as a buffer of 10 additional neighbor villages. So my total sample is:
200 original villages + 210 neighbor villages = 410 villages, total
We're going to begin fieldwork soon, and I want to give each survey team a map for 1 original village + the nearest neighbor village. Because I'm surveying in some dense urban areas as well, sometimes a neighbor village is actually quite close to more than one original village.
My problem is this: if I run Distance Matrix in QGIS, matching an old village to its nearest neighbor village, I get duplicates in the latter. To get around this, I've matched each old village to the nearest 5 neighbor villages. My main idea/goal is to pick the nearest neighbor that hasn't already been picked.
I end up with a .csv like so:
As you can see, picking the five nearest villages, I'm getting repeats - neighbor village 79 is showing up as nearby to original villages 1, 2, 3, and 4. This is fine, as long as I can assign neighbor village 79 to one (and only one) original village, and then have the rest uniquely match as well.
What I want to do, then, is to uniquely match each original village to one neighbor village. I've tried a bunch of stuff, none of which has worked: My sense is that I need to loop over original village groups, assign a variable (e.g. taken==1) to one of the neighbor villages, and then - somehow - have each instance of that taken==1 apply to all instances of, say, neighbor village 79.
Here's some sample code of what I was thinking. Note: this uniquely matches 163 of my neighbors.
gen taken = 0
so ea distance
by ea: replace taken=1 if _n==1
keep if taken==1
codebook FID ea
This also doesn't work; it just sets taken to 1 for all obs:
foreach i in 5 4 3 2 1 {
by ea: replace taken=1 if _n==`i' & taken==0
}
What I need to do, I think, is loop over both _N and _n, and maybe use an if/else. But I'm not sure how to put it all together.
(Tangentially, is there a better way to loop over decreasing values in Stata? Similar to i-- in other programming languages?)
This should work but the setup is a little different than what you say you need. By comparing with only five neighbors, you have an ill-posed problem. Imagine that geography is such that you end up with six (or more) original villages that have all the same list of five neighbors. What do you assign the sixth original village?
Given this, I compare the original village with all other villages, not only five. The strategy is then to assign original village 1 its closest neighbor; to original village 2 its closest neighbor after discarding the one previously assigned, and so on. This assumes equal number of original and neighbor villages but you have ten additional, so you need to give that a thought.
clear
set more off
*----- example data -----
local numvilla = 4 // change to test
local numobs = `numvilla'^2
set obs `numobs'
egen origv = seq(), from(1) to(`numvilla') block(`numvilla')
bysort origv: gen neigh = _n
set seed 1956
gen dist = runiform()*10
*----- what you want ? -----
sort origv dist
list, sepby(origv)
quietly forvalues villa = 1/`numvilla' {
drop if origv == `villa' & _n > `villa'
drop if neigh == neigh[`villa'] & _n > `villa'
}
list
The other issue is that results will depend on which original village is set to first, second, and so on; because order of assignments will change according to that. That is, the order in which available options are discarded changes with the order in which you set up the original villages. You may want to randomize the order of the original villages before you start the assignments.
You can increase efficiency substituting out & _n > `villa' for in `=`villa'+1'/L, but you won't notice much with your sample size.
I'm not qualified to say anything about your sample design, so take this answer to address only the programming issue you pose.
By the way, to loop over decreasing values:
forvalues obs = 5(-1)1 {
display "`obs'"
}
See help numlist.