Extracting Rpart rules to segment a dataset - rpart

Ive used Rpart to come up with decision tree, and using predict, i am able to apply the rules to see the predicted values. But I want to segment the dataset by the rules that have been generated. Essentially i want to label the rows in the dataset by the Rule/Rulenumber. how does one do this?

Almost a year since question was posted, but could be of help to others. The observations' node assignments in the rpart tree are saved in tree$where:
library("rpart")
airq <- airquality[complete.cases(airquality),]
tree <- rpart(Ozone ~ ., data = airq)
tree$where

Related

How to access by() output?

I have a large data.frame containing different forest sites, tree species and their dimensions. For some trees I have height and dbh data, for some I only have dbh. I need to calculate missing heights for additional evaluation. Height is site and species specific which is why I used the by() function on a with_height subset:
tmp <- with(with_height,
by(with_height, with_height[,1:2], #with_height[,1:2] are site and species
function(x) lm(height~log(dbh), data = x)))
This works out and creates a large list (1144 unnamed elements, 9.8Mb).
How do I access this list? I need either the lm() or the coefficients for each real combination of site and species (without NULL/ZERO responses if a species did not occur).
I found that
tmp[[1]]$coefficients
returns
tmp[[1]]$coefficients
(Intercept) log(dbh)
-16.36298 11.18222
But how do I know to which site-species combination this is related to? And is there a way to do this for all real site-species combinations simultanously?
I already spent hours on that question and would be very thankfull for any help and advices!

How can one create a single table from multiple datasets?

I'm trying to create a descriptive table by treatment group. For my analysis, I have 3 different partitions of the data (because I'm running 3 separate analyses) from a complete data set, but I only have one statistic from each subset that I am trying to describe, so I think it'd look better in one complete table. At the end, I'd like an output that can convert to latex (as I'm using bookdown).
I've been using the compareGroups package to easily create each table individually. I know that there is an rbind function that allows to create a stacked table, but it won't let me combine them because the n of each separate data frame is different (due to missingness). For instance, I'm trying to study marriage in one of my analyses, and later divorce (which is a separate analysis), and so the n's of these two data frames differ, but the definition of treatment group is the same.
Ideally, I'd have two columns, one for the treatment group and one for the control group. There would be two rows, one that has age of first marriage, and the second row which would have length of that first marriage, and then the respective ns of the cells.
library(compareGroups)
d1 <- compareGroups(treat ~ time1mar,
data = nlsy.mar,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d2 <- compareGroups(treat ~ time1div,
data = nlsy.div,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d.tot <- rbind(`First Age at Marriage` = d1, `Length of First Marriage` = d2)
This is the error that I get:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 6626, 5057
Any suggestions?
The problem might be that you're using na.omit which delets the cases/rows with NAs from both of your datasets. Probably a different amount of cases get removed from each data set. But actually different numbers of row should only be a problem with cbind. However you might try to change the na.action option.
I'm just guessing. As said by joshpk without sample data is difficult to reproduce your problem.

regex to find sentences not terminated by a period

In a book ms., I have figure captions that take one of the following forms:
cap=['"]Figure caption['"] (with matching ' or ")
\caption{Figure caption} (a LaTeX caption)
where the style calls for all captions to be terminated by a ., i.e.,
\caption{Figure caption.}
Unfortunately, I wasn't consistent when writing, so only some captions obey the style, and I have ~300 figures in a file tree of ~100 files, so I'd like to find a perl solution for finding the problem cases and making corrections rather than editing manually.
Can someone help?
Let me try to make this more precise with some test cases from my files. For the \caption{} problem, here are a few example lines from my files. The first three are properly terminated with a .. The rest need a . appended before the caption-closing }. Note there can be several sentences in a caption, and other LaTeX material on the same line.
\caption{CA plot and mosaic display for the TV viewing data. The days of the week in the mosaic plot were permuted according to their order in the CA solution.}
\caption{Stacking approach for a three-way table. Two of the table variables are combined interactively to form the rows of a two-way table.}\label{fig:stacking}
\caption{Overview of fitting and graphing for model-based methods in \R.}
\caption{Each way of stacking a three-way table corresponds to a loglinear model}\label{tab:stacking}
\caption{CA biplot of the suicide data, showing calibrated axes for the suicide methods}
\caption{Arthritis treatment data, for the relationship of the binary response ``Better'' to Age}
\caption{Space shuttle data, with fitted logistic regression model}
\caption{Observed (points) and fitted (lines) log odds of admissions in the logit models for \data{UCB}}
(\\caption{[^}]*[^\.]}|cap=(['"]).+[^\.]\2)
https://regex101.com/r/kW9yZ3/2
This one works for the cases you provided for the \caption{} format, and some examples I mades for the cap="..." format.

How to plot the different graphs by stcurve in one chart in Stata?

I am using stcurve in Stata to plot survival probability. I need to plot the graph for all data and then for specific variables. I can generate the graphs in two different charts, but I need to have all three lines together in one chart.
I have tried the addplot() option but I get the error that stcurve is not a twoway graph. Do you have any idea how to do this?
This is the code that I have used which generates the graphs in two different charts separately:
stcurve, survival graphregion(lcolor(white) ilcolor(white) ifcolor(white) ) plotregion( lcolor(black)) title("Survival Function", size(vlarge)) ytitle("Survival probabilities", size(large)) xtitle("Time", size(large)) xlabel(,labsize(medium)) ylabel(,labsize(medium))
stcurve, survival at1( def=0) at2( def=1) graphregion(lcolor(white) ilcolor(white) ifcolor(white) ) plotregion( lcolor(black)) legend(label(1 "X Firms") label(2 "Y Firms")) legend(size(large)) lwidth(thin thick) title("Survival Function", size(vlarge)) ytitle("Survival probabilities", size(large)) xtitle("Time", size(large)) xlabel(,labsize(medium)) ylabel(,labsize(medium))
I am not sure if I understood correctly what you want. It would have been useful if you had added the stset and stcox code necessary before running stcurve.
If the Kaplan-Meier hazard graph is identical to your first stcurve, survival you can try a dirty fix by generating a variable e.g.
sts gen s2=s after running stset
then plotting it as a line against your time variable. i.e. adding this to the end of the second graph:
addplot(line s2 your_timevar, sort c(J) title("Survival probabilities"))
The equality of KM hazard and Cox hazard only holds if the first graph does not have any more predictors than failvar in the stset. So if you ran stcox, estimate after stset timevar, failure(failvar) id(idvar) it works, but if you have more variables in the stcox call this will not give you the correct plot.
edit:
As the above quick solution does not work, there is another dirty workaround: save the results from stcurve in a file (option outfile), then plot the "new" data as twoway graphs. Something like this:
stcurve, survival name("surv1") outfile(stcurve1.dta, replace)
stcurve, survival name("surv2") at1( def=0) at2( def=1) outfile(stcure2.dta, replace)
use stcurve1.dta, clear
rename surv1 surv1_A
rename _t _tA
append using stcurve2.dta
twoway line surv1 _t, sort || line surv1_A _tA, sort
I do not know if this will work with your data: it may be that you need to manipulate the new variables in the outfiles in some way to get the desired results, and you need to add the options you want to the twoway graphs. There surely are many better and easier ways of plotting this when you have the data for the graphs in separate datafiles, but this is the first solution that sprang to mind.

How to map a function to a triple nested list and keep the triple nested list intact?

I've have been building an analysis workflow for my PhD and have been using a triple nested list to represent my data structure because I want it to be able to expand to an arbitrary amount of data in its second and third levels. The first level is the whole dataset, the second level is each subject in the dataset and third level is a row for each measure that each subject.
[dataset]
|
[subject]
|
[measure1, measure2, measure3]
I am trying to map a function to each measure - for instance convert all the points into floats or replace anomalous values with None - and wish to return the whole dataset according to its nesting but my current code:
for subject in dataset:
for measure in subject:
map(float, measure)
...the result is correct and exactly what I want but the problem is that I can't think how to assign the result back to the dataset efficiently or without losing a level of the nest. Ideally, I would like it to change the measure *in place but I can't think how to do it.
Could you suggest an efficient and pythonic way of doing that? Is a triple nested list a silly way to organize my data in the program?
Rather than doing it in place, make a new list
dataset = [[[float(value) for value in measure]
for measure in subject]
for subject in dataset]
return [[map(float, measure) for measure in subject] for subject in dataset]
You can return a list instead of altering it in place -- this is still remarkably efficient and preserves all the information you want. (aside: In fact, it's often faster than assigning to list indexes [citation needed], which is what others have suggested here!)
A straight-forward way to do that in place would be:
for subject in dataset:
for measure in subject:
for i, elem in enumerate(measure):
measure[i] = float(elem)
Alternatively, use the slice operator to upate the list in-place with the results of map
for subject in dataset:
for measure in subject:
measure[:] = map(float, measure)
This should do the job
for subject in dataset:
for measure in subject:
for i, m in enumerate(measure):
measure[i] = float(m)