Is there a function to retrieve the role of a variable in a 'recipe'? - r-recipes

In recipe, you can specify the roles variables will have by passing a character string that specifies roles such "outcome", "predictor", "case_weight", "ID", etc. For example, the iris dataset variables have the following roles:
> recipe(Species ~ ., data = iris) %>% summary()
# A tibble: 5 × 4
variable type role source
<chr> <chr> <chr> <chr>
1 Sepal.Length numeric predictor original
2 Sepal.Width numeric predictor original
3 Petal.Length numeric predictor original
4 Petal.Width numeric predictor original
5 Species nominal outcome original
Ho do I retrieve the role of a specific variable? I haven't found a function for this. Since add_role(), update_role(), remove_role() and even has_role() exist, I would expect something like the following to be possible:
> recipe(Species ~ ., data = iris) %>% get_role(Species)
outcome
Does a function like this exist?

Related

Summarise/ descriptive statistics for categorical data

I have a dataset in Stata and would like to create a descriptive statistics table. The current problem I have is that my variables are both numerical and categorical. For the numerical variables, I know I can create a table easily with the mean, standard deviation and so on. I have just had a problem with categorical variables. For example, education encompasses 5 levels of different education and I would like to show the proportion of observations for each option within the education variable.This is just part of it. I wanted to create an overall table that has descriptive statistics for other variables, like gender, age, income, level of education and so on.
I like to use the user-contributed command table1 for this purpose. Type ssc install table1 to access the package.
sysuse auto
table1, vars(price contn \ rep78 cat)
+------------------------------------------------+
| Factor Level Value |
|------------------------------------------------|
| N 74 |
|------------------------------------------------|
| Price, mean (SD) 6,165.3 (2,949.5) |
|------------------------------------------------|
| Repair record 1978 1 2 (3%) |
| 2 8 (12%) |
| 3 30 (43%) |
| 4 18 (26%) |
| 5 11 (16%) |
+------------------------------------------------+
Type help table1 for additional options.
asdocx has a comprehensive template for creating table1. The template can summarize different types of variables such as continuous and binary / categorical variables in a single table. Table1 template allows different statistics with categorical / factor variables, continuous variables, and binary variables. The allowed statistics are given below:
mean Mean of the variable
sd Standard deviation of the variables
ci 95% Confidence interval
n Counts
N Counts
frequency Counts
percentage Count as Percentage of total *
% Count as percentage of total
The statistics presented in the above table can be selectively used with categorical, binary, and continuous variables. The default statistics for each type of variables are given below:
(1) Binary variables : Count (Percentages)
(2) Categorical variables : Count (Percentages)
(3) Continuous variables : Mean (95% confidence interval)
Table1 template also support survey weights. I have posted several examples on this page

Collapsing across all variables in stata

I am trying to collapse all variables in my dataset, which is as follows.
date number_of_patients health_center vaccinations
6/25/21 1 healthcentername 1
6/18/21 2 healthcentername 2
10/9/20 2 healthcentername 1
10/2/20 2 healthcentername 1
10/16/20 1 healthcentername 1
I am trying to collapse by date through count into:
number_of_patients healthcentername vaccinations
8 healthcentername 6
I am trying to do this across all health centers but I can't seem to do it without identifying the specific variables I want to collapse. Unfortunately, this isn't entirely feasible because I have 3500 variables in the dataframe.
Somehow you need to tell Stata which variables you want to sum by health center, but that doesn't mean that you need to type them all. You can use ds to create a list of variable names. If you use the option not then ds will list all but the variable names you are mentioning. Like this:
* Example generated by -dataex-. For more info, type help dataex
clear
input str8 date byte number_of_patients str16 health_center byte vaccinations
"6/25/21" 1 "healthcentername" 1
"6/18/21" 2 "healthcentername" 2
"10/9/20" 2 "healthcentername" 1
"10/2/20" 2 "healthcentername" 1
"10/16/20" 1 "healthcentername" 1
end
*List all variables but the one mentioned and store list in r(varlist)
ds date health_center, not
*Sum by health center all but the variables explicitly excluded above
collapse (sum) `r(varlist)' , by(health_center)

Pivoting/reshaping a dataframe to have dates as columns

Here is my dataframe:
ID AMT DATE
0 1496846 54.76 2015-02-11
1 1496846 195.00 2015-01-09
2 1571558 11350.00 2015-04-30
3 1498812 135.00 2014-07-11
4 1498812 157.00 2014-08-04
5 1498812 110.00 2014-09-23
6 1498812 1428.00 2015-01-28
7 1558450 4355.00 2015-01-26
8 1858606 321.52 2015-03-27
9 1849431 1046.81 2015-03-19
I would like to make this a dataframe consisting of time series data for each ID. That is, each column name is a date (sorted), and it is indexed by ID, and the values are the AMT values corresponding to each date. I can get so far as doing something like
df.set_index("DATE").T
but from here I'm stuck.
I also tried
df.pivot(index='ID', columns='DATE', values='AMT')
but this gave me an error on having duplicate entries (the IDs).
I envision it as transposing DATE, and then grouping by unique ID and melting AMT underneath.
you want to use pivot_table where there is an aggfunc parameter that handles duplicate indices.
df.pivot_table('AMT', 'DATE', 'ID', aggfunc='sum')
You'll want to choose how to handle the dups. I put 'sum' in there. It defaults to 'mean'

sas covariates in a linear regressions

I am running a simple linear regression in SAS. The regression has three different groups of participants as the predictors (with group 1 as the reference), the outcome a continuous social support variable, and five covariates. Three of the covariates are dichotomized (age, sex, & education), one is a three-level nominal variable (marital status), and the last is continuous (it's a chronic disease index).
My question is: Do I need to specify the different types of covariates in the SAS coding somehow?
Would this coding example be correct?:
proc glm data=work.example;
class group age sex education marital education chronic_diseases;
model social_support = group age sex education marital education chronic_diseases;
estimate 'group 1 vs group 2' group -1 1 0;
estimate 'group 1 vs group 3' group -1 0 1;
run;
The class statement tells SAS that you want to consider a variable non-continuous: that is, categorical or binary. It doesn't differentiate between the two, as it will choose the reference based on the first value in ascending order by default unless you specify a reference group.
For example, if you're comparing Apples and Oranges, SAS will use Apples as the reference value. Hey, they're fruit - you can compare fruit to fruit! :)
All model covariates are considered numeric unless specified in a class statement. Since chronic_diseases is continuous, simply remove it from the class statement; otherwise, SAS will look at every single value of chronic_diseases and consider it a level, then compare them all to the lowest level.
proc glm data=work.example;
class group age sex education marital education;
model social_support = group age sex education marital education chronic_diseases;
estimate 'group 1 vs group 2' group -1 1 0;
estimate 'group 1 vs group 3' group -1 0 1;
run;

Stata: DFL Decomposition and Bootstrapping with Complex Survey design

Hello I am trying to do a DFL style reweighting with bootstrap weights and SEs. I have a 2 stage stratified sample over 5 rounds (repeated cross section).
The idea is to create counterfactual weights for the reference population and then find the difference in mean outcomes for the two groups. This difference can be divided into three parts
Total difference (group 1 - group 2 , both using survey weights)
Explained difference (group 2 using counterfactual weights- group 2 using survey weights)
Unexplained difference (group 1 using survey weights- group 2 using counterfactual weights)
I have written the following program for the same
Code:
///to make sure there is no singleton strata
egen cluster_id= group(sector state region strat fsu)
egen stratum_id= group(sector state region strat)
foreach r in 1 2 3 4 5 {
preserve
qui keep if round==`r'
qui svyset cluster_id [pw=hhwt] , strata (startum_id)
qui unique cluster_id, by (startum_id) gen (dup)
qui by startum_id, sort: egen temp= total(dup)
count if temp==1
drop if temp==1
drop temp dup
save "C:\Users\Round 2 Data\bs_round`r'", replace
restore
}
Code:
///final data we will use
use "C:\Users\Round 2 Data\bs_round1"
foreach r in 2 3 4 5 {
qui merge m:m round using "C:\Users\Round 2 Data\bs_round`r'"
drop _merge
sort round
tab round
}
save "C:\Users\Round 2 Data\bs_all"
Code:
///constructing bootstrap weights
egen pooled_cid= group (cluster_id round)
egen pooled_sid= group (stratum_id round)
svyset pooled_cid [pw=hhwt], strata( pooled_sid)
bsweights bsw, reps(100) n(-1)
svyset pooled_cid [pw=hhwt], strata( pooled_sid) bsrweight(bsw*) vce(bootstrap)
Code:
///writing the program
#delimit ;
capture program drop mydfl;
program define mydfl, eclass properties (svyb);
version 13;
args wgtname xvars outcome;
gen groupref=(group==1);
egen countg1=sum(group==1);
egen countg2=sum(group==2);
logit groupref `xvars';
predict phatref;
gen `wgtname'2=(phatref/(1-phatref))*(countg2/countg1) if group==2;
replace `wgtname'2=1 if group==1;
gen `wgtname'1=((1-phatref)/phatref)*(countg1/countg2) if group==1;
replace `wgtname'1=1 if group==2;
drop phatref groupref countg*;
forvalues i=1/2 {;
sum `wgtname'`i' if group==`i';
replace `wgtname'`i' = `wgtname'`i' / r(mean) if group==`i';
};
mean `outcome' if group==1 ;
mat diff_1=e(b) ;
mean `outcome' if group==2 ;
mat diff_2=e(b) ;
mean `outcome' if group==2 [pw=`wgtname'2] ;
mat diff_3=e(b) ;
mat dd_t = diff_1-diff_2 ;
mat dd_e= diff_3-diff_2 ;
mat dd_u= diff_1-diff_3 ;
ereturn scalar dd_tot=e1(dd_t,1,1) ;
ereturn scalar dd_exp=e1(dd_e,1,1) ;
ereturn scalar dd_unex=e1(dd_u,1,1) ;
end;
Code:
///running the program
local xvars age i.state fhead yrs_ed marital rural
local outcome wage
svy bootstrap e(dd_tot) e(dd_exp) e(dd_unex): mydfl wtid "`xvars'" `outcome'
I want to find the standard error for the mean gap, mean explained gap and mean unexplained gap in outcome-in this case wage of the two groups.
I keep getting the following error (after the program creates wtid1 and wtid2)
Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 100
insufficient observations to compute bootstrap standard errors
no results will be saved
What am I doing wrong?
Also posted on http://www.statalist.org/forums/forum/general-stata-discussion/general/1309830-dfl-decomposition-and-bootstrapping-with-complex-survey-design
A certain cause of bootstrap failure is that the program creates permanent variables.
Here is the first generate statement:
gen groupref=(group==1);
bootstrap first runs the program on the entire data set, and the variable groupref is added. Next, the first bootstrap replica is drawn, and the program is run on that replicate. The generate statement will now silently fail because the variable already exists. The entire program will therefore fail and the only indication will be the "X" in the Stata results.
The solution is to designate all variables created by generate, egen, or predict as temporary variables. These will be dropped after each replicate is analyzed. Here is the usage:
tempvar groupref;
gen `groupref' = (group==1);
tempvar is a local macro and can take a list of names. Similar macros are tempname and tempfile.