-esttab- dropping time dummies - compare

I am trying to export my regression results by Stata to Excel.
xtreg index_bhar adsue1 i.date, fe cluster(company_id) dfadj
estimates store FTFE1,title("Stock FTFE DSUE1")
xtreg index_bhar adsue2 i.date, fe cluster(company_id) dfadj
estimates store FTFE2,title("Stock FTFE DSUE2")
xtreg index_bhar adsue3 i.date, fe cluster(company_id) dfadj
estimates store FTFE3,title("Stock FTFE DSUE3")
esttab using BHAR_regression_csv.csv, cells(b(star fmt(%9.5f)) se(par fmt(%9.5f))) drop(_Idate*) stats(N ar2)
The last step does not work, no matter "how" I drop the i.date variable. I tried:
_Idate*
*date*
i.date
*_date_*
Moreover, I would like to add text (like in outreg2 (http://dss.princeton.edu/training/Outreg2.pdf) because I am going to compare FE vs RE vs OLS.
Edit
I run
xtreg index_bhar adsue3 i.date, fe cluster(company_id) dfadj
matrix list r(table)
which esttab
and get the following result:
r(table)[9,37]
177b. 178. 179. 180. 181. 182. 183. 184. 185. 186.
adsue3 date date date date date date date date date date
b .06636876 0 -.00551558 -.01558847 -.01223474 -.02825648 -.01797169 -.02353377 -.02351679 -.03266691 -.0270249
se .00547756 . .00742083 .00708063 .00731967 .0070688 .0064812 .00733153 .00801699 .00725142 .00718932
t 12.116494 . -.743256 -2.2015648 -1.6714891 -3.9973525 -2.7728929 -3.2099395 -2.9333704 -4.5048967 -3.7590343
pvalue 2.123e-29 . .45771891 .0282086 .09532812 .00007497 .00578939 .00142372 .00352624 8.497e-06 .00019331
ll .05560367 . -.02009981 -.0295041 -.02662015 -.04214886 -.03070926 -.0379425 -.03927265 -.04691819 -.04115414
ul .07713385 . .00906865 -.00167284 .00215067 -.01436411 -.00523411 -.00912505 -.00776093 -.01841562 -.01289566
df 445 445 445 445 445 445 445 445 445 445 445
crit 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092
eform 0 0 0 0 0 0 0 0 0 0 0
187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197.
date date date date date date date date date date date
b -.01494385 -.04620886 -.0555639 -.03572873 -.02022551 -.04343248 -.02897542 .00823335 .00360491 -.03832982 .01898826
se .00767353 .00702574 .00719699 .00910337 .00841343 .00953967 .01449665 .01373734 .01801534 .00976289 .01180829
t -1.9474554 -6.5770847 -7.7204359 -3.9247818 -2.403956 -4.5528285 -1.9987665 .59934072 .20010199 -3.9260751 1.6080448
pvalue .05210873 1.349e-10 7.722e-14 .00010055 .01662665 6.841e-06 .04624143 .54925071 .84149223 .00010003 .10853459
ll -.03002471 -.0600166 -.06970821 -.05361967 -.0367605 -.06218089 -.05746582 -.01876477 -.03180081 -.05751691 -.00421868
ul .000137 -.03240112 -.04141959 -.0178378 -.00369052 -.02468408 -.00048502 .03523146 .03901062 -.01914273 .04219519
df 445 445 445 445 445 445 445 445 445 445 445
crit 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092
eform 0 0 0 0 0 0 0 0 0 0 0
198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208.
date date date date date date date date date date date
b -.00349209 .01009475 -.05471275 -.02974371 -.02135717 -.02611022 -.03897832 -.05436064 -.01511528 -.04263671 -.03508992
se .00784112 .01076697 .00854308 .00684135 .00770194 .00762225 .00711017 .01004461 .00849691 .00786062 .00874839
t -.44535604 .93756631 -6.4043313 -4.3476349 -2.7729599 -3.4255286 -5.4820548 -5.4119214 -1.7789158 -5.4240873 -4.0110133
pvalue .65627905 .34897589 3.845e-10 .00001707 .00578822 .00067052 7.057e-08 1.021e-07 .07593611 9.579e-08 .00007091
ll -.01890232 -.01106568 -.07150255 -.04318909 -.03649387 -.04109029 -.05295199 -.0741014 -.03181434 -.05808527 -.05228321
ul .01191814 .03125518 -.03792294 -.01629833 -.00622048 -.01113015 -.02500464 -.03461988 .00158377 -.02718815 -.01789662
df 445 445 445 445 445 445 445 445 445 445 445
crit 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092 1.9653092
eform 0 0 0 0 0 0 0 0 0 0 0
209. 210. 211.
date date date _cons
b -.0238768 -.00742588 -.02083166 -.00432197
se .00831877 .00967837 .0094757 .00546817
t -2.8702321 -.76726574 -2.198429 -.79038681
pvalue .00429722 .44333042 .02843267 .4297229
ll -.04022575 -.02644688 -.03945434 -.01506861
ul -.00752784 .01159511 -.00220897 .00642467
df 445 445 445 445
crit 1.9653092 1.9653092 1.9653092 1.9653092
eform 0 0 0 0
. which esttab
c:\ado\plus\e\esttab.ado
*! version 2.0.6 02jun2014 Ben Jann
*! wrapper for estout
I am using Stata/SE 12.0.
I have run the code but the result is the same, Stata cannot find the coefficient *.date.

You don't specify your exact error. Presumably:
coefficient not found
r(111);
but I can't really understand what you say you've tried.
With factor notation virtual variables are created that have the following form: num.varname, so dropping *.varname is enough. A nonsensical example:
clear all
set more off
*----- example data -----
sysuse auto
*----- what you want -----
eststo clear
eststo: quietly regress price weight mpg
eststo: quietly regress price weight mpg i.foreign
esttab, ar2 title(Model Comparison for Price) ///
mtitles("first model" "second model") drop(*.foreign)
I'm not able to open the link you give, but consider using the title() and mtitles() options to introduce text in the table (also exemplified). Add using ... to export to .csv.
Edit
One issue with your code is that you're not using eststo to store your results, but rather estimates store. This is valid, but then you have to explicitly list the names of the stored results when you call esttab. If you don't, then esttab will use only the results of the model that was last fit (whether they were saved or not using estimates store). For example, this works as expected:
clear all
set more off
*----- example data -----
sysuse auto
*----- what you want -----
estimates clear
quietly regress price weight mpg
estimates store first
quietly regress price weight mpg i.foreign
estimates store second
esttab first second, ar2 title(Model Comparison for Price) ///
mtitles("first model" "second model") drop(*.foreign)
The implication then, is that your code uses only the last available result:
xtreg index_bhar adsue3 i.date, fe cluster(company_id) dfadj
But here, I suspect the option drop(*.date) should work.
Edit 2
Try running this and report back exact results in your original post:
xtreg index_bhar adsue3 i.date, fe cluster(company_id) dfadj
matrix list r(table)
esttab, ar2 title(Model Comparison) ///
mtitles("first model") drop(*.date)

Related

How to compose new string variable using bysort

I want to prepare Stata code to generate new string variable as suspect from identified urid on the basis of corresponding score less than or equal to 60 and missing score in each said.
Input:
primkey ssuid sup urid score
10312551 1255 601 122 60
10312552 1255 601 122 80
10312553 1255 601 123 90
10312554 1255 601 124 66
10312561 1256 601 122 40
10312562 1256 601 123 30
10312563 1256 601 124 .
10312564 1256 601 125 66
10312581 1258 602 126 80
10312582 1258 602 127 95
10312583 1258 602 127 100
10312584 1258 602 128 .
Output:
ssuid sup suspect
1255 601 122
1256 601 122,123,124
1258 602 128
The variables primkey and score fields are not required in output.
The following is the code that I have already tried:
sort state ssuid urid sup
gen x=_n
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[x-1]
drop x
sort state ssuid suspect
gen x=_n
tostring suspect, replace
replace suspect=suspect[x-1]+","+suspect if suspect!="." & ssuid==ssuid[x-1]
drop x
gen x=strlen(ssususpect_sup)
gsort state ssuid -x
drop x
gen x=_n
replace suspect=suspect_sup[x-1] if ssuid==ssuid[x-1]
drop x
bys ssuid:gen x=_n
keep if x==1
drop x
However, this is not producing the desired result.
Your example doesn't make complete sense.
For ssuid 1255 the lowest score is 60. So, no score is less than 60.
For ssuid 1258 the lowest score is 80.
Stata's rule is that missing is arbitrarily large positive, so never less than 60.
It seems that your rule is less than or equal to 60 or missing.
With those fixes this works for me:
clear
input primkey ssuid sup urid score
10312551 1255 601 122 60
10312552 1255 601 122 80
10312553 1255 601 123 90
10312554 1255 601 124 66
10312561 1256 601 122 40
10312562 1256 601 123 30
10312563 1256 601 124 .
10312564 1256 601 125 66
10312581 1258 602 126 80
10312582 1258 602 127 95
10312583 1258 602 127 100
10312584 1258 602 128 .
end
egen tokeep = max(score <= 60 | missing(score)), by(ssuid)
keep if tokeep
drop tokeep
drop if inrange(score, 61, .)
bysort ssuid (primkey) : gen suspect = string(urid) if _n == 1
by ssuid: replace suspect = suspect[_n-1] + ///
"," + string(urid) if _n > 1 & urid != urid[_n-1]
by ssuid: keep if _n == _N
keep ssuid sup suspect
list
+---------------------------+
| ssuid sup suspect |
|---------------------------|
1. | 1255 601 122 |
2. | 1256 601 122,123,124 |
3. | 1258 602 128 |
+---------------------------+
EDIT Let's look at the original code. I have made some comments and found some code that can be slimmed and one clear bug, but I bailed out after that. Problems must be reproducible!
sort state ssuid urid sup
gen x=_n
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[x-1]
drop x
sort state ssuid suspect
gen x=_n
tostring suspect, replace
replace suspect=suspect[x-1]+","+suspect if suspect!="." & ssuid==ssuid[x-1]
drop x
gen x=strlen(ssususpect_sup)
gsort state ssuid -x
drop x
gen x=_n
replace suspect=suspect_sup[x-1] if ssuid==ssuid[x-1]
drop x
bys ssuid:gen x=_n
keep if x==1
drop x
*1 There is nothing in the data example about a variable -state-.
* What do you don't show, we need not try to replicate
*2 Your use of x for _n isn't needed. You can use _n directly.
* That saves 8 lines
sort ssuid urid sup
gen suspect=.
replace suspect=urid if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
tostring suspect, replace
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*3 Your first three lines can be slimmed to one
by sort ssuid urid (sup) : gen suspect = urid if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
tostring suspect, replace
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*4 It's simpler to create -suspect- as string in the first place
* so cut the -tostring- line
bysort ssuid urid (sup) : gen suspect = string(urid) if (score <=60 | score==.) & urid=urid[_n-1]
sort ssuid suspect
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*5 bug: testing for equality needs == not =
bysort ssuid urid (sup) : gen suspect = string(urid) if (score <=60 | score==.) & urid==urid[_n-1]
sort ssuid suspect
replace suspect=suspect[_n-1]+","+suspect if suspect!="." & ssuid==ssuid[_n-1]
gen x=strlen(ssususpect_sup)
gsort ssuid -x
drop x
replace suspect=suspect_sup[_n-1] if ssuid==ssuid[_n-1]
bys ssuid: keep if _n == 1
*6 you refer to a variable -ssussusspect_sup- which you haven't supplied as
* data, or created in your code. At this point I bail out.

Pandas Nested DataFrame assignment

I have the following DataFrame:
prefix operator_name country_name mno_subscribers
0 267.0 Airtel Botswana 490
1 373.0 Orange Moldova 207
2 248.0 Airtel Seychelles 490
3 91.0 Reliance Bostwana 92
4 233.0 Vodafone Bostwana 516
I am trying to acheive this:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 0.045
1 373.0 Orange Moldova 207 0.004
2 248.0 Airtel Seychelles 490 0.135
3 91.0 Reliance India 92 0.945
4 233.0 Vodafone Ghana 516 0.002
With this:
countries = df["country_name"].unique()
df["operator_proba"] = 0
for country in countries:
country_name = df[df["country_name"] == country]
for operator in country:
mno_sum = country_name["mno_subscribers"].sum()
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
Where am I going wrong in assigning the operator_proba to the original DataFrame?
This line
df["operator_proba"]["country_name"] = country_name["mno_subscribers"] / mno_sum
can't really work, since df["operator_proba"] is a column (or Series); you can't use ["country_name"] indexing on that.
That is probably why things don't work for you.
It's not entirely clear what you want to achieve, but I guess this may work:
df['operator_proba'] = df.groupby('country_name')['mno_subscribers'].apply(lambda x : x/x.sum())
This saves you a double loop, and is more Pandas-style (there are probably even nicer ways to compute this). The result is:
prefix operator_name country_name mno_subscribers operator_proba
0 267.0 Airtel Botswana 490 1.000000
1 373.0 Orange Moldova 207 1.000000
2 248.0 Airtel Seychelles 490 1.000000
3 91.0 Reliance Bostwana 92 0.151316
4 233.0 Vodafone Bostwana 516 0.848684
with the limited data set (and Botswana/Bostwana difference), most "probabilities" are 1.

Create table with frequency buckets in Base SAS

Below is a sample of my dataset:
City Days
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
I would like to break down "Days" into buckets of 0-30, 30-90, 90-180, 180+, such that the "buckets" are along the x-axis of the table, and the cities are along the y-axis.
I tried using PROC FREQ, but I don't have SAS/STAT. Is there any way to do this in base SAS?
I believe this is what you want. This is most certainly a "brute force" approach, but I think its outlines the concept correctly.
data have;
length city $9;
input city dayscount;
cards;
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
;
run;
data want;
set have;
if dayscount >= 0 and dayscount <=30 then '0-30'n = dayscount;
if dayscount >= 30 and dayscount <=90 then '30-90'n = dayscount;
if dayscount >= 90 and dayscount <=180 then '90-180'n = dayscount;
if dayscount > 180 then '180+'n = dayscount;
drop dayscount;
run;
One of the ways for solving this problem is by using Proc Format for assigning the value bucket and then using Proc Transpose for the desired result:
data city_day_split;
length city $12.;
input city dayscount;
cards;
atlanta 10
tampa 95
atlanta 100
charlotte 20
charlotte 31
tampa 185
;
run;
/****Assigning the buckets****/
proc format;
value buckets
0 - <30 = '0-30'
30 - <90 = '30-90'
90 - <180 = '90-180'
180 - high = 'gte180'
;
run;
data city_day_split;
set city_day_split;
day_bucket = put(dayscount,buckets.);
run;
proc sort data=city_day_split out=city_day_split;
by city;
run;
/****Making the Buckets as columns, City as rows and daycount as Value****/
proc transpose data=city_day_split out=city_day_split_1(drop=_name_);
by city;
id day_bucket;
var dayscount;
run;
My Output:
> **city |0-30 |90-180 |30-90 |GTE180**
> Atlanta |10 |100 |. |.
> Charlotte |20 |. |31 |.
> Tampa |. |95 |. |185

Percentage calculation based on multiple columns SAS

I have a data with patientID and the next column with illness, which has more than one category seperated by commas. I need to find the total number of patients per each illness category and the percentage of patients per category. I tried the normal way, it gives the frequency correct but not the percent.
The data looks like this.
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
where the empty spaces represent no illness. I first separated the illnesses into separate columns, but then got stuck there not knowing how to process to find out the percent.
The code I wrote is as below:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
Then I created a new column to copy all the illnesses to it so I thought it would give me the correct frequency, but it did not give me the correct percent.
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
I have to create a table something like
*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
Please help.
You're trying to output percentages on non-mutually exclusive groups (each illness). It isn't obvious in SAS how to do this.
The following takes Joe's input code but takes an alternative route in determining percentages from event data (a 'long' dataset, if you will). I prefer this to creating a binary variable for each illness at the patient level (a 'wide' dataset) as, for me, this soon gets unwieldy. That said, if you then go on to do some modelling then a 'wide' dataset is usually more useful.
The following code produces output as follows:-
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
Note that patient 5 is repeated in your data for illness lf5. My code only counts this record once. This is fine if a chronic illness but not if acute. Also, my code includes patients in the denominator who do not have an event.
Finally, you can see another example of this code using dates - with test data - here at the mycodestock.com code sharing site => https://mycodestock.com/public/snippet/11251
Here's the code for the table above:-
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;
You're going about this right, but you need to pick percent differently. Normally percent is 'percent of whole dataset', which means that it is going to triplicate your base. You want the percent based to the illness. This means you need a 1/0 for each illness.
The one downside is you have the 0's in your automatic tables; you would have to output the table to a dataset and remove them, then proc print/report the resulting dataset to get the 1's only - or use PROC SQL to generate the table.
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;
The multilabel format solution is interesting, so I present it separately.
Using the same have, we create a format that takes every combination of illnesses and outputs a row for each illness in it, ie, if you have "1,2,3", it outputs rows
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
Enabling multilabel formats and using a class-enabled proc like proc tabulate, you can then use this to allow each respondent to count in each of the label values, but not be counted more than once against the total.
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;

How to output R square and RMSE in SAS

I regress Y on X. The regression is grouped by DATE.
proc reg data=A noprint;
model Y=X;
by DATE;
run;
In the regression result for each DATE, there is a R square and a RMSE value.
How do I output a table of R square and RMSE values as below. Thank you!
|------------|-----------|------------|
| Date | R Square | RMSE |
|------------|-----------|------------|
| 1/1/2009 | R1 | RMSE1 |
| 2/1/2009 | R2 | RMSE2 |
| 3/1/2009 | R3 | RMSE3 |
| .... | .... | .... |
| 10/1/2010 | R22 | RMSE22 |
| 11/1/2010 | R23 | RMSE23 |
| 12/1/2010 | R24 | RMSE24 |
|------------|-----------|------------|
If you look at one of the examples of the SAS PROC REG, this is pretty easy to do.
The example data:
data htwt;
input sex $ age :3.1 height weight ##;
datalines;
f 143 56.3 85.0 f 155 62.3 105.0 f 153 63.3 108.0 f 161 59.0 92.0
f 191 62.5 112.5 f 171 62.5 112.0 f 185 59.0 104.0 f 142 56.5 69.0
f 160 62.0 94.5 f 140 53.8 68.5 f 139 61.5 104.0 f 178 61.5 103.5
f 157 64.5 123.5 f 149 58.3 93.0 f 143 51.3 50.5 f 145 58.8 89.0
m 164 66.5 112.0 m 189 65.0 114.0 m 164 61.5 140.0 m 167 62.0 107.5
m 151 59.3 87.0
;
run;
Now, the PROC REG. Pay particular attention to the OUTEST statement; this is going to output a dataset est1 that will contain what you want.
proc reg outest=est1 outsscp=sscp1 rsquare;
by sex;
eq1: model weight=height;
eq2: model weight=height age;
proc print data=sscp1;
title2 'SSCP type data set';
proc print data=est1;
title2 'EST type data set';
run;
quit;
So now, the last PROC PRINT of est1 can be easily routed to contain only what you want:
proc print data=est1;
var sex _RMSE_ _RSQ_;
run;
You can use this code:
proc reg data=A noprint;
model Y=X;
by DATE;
ods output FitStatistics=final_table;
run;
data final_table; set final_table;
rename cValue1=RMSE;
rename cValue2=RSquared;
proc print data=final_table;
where Label1='Root MSE';
var DATE RSquared RMSE;
run;
I simply turned the trace on and took the name of the table I was interested in and saved it as the output of proc reg.
ODS Trace on;
proc reg data=A noprint;
model Y=X;
by DATE;
run;
ODS Trace off;
this prints out the name of all the output tables in the log file.