I am trying to combine the cumpct results I get from estpost tabulate with the summary statistics that I obtain from estpost tabstat in one esttab output. However, I get a blank cumpct column by using the code below. I believe that the problem stems from the way I store the cumpct matrix but I, unfortunately, couldn't find a solution.
clear
input float A wage
1 100
3 450
2 180
2 190
1 70
4 880
3 65
5 40
1 144
4 28
5 110
end
* tabulation
estpost tabulate A
matrix cumpct=e(cumpct)
* Summary Stats
estpost tabstat wage, ///
statistics(mean sd p25 p50 p75) ///
columns(statistics) by(A)
* Esttab
esttab ., replace ///
cells("cumpct mean sd p25 p50 p75")
Result I get is the following:
------------------------------------------------------------------------------------------
(1)
cumpct mean sd p25 p50 p75
------------------------------------------------------------------------------------------
1 104.6667 37.22007 70 100 144
2 185 7.071068 180 185 190
3 257.5 272.2361 65 257.5 450
4 454 602.455 28 454 880
5 75 49.49747 40 75 110
Total 205.1818 252.3192 65 110 190
------------------------------------------------------------------------------------------
N 11
------------------------------------------------------------------------------------------
.
In your example after running:
estpost tabstat wage, ///
statistics(mean sd p25 p50 p75) ///
columns(statistics) by(A)
What is stored in e() is:
. ereturn list
scalars:
e(N) = 11
macros:
e(cmd) : "estpost"
e(subcmd) : "tabstat"
e(stats) : "mean sd p25 p50 p75"
e(vars) : "wage"
e(byvar) : "A"
matrices:
e(mean) : 1 x 6
e(sd) : 1 x 6
e(p25) : 1 x 6
e(p50) : 1 x 6
e(p75) : 1 x 6
So when you run
esttab ., replace ///
cells("cumpct mean sd p25 p50 p75")
cumpct is not found and hence will be missing.
It is possible to manually add the cumulative matrix to e() with a small helper program.
clear
input float A wage
1 100
3 450
2 180
2 190
1 70
4 880
3 65
5 40
1 144
4 28
5 110
end
// Helper program
cap program drop add_e
program add_e, eclass
args name matrix
ereturn matrix `name' = `matrix'
end
* tabulation
estpost tabulate A
matrix cumpct=e(cumpct)
* Summary Stats
estpost tabstat wage, ///
statistics(mean sd p25 p50 p75) ///
columns(statistics) by(A)
add_e "cumpct" cumpct
* Esttab
esttab ., replace ///
cells("cumpct mean sd p25 p50 p75")
Result:
------------------------------------------------------------------------------------------
(1)
cumpct mean sd p25 p50 p75
------------------------------------------------------------------------------------------
1 27.27273 104.6667 37.22007 70 100 144
2 45.45455 185 7.071068 180 185 190
3 63.63636 257.5 272.2361 65 257.5 450
4 81.81818 454 602.455 28 454 880
5 100 75 49.49747 40 75 110
Total 205.1818 252.3192 65 110 190
------------------------------------------------------------------------------------------
N 11
------------------------------------------------------------------------------------------
Related
I have a data for sales in 3 months (sale1, sale2 and sale3), and I need to show the the different summations with different filters.
data sales;
input area load $ prod : $ sale1 sale2 sale3;
diff=sale3-sale2;
datalines;
1 Y p1 109 117 138
1 N p1 23 29 20
1 Y p2 78 70 68
1 N p2 63 19 22
2 Y p1 49 36 32
2 N p1 50 39 44
2 Y p3 138 157 158
2 N p3 110 126 107
3 Y p2 251 267 259
3 N p2 182 184 160
;
run;
ods excel close;
ods excel file="/C:/data/t1.xlsx"
options (sheet_name="tab1" frozen_headers='3' frozen_rowheaders='2'
embedded_footnotes='yes' autofilter='1-8');
proc report data=sales nocenter;
column area load prod sale1 sale2 sale3 diff change;
define area -- diff/ display;
define sale1-- diff / analysis sum format=comma12. style(column)=[cellwidth=.5in];
define change / computed format=percent8.2 '% change' style(column)=[cellwidth=.8in];
compute change;
change = diff.sum/sale2.sum;
if change >= 0.1 then call define ("change",'STYLE','STYLE=[color=red
fontweight=bold]');
if change <= -0.1 then call define ("change",'STYLE','STYLE=[color=blue
fontweight=bold]');
endcomp;
rbreak after / summarize style=[background=lightblue font_weight=bold];
run;
ods excel close;
this report with no filtering looks likeoriginal report
but if I filter with column load='Y' in the .xlsx file, i want to see the result like this:
output with filter
wonder if anyone can help, thanks!
I have a dataset that looks like the following for multiple patients. I am trying to subtract each visit value from the baseline value of corresponding variables (which are sometimes missing).
Data Have:
Patient Variable Value Visit
A Height 100 Baseline
A Weight 50 Baseline
A HDCIRC 30 Baseline
A BMI 50 Baseline
A Height 100 a
A Weight 50 a
A HDCIRC 30 a
A BMI 50 a
A Height 100 b
A Weight 50 b
Data Want:
Patient Variable Value Visit BASELINE Change
A Height 100 Baseline 100 0
A Weight 50 Baseline 50 0
A HDCIRC 30 Baseline 30 0
A BMI 50 Baseline 50 0
A Height 120 a 100 20
A Weight 50 a 50 0
A HDCIRC 30 a 30 0
A BMI 34.7 a 50 -15.3
A Height 150 b 100 50
A Weight 51 b 50 1
My attempt would be to first create BASELINE and then calculate the change.
In order to get BASELINE, I've seen some people use a lag or a dif function. How can I correctly create the BASELINE variable?
proc sort data=have;
by patient visit;
;
data want;
set have;
by patient visit;
difstamp = dif(visit);
if first.patient then do;
dif=0;
end;
else dif=difstamp;
drop difstamp;
run;
proc sort data=want;
by timestamp;
run;
As alternative you could simply merge have with itself
data have;
input Patient $ Variable $ Value Visit $;
cards;
A Height 100 Baseline
A Weight 50 Baseline
A HDCIRC 30 Baseline
A BMI 50 Baseline
A Height 120 a
A Weight 50 a
A HDCIRC 30 a
A BMI 34.7 a
A Height 150 b
A Weight 51 b
;
proc sort;
by patient variable;
run;
data want;
merge have have(where=(__visit='Baseline') keep=patient variable value visit rename=(visit=__visit value=BASELINE))
;
by patient variable;
Change=Value-BASELINE;
drop __:;
run;
It probably helps to sort by PATIENT VARIABLE so that you can get the baseline.
If your VISIT variable doesn't properly sort BASELINE to the first visit then you can use WHERE= dataset options to make sure the baseline appear first.
data have;
input Patient $ Variable $ Value Visit $;
cards;
A Height 100 Baseline
A Weight 50 Baseline
A HDCIRC 30 Baseline
A BMI 50 Baseline
A Height 120 a
A Weight 50 a
A HDCIRC 30 a
A BMI 34.7 a
A Height 150 b
A Weight 51 b
;
proc sort;
by patient variable visit;
run;
data want;
set have(in=in1 where=(visit='Baseline'))
have(in=in2 where=(visit^='Baseline'))
;
by patient variable ;
if first.variable then do;
if in1 then baseline=Value;
else baseline=.;
retain baseline;
end;
if n(value,baseline)=2 then change=value-baseline;
run;
Result:
Obs PATIENT VARIABLE VALUE VISIT BASELINE CHANGE
1 A BMI 50.0 Baseline 50 0.0
2 A BMI 34.7 a 50 -15.3
3 A HDCIRC 30.0 Baseline 30 0.0
4 A HDCIRC 30.0 a 30 0.0
5 A Height 100.0 Baseline 100 0.0
6 A Height 120.0 a 100 20.0
7 A Height 150.0 b 100 50.0
8 A Weight 50.0 Baseline 50 0.0
9 A Weight 50.0 a 50 0.0
10 A Weight 51.0 b 50 1.0
I'm looking to transform a set of ordered values into a new dataset containing all ordered combinations.
For example, if I have a dataset that looks like this:
Code Rank Value Pctile
1250 1 25 0
1250 2 32 0.25
1250 3 37 0.5
1250 4 51 0.75
1250 5 59 1
I'd like to transform it to something like this, with values for rank 1 and 2 in a single row, values for 2 and 3 in the next, and so forth:
Code Min_value Min_pctile Max_value Max_pctile
1250 25 0 32 0.25
1250 32 0.25 37 0.5
1250 37 0.5 51 0.75
1250 51 0.75 59 1
It's simple enough to do with a handful of values, but when the number of "Code" families is large (as is mine), I'm looking for a more efficient approach. I imagine there's a straightforward way to do this with a data step, but it escapes me.
Looks like you just want to use the lag() function.
data want ;
set have ;
by code rank ;
min_value = lag(value) ;
min_pctile = lag(pctile) ;
rename value=max_value pctile=max_pctile ;
if not first.code ;
run;
Results
max_ max_ min_ min_
Obs Code Rank value pctile value pctile
1 1250 2 32 0.25 25 0.00
2 1250 3 37 0.50 32 0.25
3 1250 4 51 0.75 37 0.50
4 1250 5 59 1.00 51 0.75
Our university is forcing us to perform the old school chi square test using PROC FREQ (I am aware of the options with proc univariate).
I have generated one theoretical exponential distribution with Beta=15 (and written down the values laboriously), and I've generated 10000 random variables which have an exponential distribution, with beta=15.
I try to first enter the frequencies of my random variables (in each interval) via the datalines command:
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
This seems to work.
I then try to compare these values to the theoretical values, using the chi square test in proc freq (the one we are supposed to use)
As follows:
proc freq data=expofaktiska;
weight count;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I get the following error:
ERROR: The number of TESTP values does not equal the number of levels. For the table of number,
there are 24 levels and 26 TESTP values.
This may be because two intervals contain 0 obervations. I don't really see a way around this.
Also, I don't get the chi square test in the results viewer, nor the "tes probability", I only the frequency/cumulative frequency of the random variables.
What am I doing wrong? Do both theoretical/actual distributions need to have the same form (probability/frequencies?)
We are using SAS 9.4
Thanks in advance!
/Magnus
You need ZEROS options on the WEIGHT statement.
data expofaktiska;
input number count;
datalines;
1 2910
2 2040
3 1400
4 1020
5 732
6 531
7 377
8 305
9 210
10 144
11 106
12 66
13 40
14 45
15 29
16 16
17 12
18 8
19 8
20 3
21 2
22 0
23 1
24 2
25 0
26 2
;
run;
proc freq data=expofaktiska;
weight count / zeros;
tables number / testp=(0.28347 0.20311 0.14554 0.10428 0.07472 0.05354 0.03837 0.02749 0.01969 0.01412 0.01011 0.00724 0.0052 0.00372 0.00266 0.00191 0.00137 0.00098 0.00070 0.00051 0.00036 0.00026 0.00018 0.00013 0.00010 0.00007) chisq;
run;
I want to use Stata's collapse like summarize. Say I have data (the 1's correspond to the same person, so do the 2's and the 3's) that, when summarized, looks like this:
Obs Mean Std. Dev. Min Max
Score1 54 17 3 11 22
Score2 32 13 2 5 28
Score3 43 22 4 17 33
Value1 54 9 3 2 12
Value2 32 31 7 22 44
Value3 43 38 4 31 45
Speed1 54 3 1 1 11
Speed2 32 6 3 2 12
Speed3 43 8 2 2 15
How would I create a new dataset (using collapse or something else) that looks somewhat like what summarize gives, but looks like the following? Note that the numbers after the variables correspond to observations in my data. So Score1, Value1, and Speed1 all correspond to _n==1.
_n ScoreMean ValueMean SpeedMean ScoreMax ValueMax SpeedMax
1 17 9 3 22 12 11
2 13 31 6 28 44 12
3 22 38 8 33 45 15
(I have omitted Std. Dev. and Min for brevity.)
When I run collapse (mean) Score1 Score2 Score3 Value1 Value2 Value3 Speed1 Speed2 Speed3, I get the following, which is not very helpful:
Score1 Score2 Score3 Value1 Value2 Value3 Speed1 Speed2 Speed3
1 17 13 22 9 31 38 3 6 8
This is on the right track. It only gives me the mean, though. I am not sure how to have it give me more than one statistic at once. I think I need to somehow use reshape at some point.
One way, following your lead:
*clear all
set more off
input ///
score1 score2 value1 value2 speed1 speed2
5 8 346 235 80 89
2 10 642 973 65 78
end
list
summarize
*-----
collapse (mean) score1m=score1 score2m=score2 ///
value1m=value1 value2m=value2 ///
speed1m=speed1 speed2m=speed2 ///
(max) score1max=score1 score2max=score2 ///
value1max=value1 value2max=value2 ///
speed1max=speed1 speed2max=speed2
gen obs = _n
reshape long score#m score#max value#m value#max speed#m speed#max, i(obs) j(n)
drop obs
list
Asking for several statistics is easy. Use the [(stat)] target_var=varname syntax so you don't get conflicting names when asking for several statistics. Then, reshape.
If there are many variables/subjects, it will turn very tedious. There are other ways. I will revise the answer later if no one posts an alternative by then.
This starts with Roberto's example toy dataset. I think it generalises more easily to 800 objects. (By the way, in Stata _n always and only means observation number in current dataset or group defined by by:, so your usage is mild abuse of syntax.)
clear
input score1 score2 value1 value2 speed1 speed2
5 8 346 235 80 89
2 10 642 973 65 78
end
gen j = _n
reshape long score value speed, i(j) j(i)
rename score yscore
rename value yvalue
rename speed yspeed
reshape long y, i(i j) j(what) string
collapse (mean) mean=y (min) min=y (max) max=y, by(what i)
reshape wide mean min max, j(what) i(i) string