I have a dataset in Stata with variables age and carrier, an indicator for carrier of a particular disease.
Using univar age I am able to getsome descriptive statistics of age for the dataset, but now I want to compare mean/median/IQ range between carriers and non-carriers. Is there some way to do this?
I have tried one line so far:
univar age if carrier = 1
which resulted in invalid syntax error, r(198)
I had expected descriptive statistics of age when carrier is 1.
Sample Data
clear
set obs 100
gen age = runiformint(18,70)
gen carrier = runiformint(0,1)
Summary Stats
There are several ways to get summary statistics in Stata, but one way is to use the tabstat command:
tabstat age, by(carrier) statistics(n mean sd min p25 median p75 max iqr)
Summary for variables: age
Group variable: carrier
carrier | N Mean SD Min p25 p50 p75 Max IQR
---------+------------------------------------------------------------------------------------------
0 | 52 43.96154 16.45667 19 30 39.5 59 70 29
1 | 48 48.4375 14.24692 20 39 49 60.5 69 21.5
---------+------------------------------------------------------------------------------------------
Total | 100 46.11 15.52183 19 33 44 59.5 70 26.5
----------------------------------------------------------------------------------------------------
See help tabstat for additional statistics options.
Edited to mimic output of univar.
You'd have to search quite hard for univar if you had not heard of it already. It's community-contributed and dates from 1997 and 1999:
STB-51 sg67.1 . . . . . . . . . . . . . . . . . . . . . . . Update to univar
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
9/99 pp.27--28; STB Reprints Vol 9, pp.159--161
improvements and new options to univar
STB-36 sg67 . . . . . . . . . . . . . . . Univariate summaries with boxplots
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
3/97 pp.23--25; STB Reprints Vol 6, pp.179--183
command that offers a streamlined display of univariate summaries,
including, optionally, text-mode boxplots
Looking at its help indicates that you need its by() option. Here's a reproducible
example:
. sysuse auto, clear
(1978 automobile data)
. univar mpg, by(foreign)
-> foreign=Domestic
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 52 19.83 4.74 12.00 16.50 19.00 22.00 34.00
-------------------------------------------------------------------------------
-> foreign=Foreign
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 22 24.77 6.61 14.00 21.00 24.50 28.00 41.00
-------------------------------------------------------------------------------
Like #JR96, I recommend tabstat here.
Related
E.g. if I have 10 variables, some of which are continuous and some are categorical, I would like to see the number of missing values in each variable, along with what proportion of the total values in the variable do these missing ones make up? Something like...
no of missing values proportion
Sex 42 33%
Age 8 12%
Ethnicity 17 3%
Etc.
tab x, mi can give me the results I want for categorical variables but not for continuous.
There are a few different ways to get the number of missing values and the proportion of missingness. I prefer using mdesc because it gives you the frequency, total, and missing percentage in a simple table. The below code will install mdesc and then run the program on your dataset to give you the information you are seeking.
ssc install mdesc
mdesc
missings from the Stata Journal will do what you wish.
. webuse nlswork, clear
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)
. missings report
Checking missings in all variables:
15082 observations with missing values
-------------------
| #
----------+--------
age | 24
msp | 16
nev_mar | 16
grade | 2
not_smsa | 8
c_city | 8
south | 8
ind_code | 341
occ_code | 121
union | 9296
wks_ue | 5704
tenure | 433
hours | 67
wks_work | 703
-------------------
. missings report, percent sort
Checking missings in all variables:
15082 observations with missing values
----------------------------
| # %
----------+-----------------
union | 9296 32.58
wks_ue | 5704 19.99
wks_work | 703 2.46
tenure | 433 1.52
ind_code | 341 1.20
occ_code | 121 0.42
hours | 67 0.23
age | 24 0.08
msp | 16 0.06
nev_mar | 16 0.06
south | 8 0.03
c_city | 8 0.03
not_smsa | 8 0.03
grade | 2 0.01
----------------------------
See the help for other subcommands and options.
To identify download availability and documentation,
. search dm0085, entry
Search of official help files, FAQs, Examples, and Stata Journals
SJ-20-4 dm0085_2 . . . . . . . . . . . . . . . . Software update for missings
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/20 SJ 20(4):1028--1030
sorting has been extended for missings report
SJ-17-3 dm0085_1 . . . . . . . . . . . . . . . . Software update for missings
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q3/17 SJ 17(3):779
identify() and sort options have been added
SJ-15-4 dm0085 Speaking Stata: A set of utilities for managing missing values
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/15 SJ 15(4):1174--1185
provides command, missings, as a replacement for, and extension
of, previous commands nmissing and dropmiss
The 2015 paper is the fullest write-up, but other functionality has been added since then.
You could also use inspect to get the number of total and the number of missing for variable. It does not show the proportion but you could calculate it manually.
sysuse nlsw88.dta
inspect
I am trying to calculate attack rate in some populations using GEE, but at this time only have total number of non-cases. My dataset has individual level data for each case, and a population-wide count for number of non-cases.
In order to do the GEE, I am trying to get create non-case observations so I have one observation for each non-case.
For example:
In this starting data...
PopID CaseNum N_cases N_noncases
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
I would need to create 3 new observations with PopID14 and 2 new observations in PopID 15.
so it would look like this
PopID CaseNum No_case N_cases N_noncases
14 . . 2 3
14 1 . . .
14 2 . . .
14 . 1 . 3
14 . 2 . 3
14 . 3 . 3
15 . . 5 2
15 1 . . .
15 2 . . .
15 3 . . .
15 4 . . .
15 5 . . .
15 . 1 . 2
15 . 2 . 2
Once I have the non-case observations, I'm planning to separate into case-level and population-level datasets before doing my GEE in the case-level dataset.
I have tried a DO-UNTIL loop set to end when no_case=n_noncases, but it just continues forever and never stops.
data test1;
set test;
do until (no_case=n_noncases) ;
no_case +1;
by Popid;
output;
end; run;
I am open to any and all other ways of doing this :) (I also attempted a proc sql, but that went downhill quickly because I have only ever used them to go from case level to population level data, and not vice versa)
Weird but doable.
data have;
input PopID CaseNum N_cases N_noncases;
cards;
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
;
;
;;
run;
data want;
set have;
by popID;
retain nrecs;
if first.popID then
nrecs=N_noncases;
output;
if last.popid then
do;
N_noncases=nrecs;
call missing(caseNum);
do No_case=1 to nrecs;
output;
end;
end;
keep popID Casenum No_case n_cases N_noncases;
run;
Probably clearer if you do it in steps.
Generate the "non-cases" .
data controls;
set have(keep=popid n_noncases);
where n_noncases > 0 ;
do controlnum=1 to n_noncases;
output;
end;
run;
Then combine with the "cases" ;
data want;
set have(where=(not missing(casenum)) controls;
by popid;
run;
If performance is an issue then make the first one a data step view instead.
data controls / view=controls;
...
I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.
The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;
Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;
I'm using R Markdown to create a SAS proc sql tutorial but I have found several instances where code chunks with proc sql are not working as expected.
First, the outobs=N and inobs=N options will not evaluate and result in this error:
Error in engine(options) : Calls: <Anonymous> ... process_group.block -> call_block -> block_exec -> in_dir -> engine
In addition: Warning message: In system2(cmd, code, stdout=TRUE, stderr= TRUE, eng = options$engine.env): running command '"C:/Program Files/SASHome/SASFoundation/9.4/sas.exe" sas3afc64d1332b.sas -nosplash -ls 75' had status 1
Next, matching is not appearing to work in the proc sql code chunks. I have tried a subquery and a full join without success in R Markdown, while the same code works in base SAS.
Reproducible code:
require(knitr)
require(SASmarkdown)
saspath <- "C:/Program Files/SASHome/SASFoundation/9.4/sas.exe"
sasopts <- "-nosplash -ls 75"
/*filter with a subquery*/
data roster;
input Person $;
datalines;
Tiffany
Tiffany
Barbara
Henry
Carol
Jane
Tiffany
Ronald
Thomas
William
William
Henry
Carol
Carol
Ronald
Thomas
;
proc sql;
title "Subquery";
select *
from sashelp.class
where Name in
(select distinct Person
from roster);
quit;
proc sql;
title "Full join";
select *
from sashelp.class a full join roster b
on a.Name = b.Person;
quit;
For the subquery above, the log results in R Markdown say that no rows were selected. The results from the full join are:
## Name Sex Age Height Weight Person
## -----------------------------------------------------
## . . . Barbara
## . . . Carol
## . . . Carol
## . . . Carol
## . . . Henry
## . . . Henry
## . . . Jane
## . . . Ronald
## . . . Ronald
## . . . Thomas
## . . . Thomas
## . . . Tiffany
## . . . Tiffany
## . . . Tiffany
## . . . William
## . . . William
## Alfred M 14 69 112.5
## Alice F 13 56.5 84
## Barbara F 13 65.3 98
## Carol F 14 62.8 102.5
## Henry M 14 63.5 102.5
## James M 12 57.3 83
## Jane F 12 59.8 84.5
## Janet F 15 62.5 112.5
## Jeffrey M 13 62.5 84
## John M 12 59 99.5
## Joyce F 11 51.3 50.5
## Judy F 14 64.3 90
## Louise F 12 56.3 77
## Mary F 15 66.5 112
## Philip M 16 72 150
## Robert M 12 64.8 128
## Ronald M 15 67 133
## Thomas M 11 57.5 85
## William M 15 66.5 112
So it looks like proc sql is just not matching the Name/Person columns? I've tried the strip() function to be sure spaces are not the issue.
Again, if I run these code snippets in SAS, the subquery and join work without issue. Thanks in advance for any insight into why this code is not working as expected!
I have a dataset like
year CNMubiBeijing CNMubiTianjing CNMubiShanghai ··· ··· Wulumuqi
1998 . . . .
1999 . . . .
····
2013 . . . .
As you can see, the first row is a list of city names in China,like Beijing, Shanghai and so on, combined with a prefix "CNMubi" (which is redundant). The first column corresponds to the year,and the observations are of another variable(like local government's tax revenue).It's similar to a "wide" type data and I want to convert it to a long type panel data like
city year tax_rev
Beijing 1998
···
Beijing 2013
Shanghai 1998
···
Shanghai 2013
Two immediate solutions come into my mind. One is to directly use the --reshape-- command, like reshape long CNMubi,i(year) j(city_eng) but it turn out give me a column of missing values (column of city_eng)
The second possible solution is use loop,like
foreach var of varlist _all {
replace city_eng="`var'"
}
It also doesn't work (in fact,the new generated city_eng equals to the last variables in the varlist), I need to "expand" the data from a mn to a mnm matrix. So how can I achieve my goal, thank you.
This works:
clear
set more off
*----- example data -----
input ///
year CNMubiBeijing CNMubiTianjing
1998 . .
1999 . .
2000
2001
2002
2003
end
set seed 259376
replace CNMubiBeijing = runiform()
replace CNMubiTianjing = runiform()
*----- what you want -----
reshape long CNMubi, i(year) j(city) string
sort city year
list, sepby(city)
Notice the string option, since j() contains string values.
The result is:
. sort city year
. list, sepby(city)
+----------------------------+
| year city CNMubi |
|----------------------------|
1. | 1998 Beijing .658855 |
2. | 1999 Beijing .494634 |
|----------------------------|
3. | 1998 Tianjing .0204465 |
4. | 1999 Tianjing .0454614 |
+----------------------------+