How do I find the proportion of missing values in each variable (continuous and categorical) in Stata? - stata

E.g. if I have 10 variables, some of which are continuous and some are categorical, I would like to see the number of missing values in each variable, along with what proportion of the total values in the variable do these missing ones make up? Something like...
no of missing values proportion
Sex 42 33%
Age 8 12%
Ethnicity 17 3%
Etc.
tab x, mi can give me the results I want for categorical variables but not for continuous.

There are a few different ways to get the number of missing values and the proportion of missingness. I prefer using mdesc because it gives you the frequency, total, and missing percentage in a simple table. The below code will install mdesc and then run the program on your dataset to give you the information you are seeking.
ssc install mdesc
mdesc

missings from the Stata Journal will do what you wish.
. webuse nlswork, clear
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)
. missings report
Checking missings in all variables:
15082 observations with missing values
-------------------
| #
----------+--------
age | 24
msp | 16
nev_mar | 16
grade | 2
not_smsa | 8
c_city | 8
south | 8
ind_code | 341
occ_code | 121
union | 9296
wks_ue | 5704
tenure | 433
hours | 67
wks_work | 703
-------------------
. missings report, percent sort
Checking missings in all variables:
15082 observations with missing values
----------------------------
| # %
----------+-----------------
union | 9296 32.58
wks_ue | 5704 19.99
wks_work | 703 2.46
tenure | 433 1.52
ind_code | 341 1.20
occ_code | 121 0.42
hours | 67 0.23
age | 24 0.08
msp | 16 0.06
nev_mar | 16 0.06
south | 8 0.03
c_city | 8 0.03
not_smsa | 8 0.03
grade | 2 0.01
----------------------------
See the help for other subcommands and options.
To identify download availability and documentation,
. search dm0085, entry
Search of official help files, FAQs, Examples, and Stata Journals
SJ-20-4 dm0085_2 . . . . . . . . . . . . . . . . Software update for missings
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/20 SJ 20(4):1028--1030
sorting has been extended for missings report
SJ-17-3 dm0085_1 . . . . . . . . . . . . . . . . Software update for missings
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q3/17 SJ 17(3):779
identify() and sort options have been added
SJ-15-4 dm0085 Speaking Stata: A set of utilities for managing missing values
(help missings if installed) . . . . . . . . . . . . . . . N. J. Cox
Q4/15 SJ 15(4):1174--1185
provides command, missings, as a replacement for, and extension
of, previous commands nmissing and dropmiss
The 2015 paper is the fullest write-up, but other functionality has been added since then.

You could also use inspect to get the number of total and the number of missing for variable. It does not show the proportion but you could calculate it manually.
sysuse nlsw88.dta
inspect

Related

Determine mean/median/IQ range of age for two separate groups

I have a dataset in Stata with variables age and carrier, an indicator for carrier of a particular disease.
Using univar age I am able to getsome descriptive statistics of age for the dataset, but now I want to compare mean/median/IQ range between carriers and non-carriers. Is there some way to do this?
I have tried one line so far:
univar age if carrier = 1
which resulted in invalid syntax error, r(198)
I had expected descriptive statistics of age when carrier is 1.
Sample Data
clear
set obs 100
gen age = runiformint(18,70)
gen carrier = runiformint(0,1)
Summary Stats
There are several ways to get summary statistics in Stata, but one way is to use the tabstat command:
tabstat age, by(carrier) statistics(n mean sd min p25 median p75 max iqr)
Summary for variables: age
Group variable: carrier
carrier | N Mean SD Min p25 p50 p75 Max IQR
---------+------------------------------------------------------------------------------------------
0 | 52 43.96154 16.45667 19 30 39.5 59 70 29
1 | 48 48.4375 14.24692 20 39 49 60.5 69 21.5
---------+------------------------------------------------------------------------------------------
Total | 100 46.11 15.52183 19 33 44 59.5 70 26.5
----------------------------------------------------------------------------------------------------
See help tabstat for additional statistics options.
Edited to mimic output of univar.
You'd have to search quite hard for univar if you had not heard of it already. It's community-contributed and dates from 1997 and 1999:
STB-51 sg67.1 . . . . . . . . . . . . . . . . . . . . . . . Update to univar
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
9/99 pp.27--28; STB Reprints Vol 9, pp.159--161
improvements and new options to univar
STB-36 sg67 . . . . . . . . . . . . . . . Univariate summaries with boxplots
(help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
3/97 pp.23--25; STB Reprints Vol 6, pp.179--183
command that offers a streamlined display of univariate summaries,
including, optionally, text-mode boxplots
Looking at its help indicates that you need its by() option. Here's a reproducible
example:
. sysuse auto, clear
(1978 automobile data)
. univar mpg, by(foreign)
-> foreign=Domestic
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 52 19.83 4.74 12.00 16.50 19.00 22.00 34.00
-------------------------------------------------------------------------------
-> foreign=Foreign
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
mpg 22 24.77 6.61 14.00 21.00 24.50 28.00 41.00
-------------------------------------------------------------------------------
Like #JR96, I recommend tabstat here.

How to align data that are on a diagonal in SAS

I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.
The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;
Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;

Create 3-way percentages table

I would like to have a 3-way table displaying column or row percentages using three categorical variables. The command below gives the counts but I cannot find how to get percentages instead.
sysuse nlsw88
table married race collgrad, col
--------------------------------------------------------------------
| college graduate and race
| ---- not college grad ---- ------ college grad ------
married | white black other Total white black other Total
----------+---------------------------------------------------------
single | 355 256 5 616 132 53 3 188
married | 862 224 12 1,098 288 50 6 344
--------------------------------------------------------------------
How can I get percentages?
This answer will show a miscellany of tricks. The downside is that I don't know an easy way to get exactly what you ask. The upside is that all these tricks are easy to understand and often useful.
Let's use your example, which is excellent for the purpose.
. sysuse nlsw88, clear
(NLSW, 1988 extract)
Tip #1 You can calculate a percent variable for yourself. I focus on % single. In this data set married is binary, so I won't show the complementary percent.
Once you have calculated it, you can (a) rely on the fact that it is constant within the groups you used to define it (b) tabulate it directly. I find that tabdisp is underrated by users. It's billed as a programmer's command, but it is not difficult to use at all. tabdisp lets you set a display format on the fly; it does no harm and might be useful for other commands to assign one directly using format.
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. tabdisp collgrad race, c(pcsingle) format(%2.1f)
--------------------------------------
| race
college graduate | white black other
-----------------+--------------------
not college grad | 29.2 53.3 29.4
college grad | 31.4 51.5 33.3
--------------------------------------
. format pcsingle %2.1f
Tip #2 A user-written command groups offers different flexibility. groups can be installed from SSC (strictly, must be installed before you can use it). It's a wrapper for various kinds of tables, but using list as a display engine.
. * do this installation just once
. ssc inst groups
. groups collgrad race pcsingle
+-------------------------------------------------------+
| collgrad race pcsingle Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
|-------------------------------------------------------|
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
We can improve on that. We can set up better header text using characteristics. (In practice, these can be less constrained than variable names but often need to be shorter than variable labels.) We can use separators by calling up standard list options.
. char pcsingle[varname] "% single"
. char collgrad[varname] "college?"
. groups collgrad race pcsingle , subvarname sepby(collgrad)
+-------------------------------------------------------+
| college? race % single Freq. Percent |
|-------------------------------------------------------|
| not college grad white 29.2 1217 54.19 |
| not college grad black 53.3 480 21.37 |
| not college grad other 29.4 17 0.76 |
|-------------------------------------------------------|
| college grad white 31.4 420 18.70 |
| college grad black 51.5 103 4.59 |
| college grad other 33.3 9 0.40 |
+-------------------------------------------------------+
Tip #3 Wire display formats into a variable by making a string equivalent. I don't illustrate this fully, but I often use it when I want to combine a display of counts with numerical results with decimal places in tabdisp. format(%2.1f) and format(%3.2f) might do fine for most variables (and incidentally the important detail is the number of decimal places) but they would lead to a display of a count of 42 as 42.0 or 42.00, which would look pretty silly. The format() option of tabdisp does not reach into the string and change the contents; it doesn't even know what the string variable contains or where it came from. So, strings just get shown by tabdisp as they come, which is what you want.
. gen s_pcsingle = string(pcsingle, "%2.1f")
. char s_pcsingle[varname] "% single"
groups has an option to save what is tabulated as a fresh dataset.
Tip #4 To have a total category, temporarily double up the data. The clone of the original is relabelled as a Total category. You may need to do some extra calculations, but nothing there amounts to rocket science: a smart high school student could figure it out. Here a concrete example for line-by-line study beats lengthy explanations.
. preserve
. local Np1 = _N + 1
. expand 2
(2,246 observations created)
. replace race = 4 in `Np1'/L
(2,246 real changes made)
. label def racelbl 4 "Total", modify
. drop pcsingle
. egen pcsingle = mean(100 * (1 - married)), by(collgrad race)
. char pcsingle[varname] "% single"
. format pcsingle %2.1f
. gen istotal = race == 4
. bysort collgrad istotal: gen total = _N
. * for percents of the global total, we need to correct for doubling up
. scalar alltotal = _N/2
. * the table shows percents for college & race | collgrad and for collgrad | total
. bysort collgrad race : gen pc = 100 * cond(istotal, total/alltotal, _N/total)
. format pc %2.1f
. char pc[varname] "Percent"
. groups collgrad race pcsingle pc , show(f) subvarname sepby(collgrad istotal)
+-------------------------------------------------------+
| college? race % single Percent Freq. |
|-------------------------------------------------------|
| not college grad white 29.2 71.0 1217 |
| not college grad black 53.3 28.0 480 |
| not college grad other 29.4 1.0 17 |
|-------------------------------------------------------|
| not college grad Total 35.9 76.3 1714 |
|-------------------------------------------------------|
| college grad white 31.4 78.9 420 |
| college grad black 51.5 19.4 103 |
| college grad other 33.3 1.7 9 |
|-------------------------------------------------------|
| college grad Total 35.3 23.7 532 |
+-------------------------------------------------------+
Note the extra trick of using a variable not shown explicitly to add separator lines.

Stata: egen rowpctile a range of values instead of single percentile value

I have a variable var with many missing values for which I want to calculate the 95th percentile then use this value to drop observations that lie above the 95th percentile (for those observations that are not missing the variable).
Because of the many missing values, I use egen with rowpctile which is supposed to calculate the p(#) percentile, ignoring missing values. When I look at the p95 values, however, they're a range of different values rather than a single 95th percentile value as seen below:
. egen p95 = rowpctile(var), p(95)
. list p95
+-----------+
| p95 |
|-----------|
1. | . |
2. | 65.71429 |
3. | 14.28571 |
4. | . |
5. | . |
...
Am I using the function incorrectly or is there a better way to go about this?
The rowpctile function of the egen command calculates the percentile of the values of a list of variables separately for each observation. Here is some technique which should set you on the right path.
. sysuse auto, clear
(1978 Automobile Data)
. replace price = . in 1/5
(5 real changes made, 5 to missing)
. summarize price, detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 69
25% 4296 3748 Sum of Wgt. 69
50% 5104 Mean 6245.493
Largest Std. Dev. 3015.072
75% 6342 13466
90% 11497 13594 Variance 9090661
95% 13466 14500 Skewness 1.594391
99% 15906 15906 Kurtosis 4.555704
. display r(p95)
13466
. generate toobig = price>r(p95)
. list make price if toobig | price==.
+---------------------------+
| make price |
|---------------------------|
1. | AMC Concord . |
2. | AMC Pacer . |
3. | AMC Spirit . |
4. | Buick Century . |
5. | Buick Electra . |
|---------------------------|
12. | Cad. Eldorado 14,500 |
13. | Cad. Seville 15,906 |
27. | Linc. Mark V 13,594 |
+---------------------------+

Quintiles with different quantity of observations

I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.