How to create "non-standard" descriptive statistics more efficiently in Stata - stata

Say I want to create some scalar value like median price/median income mean downpayment/house price. I know I can first use su command and then extract denominators and numerators separately from the r-class and then create the desired scalars.
However, when I have a dozen such scalars and by different household type, such approach is tedious in practice. So I wonder if there's any way to accomplish above work more efficiently? If I can create a table containing such scalars within Stata, it's even more amusing.

Executive summary: So, don't use scalars; use variables instead.
There is a prior statistical issue, which is that (say) summary(y) / summary(x) is not necessarily equal to summary(y/x); in general, the two will differ. It seems to me that the latter usually makes more sense, but set that aside otherwise.
Here is one not too crazy example. How much do you have to pay (in US dollars circa 1978) per pound weight (physicists: mass, really) for various cars in the Stata auto dataset?
. sysuse auto
(1978 Automobile Data)
. gen pricePERlb = price/weight
. egen mean = mean(pricePERlb), by(rep78)
. tabstat mean, s(n mean) by(rep78)
Summary for variables: mean
by categories of: rep78 (Repair Record 1978)
rep78 | N mean
---------+--------------------
1 | 2 1.479266
2 | 8 1.731407
3 | 30 1.895855
4 | 18 2.25233
5 | 11 2.472519
---------+--------------------
Total | 69 2.049639
------------------------------
Now here's a small twist. The generate wasn't needed here. We could have gone
egen mean = mean(price/weight), by(rep78).
The tools are all trivial: generate to create new variables, egen to create new variables that here can be summary statistics calculated for groups, and tabstat, among many other tabulation commands, to show results. Since the statistics here are by construction constant within groups, asking for their mean is just one of several ways of getting at them. Similarly, graph dot, graph hbar, etc. are immediate for display.

Related

How do I create a non mutually exclusive categorical variable?

In Stata I am analyzing a study looking at pre-existing conditions that participants may have had that affect whether they experience side effects after vaccination.
For each participant, there are three binary variables that denote whether the participant had that condition (0: does not have, 1: does have), namely hypertension: 0/1, asthma: 0/1, diabetes: 0/1.
However, these categories are not mutually exclusive as the participant can have any combination of conditions: (no pre-existing conditions, only hypertension, only asthma, only diabetes, hypertension and asthma, hypertension and diabetes, asthma and diabetes, hypertension and asthma and diabetes).
I would like to perform a regression analysis to determine the risk of developing side effects given exposure to pre-existing conditions and to create a variable denoting the different combinations.
I would like to get the risk ratios for the following table:
Type of pre-existing condition
With side effects
no side effects
risk ratio
None
455
316
ref
Hypertension
51
28
Asthma
42
26
Diabetes
17
7
Does anyone havecode that would help in creating a new categorical variable to help with this regression analysis?
I've tried using the following code, but because the categories are not mutually exclusive, the values assigned overwrite each other. new_var denotes the new variable created denoting the pre-existing conditions.
generate new_var = 0
replace new_var = 1 if hypertension == 1
replace new_var = 2 if asthma == 1
replace new_var = 3 if diabetes == 1
This is as much statistical as Stata-oriented, but there is a Stata dimension, so here goes.
#Stuart has indicated some ways of getting composite variables in Stata, but as no doubt he would emphasise too, watch out that the numeric coding is arbitrary and not to be taken literally.
Other methods of creating composite variables were discussed in this paper and that advice remains valid.
That said, I suspect most researchers would not use a composite variable here at all, but would use as predictors the three indicators you already have and their interactions. That is the only serious and supported method to get estimates of effect size together with appropriate tests.
There are 8 possible combinations of preexisting conditions, and one approach is to add the variables like this, then manually label them:
generate new_var = hypertension * 4 + asthma * 2 + diabetes
label define preexisting 0 none 1 diabetes 2 asthma 3 "asthma and diabetes" 4 hypertension 5 "hypertension and asthma" 6 "hypertension and diabetes" 7 "hypertension, asthma and diabetes"
label values new_var preexisting
If you have additional preexisting condition variables, multiply them by 8, 16, 32 and so on to get unique values for every combination.
Another approach is to use interactions in the regression.
regress outcome hypertension##asthma##diabetes

Variables creation before or after Data cleaning?

I am trying to perform k-means cluster analysis on a dataset with 20 variables and 9000 observations. I want to create 2 new variables (Usage and Payment Ratio) using 4 (Balance, Due date, Payment, Minimum Payment) of the 20 variables in the dataset. Ex: Usage = (Balance/Due date) and Payment_ratio = (payment/minimum payment). Now, should I make these variables at the start itself, or should I first cap the outliers, remove/impute missing values and then create these 2 new variables?
I tried to first clean the data and then created these two variables. Then I also capped the outliers and treated missing values again for these 2 variables. After doing this, these 2 new variables are skewed and I tried taking log, sqrt etc but the variables are still not normal i.e there is still skewness in these variables. After this step, I need to do the Factor Analysis but I cannot proceed.
Can anyone please suggest a way to go about this? Thanks.

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

Stata. How to use if statement with sum()?

I am trying to execute the following code:
forval i = 1/51 {
// number of households
by hhid, sort: gen nvals = _n==1
count if (nvals & stateID == `i')
local stateTotalHH = r(N)
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
drop nvals
}
Everything works fine except if is not allowed with sum(). How can I estimate the total or the sum of all values in numper variable for each state and at household level?
ps:
I cannot use collapse numper, by(stateID) because I have other estimations
also, I cannot do the following: duplicates drop hhid, force
Your problem does not even call for sum() with if, so it is best to start at the beginning.
Reconstructing your problem, which is not well explained,
You have observations for individuals within households (identifier hhid) within 50 states of the USA and the District of Columbia (identifier stateID).
You have a variable numper, the number of persons per household, and you want the average per state.
Observations are repeated for each individual in a household, so it is necessary to use just one observation per household.
You can tag each household once by
egen tag = tag(hhid)
The average as a new variable would be
egen avPersonHH = mean(numper/tag), by(stateID)
Stata is going to average numper/tag which variously will be numper/1 and numper/0; the missings from the latter division will just be ignored, which is what is wanted.
That variable is repeated for each household. To see just one value for each stateID,
tabdisp stateID, cell(avPersonHH)
What is wrong with your code? Here is a partial list:
a. No loop is required.
b. If it were, the statement by hhid, sort: gen nvals = _n==1 should not be repeated.
c. sum() is a function for cumulative sums across observations, not what you want here.
d. The line
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
would at best calculate one number, but the if condition is misplaced. if whatever local ... often makes sense in Stata, but putting if on the right of a local definition only makes sense for manipulating text containing commands.
Your comment on this line misses these basic misconceptions, c. and d.
e. You were aiming to have collected 51 values of averages in as many local macros, but still need to put them somewhere useful.
f. Separate calculation of totals and numbers is not required, as you can get Stata to calculate the mean for you.
(LATER) This code plays along step by step with your aversion to using collapse and duplicates, the grounds for which are not stated. But most experienced Stata users would be happy to use brute force:
duplicates drop hhid, force
collapse numper, by(stateID)
and then merge back. That solution is not only direct, but also uses fewer idiosyncratic Stata details, which can take time to figure out.