Stata. How to use if statement with sum()? - if-statement

I am trying to execute the following code:
forval i = 1/51 {
// number of households
by hhid, sort: gen nvals = _n==1
count if (nvals & stateID == `i')
local stateTotalHH = r(N)
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
drop nvals
}
Everything works fine except if is not allowed with sum(). How can I estimate the total or the sum of all values in numper variable for each state and at household level?
ps:
I cannot use collapse numper, by(stateID) because I have other estimations
also, I cannot do the following: duplicates drop hhid, force

Your problem does not even call for sum() with if, so it is best to start at the beginning.
Reconstructing your problem, which is not well explained,
You have observations for individuals within households (identifier hhid) within 50 states of the USA and the District of Columbia (identifier stateID).
You have a variable numper, the number of persons per household, and you want the average per state.
Observations are repeated for each individual in a household, so it is necessary to use just one observation per household.
You can tag each household once by
egen tag = tag(hhid)
The average as a new variable would be
egen avPersonHH = mean(numper/tag), by(stateID)
Stata is going to average numper/tag which variously will be numper/1 and numper/0; the missings from the latter division will just be ignored, which is what is wanted.
That variable is repeated for each household. To see just one value for each stateID,
tabdisp stateID, cell(avPersonHH)
What is wrong with your code? Here is a partial list:
a. No loop is required.
b. If it were, the statement by hhid, sort: gen nvals = _n==1 should not be repeated.
c. sum() is a function for cumulative sums across observations, not what you want here.
d. The line
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
would at best calculate one number, but the if condition is misplaced. if whatever local ... often makes sense in Stata, but putting if on the right of a local definition only makes sense for manipulating text containing commands.
Your comment on this line misses these basic misconceptions, c. and d.
e. You were aiming to have collected 51 values of averages in as many local macros, but still need to put them somewhere useful.
f. Separate calculation of totals and numbers is not required, as you can get Stata to calculate the mean for you.
(LATER) This code plays along step by step with your aversion to using collapse and duplicates, the grounds for which are not stated. But most experienced Stata users would be happy to use brute force:
duplicates drop hhid, force
collapse numper, by(stateID)
and then merge back. That solution is not only direct, but also uses fewer idiosyncratic Stata details, which can take time to figure out.

Related

Remove an entity in recursive regressions with a for loop

I am running a panel data regression. I want to run a for loop to obtain regression results where every time one of the entities has been dropped, so that I can see, by comparing the different coefficients, whether my results are driven by a particular entity or they are consistent across the sample. I am currently using this for loop
forvalues i = 1/19{
use "sample_seven.dta", clear
drop if countryid == i
xtscc ln_gdp tech population inflation tradebalance i.year, fe}
However, when I run the above code what I get is 19 regressions where only two observations have been dropped in each of them.
use "sample_seven.dta", clear
forvalues i = 1/19 {
di _n "Omitting `i'"
xtscc ln_gdp tech population inflation tradebalance i.year if countryid != `i', fe}
}
The error in your code was not using single quotation marks to specify the local macro i. Your code worked, so a reference to i was interpreted as a reference to some variable, possibly inflation.
Beyond that, there is no need to read in the dataset repeatedly. But it would help to spell out which country is being omitted.

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

Dummy and Heckman

I'm using Heckman Selection Model which are two consist of 2 equation. i'm using Probit as a selection equation and multiple regression as a result equation.
how can put in dummy variables in those equation ?
Do we have to make the variables into logaritmic form ?
How can I make logaritmic variables with stata ?
Thank you..
Here's an example of how you might do what you ask. The example looks at the effect of being a union member on log wages:
webuse union3
gen log_wage = ln(wage)
etregress log_wage age grade i.smsa i.black tenure, treat(union = i.south i.black tenure) twostep
etregress estimates an average treatment effect of an endogenous binary-treatment variable. In plain English, that means the "first-stage" is a probit. Estimation is by either full maximum likelihood or a two-step consistent estimator, as above.
The dummies are created on the fly by putting an i. in front of the covariates. This is called factor variable notation, and it also makes interactions a breeze. You can also do tab race, gen(d_) to create d_1, d_2, and d_3 (3 race dummies, one of which you can drop).

Code observation as belonging to quantile in Stata

In Stata, I wanted to be able to put observations in buckets based on a specific variable, or equivalently code observations as belonging to a certain quantile. I looked around for some existing code that would accomplish this task but didn't quite find what I wanted. I wrote the following simple ado:
program toquantiles
version 13
syntax varname [, n(integer 4)]
quietly{
local interval = 100/`n'
local binVarName = "`varlist'_quantile"
gen `binVarName' = `n'
local upper = `n'-1
forvalues i=1/`upper'{
local y = `i'*`interval'
//Abuse the egen cmd to calculate the yth percentile.
tempvar x
egen `x' = pctile(`varlist'), p(`y')
//Label this observation as belonging to the ith bin if the value of the
//var in question is greater than x.
replace `binVarName' = `n'-`i' if `varlist' > `x'
drop `x'
}
}
end
The output is that each observation has a new variable, varname_quantile that is coded as 1,2,3, etc. based on the quantile in which it fits. My code seems like a pretty naive approach to this problem.
Is there any built-in functionality that does what I do above? If not, are there any improvements to this ado that would speed up execution? Currently, it runs quite slowly. (Slowly as in, it is faster to summ all 100+ variables than to calculate the quintiles for 1 variable.) Thanks so much.
There is a terminology problem here, most simply illustrated by quartiles, three particular summary statistics, the lower and upper quartiles and the median in between, and the first, second, third and fourth quarters (some say quartiles here too), intervals defined by falling below or above particular quartiles. (What happens when values equal particular quartiles is a matter of convention.)
In other words, quartiles and more generally quantiles can be particular levels (which I take to be the standard statistical use of the term) or intervals (a common (mis?)use of the term in some applied fields, e.g. applied economics).
It seems that you want the second sense.
Turning to Stata, doesn't xtile do this?
See also http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/index.html

How do I store regression results from loops in Stata?

I have built a model which basically does the following:
run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:
for i in timeperiod {
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
}
The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.
regress portfolio 1 against portfolio 10
I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.
In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.
EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.
Sounds like a case for either rolling or statsby, depending on what you exactly want to do. These are prefix commands, that you prefix to your regression model. rolling or statsby will take care of both the looping and storing of results for you.
If you want maximum control, you can do the loop yourself with forvalues or foreach and store the results in a separate file using post. In fact, if you look inside rolling and statsby (using viewsource) you will see that this is what these commands do internally.
Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.
Solution 1: create a separate data set using post (see help).
use my_data, clear
postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
using my_derived_data, replace
* 1. topost is a placeholder name
* 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
* 3. This will create the file my_derived_data.dta,
* which of course you can name as you wish
* 4. The triple slash is a continuation comment: the code is coninued on next line
levelsof time_period, local( allyears )
* 5. This will create a local macro allyears
* that contains all the values of time_period
foreach t of local allyears {
regress outcome x1 x2 x3 if time_period == `t', robust
* 6. the opening and closing single quotes are references to Stata local macros
* Here, I am referring to the cycle index t
organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
* this isn't making huge sense for me, so you'll have to put your code here
* don't forget inserting if time_period == `t' as needed
* something like this:
predict yhat`t' if time_period == `t', xb
xtile decile`t' = yhat`t' if time_period == `t', n(10)
calculate_portfolio_returns_for_stocks_based_on_quantile
forvalues q=1/10 {
* do whatever if time_period == `t' & decile`t' == `q'
}
* store quantile 1 portolio and quantile 10 return for the last period
* again I am not sure what you mean and how to do that exactly
* so I'll pretend it is something like
ratio change / price if time_period == `t' , over( decile`t' )
post topost (`t') ("whatever text describes the time `t' portfolio") ///
(_b[_ratio_1:1]) (_b[_ratio_1:10])
* the last two sets of parentheses may contain whatever numeric answer you are producing
}
postclose topost
* 7. close the file you are creating
use my_derived_data, clear
tsset time_period, year
newey return_q10 return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.
use my_data, clear
gen int collapsed_time = .
gen double collapsed_return_q1 = .
gen double collapsed_return_q10 = .
* 1. set up placeholders for your results
levelsof time_period, local( allyears )
* 2. This will create a local macro allyears
* that contains all the values of time_period
local T : word count `allyears'
* 3. I now use the local macro allyears as is
* and count how many distinct values there are of time_period variable
forvalues n=1/`T' {
* 4. my cycle now only runs for the numbers from 1 to `T'
local t : word `n' of `allyears'
* 5. I pull the `n'-th value of time_period
** computations as in the previous solution
replace collapsed_time_period = `t' in `n'
replace collapsed_return_q1 = (compute) in `n'
replace collapsed_return_q10 = (compute) in `n'
* 6. I am filling the pre-arranged variables with the relevant values
}
tsset collapsed_time_period, year
* 7. this will likely complain about missing values, so you may have to fix it
newey collapsed_return_q10 collapsed_return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.
I think you're looking for the estout command to store the results of the regressions.