My dataset is like this.....
Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
Here is what I need to do (this is essentially calculating a Wald estimator in econometrics, if you are familiar, if not, no biggie)
I need to create new categories so that if the observation is type 1 then it is 'first' and if it is 2, 3, or 4, it is 'other'
calculate the averages of pizzas and hamburgers by first and other
subtract the means between first and other
divide the differences
There must be more structure than this to the problem; otherwise it's school arithmetic. This may get you started, but I think you need to show more substance about your data structure and larger goals. In a larger dataset, collapse may be a good idea, depending on what you want to do with the results.
clear
input Pizzas Hamburgers Type
10.7 5.6 1
9.6 6.7 2
13.4 4.1 3
7.2 3.7 4
end
gen First = Type == 1
egen MeanPizzas = mean(Pizzas), by(First)
egen MeanHamb = mean(Hamb), by(First)
sort First
gen DiffMeanPizzas = MeanPizzas[1] - MeanPizzas[_N]
gen DiffMeanHamb = MeanHamb[1] - MeanHamb[_N]
tabdisp First, c(Mean* Diff*)
--------------------------------------------------------------------------
First | MeanPizzas MeanHamb DiffMeanPizzas DiffMeanHamb
----------+---------------------------------------------------------------
0 | 10.06667 4.833333 -.6333332 -.7666669
1 | 10.7 5.6 -.6333332 -.7666669
--------------------------------------------------------------------------
Related
data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.
I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.
I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.
I want to calculate growth rates in Stata for observations having the same ID. My data looks like this in a simplified way:
ID year a b c d e f
10 2010 2 4 9 8 4 2
10 2011 3 5 4 6 5 4
220 2010 1 6 11 14 2 5
220 2011 6 2 12 10 5 4
334 2010 4 5 4 6 1 4
334 2011 5 5 4 4 3 2
Now I want to calculate for each ID growth rates from variables a-f from 2010 to 2011:
For e.g ID 10 and variable a it would be: (3-2)/2, for variable b: (5-4)/4 etc. and store the results in new variables (e.g. growth_a, growth_b etc).
Since I have over 120k observations and around 300 variables, is there an efficient way to do so (loop)?
My code looks like the following (simplified):
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
FYI: variables a-f are numeric.
But Stata says: 'local not found' and I am not sure whether the code is correct. Do I also have to sort for year first?
The specific error in
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
is an error in the syntax of foreach, which here expects syntax like foreach x of local variables, given your prior use of a local macro. With the keyword in, foreach takes the word local literally and here looks for a variable with that name: hence the error message. This is basic foreach syntax: see its help.
This code is problematic for further reasons.
Sorting on ID does not guarantee the correct sort order, here time order by year, for each distinct ID. If observations are jumbled within ID, results will be garbage.
The code assumes that all time values are present; otherwise the time gap between observations might be unequal.
A cleaner way to get growth rates is
tsset ID year
foreach x in a b c d e f {
gen `x'_gr = D.`x'/L.`x'
}
Once you have tsset (or xtset) the time series operators can be used without fear: correct sorting is automatic and the operators are smart about gaps in the data (e.g. jumps from 1982 to 1984 in yearly data).
For more variables the loop could be
foreach x of var <whatever> {
gen `x'_gr = D.`x'/L.`x'
}
where <whatever> could be a general (numeric) varlist.
EDIT: The question has changed since first posting and interest is declared in calculating growth rates only from 2010 to 2011, with the implication in the example that only those years are present. The more general code above will naturally still work for calculating those growth rates.
I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.