Stata Deleting Multiple Observations - stata

I have the following data matrix containing ideology scores in a customized dataset:
year state cdnum party name dwnom1
1946 23 10 200 WOODRUFF 0.43
1946 23 11 200 BRADLEY F. 0.534
1946 23 11 200 POTTER C. 0.278
1946 23 12 200 BENNETT J. 0.189
My unit of analysis is a given congressional district, in a given year. As one can see state #23, cdnum #11, has two observations in 1946.
What I would like to do is delete the earlier observation, in this case the observation corresponding to name: BRADLEY.F. This happens when a Congressional district has two members in a given Congress. The attempt of code that I have tried is as follows:
drop if year==[_n+1] & statenum==[_n+1] & cdnum==[_n+1]
My attempt is a conditional argument, drop the observation if: the year is the same as the next observation, the statenum is the same as the next observation, and the cdnum is the same as the next observation. In this way, I can insure each district has only one corresponding for a given year. When I attempt to run the code I get:
drop if year==[_n-1] & statenum==[_n-1] & cdnum==[_n-1]
(0 observations deleted)

Brief alternative: You should check out the duplicates command.
Detailed explanation of error:
You don't mean what you say to Stata.
Your conditions such as
if year == [_n-1]
should be
if year == year[_n-1]
and so forth.
[_n-1]
by itself is treated as if you typed
_n-1
which is the observation number, minus 1.
Here is a dopey example. Read in the auto data.
. sysuse auto
(1978 Automobile Data)
. list foreign if foreign == [_n-1], nola
+---------+
| foreign |
|---------|
1. | 0 |
+---------+
The variable foreign is equal to _n - 1 precisely once, in observation 1 when foreign is 0 and _n is 1.
In short, [_n-1] is not to be interpreted as the previous value (of the variable I just mentioned).
help subscripting gives very basic help.

Related

How can I replace observations for multiple variables with the same prefix?

I am attempting to identify which observations are below 15 seconds for timer variables. First, I generated the variable speed, and attempted to replace all observations of the variables that start with timer, that are below 15 with 1.
gen speed = 0
replace speed = 1 if timer* <15
However, Stata is telling me,
timer* invalid name
r(198);
What might be going on? I'm not sure how to attach my data from Stata here, any insight into that would be appreciated too.
The Stata tag wiki has massively detailed information on how to post a data example.
What is going on is simply that Stata doesn't support your syntax. Indeed, it is not even clear what it might mean.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen timer1 = 10
. gen timer2 = 20
. list
+-----------------+
| timer1 timer2 |
|-----------------|
1. | 10 20 |
+-----------------+
. gen wanted1 = min(timer1, timer2) < 15
. gen wanted2 = max(timer1, timer2) < 15
. l
+-------------------------------------+
| timer1 timer2 wanted1 wanted2 |
|-------------------------------------|
1. | 10 20 1 0 |
+-------------------------------------+
.
One guess is that you want an indicator which is 1 if any of the variables timer* is less than 15, in which case you need to compute the minimum over those variables in an observation and compare it with 15. Another guess is that you want an indicator which is 1 if all of the variables timer* are all less than 15, in which case you need first to compute the maximum in an observation. For the simple example above the functions min() and max() serve well. For a dataset with many more variables in timer*, you will find it more convenient to reach for the egen functions rowmin() and rowmax().
There are other ways to do it, and other wilder guesses at what you want, but I will stop there.

tabstat: How to sort/order the output by a certain variable?

I gathered some NBA players' data of their triple-double games, and would like to find out who got the most explosive data on average.
The source is "Basketball Reference - Player Game Finder - Triple Doubles".(Sorry that I can't post the direct url because of the lack of reputation)
So I generated a table summarizing descriptive statistics (e.g. count mean) for several variables (pts trb ast stl blk) usingļ¼š
tabstat pts trb ast stl blk, statistics(count mean) format(%9.1f) by(player)
What I get is the following table:
tabstat result:
How can I tell Stata to filter the players by count >= 10 (who got 10 or more triple-doubles ever) as a column then sort the table by pts and get:
Ideal result:
Like above, I would say Michael Jordan and James Harden are the Top 2 most explosive triple-double players and Darrell Walker is the most economic one.
Do study https://stackoverflow.com/help/mcve on how to present an example other people can work with straight away. Also, avoiding sports-specific jargon that won't be universally comprehensible and focusing more on the general programming problem would help. Fortunately, what you want seems clear nevertheless.
To do this you need to create a variable defining the order desired in advance of your tabstat call. To get it (value) labelled as you wish, use labmask (search labmask then download from the Stata Journal location given).
Here is some technique.
sysuse auto, clear
egen mean = mean(weight), by(rep78)
egen count = count(weight), by(rep78)
egen group = group(mean rep78) if count >= 5
replace group = -group
labmask group, values(rep78)
label var group "`: var label rep78'"
tabstat mpg weight , by(group) s(count mean) format(%1.0f)
Summary statistics: N, mean
by categories of: group (Repair Record 1978)
group | mpg weight
-------+--------------------
2 | 8 8
| 19 3354
-------+--------------------
3 | 30 30
| 19 3299
-------+--------------------
4 | 18 18
| 22 2870
-------+--------------------
5 | 11 11
| 27 2323
-------+--------------------
Total | 67 67
| 21 3030
----------------------------
Key details:
The grouping variable is based not only on the means you want to sort on but also on the original grouping variable, just in case there are ties on the means.
To get ordering from highest mean downwards, the grouping variable must be negated.
tabstat doesn't show variable labels in the body of the table. (Usually there wouldn't be enough space for them.)

Stata Collapsing by first observation date when there are multiple date observations per ID

I am working with a dataset that has purchases per date (called ItemNum) on multiple dates across 2800 individuals. Each Item is given its own line, so if an individual has purchased two items on a date, that date will appear twice. I don't care how many items were purchased on a date (with each date representing one trip), but rather the mean number of trips made across the 2800 individuals (For about 18230 lines of data). My data looks like this:
+---+----------+-------+---------------------- ---+
|ID | Date |ItemNum| ItemDescript |
| 1 |01/22/2010| 1 |Description of the item |
| 1 |01/22/2010| 2 |Description of other item |
| 1 |07/19/2013| 1 | |
| 2 |06/04/2012| 1 | |
| 2 |02/02/2013| 1 | |
| 2 |11/13/2013| 1 | |
+---+----------+-------+---------------------- ---+
In the above table, person 1 made two trips and three item purchases (because two dates are shown), person 2 made three trips. I am interested in the average number of trips across all people, but first I need to collapse it down to unique dates. So I know I need to collapse on the date, but when I do
collapse (mean) ItemNum (first) Date, by(ID)
it just takes the first date that the ID shows up, not the first occurrence of each unique date.
The next issue is that once it's collapsed, I need to take the mean of the count of the dates, not the date itself, which is also where I seem to be getting tripped up.
Or perhaps something like
clear
input ID str16 dt ItemNum
1 "01/22/2010" 1
1 "01/22/2010" 2
1 "07/19/2013" 1
end
generate Date = daily(dt,"MDY")
egen trip = tag(ID Date)
collapse (sum) trip, by(ID)
summarize trip
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
trip | 1 2 . 2 2
if what you are looking for is found in "Mean" - a single number giving the average number of trips made by the 2800 individuals (1 individual with the limited sample data given).
are you trying to do the following?
collapse (mean) ItemNum, by(ID Date) fast

Stata: calculating growth rates for observations with same ID

I want to calculate growth rates in Stata for observations having the same ID. My data looks like this in a simplified way:
ID year a b c d e f
10 2010 2 4 9 8 4 2
10 2011 3 5 4 6 5 4
220 2010 1 6 11 14 2 5
220 2011 6 2 12 10 5 4
334 2010 4 5 4 6 1 4
334 2011 5 5 4 4 3 2
Now I want to calculate for each ID growth rates from variables a-f from 2010 to 2011:
For e.g ID 10 and variable a it would be: (3-2)/2, for variable b: (5-4)/4 etc. and store the results in new variables (e.g. growth_a, growth_b etc).
Since I have over 120k observations and around 300 variables, is there an efficient way to do so (loop)?
My code looks like the following (simplified):
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
FYI: variables a-f are numeric.
But Stata says: 'local not found' and I am not sure whether the code is correct. Do I also have to sort for year first?
The specific error in
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
is an error in the syntax of foreach, which here expects syntax like foreach x of local variables, given your prior use of a local macro. With the keyword in, foreach takes the word local literally and here looks for a variable with that name: hence the error message. This is basic foreach syntax: see its help.
This code is problematic for further reasons.
Sorting on ID does not guarantee the correct sort order, here time order by year, for each distinct ID. If observations are jumbled within ID, results will be garbage.
The code assumes that all time values are present; otherwise the time gap between observations might be unequal.
A cleaner way to get growth rates is
tsset ID year
foreach x in a b c d e f {
gen `x'_gr = D.`x'/L.`x'
}
Once you have tsset (or xtset) the time series operators can be used without fear: correct sorting is automatic and the operators are smart about gaps in the data (e.g. jumps from 1982 to 1984 in yearly data).
For more variables the loop could be
foreach x of var <whatever> {
gen `x'_gr = D.`x'/L.`x'
}
where <whatever> could be a general (numeric) varlist.
EDIT: The question has changed since first posting and interest is declared in calculating growth rates only from 2010 to 2011, with the implication in the example that only those years are present. The more general code above will naturally still work for calculating those growth rates.

to create highest & lowest quartiles of a variable in Stata

This is the Stata code I used to divide a Winsorised & centred variable (num_exp, denoting number of experienced managers) based on 4 quartiles & thereafter to generate the highest & lowest quartile dummies thereof:
egen quartile_num_exp = xtile(WC_num_exp), n(4)
gen high_quartile_numexp = 1 if quartile_num_exp==4
(1433 missing values generated);
gen low_quartile_num_exp = 1 if quartile_num_intlexp==1
(1062 missing values generated);
Thanks everybody - here's the link
https://dl.dropboxusercontent.com/u/64545449/No%20of%20expeienced%20managers.dta
I did try both Aspen Chen's & Roberto's suggestions - Chen's way of creating high quartile dummy gives the same results as I had earlier & Roberto's - both quartiles show 1 for the same rows - how's that possible?
I forgot to mention here that there are indeed many ties - the range of the original variable W_num_exp is from 0 to 7, the mean being 2.126618, i subtracted that from each observation of W_num_exp to get the WC_num_exp.
tab high_quartile_numexp shows the same problem I originally had
le_numexp | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,433 80.64 80.64
1 | 344 19.36 100.00
------------+-----------------------------------
Total | 1,777 100.00
Also, I checked egenmore is already installed in my Stata version 13.1
What I fail to understand is why the dummy variable based on the highest quartile doesn't have 75% of observations below it (I've got 1777 total observations): to my understanding this dummy variable should be the cut-off point above which exactly 25% of the total no. of observations should lie (as we can see it contains only 19.3% of observations).
Am I doing anything wrong in writing the correct Stata code for high_quartile low_quartile dummy variables?
Consider the following code:
clear
set more off
sysuse auto
keep make mpg
*-----
// your way (kind of)
egen mpg4 = xtile(mpg), nq(4)
gen lowq = mpg4 == 1
gen highq = mpg4 == 4
*-----
// what you want
summarize mpg, detail
gen lowq2 = mpg < r(p25)
gen highq2 = mpg < r(p75)
*-----
summarize high* low*
list
Now check the listing to see what's going on.
See help stored results.
The dataset provided answers the question. Consider the tabulation:
. tab W_num_exp
num_execs_i |
ntl_exp, |
Winsorized |
fraction |
.01 | Freq. Percent Cum.
------------+-----------------------------------
0 | 297 16.71 16.71
1 | 418 23.52 40.24
2 | 436 24.54 64.77
3 | 282 15.87 80.64
4 | 171 9.62 90.26
5 | 109 6.13 96.40
6 | 34 1.91 98.31
7 | 30 1.69 100.00
------------+-----------------------------------
Total | 1,777 100.00
Exactly equal numbers in each of 4 quartile-based bins can be provided if, and only if, there are values with cumulative percents 25, 50, 75. No such values exist. You have to make do with approximations. The approximations can be lousy, but the only alternative, of arbitrarily assigning observations with the same value to different bins to even up frequencies, is statistically indefensible.
(The number of observations needing to be a multiple of 4 for 4 bins, etc., for exactly equal frequencies is also a complication, which bites hard for small datasets, but that is not the major issue here.)