Stata: break ties in rank of a variable using a second variable - stata

I would like to rank observations in Stata by score1, while breaking ties using score2, as below:
score1 score2 desired_rank
____________________________
99 5 1
99 4 2
89 8 3
80 9 4
80 9 4
78 6 6
I've tried using egen rank, but can't find an option for specifying another variable for tiebreaking.
I've also read this post, but I haven't been able to adapt its solution to my problem very elegantly.
Any recommendations on how to create desired_rank?

One way could be:
clear
set more off
*----- example data -----
input ///
score1 score2 desired_rank
99 4 2
99 5 1
89 8 3
80 9 4
78 6 6
80 9 4
end
list, sep(0)
*----- what you want -----
egen scoreg = group(score1 score2)
egen myrank = rank(scoreg), field
// check
assert desired_rank == myrank
sort myrank
list, sep(0)
The key here is that egen, group() will assign group numbers according to the sort order of the varlist: score1 score2. Then use egen, rank() but with the field option which will rank the highest value as 1 and will not correct ties.

Let's flag here that the question asks for a twist on Stata's default ranking conventions. By default, Stata ranks the lowest value as 1, as is the more common practice in statistics, but here the question asks for the opposite convention, which Stata calls field ranks. That term is intended to evoke field events in athletics such as throwing and jumping in which the highest or longest score is ranked 1.
#Roberto Ferrer's solution is good, but let's work from first principles as an alternative. If we get the observations into the desired sort order, the rank desired is just the observation number, except that if the values in one observation are the same as those in the preceding observation, that rank is used, an exception we apply in cascade.
Here is some code:
clear
input score1 score2 desired_rank
99 5 1
99 4 2
89 8 3
80 9 4
80 9 4
78 6 6
end
gsort -score1 -score2
gen Desired_Rank = _n
replace Desired_Rank = Desired_Rank[_n-1] if score1 == score1[_n-1] & score2 == score2[_n-1]
assert desired_rank == Desired_Rank
Had we wanted lowest values to rank 1, the sorting command would have been
sort score1 score2
This solution gets messier if we want to rank only some observations using if or in; or if there are missing values; or if there are more scores to be used. In all those cases a solution based on egen is cleaner.
This is a good point to emphasise a trick obvious when it's explained:
egen rank1 = rank(mpg)
egen rank2 = rank(-mpg)
Negating a variable flips the ranking order round. The ranks of 2.71828, 3.14159 and 42 are 1, 2, 3; the ranks of -2.71828, -3.14159, -42 are 3, 2, 1. People often miss that the rank() function of egen can be fed an expression, which can easily be more complicated than a single variable name.
Personal note: When writing some ranking code for Stata in 1999, I was surprised to find no hint in the statistical or computing literature of names for different kinds of ranks, so I introduced the terms field and track to the Stata literature. Some years on, the only other term I have noticed is "schoolmaster's rank" for field rank, but that does not seem a better term, for several quite different reasons.

Related

Two sided test: proc ttest?

I need to calculate the two-sided ttest in SAS.
I generally use the proc ttest adding side=2 but I am not sure if this test works fine or if another way should be preferred to it.
An example of data is the following:
Score Segment Obs Class_obs
1 0 500 15
1 1 500 34
2 0 234 23
2 1 766 65
Where the p-value is calculated per each score. Segment means that a condition is met (e.g. score higher than 60. 0 means ‘lower than 60’ while 1 means ‘higher than 60’).
Obs is the number of observations in each segment by score. Class obs is the number of obs that satisfy a specific condition on the overall population.
Happy to share more info if it needs.

Splitsample in Stata 16: How to create samples based on varying proportions saved in a variable?

Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.
Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:
Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old
Dividing individuals into treatment and controlgroups based on varying chance:
due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups.
To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture).
There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.
Programming
I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)
gen individualnumber=(age-year)+2007
I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:
gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9
gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9
gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)
After that I wanted to construct the samples...
splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show
...but I received an error since only a numlist might be used in the split(numlist) subcommand.
Question: How to construct the samples or overcome this error in an efficient way?
Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals.
Let's state there are 100 individuals like individual 7 in the table.
Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.
What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.
Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end
* Create the var that will display the
gen birthyear = year-age
preserve
* Collapse year-person level data to person level so
* that each individual only get one treatment status.
* You must have an individual id number for this
* Get standard deviation to test that data is good and the birthyear
* is identical for each individual across the panel data set
collapse (mean) birthyear (sd) bysd=birthyear, by(id)
* Test that birthyear is same across all indivudals - this is not needed,
* but good data quality assurance test. Then drop the var as it is not needed
assert bysd == 0
drop bysd
* Set seed to make replicable. Replace this seed when you have tested this
* script using a new random seed. For example from here:
* https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
set seed 123456
*Generate a random number based on the seed
gen random_draw = runiform()
* For each birthyear, get the rank of the random number divided by the number
* of individuals in each birthyear
sort birthyear random_draw
by birthyear : gen percent_rank = _n/_N
*Initiate treatmen variable
gen tmt_status = .
label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"
*Assign birthyear 2006-2004 that are all the same
replace tmt_status = 1 if birthyear == 2006
replace tmt_status = 1 if birthyear == 2005
replace tmt_status = 1 if birthyear == 2004
*Assign birthyear 2003
replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
replace tmt_status = 1 if birthyear == 2003 & percent_rank > .25
*Assign birthyear 2002
replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
replace tmt_status = 1 if birthyear == 2002 & percent_rank > .40
*Fill in birthyear 2001-1999
*Assign year 1998
replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
replace tmt_status = 1 if birthyear == 1998 & percent_rank > .72 & percent_rank <= .86
replace tmt_status = 2 if birthyear == 1998 & percent_rank > .86
*Fill in birthyear 1997-1990
* Do some tabulates etc to convince yourself the randomization is as expected
* Save tempfile of data to be merged to later
* (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
tempfile assignment_results
save `assignment_results'
restore
merge m:1 id using `assignment_results'
This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.
This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.

Lag in Stata generates only missing

I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.

Stata Deleting Multiple Observations

I have the following data matrix containing ideology scores in a customized dataset:
year state cdnum party name dwnom1
1946 23 10 200 WOODRUFF 0.43
1946 23 11 200 BRADLEY F. 0.534
1946 23 11 200 POTTER C. 0.278
1946 23 12 200 BENNETT J. 0.189
My unit of analysis is a given congressional district, in a given year. As one can see state #23, cdnum #11, has two observations in 1946.
What I would like to do is delete the earlier observation, in this case the observation corresponding to name: BRADLEY.F. This happens when a Congressional district has two members in a given Congress. The attempt of code that I have tried is as follows:
drop if year==[_n+1] & statenum==[_n+1] & cdnum==[_n+1]
My attempt is a conditional argument, drop the observation if: the year is the same as the next observation, the statenum is the same as the next observation, and the cdnum is the same as the next observation. In this way, I can insure each district has only one corresponding for a given year. When I attempt to run the code I get:
drop if year==[_n-1] & statenum==[_n-1] & cdnum==[_n-1]
(0 observations deleted)
Brief alternative: You should check out the duplicates command.
Detailed explanation of error:
You don't mean what you say to Stata.
Your conditions such as
if year == [_n-1]
should be
if year == year[_n-1]
and so forth.
[_n-1]
by itself is treated as if you typed
_n-1
which is the observation number, minus 1.
Here is a dopey example. Read in the auto data.
. sysuse auto
(1978 Automobile Data)
. list foreign if foreign == [_n-1], nola
+---------+
| foreign |
|---------|
1. | 0 |
+---------+
The variable foreign is equal to _n - 1 precisely once, in observation 1 when foreign is 0 and _n is 1.
In short, [_n-1] is not to be interpreted as the previous value (of the variable I just mentioned).
help subscripting gives very basic help.

Expanding observations from a range to form a panel

I currently have a data set that appears as follows
mnbr uact_id hiredate termdate
9 3709 19510101 20000915
20 9409 20001001 20080601
33 25646 19990201 20000731
mnbr represents the member number of a given worker in a labor union. uact_id is the shop they were working for and hiredate and termdate (given yyyymmdd) represent the given dates they were at the shop/uact_id. I am currently trying to use the expand command in Stata to create a panel such that there is one observation per year for each member number (mnbr) between the indicators of hiredate and termdate.
i.e. it should ideally look like
mnbr uact_id year
9 3709 1951
9 3709 1952
9 3709 1953
9 3709 1954
etc. for each member number for each year.
Assuming arbitrarily that the dates are strings, we can go
gen year = real(substr(hiredate, 1, 4))
gen duration = real(substr(termdate, 1, 4)) - year + 1
expand duration
bysort mnbr : replace year = year[_n-1] + 1 if _n > 1
If the dates are numeric, specifically integers, then the first two lines could be
gen year = floor(hiredate/10000)
gen duration = floor(termdate/10000) - year + 1
The replace step is discussed within
How can I replace missing values with previous or following nonmissing values or within sequences?