How to copy value from previous group into the next group in Stata - stata

I think this might be simple but haven't found a way to do it in Stata so I'd really appreciate if someone could help me out. I'm trying to copy values from one group to the other, with the restriction that the values must be from the previous year. I think the example and the results I'd like to have may make it clearer:
I have data that looks like this:
year group_id educ place
1990 1 6 a
1990 1 6 b
1992 2 2 c
1992 2 2 d
1994 3 11 e
1994 3 11 f
1990 4 10 g
1990 4 10 h
1992 5 5 i
1992 5 5 j
1994 6 7 k
1994 6 7 l
I have different groups, identified by "group_id" and different years ("year"), and I want, for example, for groups to get the educ value from the previous year. But I don't want 1990 to get the value from 1994. Is this possbile? My data would hopefully end up looking like this:
year group_id educ place prev_educ
1990 1 6 a .
1990 1 6 b .
1992 2 2 c 6
1992 2 2 d 6
1994 3 11 e 2
1994 3 11 f 2
1990 4 10 g .
1990 4 10 h .
1992 5 5 i 10
1992 5 5 j 10
1994 6 7 k 5
1994 6 7 l 5
I tried variations of:
gen prev_educ=.
bysort group_id: replace prev_educ=educ[_N -1] if group_id[_n]!=group_id[_n-1]
which is clearly not what I want.

This is a very dangerous data structure, as sorting without extreme care could destroy its integrity. It appears that only the order of year tells you that 1, 2, 3 belong together and 4, 5, 6 belong together.
This reproduces your example:
clear
input year group_id educ str1 place
1990 1 6 a
1990 1 6 b
1992 2 2 c
1992 2 2 d
1994 3 11 e
1994 3 11 f
1990 4 10 g
1990 4 10 h
1992 5 5 i
1992 5 5 j
1994 6 7 k
1994 6 7 l
end
gen prev_educ = .
replace prev_educ = educ[_n-1] if year > year[_n-1]
replace prev_educ = prev_educ[_n-1] if group_id == group_id[_n-1] & missing(prev_educ)

Related

Stata alternatives for lookup

I have a large Stata dataset that contains the following variables: year, state, household_id, individual_id, partner_id, and race. Here is an example of my data:
year state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
Note that, in the above table, column 1 and 2 are married to each other.
I want to create a variable that is one if the person is in an interracial marriage.
As a first step, I used the following code
by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.
This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using
gen id_user=string(household_id)+"."+string(individual_id)+string(year)
gen id_partner=string(household_id)+"."+string(partner_id)+string(year)
What I want to do now is to create something like what vlookup does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.
I guess it should be something like this?
gen inter2==1 if (find race[idpartner]) == (race[iduser])
The expected output should be like this
year state household_id individual_id partner_id race inter2
1980 CA 23 2 1 3 1
1980 CA 23 1 2 1 1
1990 NY 43 4 2 1 0
1990 NY 43 2 4 1 0
I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race for the people in each couple.
In the code below _N == 2 is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.
clear
input year str2 state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
end
generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
" " + string(partner_id), string(partner_id) + ///
" " + string(individual_id))
bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2
list, sepby(household_id) abbreviate(15)
+-------------------------------------------------------------------------------------+
| year state household_id individual_id partner_id race couple_id mixed |
|-------------------------------------------------------------------------------------|
1. | 1980 CA 23 2 1 3 1 2 1 |
2. | 1980 CA 23 1 2 1 1 2 1 |
|-------------------------------------------------------------------------------------|
3. | 1990 NY 43 4 2 1 2 4 0 |
4. | 1990 NY 43 2 4 1 2 4 0 |
+-------------------------------------------------------------------------------------+
This idea is documented in this article. The link gives free access to a pdf file.

Power Bi - Display Trend line for a Year - Quarter Graph (Clustered Column Chart)

I am able to get the Trend line when there is only Year on the axis. But, I want the trend line when there is a quarter as well.
I am not sure if the solution for this requirement is Possible with Power BI. I don't think it is possible at the moment. But, putting it here as someone might have a workaround for this.
This is my data for this example,
Date Category Value
1/1/2018 A 4
2/1/2018 A 7
3/1/2018 A 6
4/1/2018 A 1
5/1/2018 A 8
6/1/2018 A 1
7/1/2018 A 7
8/1/2018 A 1
9/1/2018 A 9
10/1/2018 A 10
11/1/2018 A 2
12/1/2018 A 1
1/1/2019 A 7
2/1/2019 A 1
3/1/2019 A 4
1/1/2018 B 10
2/1/2018 B 1
3/1/2018 B 7
4/1/2018 B 4
5/1/2018 B 8
6/1/2018 B 4
7/1/2018 B 7
8/1/2018 B 9
9/1/2018 B 10
10/1/2018 B 10
11/1/2018 B 7
12/1/2018 B 5
1/1/2019 B 4
2/1/2019 B 4
3/1/2019 B 1
You can do this using a Line and Clustered Column chart type.
For the lines, you'll need to define measures like this:
A Line = CALCULATE(SUM(Table2[Value]), Table2[Category] = "A")
B Line = CALCULATE(SUM(Table2[Value]), Table2[Category] = "B")

Find Lagged Average of Group

I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself

Save list of distinct values of a variable in another variable

I have data at the country-year-z level, where z is a categorical variable that can take(say) 10 different values (for each country-year). Each combination of country-year-z is unique in the dataset.
I would like to obtain a dataset at the country-year level, with a new (string) variable containing all distinct values of z.
For instance let's say I have the following data:
country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
I would like to get the following data:
country year z_distinct
A 2000 1
A 2001 1 2
A 2002 2 5
B 2001 7 8
B 2002 4 5 9
B 2003 3 4
B 2003 4
Here's another way to do it, perhaps more direct. If z is already a string variable the string() calls should both be omitted.
clear
input str1 country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
end
bysort country year (z) : gen values = string(z[1])
by country year : replace values = values[_n-1] + " " + string(z) if z != z[_n-1] & _n > 1
by country year : keep if _n == _N
drop z
list , sepby(country)
+-------------------------+
| country year values |
|-------------------------|
1. | A 2000 1 |
2. | A 2001 1 2 4 |
3. | A 2002 2 5 |
|-------------------------|
4. | B 2001 7 8 |
5. | B 2002 4 5 9 |
6. | B 2003 3 4 |
7. | B 2005 1 |
+-------------------------+
I think there may be some problems with your desired output given your input, but otherwise something like this should do it:
clear
input str1 country year z
"A" 2000 1
"A" 2001 1
"A" 2001 2
"A" 2001 4
"A" 2002 2
"A" 2002 5
"B" 2001 7
"B" 2001 8
"B" 2002 4
"B" 2002 5
"B" 2002 9
"B" 2003 3
"B" 2003 4
"B" 2005 1
end
gen z_distinct = "";
egen c_x_y = group(country year)
levelsof c_x_y, local(pairs)
foreach p of local pairs {
qui levelsof z if c_x_y == `p', clean separate(" ")
qui replace z_distinct = "`r(levels)'" if c_x_y==`p'
}
collapse (first) z_distinct, by(country year)
sort country year
The code loops over country-years, calculating the observed values of z using levelsof, and then collapses to get one row for each country-year.

Looking up data within a file versus merging

I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n