Save list of distinct values of a variable in another variable - stata

I have data at the country-year-z level, where z is a categorical variable that can take(say) 10 different values (for each country-year). Each combination of country-year-z is unique in the dataset.
I would like to obtain a dataset at the country-year level, with a new (string) variable containing all distinct values of z.
For instance let's say I have the following data:
country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
I would like to get the following data:
country year z_distinct
A 2000 1
A 2001 1 2
A 2002 2 5
B 2001 7 8
B 2002 4 5 9
B 2003 3 4
B 2003 4

Here's another way to do it, perhaps more direct. If z is already a string variable the string() calls should both be omitted.
clear
input str1 country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
end
bysort country year (z) : gen values = string(z[1])
by country year : replace values = values[_n-1] + " " + string(z) if z != z[_n-1] & _n > 1
by country year : keep if _n == _N
drop z
list , sepby(country)
+-------------------------+
| country year values |
|-------------------------|
1. | A 2000 1 |
2. | A 2001 1 2 4 |
3. | A 2002 2 5 |
|-------------------------|
4. | B 2001 7 8 |
5. | B 2002 4 5 9 |
6. | B 2003 3 4 |
7. | B 2005 1 |
+-------------------------+

I think there may be some problems with your desired output given your input, but otherwise something like this should do it:
clear
input str1 country year z
"A" 2000 1
"A" 2001 1
"A" 2001 2
"A" 2001 4
"A" 2002 2
"A" 2002 5
"B" 2001 7
"B" 2001 8
"B" 2002 4
"B" 2002 5
"B" 2002 9
"B" 2003 3
"B" 2003 4
"B" 2005 1
end
gen z_distinct = "";
egen c_x_y = group(country year)
levelsof c_x_y, local(pairs)
foreach p of local pairs {
qui levelsof z if c_x_y == `p', clean separate(" ")
qui replace z_distinct = "`r(levels)'" if c_x_y==`p'
}
collapse (first) z_distinct, by(country year)
sort country year
The code loops over country-years, calculating the observed values of z using levelsof, and then collapses to get one row for each country-year.

Related

Counting number of prior observation excluding those belonging to a certain group

I have loan level data which has the following structure and want to create the variable Number
Loan Borrower Lender Date Crop Country Number
1 A X 01/01/20 Coffee USA 0
2 B X 01/02/20 Coffee USA 0
3 C X 01/03/20 Coffee USA 0
4 D X 01/04/20 Coffee USA 0
5 E X 01/05/20 Banana USA 4
6 F X 01/06/20 Banana USA 4
7 G X 01/07/20 Coffee USA 2
8 H X 01/08/20 Orange USA 7
9 I X 01/09/20 Coffee USA 3
. . . . . . .
. . . . . . .
I want to number my loan based on this set of rules
How many loans has the lender issued up to this point (including this loan)
This number should only include loans in the same country as my loan
This number should exclude all loans given out in the same crop
Hence I am left with a number for each observation which states the number of loans given out by the lender in the same country as said loan but excluding those observations in the country which also occur in the same crop.
So far I tried running:
bysort Lender Country (Date): gen var = _n
The problem with this is that I don't subtract the observations which occur in the same crop.
* Example generated by -dataex-.
clear
input byte Loan str8 Borrower str6 Lender float Date str6 Crop str7 Country byte Number
1 "A" "X" 21915 "Coffee" "USA" 0
2 "B" "X" 21916 "Coffee" "USA" 0
3 "C" "X" 21917 "Coffee" "USA" 0
4 "D" "X" 21918 "Coffee" "USA" 0
5 "E" "X" 21919 "Banana" "USA" 4
6 "F" "X" 21920 "Banana" "USA" 4
7 "G" "X" 21921 "Coffee" "USA" 2
8 "H" "X" 21922 "Orange" "USA" 7
9 "I" "X" 21923 "Coffee" "USA" 3
end
format %td Date
bysort Crop (Date) : gen this = _n
bysort Crop Date (this): replace this = this[_N]
sort Loan
gen wanted1 = _n - this
bysort Country (Date) : replace this = _n
bysort Country Date (this): replace this = this[_N]
sort Loan
gen wanted2 = _n - this
list
+---------------------------------------------------------------------------------------------+
| Loan Borrower Lender Date Crop Country Number this wanted1 wanted2 |
|---------------------------------------------------------------------------------------------|
1. | 1 A X 01jan2020 Coffee USA 0 1 0 0 |
2. | 2 B X 02jan2020 Coffee USA 0 2 0 0 |
3. | 3 C X 03jan2020 Coffee USA 0 3 0 0 |
4. | 4 D X 04jan2020 Coffee USA 0 4 0 0 |
5. | 5 E X 05jan2020 Banana USA 4 5 4 0 |
|---------------------------------------------------------------------------------------------|
6. | 6 F X 06jan2020 Banana USA 4 6 4 0 |
7. | 7 G X 07jan2020 Coffee USA 2 7 2 0 |
8. | 8 H X 08jan2020 Orange USA 7 8 7 0 |
9. | 9 I X 09jan2020 Coffee USA 3 9 3 0 |
+---------------------------------------------------------------------------------------------+

Generate a variable equals to 1 if 1 ever observed in panel data

I have the following data with person ID and whether they have insurance in each year:
ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
What I want is to add another column, which equals 1 if a person is ever insured. For example, Person 2 only had insurance in 2002 but it means he has had insurance at some point, so Ever_Ins should equal 1 in all years:
ID Year Insured Ever_Ins
1 2001 1 1
2 2001 0 1
3 2001 0 0
1 2002 1 1
2 2002 1 1
3 2002 0 0
1 2003 1 1
2 2003 0 1
3 2003 0 0
I cannot use egen Ever_Ins = max(Insured), by (ID) because Insured is not a dummy in the true data. It has values such as 9 for unknown.
Technique for "any" and "all" problems is documented in this FAQ. See also this paper for a more detailed discussion. Here is one way to do it.
clear
input ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
end
egen Ever_Ins = max(Insured == 1), by(ID)
sort ID Year
list , sepby(ID)
+--------------------------------+
| ID Year Insured Ever_Ins |
|--------------------------------|
1. | 1 2001 1 1 |
2. | 1 2002 1 1 |
3. | 1 2003 1 1 |
|--------------------------------|
4. | 2 2001 0 1 |
5. | 2 2002 1 1 |
6. | 2 2003 0 1 |
|--------------------------------|
7. | 3 2001 0 0 |
8. | 3 2002 0 0 |
9. | 3 2003 0 0 |
+--------------------------------+

Find Lagged Average of Group

I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself

How to backfill data

I have data that looks something like:
n year y
1 2000
1 2000
1 2001
1 2002 6
1 2002 6
1 2003 9
2 2000
2 2000
2 2001
2 2002 1
2 2002 9
2 2003 4
3 2000
3 2001
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
How can I fill in the y value for all years before 2002 with the y value corresponding to the ~first~ observation of 2002 - and do this by n?
For example, for n==2, the first y value of year==2002 is 1. Thus, I would want to fill in the three y values of years 2000 (2) and 2001 (1) with 1. The new dataset would be:
n year y
1 2000 6
1 2000 6
1 2001 6
1 2002 6
1 2002 6
1 2003 9
2 2000 1
2 2000 1
2 2001 1
2 2002 1
2 2002 9
2 2003 4
3 2000 3
3 2001 3
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
Note that the years before 2002 for n==4 did not get filled in because the first observation where year==2002 is blank.
I think that a solution may be along the lines of:
bysort n: gen temp = y[1] if year==2002
replace y = temp if year<2002
drop temp
But I am not sure about the first line.
One (perhaps inelegant) solution:
sort n year, stable // [1]
gen y2 = y
by n year: gen _y = y2[1] if year == 2002 // [2]
egen _y2 = max(_y), by(n) // [3]
replace y2 = _y2 if year < 2002 // [4]
drop _*
li, sepby(n) noobs
yielding:
+-------------------+
| n year y y2 |
|-------------------|
| 1 2000 . 6 |
| 1 2000 . 6 |
| 1 2001 . 6 |
| 1 2002 6 6 |
| 1 2002 6 6 |
| 1 2003 9 9 |
|-------------------|
| 2 2000 . 1 |
| 2 2000 . 1 |
| 2 2001 . 1 |
| 2 2002 1 1 |
| 2 2002 9 9 |
| 2 2003 4 4 |
|-------------------|
| 3 2000 . 3 |
| 3 2001 . 3 |
| 3 2002 3 3 |
| 3 2002 3 3 |
| 3 2003 5 5 |
| 3 2003 5 5 |
|-------------------|
| 4 1999 . . |
| 4 2000 . . |
| 4 2001 . . |
| 4 2002 . . |
| 4 2002 4 4 |
+-------------------+
Notes:
[1] The stable option preserves the ordering of y.
[2] Generates _y equal to the first observation where year == 2002 only. Note that you need by n year or else y[1] is the first observation of the n group even when year != 2002 (but present only for observations where year == 2002).
[3] Fills in _y across the n group.
[4] Replaces y2 for years earlier than 2002.
mipolate from SSC offers "backward" interpolation, as follows:
. ssc inst mipolate
. bysort n: mipolate y year, gen(y2) backward
. l
+-------------------+
| n year y y2 |
|-------------------|
1. | 1 2000 . 6 |
2. | 1 2000 . 6 |
3. | 1 2001 . 6 |
4. | 1 2002 6 6 |
5. | 1 2002 6 6 |
|-------------------|
6. | 1 2003 9 9 |
7. | 2 2000 . 5 |
8. | 2 2000 . 5 |
9. | 2 2001 . 5 |
10. | 2 2002 1 5 |
|-------------------|
11. | 2 2002 9 5 |
12. | 2 2003 4 4 |
13. | 3 2000 . 3 |
14. | 3 2001 . 3 |
15. | 3 2002 3 3 |
|-------------------|
16. | 3 2002 3 3 |
17. | 3 2003 5 5 |
18. | 3 2003 5 5 |
19. | 4 1999 . 4 |
20. | 4 2000 . 4 |
|-------------------|
21. | 4 2001 . 4 |
22. | 4 2002 . 4 |
23. | 4 2002 4 4 |
+-------------------+
I mention this because it may be of interest to others interested in the question. A key here is that multiple observations for the same identifier and year are averaged first, which is not what you want.
Your particular version of the question is highly fragile because somehow you know that the first value of several is the one to use, but nothing in the data you show us flags which or why. Sort the data on n year and which of various duplicates comes first may well change! This is a dangerous situation for data management.

How to copy value from previous group into the next group in Stata

I think this might be simple but haven't found a way to do it in Stata so I'd really appreciate if someone could help me out. I'm trying to copy values from one group to the other, with the restriction that the values must be from the previous year. I think the example and the results I'd like to have may make it clearer:
I have data that looks like this:
year group_id educ place
1990 1 6 a
1990 1 6 b
1992 2 2 c
1992 2 2 d
1994 3 11 e
1994 3 11 f
1990 4 10 g
1990 4 10 h
1992 5 5 i
1992 5 5 j
1994 6 7 k
1994 6 7 l
I have different groups, identified by "group_id" and different years ("year"), and I want, for example, for groups to get the educ value from the previous year. But I don't want 1990 to get the value from 1994. Is this possbile? My data would hopefully end up looking like this:
year group_id educ place prev_educ
1990 1 6 a .
1990 1 6 b .
1992 2 2 c 6
1992 2 2 d 6
1994 3 11 e 2
1994 3 11 f 2
1990 4 10 g .
1990 4 10 h .
1992 5 5 i 10
1992 5 5 j 10
1994 6 7 k 5
1994 6 7 l 5
I tried variations of:
gen prev_educ=.
bysort group_id: replace prev_educ=educ[_N -1] if group_id[_n]!=group_id[_n-1]
which is clearly not what I want.
This is a very dangerous data structure, as sorting without extreme care could destroy its integrity. It appears that only the order of year tells you that 1, 2, 3 belong together and 4, 5, 6 belong together.
This reproduces your example:
clear
input year group_id educ str1 place
1990 1 6 a
1990 1 6 b
1992 2 2 c
1992 2 2 d
1994 3 11 e
1994 3 11 f
1990 4 10 g
1990 4 10 h
1992 5 5 i
1992 5 5 j
1994 6 7 k
1994 6 7 l
end
gen prev_educ = .
replace prev_educ = educ[_n-1] if year > year[_n-1]
replace prev_educ = prev_educ[_n-1] if group_id == group_id[_n-1] & missing(prev_educ)