In my data, I have variables as follows: household ID, ID of persons in household, father ID, years of education, who is the father. So person 3 in house 23 for example might say that person 1 is his or her father, while person 6 and 7 and 8 also in house 23 says that person 9 is their father. This is likely a joint family.
So I can't make a new column eduF in the usual way, since for person 3 and 6/7/8 in the same household, the father is different so the eduF level varies even in the same household. I need however this new column eduF saying, for each member of the family, what is the education level of the person they list to be their father.
I think this requires forvalues or foreach and loops, but am not sure what would be the code!
In the image of the sample, 'father i' and 'father n' mean that the father is dead or info not available.
key pid fathID yearsEDU
282 10 fath n 13
282 9 1 10
282 8 4
282 7 4 12
282 6 4 14
282 5 fath n 10
282 4 1 9
282 3 1 8
282 2 fath i
282 1 fath i 4
283 4 1 4
283 3 1 6
283 2 fath i 14
283 1 fath i 17
In the example given, the values of xpers run 1 up in each household. (If that's not true, it can be arranged).
There is a singular lack of information here about which variables are numeric, which are numeric with value labels and which are string.
But assuming that q0111 is string, we can get the numeric only values for fathers' identifiers by
gen fatherid = real(q0111)
Then it is
bysort xhhkey (xpers) : gen father_educ = q0407_a[fatherid]
The key idea here is that under the aegis of by: subscripts are interpreted within groups, and therefore the values of fatherid are precisely the subscripts we need.
As #Metrics asserted, no loop is necessary.
list xhhkey xpers q0111 q0407_a fatherid father_educ, sep(0)
+-----------------------------------------------------------+
| xhhkey xpers q0111 q0407_a fatherid father~c |
|-----------------------------------------------------------|
1. | 282 1 father i 13 . . |
2. | 282 2 father i 10 . . |
3. | 282 3 1 . 1 13 |
4. | 282 4 1 12 1 13 |
5. | 282 5 father n 14 . . |
6. | 282 6 4 10 4 12 |
7. | 282 7 4 9 4 12 |
8. | 282 8 4 8 4 12 |
9. | 282 9 1 . 1 13 |
10. | 282 10 father n 4 . . |
11. | 283 1 father i 4 . . |
12. | 283 2 father i 6 . . |
13. | 283 3 1 14 1 4 |
14. | 283 4 1 17 1 4 |
15. | 284 1 father i 5 . . |
16. | 284 2 father n . . . |
17. | 284 3 1 1 1 5 |
18. | 284 4 father i 4 . . |
19. | 284 5 father n 8 . . |
20. | 284 6 father i 7 . . |
21. | 284 7 father n 18 . . |
22. | 284 8 6 2 6 7 |
23. | 284 9 6 . 6 7 |
24. | 284 10 father i 9 . . |
+-----------------------------------------------------------+
By the way, the terminology of columns is alien to Stata outside the context of matrices: they are variables.
There is a moderately detailed tutorial on by: in http://www.stata-journal.com/article.html?article=pr0004 Even experienced Stata users often underestimate what you can do with by:.
Related
I have a dataset with only variable values:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
How can I generate a new variable new_var containing a repeating sequence of the first three observations in value?
Many ways to do it: here are two:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
egen index = seq(), to(3)
generate wanted = value[index]
generate direct = cond(mod(_n, 3) == 1, 1, cond(mod(_n, 3) == 2, 3, 5))
list, sep(3)
+-------------------------------------------+
| value new_var index wanted direct |
|-------------------------------------------|
1. | 1 1 1 1 1 |
2. | 3 3 2 3 3 |
3. | 5 5 3 5 5 |
|-------------------------------------------|
4. | 30 1 1 1 1 |
5. | 40 3 2 3 3 |
6. | 50 5 3 5 5 |
|-------------------------------------------|
7. | 11 1 1 1 1 |
8. | 12 3 2 3 3 |
9. | 13 5 3 5 5 |
+-------------------------------------------+
Consider the following data example:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 22342 64
3 23452 23
1 23432 22
end
How can I keep all the records for the IDs that contain the code 15324 in any row?
This is a follow-up question to a previous one of mine: Keeping all the records for specific IDs
The following works for me:
clear
input id code cost
1 15342 18
2 15366 12
1 16786 32
2 15342 12
3 12345 45
4 23453 345
1 34234 23
2 22223 12
4 15342 64
3 23452 23
1 23432 22
end
bysort id (code): egen tag = total(inlist(code, 15342))
keep if tag
Results:
list, sepby(id)
+-------------------------+
| id code cost tag |
|-------------------------|
1. | 1 15342 18 1 |
2. | 1 16786 32 1 |
3. | 1 23432 22 1 |
4. | 1 34234 23 1 |
|-------------------------|
5. | 2 15342 12 1 |
6. | 2 15366 12 1 |
7. | 2 22223 12 1 |
|-------------------------|
8. | 4 15342 64 1 |
9. | 4 23453 345 1 |
+-------------------------+
Note that I changed the data example slightly for better illustration.
I am trying to create instruments from a three-dimensional panel dataset, as included below:
input firm year market price comp_avg
1 2000 10 1 .
3 2000 10 2 .
3 2001 10 3 .
1 2002 10 4 .
3 2002 10 5 .
1 2000 20 6 .
3 2000 20 7 .
1 2001 20 8 .
2 2001 20 9 .
3 2001 20 10 .
1 2002 20 20 .
2 2002 20 30 .
3 2002 20 40 .
2 2000 30 50 .
1 2001 30 60 .
2 2001 30 70 .
1 2002 30 80 .
2 2002 30 90 .
end
The instrument I am trying to create is the lagged (year-1) average price of a firm's competitors (those in the same market) in each market the firm operates in in a given year.
At the moment, I have some code that does the job, but I am hoping that I am missing something and can do this in a more clear or efficient way.
Here is the code:
// for each firm
qui levelsof firm, local(firms)
qui foreach f in `firms' {
// find all years for that firm
levelsof year if firm == `f', local(years)
foreach y in `years' {
// skip first year (because there is no lagged data)
if `y' == 2000 {
continue
}
// find all markets in that year
levelsof market if firm == `f' & year == `y', local(mkts)
local L1 = `y'-1
foreach m in `mkts' {
// get average of all compeitors in that market in the year prior
gen temp = firm != `f' & year == `L1' & market == `m'
su price if temp
replace comp_avg = r(mean) if firm == `f' & market == `m' & year == `y'
drop temp
}
}
}
The data I am working with are reasonably large (~1 million obs) so the faster the better.
clear
input firm year market price
1 2000 10 1
3 2000 10 2
3 2001 10 3
1 2002 10 4
3 2002 10 5
1 2000 20 6
3 2000 20 7
1 2001 20 8
2 2001 20 9
3 2001 20 10
1 2002 20 20
2 2002 20 30
3 2002 20 40
2 2000 30 50
1 2001 30 60
2 2001 30 70
1 2002 30 80
2 2002 30 90
end
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
bysort market year : egen total = total(Lprice)
bysort market year : egen count = count(Lprice)
gen mean_others = (total - cond(missing(Lprice), 0, Lprice)) ///
/ (count - cond(missing(Lprice), 0, 1))
sort market year
list market year firm price Lprice mean_others total count, sepby(market year)
+--------------------------------------------------------------------------+
| market year firm price Lprice price mean_o~s total count |
|--------------------------------------------------------------------------|
1. | 10 2000 1 1 . 1 . 0 0 |
2. | 10 2000 3 2 . 2 . 0 0 |
|--------------------------------------------------------------------------|
3. | 10 2001 3 3 2 3 . 2 1 |
|--------------------------------------------------------------------------|
4. | 10 2002 1 4 . 4 3 3 1 |
5. | 10 2002 3 5 3 5 . 3 1 |
|--------------------------------------------------------------------------|
6. | 20 2000 3 7 . 7 . 0 0 |
7. | 20 2000 1 6 . 6 . 0 0 |
|--------------------------------------------------------------------------|
8. | 20 2001 2 9 . 9 6.5 13 2 |
9. | 20 2001 3 10 7 10 6 13 2 |
10. | 20 2001 1 8 6 8 7 13 2 |
|--------------------------------------------------------------------------|
11. | 20 2002 1 20 8 20 9.5 27 3 |
12. | 20 2002 3 40 10 40 8.5 27 3 |
13. | 20 2002 2 30 9 30 9 27 3 |
|--------------------------------------------------------------------------|
14. | 30 2000 2 50 . 50 . 0 0 |
|--------------------------------------------------------------------------|
15. | 30 2001 2 70 50 70 . 50 1 |
16. | 30 2001 1 60 . 60 50 50 1 |
|--------------------------------------------------------------------------|
17. | 30 2002 2 90 70 90 60 130 2 |
18. | 30 2002 1 80 60 80 70 130 2 |
+--------------------------------------------------------------------------+
My approach breaks it down:
Calculate the previous price for the same firm and market. (#1 could also be done by declaring a (firm, market) pair a panel.)
The mean of other values (here previous prices) in the same market and year is the (sum of others MINUS this price) divided by (number of others MINUS 1).
#2 needs a modification as if this price is missing, you need to subtract 0 from both numerator and denominator. Stata's normal rules would render sum MINUS missing as missing, but this firm's previous price might be unknown, yet others in the same market might have known prices.
Note: There are small ways of speeding up your code, but this should be faster (so long as it is correct).
EDIT: Another solution (2 lines) using rangestat (must be installed using ssc inst rangestat):
bysort firm market (year) : gen Lprice = price[_n-1] if year - year[_n-1] == 1
rangestat Lprice, interval(year 0 0) by(market) excludeself
I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n
I have data that looks something like:
n year y
1 2000
1 2000
1 2001
1 2002 6
1 2002 6
1 2003 9
2 2000
2 2000
2 2001
2 2002 1
2 2002 9
2 2003 4
3 2000
3 2001
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
How can I fill in the y value for all years before 2002 with the y value corresponding to the ~first~ observation of 2002 - and do this by n?
For example, for n==2, the first y value of year==2002 is 1. Thus, I would want to fill in the three y values of years 2000 (2) and 2001 (1) with 1. The new dataset would be:
n year y
1 2000 6
1 2000 6
1 2001 6
1 2002 6
1 2002 6
1 2003 9
2 2000 1
2 2000 1
2 2001 1
2 2002 1
2 2002 9
2 2003 4
3 2000 3
3 2001 3
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
Note that the years before 2002 for n==4 did not get filled in because the first observation where year==2002 is blank.
I think that a solution may be along the lines of:
bysort n: gen temp = y[1] if year==2002
replace y = temp if year<2002
drop temp
But I am not sure about the first line.
One (perhaps inelegant) solution:
sort n year, stable // [1]
gen y2 = y
by n year: gen _y = y2[1] if year == 2002 // [2]
egen _y2 = max(_y), by(n) // [3]
replace y2 = _y2 if year < 2002 // [4]
drop _*
li, sepby(n) noobs
yielding:
+-------------------+
| n year y y2 |
|-------------------|
| 1 2000 . 6 |
| 1 2000 . 6 |
| 1 2001 . 6 |
| 1 2002 6 6 |
| 1 2002 6 6 |
| 1 2003 9 9 |
|-------------------|
| 2 2000 . 1 |
| 2 2000 . 1 |
| 2 2001 . 1 |
| 2 2002 1 1 |
| 2 2002 9 9 |
| 2 2003 4 4 |
|-------------------|
| 3 2000 . 3 |
| 3 2001 . 3 |
| 3 2002 3 3 |
| 3 2002 3 3 |
| 3 2003 5 5 |
| 3 2003 5 5 |
|-------------------|
| 4 1999 . . |
| 4 2000 . . |
| 4 2001 . . |
| 4 2002 . . |
| 4 2002 4 4 |
+-------------------+
Notes:
[1] The stable option preserves the ordering of y.
[2] Generates _y equal to the first observation where year == 2002 only. Note that you need by n year or else y[1] is the first observation of the n group even when year != 2002 (but present only for observations where year == 2002).
[3] Fills in _y across the n group.
[4] Replaces y2 for years earlier than 2002.
mipolate from SSC offers "backward" interpolation, as follows:
. ssc inst mipolate
. bysort n: mipolate y year, gen(y2) backward
. l
+-------------------+
| n year y y2 |
|-------------------|
1. | 1 2000 . 6 |
2. | 1 2000 . 6 |
3. | 1 2001 . 6 |
4. | 1 2002 6 6 |
5. | 1 2002 6 6 |
|-------------------|
6. | 1 2003 9 9 |
7. | 2 2000 . 5 |
8. | 2 2000 . 5 |
9. | 2 2001 . 5 |
10. | 2 2002 1 5 |
|-------------------|
11. | 2 2002 9 5 |
12. | 2 2003 4 4 |
13. | 3 2000 . 3 |
14. | 3 2001 . 3 |
15. | 3 2002 3 3 |
|-------------------|
16. | 3 2002 3 3 |
17. | 3 2003 5 5 |
18. | 3 2003 5 5 |
19. | 4 1999 . 4 |
20. | 4 2000 . 4 |
|-------------------|
21. | 4 2001 . 4 |
22. | 4 2002 . 4 |
23. | 4 2002 4 4 |
+-------------------+
I mention this because it may be of interest to others interested in the question. A key here is that multiple observations for the same identifier and year are averaged first, which is not what you want.
Your particular version of the question is highly fragile because somehow you know that the first value of several is the one to use, but nothing in the data you show us flags which or why. Sort the data on n year and which of various duplicates comes first may well change! This is a dangerous situation for data management.