How to backfill data

How to backfill data - stata

I have data that looks something like:
n year y
1 2000
1 2000
1 2001
1 2002 6
1 2002 6
1 2003 9
2 2000
2 2000
2 2001
2 2002 1
2 2002 9
2 2003 4
3 2000
3 2001
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
How can I fill in the y value for all years before 2002 with the y value corresponding to the ~first~ observation of 2002 - and do this by n?
For example, for n==2, the first y value of year==2002 is 1. Thus, I would want to fill in the three y values of years 2000 (2) and 2001 (1) with 1. The new dataset would be:
n year y
1 2000 6
1 2000 6
1 2001 6
1 2002 6
1 2002 6
1 2003 9
2 2000 1
2 2000 1
2 2001 1
2 2002 1
2 2002 9
2 2003 4
3 2000 3
3 2001 3
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
Note that the years before 2002 for n==4 did not get filled in because the first observation where year==2002 is blank.
I think that a solution may be along the lines of:
bysort n: gen temp = y[1] if year==2002
replace y = temp if year<2002
drop temp
But I am not sure about the first line.

One (perhaps inelegant) solution:
sort n year, stable // [1]
gen y2 = y
by n year: gen _y = y2[1] if year == 2002 // [2]
egen _y2 = max(_y), by(n) // [3]
replace y2 = _y2 if year < 2002 // [4]
drop _*
li, sepby(n) noobs
yielding:
+-------------------+
| n year y y2 |
|-------------------|
| 1 2000 . 6 |
| 1 2000 . 6 |
| 1 2001 . 6 |
| 1 2002 6 6 |
| 1 2002 6 6 |
| 1 2003 9 9 |
|-------------------|
| 2 2000 . 1 |
| 2 2000 . 1 |
| 2 2001 . 1 |
| 2 2002 1 1 |
| 2 2002 9 9 |
| 2 2003 4 4 |
|-------------------|
| 3 2000 . 3 |
| 3 2001 . 3 |
| 3 2002 3 3 |
| 3 2002 3 3 |
| 3 2003 5 5 |
| 3 2003 5 5 |
|-------------------|
| 4 1999 . . |
| 4 2000 . . |
| 4 2001 . . |
| 4 2002 . . |
| 4 2002 4 4 |
+-------------------+
Notes:
[1] The stable option preserves the ordering of y.
[2] Generates _y equal to the first observation where year == 2002 only. Note that you need by n year or else y[1] is the first observation of the n group even when year != 2002 (but present only for observations where year == 2002).
[3] Fills in _y across the n group.
[4] Replaces y2 for years earlier than 2002.

mipolate from SSC offers "backward" interpolation, as follows:
. ssc inst mipolate
. bysort n: mipolate y year, gen(y2) backward
. l
+-------------------+
| n year y y2 |
|-------------------|
1. | 1 2000 . 6 |
2. | 1 2000 . 6 |
3. | 1 2001 . 6 |
4. | 1 2002 6 6 |
5. | 1 2002 6 6 |
|-------------------|
6. | 1 2003 9 9 |
7. | 2 2000 . 5 |
8. | 2 2000 . 5 |
9. | 2 2001 . 5 |
10. | 2 2002 1 5 |
|-------------------|
11. | 2 2002 9 5 |
12. | 2 2003 4 4 |
13. | 3 2000 . 3 |
14. | 3 2001 . 3 |
15. | 3 2002 3 3 |
|-------------------|
16. | 3 2002 3 3 |
17. | 3 2003 5 5 |
18. | 3 2003 5 5 |
19. | 4 1999 . 4 |
20. | 4 2000 . 4 |
|-------------------|
21. | 4 2001 . 4 |
22. | 4 2002 . 4 |
23. | 4 2002 4 4 |
+-------------------+
I mention this because it may be of interest to others interested in the question. A key here is that multiple observations for the same identifier and year are averaged first, which is not what you want.
Your particular version of the question is highly fragile because somehow you know that the first value of several is the one to use, but nothing in the data you show us flags which or why. Sort the data on n year and which of various duplicates comes first may well change! This is a dangerous situation for data management.

Related

How to rank observations in panel data?

I have a panel dataset in Stata with several countries and each country containing groups. I would like to rank the groups by country, according to the variable var1.
The structure of my dataset is as follows (the rank column is what I would like to achieve). Note that var1 is indeed constant within groups (it is just the within group average of another variable).
--country--|--groupId--|---time----|---var1----|---rank---
1 | 1 | 1 | 50 | 3
1 | 1 | 2 | 50 | 3
1 | 1 | 3 | 50 | 3
1 | 2 | 1 | 90 | 1
1 | 2 | 2 | 90 | 1
1 | 2 | 3 | 90 | 1
1 | 3 | 1 | 60 | 2
1 | 3 | 2 | 60 | 2
1 | 3 | 3 | 60 | 2
2 | 4 | 1 | 15 | 2
2 | 4 | 2 | 15 | 2
2 | 4 | 3 | 15 | 2
2 | 5 | 1 | 10 | 3
2 | 5 | 2 | 10 | 3
2 | 5 | 3 | 10 | 3
2 | 6 | 1 | 80 | 1
2 | 6 | 2 | 80 | 1
2 | 6 | 3 | 80 | 1
Among the options I have tried is:
sort country groupId
by country (groupId): egen rank = rank(var1)
However, I cannot achieve the desired result.

Thanks for the data example. There are two problems with your code. One is that as you want to rank from highest to lowest, you need to negate the argument to rank(). The second is that given the repetitions, you need to rank on one time only and then copy those ranks to other times.
This works with your data example, here edited to be input code. (See also the Stata tag wiki for that principle.)
clear
input country groupId time var1 rank
1 1 1 50 3
1 1 2 50 3
1 1 3 50 3
1 2 1 90 1
1 2 2 90 1
1 2 3 90 1
1 3 1 60 2
1 3 2 60 2
1 3 3 60 2
2 4 1 15 2
2 4 2 15 2
2 4 3 15 2
2 5 1 10 3
2 5 2 10 3
2 5 3 10 3
2 6 1 80 1
2 6 2 80 1
2 6 3 80 1
end
bysort country : egen wanted = rank(-var) if time == 1
bysort country groupId (time) : replace wanted = wanted[1]
assert rank == wanted

Generating variable by groups taking values of certain observations

I have a dataset with only variable values:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
How can I generate a new variable new_var containing a repeating sequence of the first three observations in value?

Many ways to do it: here are two:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
egen index = seq(), to(3)
generate wanted = value[index]
generate direct = cond(mod(_n, 3) == 1, 1, cond(mod(_n, 3) == 2, 3, 5))
list, sep(3)
+-------------------------------------------+
| value new_var index wanted direct |
|-------------------------------------------|
1. | 1 1 1 1 1 |
2. | 3 3 2 3 3 |
3. | 5 5 3 5 5 |
|-------------------------------------------|
4. | 30 1 1 1 1 |
5. | 40 3 2 3 3 |
6. | 50 5 3 5 5 |
|-------------------------------------------|
7. | 11 1 1 1 1 |
8. | 12 3 2 3 3 |
9. | 13 5 3 5 5 |
+-------------------------------------------+

Generate a variable equals to 1 if 1 ever observed in panel data

I have the following data with person ID and whether they have insurance in each year:
ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
What I want is to add another column, which equals 1 if a person is ever insured. For example, Person 2 only had insurance in 2002 but it means he has had insurance at some point, so Ever_Ins should equal 1 in all years:
ID Year Insured Ever_Ins
1 2001 1 1
2 2001 0 1
3 2001 0 0
1 2002 1 1
2 2002 1 1
3 2002 0 0
1 2003 1 1
2 2003 0 1
3 2003 0 0
I cannot use egen Ever_Ins = max(Insured), by (ID) because Insured is not a dummy in the true data. It has values such as 9 for unknown.

Technique for "any" and "all" problems is documented in this FAQ. See also this paper for a more detailed discussion. Here is one way to do it.
clear
input ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
end
egen Ever_Ins = max(Insured == 1), by(ID)
sort ID Year
list , sepby(ID)
+--------------------------------+
| ID Year Insured Ever_Ins |
|--------------------------------|
1. | 1 2001 1 1 |
2. | 1 2002 1 1 |
3. | 1 2003 1 1 |
|--------------------------------|
4. | 2 2001 0 1 |
5. | 2 2002 1 1 |
6. | 2 2003 0 1 |
|--------------------------------|
7. | 3 2001 0 0 |
8. | 3 2002 0 0 |
9. | 3 2003 0 0 |
+--------------------------------+

Adding observations between rows

I would like to create new observations as follows:
A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
*3* 1 .
*3* 2 .
*3* 3 .
*3* 4 .
*3* 5 .
4 1 4
4 2 3
4 3 1
The new lines are indicated by asterisks.
How can I create new observations for variable A and B?

This is a simple expand:
clear
input A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
4 1 4
4 2 3
4 3 1
end
generate id = _n
expand 6 if id == 10
replace id = 11 if _n == _N
replace A = 3 if id == 10
replace C = . if id == 10
bysort id: replace B = cond(_n == 1, 1, B[_n-1]+1) if id == 10
Which will produce the desired output:
list, sepby(A)
+----------------+
| A B C id |
|----------------|
1. | 1 1 1 1 |
2. | 1 2 2 2 |
3. | 1 3 4 3 |
4. | 1 4 5 4 |
5. | 1 5 2 5 |
|----------------|
6. | 2 1 1 6 |
7. | 2 2 5 7 |
8. | 2 3 3 8 |
9. | 2 4 3 9 |
|----------------|
10. | 3 1 . 10 |
11. | 3 2 . 10 |
12. | 3 3 . 10 |
13. | 3 4 . 10 |
14. | 3 5 . 10 |
|----------------|
15. | 4 1 4 11 |
16. | 4 2 3 11 |
17. | 4 3 1 12 |
+----------------+

The code could be shorter.
expand 2 if _n < 6
replace A = 3 if _n > _N - 5
*replace B = _n + 5 - _N if A == 3
replace C = . if A == 3
sort A B

Save list of distinct values of a variable in another variable

I have data at the country-year-z level, where z is a categorical variable that can take(say) 10 different values (for each country-year). Each combination of country-year-z is unique in the dataset.
I would like to obtain a dataset at the country-year level, with a new (string) variable containing all distinct values of z.
For instance let's say I have the following data:
country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
I would like to get the following data:
country year z_distinct
A 2000 1
A 2001 1 2
A 2002 2 5
B 2001 7 8
B 2002 4 5 9
B 2003 3 4
B 2003 4

Here's another way to do it, perhaps more direct. If z is already a string variable the string() calls should both be omitted.
clear
input str1 country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
end
bysort country year (z) : gen values = string(z[1])
by country year : replace values = values[_n-1] + " " + string(z) if z != z[_n-1] & _n > 1
by country year : keep if _n == _N
drop z
list , sepby(country)
+-------------------------+
| country year values |
|-------------------------|
1. | A 2000 1 |
2. | A 2001 1 2 4 |
3. | A 2002 2 5 |
|-------------------------|
4. | B 2001 7 8 |
5. | B 2002 4 5 9 |
6. | B 2003 3 4 |
7. | B 2005 1 |
+-------------------------+

I think there may be some problems with your desired output given your input, but otherwise something like this should do it:
clear
input str1 country year z
"A" 2000 1
"A" 2001 1
"A" 2001 2
"A" 2001 4
"A" 2002 2
"A" 2002 5
"B" 2001 7
"B" 2001 8
"B" 2002 4
"B" 2002 5
"B" 2002 9
"B" 2003 3
"B" 2003 4
"B" 2005 1
end
gen z_distinct = "";
egen c_x_y = group(country year)
levelsof c_x_y, local(pairs)
foreach p of local pairs {
qui levelsof z if c_x_y == `p', clean separate(" ")
qui replace z_distinct = "`r(levels)'" if c_x_y==`p'
}
collapse (first) z_distinct, by(country year)
sort country year
The code loops over country-years, calculating the observed values of z using levelsof, and then collapses to get one row for each country-year.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to backfill data - stata

Related

How to rank observations in panel data?

Generating variable by groups taking values of certain observations

Generate a variable equals to 1 if 1 ever observed in panel data

Adding observations between rows

Save list of distinct values of a variable in another variable

Categories

Resources