Adding observations between rows - stata

I would like to create new observations as follows:
A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
*3* 1 .
*3* 2 .
*3* 3 .
*3* 4 .
*3* 5 .
4 1 4
4 2 3
4 3 1
The new lines are indicated by asterisks.
How can I create new observations for variable A and B?

This is a simple expand:
clear
input A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
4 1 4
4 2 3
4 3 1
end
generate id = _n
expand 6 if id == 10
replace id = 11 if _n == _N
replace A = 3 if id == 10
replace C = . if id == 10
bysort id: replace B = cond(_n == 1, 1, B[_n-1]+1) if id == 10
Which will produce the desired output:
list, sepby(A)
+----------------+
| A B C id |
|----------------|
1. | 1 1 1 1 |
2. | 1 2 2 2 |
3. | 1 3 4 3 |
4. | 1 4 5 4 |
5. | 1 5 2 5 |
|----------------|
6. | 2 1 1 6 |
7. | 2 2 5 7 |
8. | 2 3 3 8 |
9. | 2 4 3 9 |
|----------------|
10. | 3 1 . 10 |
11. | 3 2 . 10 |
12. | 3 3 . 10 |
13. | 3 4 . 10 |
14. | 3 5 . 10 |
|----------------|
15. | 4 1 4 11 |
16. | 4 2 3 11 |
17. | 4 3 1 12 |
+----------------+

The code could be shorter.
expand 2 if _n < 6
replace A = 3 if _n > _N - 5
*replace B = _n + 5 - _N if A == 3
replace C = . if A == 3
sort A B

Related

How to rank observations in panel data?

I have a panel dataset in Stata with several countries and each country containing groups. I would like to rank the groups by country, according to the variable var1.
The structure of my dataset is as follows (the rank column is what I would like to achieve). Note that var1 is indeed constant within groups (it is just the within group average of another variable).
--country--|--groupId--|---time----|---var1----|---rank---
1 | 1 | 1 | 50 | 3
1 | 1 | 2 | 50 | 3
1 | 1 | 3 | 50 | 3
1 | 2 | 1 | 90 | 1
1 | 2 | 2 | 90 | 1
1 | 2 | 3 | 90 | 1
1 | 3 | 1 | 60 | 2
1 | 3 | 2 | 60 | 2
1 | 3 | 3 | 60 | 2
2 | 4 | 1 | 15 | 2
2 | 4 | 2 | 15 | 2
2 | 4 | 3 | 15 | 2
2 | 5 | 1 | 10 | 3
2 | 5 | 2 | 10 | 3
2 | 5 | 3 | 10 | 3
2 | 6 | 1 | 80 | 1
2 | 6 | 2 | 80 | 1
2 | 6 | 3 | 80 | 1
Among the options I have tried is:
sort country groupId
by country (groupId): egen rank = rank(var1)
However, I cannot achieve the desired result.
Thanks for the data example. There are two problems with your code. One is that as you want to rank from highest to lowest, you need to negate the argument to rank(). The second is that given the repetitions, you need to rank on one time only and then copy those ranks to other times.
This works with your data example, here edited to be input code. (See also the Stata tag wiki for that principle.)
clear
input country groupId time var1 rank
1 1 1 50 3
1 1 2 50 3
1 1 3 50 3
1 2 1 90 1
1 2 2 90 1
1 2 3 90 1
1 3 1 60 2
1 3 2 60 2
1 3 3 60 2
2 4 1 15 2
2 4 2 15 2
2 4 3 15 2
2 5 1 10 3
2 5 2 10 3
2 5 3 10 3
2 6 1 80 1
2 6 2 80 1
2 6 3 80 1
end
bysort country : egen wanted = rank(-var) if time == 1
bysort country groupId (time) : replace wanted = wanted[1]
assert rank == wanted

Generating variable by groups taking values of certain observations

I have a dataset with only variable values:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
How can I generate a new variable new_var containing a repeating sequence of the first three observations in value?
Many ways to do it: here are two:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
egen index = seq(), to(3)
generate wanted = value[index]
generate direct = cond(mod(_n, 3) == 1, 1, cond(mod(_n, 3) == 2, 3, 5))
list, sep(3)
+-------------------------------------------+
| value new_var index wanted direct |
|-------------------------------------------|
1. | 1 1 1 1 1 |
2. | 3 3 2 3 3 |
3. | 5 5 3 5 5 |
|-------------------------------------------|
4. | 30 1 1 1 1 |
5. | 40 3 2 3 3 |
6. | 50 5 3 5 5 |
|-------------------------------------------|
7. | 11 1 1 1 1 |
8. | 12 3 2 3 3 |
9. | 13 5 3 5 5 |
+-------------------------------------------+

Check whether a range contains specific values

Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+

Creating complete data in stata

I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}
The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+

How to backfill data

I have data that looks something like:
n year y
1 2000
1 2000
1 2001
1 2002 6
1 2002 6
1 2003 9
2 2000
2 2000
2 2001
2 2002 1
2 2002 9
2 2003 4
3 2000
3 2001
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
How can I fill in the y value for all years before 2002 with the y value corresponding to the ~first~ observation of 2002 - and do this by n?
For example, for n==2, the first y value of year==2002 is 1. Thus, I would want to fill in the three y values of years 2000 (2) and 2001 (1) with 1. The new dataset would be:
n year y
1 2000 6
1 2000 6
1 2001 6
1 2002 6
1 2002 6
1 2003 9
2 2000 1
2 2000 1
2 2001 1
2 2002 1
2 2002 9
2 2003 4
3 2000 3
3 2001 3
3 2002 3
3 2002 3
3 2003 5
3 2003 5
4 1999
4 2000
4 2001
4 2002
4 2002 4
Note that the years before 2002 for n==4 did not get filled in because the first observation where year==2002 is blank.
I think that a solution may be along the lines of:
bysort n: gen temp = y[1] if year==2002
replace y = temp if year<2002
drop temp
But I am not sure about the first line.
One (perhaps inelegant) solution:
sort n year, stable // [1]
gen y2 = y
by n year: gen _y = y2[1] if year == 2002 // [2]
egen _y2 = max(_y), by(n) // [3]
replace y2 = _y2 if year < 2002 // [4]
drop _*
li, sepby(n) noobs
yielding:
+-------------------+
| n year y y2 |
|-------------------|
| 1 2000 . 6 |
| 1 2000 . 6 |
| 1 2001 . 6 |
| 1 2002 6 6 |
| 1 2002 6 6 |
| 1 2003 9 9 |
|-------------------|
| 2 2000 . 1 |
| 2 2000 . 1 |
| 2 2001 . 1 |
| 2 2002 1 1 |
| 2 2002 9 9 |
| 2 2003 4 4 |
|-------------------|
| 3 2000 . 3 |
| 3 2001 . 3 |
| 3 2002 3 3 |
| 3 2002 3 3 |
| 3 2003 5 5 |
| 3 2003 5 5 |
|-------------------|
| 4 1999 . . |
| 4 2000 . . |
| 4 2001 . . |
| 4 2002 . . |
| 4 2002 4 4 |
+-------------------+
Notes:
[1] The stable option preserves the ordering of y.
[2] Generates _y equal to the first observation where year == 2002 only. Note that you need by n year or else y[1] is the first observation of the n group even when year != 2002 (but present only for observations where year == 2002).
[3] Fills in _y across the n group.
[4] Replaces y2 for years earlier than 2002.
mipolate from SSC offers "backward" interpolation, as follows:
. ssc inst mipolate
. bysort n: mipolate y year, gen(y2) backward
. l
+-------------------+
| n year y y2 |
|-------------------|
1. | 1 2000 . 6 |
2. | 1 2000 . 6 |
3. | 1 2001 . 6 |
4. | 1 2002 6 6 |
5. | 1 2002 6 6 |
|-------------------|
6. | 1 2003 9 9 |
7. | 2 2000 . 5 |
8. | 2 2000 . 5 |
9. | 2 2001 . 5 |
10. | 2 2002 1 5 |
|-------------------|
11. | 2 2002 9 5 |
12. | 2 2003 4 4 |
13. | 3 2000 . 3 |
14. | 3 2001 . 3 |
15. | 3 2002 3 3 |
|-------------------|
16. | 3 2002 3 3 |
17. | 3 2003 5 5 |
18. | 3 2003 5 5 |
19. | 4 1999 . 4 |
20. | 4 2000 . 4 |
|-------------------|
21. | 4 2001 . 4 |
22. | 4 2002 . 4 |
23. | 4 2002 4 4 |
+-------------------+
I mention this because it may be of interest to others interested in the question. A key here is that multiple observations for the same identifier and year are averaged first, which is not what you want.
Your particular version of the question is highly fragile because somehow you know that the first value of several is the one to use, but nothing in the data you show us flags which or why. Sort the data on n year and which of various duplicates comes first may well change! This is a dangerous situation for data management.