Making unbalanced panel balanced with missing observations - stata

I am attempting to make the data balanced for my sample. My data currently looks like:
id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
....
And I would like it to look like:
id year y
1 2000 2
1 2001 .
1 2002 4
1 2003 5
2 2000 .
2 2001 2
2 2002 3
....
I have tried creating a .dta of just the year and merging it to the data; however, I can't get it to work. Essentially I would like to add rows of missing data to the panel. I realize I could just drop ids with unbalanced data, but this is not an option for my methodology.

You need to skim the Data-Management Reference Manual [D] when looking for basic data management functionality. In this case fillin does what you seem to be asking.
clear
input id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
end
fillin id year
list, sepby(id)
+-------------------------+
| id year y _fillin |
|-------------------------|
1. | 1 2000 2 0 |
2. | 1 2001 . 1 |
3. | 1 2002 4 0 |
4. | 1 2003 5 0 |
|-------------------------|
5. | 2 2000 . 1 |
6. | 2 2001 2 0 |
7. | 2 2002 3 0 |
8. | 2 2003 . 1 |
+-------------------------+

Related

variable showing the highest value attained of another variable, recorded so far, over time

I have a dataset of patients and their alcohol-related patient data over time (in years) like below
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
I would like to create a variable that shows the highest level of alcohol code that has been used so far at any (year) point in a patient's record, such that the dataset would be like below:
clear
input long patid float(year cohort highestsofar)
1051 1994 1 1
2051 1972 1 1
2051 1989 2 2
2051 1990 2 2
2051 2000 2 2
2051 2001 3 3
2051 2002 1 3
2051 2003 2 3
8051 1995 1 1
8051 1996 1 1
8051 2003 1 1
end
label values cohort cohortlab
label values highestsofar cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "lifetime_abstainer" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
Thanks for the clear example and question.
The problem is already covered by an FAQ link here on the StataCorp website. Here's a one-line solution using rangestat from SSC.
clear
input long patid float(year cohort)
1051 1994 1
2051 1972 1
2051 1989 2
2051 1990 2
2051 2000 2
2051 2001 3
2051 2002 1
2051 2003 2
8051 1995 1
8051 1996 1
8051 2003 1
end
label values cohort cohortlab
label define cohortlab 0 "general population" 1 "no alcohol data" 2 "indeterminate" 3 "non-drinker" 4 "low_risk" 5 "hazardous" 6 "AUD" , replace
rangestat (max) highestsofar = cohort, interval(year . 0) by(patid)
list, sepby(patid)
+-------------------------------------------+
| patid year cohort highes~r |
|-------------------------------------------|
1. | 1051 1994 no alcohol data 1 |
|-------------------------------------------|
2. | 2051 1972 no alcohol data 1 |
3. | 2051 1989 indeterminate 2 |
4. | 2051 1990 indeterminate 2 |
5. | 2051 2000 indeterminate 2 |
6. | 2051 2001 non-drinker 3 |
7. | 2051 2002 no alcohol data 3 |
8. | 2051 2003 indeterminate 3 |
|-------------------------------------------|
9. | 8051 1995 no alcohol data 1 |
10. | 8051 1996 no alcohol data 1 |
11. | 8051 2003 no alcohol data 1 |
+-------------------------------------------+
I would like to offer an answer:
by patid: g highestsofar=cohort if cohort>cohort[_n-1]|_n==1
by patid: replace highestsofar=highestsofar[_n-1] if cohort<=cohort[_n-1]&_n>1
by patid: replace highestsofar=highestsofar[_n-1] if (highestsofar<highestsofar[_n-1]) & ((cohort>cohort[_n-1])&_n>1)
label values highestsofar cohortlab
I would be happy if a more compact syntax could be discussed.
Thanks

Merge databases in Stata and create new vars based on identity and value of merged data

I have two databases, DB1 and DB2, that I would like to merge, but I am having difficulties. I would like help in determining what Stata calls what I am trying to do.
DB1 has about 1000 observations and looks like:
+----------+
| date b |
|----------|
1. | 1 7 |
2. | 2 6 |
3. | 3 7 |
+----------+
DB2 consists of 65 IDs each with about 1000 observations. It looks something like:
+--------------+
| date id b |
|--------------|
1. | 1 1 4 |
2. | 2 1 4 |
3. | 3 1 5 |
4. | 1 2 9 |
5. | 2 2 8 |
6. | 3 2 7 |
7. | 1 3 1 |
8. | 2 3 2 |
9. | 3 3 1 |
+--------------+
I would like to merge DB2 with DB1 so that the ultimate database looks like:
+------------------------------+
| date b id1b id2b id3b ...|
|------------------------------|
1. | 1 7 4 9 1 ...|
2. | 2 6 4 8 2 ...|
3. | 3 7 5 7 1 ...|
+------------------------------+
I have been reading about the merge command but that alone will not create my ultimate database.
Can you direct me materials that will help me with this? What do you call what I am trying to do? I feel like I need to command Stata to generate new variables.
#William Lisowski is right. This gets you what you ask for, short of a easy rename. Whether it is the best structure for your analyses is unclear: most work with similar data would be easier with a further reshape long.
clear
input date b
1 7
2 6
3 7
end
save DB1
clear
input date id b
1 1 4
2 1 4
3 1 5
1 2 9
2 2 8
3 2 7
1 3 1
2 3 2
3 3 1
end
reshape wide b, j(id) i(date)
merge 1:1 date using DB1
Indeed, I would much more usually do something like this to get a long structure directly:
clear
input date b
1 7
2 6
3 7
end
rename b B
save DB1 , replace
clear
input date id b
1 1 4
2 1 4
3 1 5
1 2 9
2 2 8
3 2 7
1 3 1
2 3 2
3 3 1
end
merge m:1 date using DB1

How to delete variables which occur in column x but not in column y?

How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.
In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.

Conditioning Stata dataset on past values of variables

I have a problem in conditioning the dataset I have on Stata. Basically I want to condition the presence in the dataset -within a certain group- of an observation for which a certain action is performed (as indicated by a variable) on the past values of another variable. So let's suppose I have the following
obs | id | action1 | action2 | year
1 | 1 | 1 | 0 | 2000
2 | 1 | 0 | 1 | 2001
3 | 1 | 0 | 1 | 2002
4 | 1 | 0 | 1 | 2002
5 | 1 | 0 | 1 | 2003
6 | 2 | 1 | 0 | 2000
7 | 2 | 1 | 0 | 2001
8 | 2 | 0 | 1 | 2002
9 | 2 | 0 | 1 | 2002
10 | 2 | 0 | 1 | 2003
And for each group identified by 'id' I want to keep the observation only if action 1 is performed or if action1 has been performed no earlier than 2 years before action2 has been performed. In this simplified example only observation 4 should be deleted. Please note that the 2 actions are not mutually exclusive and they can be performed more than once within the same year therefore looking at 2 observations in the past does not necessarily means to look at 2 years in the past.
A solution which I am not able to implement by code would be:
gen act1year= action1 * year
then by(id) store the value of act1year when they're different from 0 somewhere (I am not able to implement this)
and then by(id) keep if action1=1 or if action2[_n]=1 and the range year[_n] to year[_n]-2 contains at least one of the values in the previously stored variable.
I know probably my suggestion is not the easiest way to go and still I am not able to implement it, unfortunately I cannot manage to find a code that help me doing this. Hope you can help me. Thanks
Francesco
The following assumes certain things.
clear
set more off
input ///
obs id action1 action2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2003
5 2 1 0 2000
6 2 0 1 2001
7 2 1 0 2002
8 2 0 1 2003
end
list, sepby(id)
*-----
bysort id (year) : keep if action1 | (action1[_n-1] + action1[_n-2] > 0)
list, sepby(id)
What is between parenthesis evaluates to one or zero depending on whether the inequality is true or false, respectively. This fragment indicates if action 1 was taken in either of the previous two observations.
You need to decide what to do with the first two observations, as they can't be compared with exactly two previous observations (they don't exist). In the following example they are always kept, because comparing with a non-existant observation in this case implies adding missing values, which results in missing. A missing is considered a very large number in Stata.
You can also work with time-series operators (help tsvarlist, help xtset) and really respect the time variable. Here, I work with the previous two observations. That may or may not coincide with the previous two time points.
I think your two actions are mutually exclusive, but you are not explicit about it.

How to obtain the order of multiple reponses?

I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.
I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+