How to obtain the order of multiple reponses? - stata

I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.

I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+

Related

Which function I can use in Stata to replicate a quantitative variable?

I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.
Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.

Reshaping dataset wide

I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.

How to delete variables which occur in column x but not in column y?

How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.
In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.

Making unbalanced panel balanced with missing observations

I am attempting to make the data balanced for my sample. My data currently looks like:
id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
....
And I would like it to look like:
id year y
1 2000 2
1 2001 .
1 2002 4
1 2003 5
2 2000 .
2 2001 2
2 2002 3
....
I have tried creating a .dta of just the year and merging it to the data; however, I can't get it to work. Essentially I would like to add rows of missing data to the panel. I realize I could just drop ids with unbalanced data, but this is not an option for my methodology.
You need to skim the Data-Management Reference Manual [D] when looking for basic data management functionality. In this case fillin does what you seem to be asking.
clear
input id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
end
fillin id year
list, sepby(id)
+-------------------------+
| id year y _fillin |
|-------------------------|
1. | 1 2000 2 0 |
2. | 1 2001 . 1 |
3. | 1 2002 4 0 |
4. | 1 2003 5 0 |
|-------------------------|
5. | 2 2000 . 1 |
6. | 2 2001 2 0 |
7. | 2 2002 3 0 |
8. | 2 2003 . 1 |
+-------------------------+

Conditioning Stata dataset on past values of variables

I have a problem in conditioning the dataset I have on Stata. Basically I want to condition the presence in the dataset -within a certain group- of an observation for which a certain action is performed (as indicated by a variable) on the past values of another variable. So let's suppose I have the following
obs | id | action1 | action2 | year
1 | 1 | 1 | 0 | 2000
2 | 1 | 0 | 1 | 2001
3 | 1 | 0 | 1 | 2002
4 | 1 | 0 | 1 | 2002
5 | 1 | 0 | 1 | 2003
6 | 2 | 1 | 0 | 2000
7 | 2 | 1 | 0 | 2001
8 | 2 | 0 | 1 | 2002
9 | 2 | 0 | 1 | 2002
10 | 2 | 0 | 1 | 2003
And for each group identified by 'id' I want to keep the observation only if action 1 is performed or if action1 has been performed no earlier than 2 years before action2 has been performed. In this simplified example only observation 4 should be deleted. Please note that the 2 actions are not mutually exclusive and they can be performed more than once within the same year therefore looking at 2 observations in the past does not necessarily means to look at 2 years in the past.
A solution which I am not able to implement by code would be:
gen act1year= action1 * year
then by(id) store the value of act1year when they're different from 0 somewhere (I am not able to implement this)
and then by(id) keep if action1=1 or if action2[_n]=1 and the range year[_n] to year[_n]-2 contains at least one of the values in the previously stored variable.
I know probably my suggestion is not the easiest way to go and still I am not able to implement it, unfortunately I cannot manage to find a code that help me doing this. Hope you can help me. Thanks
Francesco
The following assumes certain things.
clear
set more off
input ///
obs id action1 action2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2003
5 2 1 0 2000
6 2 0 1 2001
7 2 1 0 2002
8 2 0 1 2003
end
list, sepby(id)
*-----
bysort id (year) : keep if action1 | (action1[_n-1] + action1[_n-2] > 0)
list, sepby(id)
What is between parenthesis evaluates to one or zero depending on whether the inequality is true or false, respectively. This fragment indicates if action 1 was taken in either of the previous two observations.
You need to decide what to do with the first two observations, as they can't be compared with exactly two previous observations (they don't exist). In the following example they are always kept, because comparing with a non-existant observation in this case implies adding missing values, which results in missing. A missing is considered a very large number in Stata.
You can also work with time-series operators (help tsvarlist, help xtset) and really respect the time variable. Here, I work with the previous two observations. That may or may not coincide with the previous two time points.
I think your two actions are mutually exclusive, but you are not explicit about it.