How to delete variables which occur in column x but not in column y? - stata

How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.

In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.

Related

Create Custom Definition of Week

I have daily data and want to convert them to weekly, using the following definition. Every Monday denotes the beginning of week i, and Sunday denotes the end of week i.
My date variable is called day and is already has %td format. I have a feeling that I should use the dow() function, combined with egen, group() but I struggle to get it quite right.
If your data are once a week and you have data for Mondays only, then your date variable is fine and all you need to do is declare delta(7) if you use tsset or xtset.
If your data are for two or more days a week and you wish to collapse or contract to weekly data, then you can convert to a suitable time basis like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float date
22067
22068
22069
22070
22071
22072
22073
22074
22075
22076
22077
22078
22079
22080
end
format %td date
gen wdate = cond(dow(date) == 1, date, cond(dow(date) == 0, date - 6, date - dow(date) + 1))
format wdate %td
gen dow = dow(date)
list, sepby(wdate)
+-----------------------------+
| date dow wdate |
|-----------------------------|
1. | 01jun2020 1 01jun2020 |
2. | 02jun2020 2 01jun2020 |
3. | 03jun2020 3 01jun2020 |
4. | 04jun2020 4 01jun2020 |
5. | 05jun2020 5 01jun2020 |
6. | 06jun2020 6 01jun2020 |
7. | 07jun2020 0 01jun2020 |
|-----------------------------|
8. | 08jun2020 1 08jun2020 |
9. | 09jun2020 2 08jun2020 |
10. | 10jun2020 3 08jun2020 |
11. | 11jun2020 4 08jun2020 |
12. | 12jun2020 5 08jun2020 |
13. | 13jun2020 6 08jun2020 |
14. | 14jun2020 0 08jun2020 |
+-----------------------------+
In short, index weeks by the Mondays that start them. Now collapse or contract your dataset. Naturally if you have panel or longitudinal data some identifier may be involved too. delta(7) remains essential for anything depending on tsset or xtset.
There is no harm in using egen to map to successive integers, but no advantage in that either.
A theme underlying this is that Stata's own weeks are idiosyncratic, always starting week 1 on 1 January and always having 8 or 9 days in week 52. For more on weeks in Stata, see the papers here and here, which include the advice given in this answer, and much more.

Conditioning Stata dataset on past values of variables

I have a problem in conditioning the dataset I have on Stata. Basically I want to condition the presence in the dataset -within a certain group- of an observation for which a certain action is performed (as indicated by a variable) on the past values of another variable. So let's suppose I have the following
obs | id | action1 | action2 | year
1 | 1 | 1 | 0 | 2000
2 | 1 | 0 | 1 | 2001
3 | 1 | 0 | 1 | 2002
4 | 1 | 0 | 1 | 2002
5 | 1 | 0 | 1 | 2003
6 | 2 | 1 | 0 | 2000
7 | 2 | 1 | 0 | 2001
8 | 2 | 0 | 1 | 2002
9 | 2 | 0 | 1 | 2002
10 | 2 | 0 | 1 | 2003
And for each group identified by 'id' I want to keep the observation only if action 1 is performed or if action1 has been performed no earlier than 2 years before action2 has been performed. In this simplified example only observation 4 should be deleted. Please note that the 2 actions are not mutually exclusive and they can be performed more than once within the same year therefore looking at 2 observations in the past does not necessarily means to look at 2 years in the past.
A solution which I am not able to implement by code would be:
gen act1year= action1 * year
then by(id) store the value of act1year when they're different from 0 somewhere (I am not able to implement this)
and then by(id) keep if action1=1 or if action2[_n]=1 and the range year[_n] to year[_n]-2 contains at least one of the values in the previously stored variable.
I know probably my suggestion is not the easiest way to go and still I am not able to implement it, unfortunately I cannot manage to find a code that help me doing this. Hope you can help me. Thanks
Francesco
The following assumes certain things.
clear
set more off
input ///
obs id action1 action2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2003
5 2 1 0 2000
6 2 0 1 2001
7 2 1 0 2002
8 2 0 1 2003
end
list, sepby(id)
*-----
bysort id (year) : keep if action1 | (action1[_n-1] + action1[_n-2] > 0)
list, sepby(id)
What is between parenthesis evaluates to one or zero depending on whether the inequality is true or false, respectively. This fragment indicates if action 1 was taken in either of the previous two observations.
You need to decide what to do with the first two observations, as they can't be compared with exactly two previous observations (they don't exist). In the following example they are always kept, because comparing with a non-existant observation in this case implies adding missing values, which results in missing. A missing is considered a very large number in Stata.
You can also work with time-series operators (help tsvarlist, help xtset) and really respect the time variable. Here, I work with the previous two observations. That may or may not coincide with the previous two time points.
I think your two actions are mutually exclusive, but you are not explicit about it.

How to obtain the order of multiple reponses?

I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.
I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+

How can I match two columns of data by name?

I have two sets of data that look something like this:
Bill | 7
Sam | 13
Chuck | 9
and
Bill | 6
Sam | 3
Beth | 6
and I want:
Beth | 0 | 6
Bill | 7 | 6
Chuck| 9 | 0
Sam | 13 | 3
I don't even care if the data ends up looking like this:
Bill | 7 | Bill | 6
| | Beth | 6
Sam | 13 | Sam | 3
Chuck| 9 | Chuck| 0
I just would like to match up the names.
Your desired outcome - I've never seen such an order in "real life practice".
To use the data, I would go with an operating system tool to combine the source files
(like: copy file1 + file2 newfile.csv; new file extension for easily recognizing by OOo Calc).
In CALC you can then sort / filter, to show a persons data together, or sum / calculate with it.
If you want standard operations, like SUM per person, check out the pivot table feature.
HTH

Generating a new variable by selection from multiple variables

I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+
I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.