I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.
Related
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 householdID byte(childID HHmemberID)
"0940041260" 1 3
"0940041030" 1 .
"0940041030" 2 .
"0940041030" 3 3
"0940041030" 4 .
"0940041030" 5 .
"0940041110" 1 3
"0940041100" 2 3
"0940041100" 3 4
"0940041100" 4 .
"0940041080" 1 .
"0940041080" 2 .
"0940041080" 3 .
"0940041060" 1 3
"0940041140" 1 .
"0940041180" 1 .
"0940041010" 1 .
"0940041010" 2 .
"0940041040" 1 .
"0940041040" 2 .
"0940041190" 1 .
"0940041190" 2 .
"0940041220" 1 3
"0940041160" 1 3
"0940041170" 1 .
"0940041170" 2 .
end
I am trying to sum up a household size and how many children a household has, but I don't know how to do that in Stata. Is there a way to deal with this problem? The greatest number of childID and HHmemberID will represent the number but I don't know how to extract the information.
If you want this info in your original data, you can use extended generate:
bysort householdID: egen N_members = max(HHmemberID)
bysort householdID: egen N_kids = max(childID)
If you want a new dataset with only that data, you should collapse:
collapse (max) N_members = HHmemberID N_kids = childID, by(householdID)
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;
I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.
I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+
In Stata, if you have data such as this:
Location Person 1 Gifts Person 2 Gifts Person 3 Gifts Person 4 Gifts
1 2 7 1
2 4 1 12 2
3 5 5 5 5
4 4 1
What is the easiest way to create a new variable, 'over_three_less_than_six' to count how many people per location gave 3 or more gifts but less than 6. I want it to ignore missing values. So in the above example the new column would output:
over_three_less_than_six
0
1
4
1
I beg to differ on style in variable naming! I assume variables such as gift1 ... gift4
gen count = 0
quietly forval j = 1/4 {
replace count = count + inrange(gift`j', 3, 5)
}
See also for a detailed review of technique
SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced
.pdf freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0046
inlist(gift`j', 3, 4, 5)
would also work instead of the inrange() call.