In Stata, if you have data such as this:
Location Person 1 Gifts Person 2 Gifts Person 3 Gifts Person 4 Gifts
1 2 7 1
2 4 1 12 2
3 5 5 5 5
4 4 1
What is the easiest way to create a new variable, 'over_three_less_than_six' to count how many people per location gave 3 or more gifts but less than 6. I want it to ignore missing values. So in the above example the new column would output:
over_three_less_than_six
0
1
4
1
I beg to differ on style in variable naming! I assume variables such as gift1 ... gift4
gen count = 0
quietly forval j = 1/4 {
replace count = count + inrange(gift`j', 3, 5)
}
See also for a detailed review of technique
SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced
.pdf freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0046
inlist(gift`j', 3, 4, 5)
would also work instead of the inrange() call.
Related
I am trying to calculate attack rate in some populations using GEE, but at this time only have total number of non-cases. My dataset has individual level data for each case, and a population-wide count for number of non-cases.
In order to do the GEE, I am trying to get create non-case observations so I have one observation for each non-case.
For example:
In this starting data...
PopID CaseNum N_cases N_noncases
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
I would need to create 3 new observations with PopID14 and 2 new observations in PopID 15.
so it would look like this
PopID CaseNum No_case N_cases N_noncases
14 . . 2 3
14 1 . . .
14 2 . . .
14 . 1 . 3
14 . 2 . 3
14 . 3 . 3
15 . . 5 2
15 1 . . .
15 2 . . .
15 3 . . .
15 4 . . .
15 5 . . .
15 . 1 . 2
15 . 2 . 2
Once I have the non-case observations, I'm planning to separate into case-level and population-level datasets before doing my GEE in the case-level dataset.
I have tried a DO-UNTIL loop set to end when no_case=n_noncases, but it just continues forever and never stops.
data test1;
set test;
do until (no_case=n_noncases) ;
no_case +1;
by Popid;
output;
end; run;
I am open to any and all other ways of doing this :) (I also attempted a proc sql, but that went downhill quickly because I have only ever used them to go from case level to population level data, and not vice versa)
Weird but doable.
data have;
input PopID CaseNum N_cases N_noncases;
cards;
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
;
;
;;
run;
data want;
set have;
by popID;
retain nrecs;
if first.popID then
nrecs=N_noncases;
output;
if last.popid then
do;
N_noncases=nrecs;
call missing(caseNum);
do No_case=1 to nrecs;
output;
end;
end;
keep popID Casenum No_case n_cases N_noncases;
run;
Probably clearer if you do it in steps.
Generate the "non-cases" .
data controls;
set have(keep=popid n_noncases);
where n_noncases > 0 ;
do controlnum=1 to n_noncases;
output;
end;
run;
Then combine with the "cases" ;
data want;
set have(where=(not missing(casenum)) controls;
by popid;
run;
If performance is an issue then make the first one a data step view instead.
data controls / view=controls;
...
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 householdID byte(childID HHmemberID)
"0940041260" 1 3
"0940041030" 1 .
"0940041030" 2 .
"0940041030" 3 3
"0940041030" 4 .
"0940041030" 5 .
"0940041110" 1 3
"0940041100" 2 3
"0940041100" 3 4
"0940041100" 4 .
"0940041080" 1 .
"0940041080" 2 .
"0940041080" 3 .
"0940041060" 1 3
"0940041140" 1 .
"0940041180" 1 .
"0940041010" 1 .
"0940041010" 2 .
"0940041040" 1 .
"0940041040" 2 .
"0940041190" 1 .
"0940041190" 2 .
"0940041220" 1 3
"0940041160" 1 3
"0940041170" 1 .
"0940041170" 2 .
end
I am trying to sum up a household size and how many children a household has, but I don't know how to do that in Stata. Is there a way to deal with this problem? The greatest number of childID and HHmemberID will represent the number but I don't know how to extract the information.
If you want this info in your original data, you can use extended generate:
bysort householdID: egen N_members = max(HHmemberID)
bysort householdID: egen N_kids = max(childID)
If you want a new dataset with only that data, you should collapse:
collapse (max) N_members = HHmemberID N_kids = childID, by(householdID)
I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.
So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;
In Stata, I have a dataset like this:
obs v2 v3 v4 v5 v6
1 . 3 . . 1
2 2 . . 4 5
3 . 7 . . .
4 1 . 1 . 4
How can I find all of the columns that have a non "." value in them, by row?
For example, I want to find that:
obs 1 has non-empty values for v3 and v6.
obs 2 has non-empty values for v2, v5, and v6.
obs 3 has non-empty values for v3.
obs 4 has non-empty values for v2, v4, and v6.
Here is pseudocode of one way that is not efficient at all (I want to find a better, faster way):
Create new variables, v2a ... v6a. v2a will take string value "v2" if there is a non-empty value in the row and 0 otherwise. Do this for all 'a' variables.
Concatenate all the a variables.
I don't need a new variable per se. If it just outputted onto the screen, that would be great too.
This code is not very elegant, but it does the job.
clear
input obs v2 v3 v4 v5 v6
1 . 3 . . 1
2 2 . . 4 5
3 . 7 . . .
4 1 . 1 . 4
end
gen strL nonmiss=""
foreach var of varlist v2-v6 {
replace nonmiss=nonmiss+" "+"`var'" if !missing(`var')
}
list nonmiss