I am trying to calculate attack rate in some populations using GEE, but at this time only have total number of non-cases. My dataset has individual level data for each case, and a population-wide count for number of non-cases.
In order to do the GEE, I am trying to get create non-case observations so I have one observation for each non-case.
For example:
In this starting data...
PopID CaseNum N_cases N_noncases
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
I would need to create 3 new observations with PopID14 and 2 new observations in PopID 15.
so it would look like this
PopID CaseNum No_case N_cases N_noncases
14 . . 2 3
14 1 . . .
14 2 . . .
14 . 1 . 3
14 . 2 . 3
14 . 3 . 3
15 . . 5 2
15 1 . . .
15 2 . . .
15 3 . . .
15 4 . . .
15 5 . . .
15 . 1 . 2
15 . 2 . 2
Once I have the non-case observations, I'm planning to separate into case-level and population-level datasets before doing my GEE in the case-level dataset.
I have tried a DO-UNTIL loop set to end when no_case=n_noncases, but it just continues forever and never stops.
data test1;
set test;
do until (no_case=n_noncases) ;
no_case +1;
by Popid;
output;
end; run;
I am open to any and all other ways of doing this :) (I also attempted a proc sql, but that went downhill quickly because I have only ever used them to go from case level to population level data, and not vice versa)
Weird but doable.
data have;
input PopID CaseNum N_cases N_noncases;
cards;
14 . 2 3
14 1 . .
14 2 . .
15 . 5 2
15 1 . .
15 2 . .
15 3 . .
15 4 . .
15 5 . .
;
;
;;
run;
data want;
set have;
by popID;
retain nrecs;
if first.popID then
nrecs=N_noncases;
output;
if last.popid then
do;
N_noncases=nrecs;
call missing(caseNum);
do No_case=1 to nrecs;
output;
end;
end;
keep popID Casenum No_case n_cases N_noncases;
run;
Probably clearer if you do it in steps.
Generate the "non-cases" .
data controls;
set have(keep=popid n_noncases);
where n_noncases > 0 ;
do controlnum=1 to n_noncases;
output;
end;
run;
Then combine with the "cases" ;
data want;
set have(where=(not missing(casenum)) controls;
by popid;
run;
If performance is an issue then make the first one a data step view instead.
data controls / view=controls;
...
Related
I'm not sure of the best way to describe this, and I'll admit that the code I wrote to recreate the problem in a smaller format isn't quite accurate.
I have 7 data sets that have the same number of columns (122) but a different number of rows. The labels for these columns are identical except for an underscore and an integer. Example: first column of each data set is "study_id_1" "study_id_2" ... "study_id_7"
I am trying to stack each of these data sets, in numerical order, on top of each other AND drop the underscore and integer.
However, if I use this code, all of the values are in chunks but along a diagonal.
data all;
set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
run;
The following code (written in SAS Studio) pretty much illustrates the problem and the "diagonal." However, in my actual data (working in SAS EG), all of the missing values are periods, regardless of variable type. In the example below, I could only get periods to appear for missing values for the numerical variables.
data have;
input study_id_1 $ variable1_1 $ variable2_1 variable3_1 study_id_2 $ variable1_2 $ variable2_2 variable3_2 study_id_3 $ variable1_3 $ variable2_3 variable3_3;
cards;
A treatment 35 24 . . . . . . . .
B placebo 24 44 . . . . . . . .
C treatment 66 77 . . . . . . . .
D placebo 73 45 . . . . . . . .
. . . . A treatment 23 34 . . . .
. . . . B placebo 43 56 . . . .
. . . . C treatment 34 34 . . . .
. . . . D placebo 54 67 . . . .
. . . . . . . . A treatment 22 66
. . . . . . . . B placebo 33 67
. . . . . . . . C treatment 23 48
. . . . . . . . D placebo 69 70
;
run;
proc print data=have;
run;
data want;
input study_id $ variable1 $ variable2 variable3;
cards;
A treatment 35 24
B placebo 24 44
C treatment 66 77
D placebo 73 45
A treatment 23 34
B placebo 43 56
C treatment 34 34
D placebo 54 67
A treatment 22 66
B placebo 33 67
C treatment 23 48
D placebo 69 70
;
run;
proc print data=want;
run;
I hope I've described the problem sufficiently and thanks for any help.
The first non-missing from a list of values is returned by the COALESCE and COALESCEC functions.
A list of variables is very simple in your data set because alike variables have a common prefix (and 1,2,3 suffixes). The syntax for specifying the alike variables is <prefix>:
Example:
data want;
set have;
* coalesce during stacking;
* set PT_BS1_all PT_BS2_all PT_BS3_all PT_BS4_all PT_BS5_all PT_BS6_all PT_BS7_all;
length study_id $8 variable1 $9;
study_id = coalesceC(of study_id_:);
variable1 = coalesceC(of variable1_:);
variable2 = coalesce (of variable2_:);
variable3 = coalesce (of variable3_:);
drop study_id_: variable1_: variable2_: variable3_:;
run;
Rather than clean up the compiled dataset output that are diagonal due to misaligned column names, adjust the inputs by appropriately renaming columns. Specifically, remove the suffix at underscore with scan using a dynamic macro of oldname=newname pattern built from proc sql. Then pass this macro into a subsequent rename command.
Below assumes all datasets resides in WORK library. Adjust SQL WHERE accordingly.
%macro rename_cols(dset);
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffix_clean separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = "&dset.";
quit;
data &dset;
set &dset;
rename &suffix_clean;
run;
%mend rename_cols;
%rename_cols(PT_BS1_ALL);
%rename_cols(PT_BS2_ALL);
%rename_cols(PT_BS3_ALL);
%rename_cols(PT_BS4_ALL);
%rename_cols(PT_BS5_ALL);
%rename_cols(PT_BS6_ALL);
%rename_cols(PT_BS7_ALL);
data all;
set PT_BS1_all
PT_BS2_all
PT_BS3_all
PT_BS4_all
PT_BS5_all
PT_BS6_all
PT_BS7_all;
run;
I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?
Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.
A need to create a new variable to repeat the earliest date for a ID visit and if it missing it should type missing, after a missing it should keep the earliest date since it was missing(like in the example). I've tried the LAG function and it didn't work; I also try the keep function but just repeat the 25NOV2015 for all records. The final result/"what I need" is in the last column.
Thanks
Example
You need to use retain statement. Retain means your value in each observation won't be reinitialized to a missing. So in the next iteration of data step your variable remembers its value.
Sample data
data a;
input date;
format date ddmmyy10.;
datalines;
.
5
6
7
.
1
2
.
9
;
run;
Solution
data b;
set a;
retain new_date;
format new_date ddmmyy10.;
if date = . then
new_date = .;
if new_date = . then
new_date = date;
run;
Since you didn't post any data I will make up some. Also since the fact that your variable is a date doesn't really impact the answer I will just use some integers as they are easier to type.
data have ;
input id value ## ;
cards;
1 . 1 2 1 3 1 . 1 5 1 6 1 . 1 8
2 1 2 2 2 3 2 . 2 5 2 6
;;;;
Basically your algorithm says that you want to store the value when either the current value is missing or stored value is missing. With multiple BY groups you would also want to set it when you start a new group.
data want ;
set have ;
by id ;
retain new_value ;
if first.id or missing(new_value) or missing(value)
then new_value=value;
run;
Results:
new_
Obs id value value
1 1 . .
2 1 2 2
3 1 3 2
4 1 . .
5 1 5 5
6 1 6 5
7 1 . .
8 1 8 8
9 2 1 1
10 2 2 1
11 2 3 1
12 2 . .
13 2 5 5
14 2 6 5
So, I'm familiar with merges in SAS, and haven't had issues before, but I noticed an issue today that has never been an issue before.
For some reason the actual merging of observations is working properly in more complex data sets, however it only lists the variable values from one of the data sets (e.g. it doesn't overwrite missing values).
For instance, I wrote up this simple program:
data dataset1;
input id var1 var2 var3 var4;
cards;
1 . . 2 2
2 . . 2 2
3 . . 2 2
4 . . 2 2
5 . . 2 2
6 . . 2 2
7 . . 2 2
8 . . 2 2
9 . 2 . 2
10 1 . . .
;
data dataset2;
input id var1 var2 var3 var4;
cards;
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
10 . 1 . .
;
data dataset3;
merge dataset1 dataset2;
by id;
run;
This should yield the following:
id var1 var2 var3 var4
1 2 2 2 2
2 2 2 2 2
3 2 2 2 2
4 2 2 2 2
5 2 2 2 2
6 2 2 2 2
7 2 2 2 2
8 2 2 2 2
9 . 2 . 2
10 1 1 . .
but instead, I get this:
id var1 var2 var3 var4
1 2 2 . .
2 2 2 . .
3 2 2 . .
4 2 2 . .
5 2 2 . .
6 2 2 . .
7 2 2 . .
8 2 2 . .
9 . 2 . 2
10 . 1 . .
So, it's as if the merge is merging the observations and then just displaying the second data set's values.
I've tried to figure out the issue (I have a feeling it's something very basic I've just looked over), but I've no idea what's happening, since I've never come across the issue before.
Anyone know what's going wrong?
Thanks for any help.
Your problem is that you are merging the datasets by ID but both datasets have the variables VAR1-VAR4. So when both datasets contribute to an observations the one that is listed last in the MERGE statement will "win".
The reason you probably never saw this before is that normally when you are merging two datasets the only variables they have in common are the key variables. So the fact that the values read from the first datasets are overwritten by the values read from the second dataset didn't matter.
To get what you want you can use the UPDATE statement instead. Update will not replace a value with a missing value. Basically it is designed to apply transactions to a master dataset.
Since it looks like each ID only has one observation in DATASET1 you could just use DATASEt1 as your master dataset.
data want ;
update dataset1 dataset2 ;
by id ;
run;
In Stata, if you have data such as this:
Location Person 1 Gifts Person 2 Gifts Person 3 Gifts Person 4 Gifts
1 2 7 1
2 4 1 12 2
3 5 5 5 5
4 4 1
What is the easiest way to create a new variable, 'over_three_less_than_six' to count how many people per location gave 3 or more gifts but less than 6. I want it to ignore missing values. So in the above example the new column would output:
over_three_less_than_six
0
1
4
1
I beg to differ on style in variable naming! I assume variables such as gift1 ... gift4
gen count = 0
quietly forval j = 1/4 {
replace count = count + inrange(gift`j', 3, 5)
}
See also for a detailed review of technique
SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced
.pdf freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0046
inlist(gift`j', 3, 4, 5)
would also work instead of the inrange() call.