propensity scoring with 4 levels of treatment exposure SAS - sas

So I am aware of a similar post a few years back: 3 Level propensity score matching SAS. I am still unsure how to accomplish propensity score matching when there is a control group (0) and 3 levels of exposure (1,2,3). I have attempted to develop 3 different data sets comparing 0-1, 0-2, 0-3. The first dataset produces a new table with the expected number of observations. However, the second and third do not produce a matched dataset. . What am I missing?
/*propensity matching CMT 0 vs 1*/
proc psmatch data=SMTgroup01;
class CMT_group Gender;
psmodel CMT_group(Treated="1")=claim_count allowed gender age R_Risk;
match method=optimal(k=1) stat=lps exact=gender caliper=0.25;
output out(obs=match)=propscore01 matchid=_MatchID;
run;
/*propensity matching CMT 0 vs 2*/
proc psmatch data=SMTgroup02;
class CMT_group Gender;
psmodel CMT_group(Treated="2")=claim_count allowed gender age R_Risk;
match method=optimal(k=1) stat=lps exact=gender caliper=0.25;
output out(obs=match)=propscore02 matchid=_MatchID;
run;
/*propensity matching CMT 0 vs 3*/
proc psmatch data=SMTgroup03;
class CMT_group Gender;
psmodel CMT_group(Treated="3")=claim_count allowed gender age R_Risk;
match method=optimal(k=1) stat=lps exact=gender caliper=0.25;
output out(obs=match)=propscore03 matchid=_MatchID;
run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

How to convert a SAS data set to a data step

How can I convert my SAS data set, into a data set that I can easily paste into the forum or hand over to someone to replicate my data. Ideally, I'd also like to be able to control the amount of records that are included.
Ie I have sashelp.class in the SASHELP library, but I want to provide it here so others can use it as the starting point for my question.
To do this, you can use a macro written by Mark Jordan at SAS, the code is stored in GitHub as well.
You need to provide the data set name, including library and the number of observations you want to output. It takes them in order. The code will then appear in your SAS log.
*data set you want to create demo data for;
%let dataSetName = sashelp.Class;
*number of observations you want to keep;
%let obsKeep = 5;
******************************************************
DO NOT CHANGE ANYTHING BELOW THIS LINE
******************************************************;
%let source_path = https://gist.githubusercontent.com/statgeek/bcc55940dd825a13b9c8ca40a904cba9/raw/865d2cf18f5150b8e887218dde0fc3951d0ff15b/data2datastep.sas;
filename reprex url "&source_path";
%include reprex;
filename reprex;
option linesize=max;
%data2datastep(dsn=&dataSetName, obs=&obsKeep);
This may not work if you do not have access to the github page, in that case, you can manually navigate to the page (same link) and copy/paste it into SAS. Then run the program and run only the last step, the %data2datastep(dsn=, obs=);
This topic came up recently on SAS Communities and I created a little more robust macro than the one Reeza linked. You can see it in Github: ds2post.sas
* Pull macro definition from GITHUB ;
filename ds2post url
'https://raw.githubusercontent.com/sasutils/macros/master/ds2post.sas'
;
%include ds2post ;
For example if you wanted to share the first 5 observations of SASHELP.CARS you would run this macro call:
%ds2post(sashelp.cars,obs=5)
Which would generate this code to the SAS log:
data work.cars (label='2004 Car Data');
infile datalines dsd dlm='|' truncover;
input Make :$13. Model :$40. Type :$8. Origin :$6. DriveTrain :$5.
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
Weight Wheelbase Length
;
format MSRP dollar8. Invoice dollar8. ;
label EngineSize='Engine Size (L)' MPG_City='MPG (City)'
MPG_Highway='MPG (Highway)' Weight='Weight (LBS)'
Wheelbase='Wheelbase (IN)' Length='Length (IN)'
;
datalines4;
Acura|MDX|SUV|Asia|All|36945|33337|3.5|6|265|17|23|4451|106|189
Acura|RSX Type S 2dr|Sedan|Asia|Front|23820|21761|2|4|200|24|31|2778|101|172
Acura|TSX 4dr|Sedan|Asia|Front|26990|24647|2.4|4|200|22|29|3230|105|183
Acura|TL 4dr|Sedan|Asia|Front|33195|30299|3.2|6|270|20|28|3575|108|186
Acura|3.5 RL 4dr|Sedan|Asia|Front|43755|39014|3.5|6|225|18|24|3880|115|197
;;;;
Try this little test to compare the two macros.
First make a sample dataset with a couple of issues.
data testit;
set sashelp.class (obs=5);
if _n_=1 then name='Le Bron';
if _n_=2 then age=.;
if _n_=3 then wt=.;
if _n_=4 then name='12;34';
run;
Then run both macros to dump code to the SAS log.
%ds2post(testit);
%data2datastep(dsn=testit,obs=20);
Copy the code from the log. Changing the name in the DATA statements to not overwrite the original dataset or each other. Run them and compare the result to the original.
proc compare data=testit compare=testit1; run;
proc compare data=testit compare=testit2; run;
Result using %DS2POST:
The COMPARE Procedure
Comparison of WORK.TESTIT with WORK.TESTIT1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT1 02NOV18:17:10:29 02NOV18:17:10:29 6 5
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
Last Obs 5 5
Number of Observations in Common: 5.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
Summary of results using %Data2DataStep:
Comparison of WORK.TESTIT with WORK.TESTIT2
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT2 02NOV18:17:10:29 02NOV18:17:10:29 6 3
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
First Unequal 1 1
Last Unequal 3 3
Last Match 3 3
Last Obs 5 .
Number of Observations in Common: 3.
Number of Observations in WORK.TESTIT but not in WORK.TESTIT2: 2.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT2: 3.
Number of Observations with Some Compared Variables Unequal: 3.
Number of Observations with All Compared Variables Equal: 0.
Variable Values Summary
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 1.
Number of Variables Compared with Some Observations Unequal: 5.
Number of Variables with Missing Value Differences: 4.
Total Number of Values which Compare Unequal: 12.
Maximum Difference: 0.
Variables with Unequal Values
Variable Type Len Ndif MaxDif MissDif
Name CHAR 8 1 0
Sex CHAR 1 3 3
Age NUM 8 2 0 2
Height NUM 8 3 0 3
Weight NUM 8 3 0 3
Note that I am sure there are values that will cause trouble for my macro also. But hopefully they are caused by data that is less likely to occur than spaces or semi-colons.

SAS - Survey Select - Selecting Different Sample Size per Stratum

I have a list of financial advisors and I need to pull 4 samples per advisor but catch is in those 4 samples I need to force 2 mortgages, 1 loan, 1 credit card lets say.
Is there a way in the Survey select statement to set the specific number of samples to pull per stratum? I know you can stratify on 1 category and set it as a equal number. I was hoping I could use a mapping of employee names + the number of samples left to pull for each category and have survey select utilize that to pull in a dynamic way.
I'm using this as an example but this only stratifies on employee first and gives me 4 per employee. I would need to further stratify on Product type and set that to a specific sample size per product.
proc surveyselect data=work.Emp_Table_Final
method=srs n=4 out=work.testsample SELECTALL;
strata Employee_No;
run;
Thanks i know it might sound complicated, but if i know its possible then i can google the rest
Yes, you can have a dataset be the target of the n option. That dataset must:
Contain the strata variables as well as a variable SAMPSIZE or _NSIZE_ with the number to select
Have the same type and length as the strata variables
Be sorted by the strata variables
Have an entry for every strata variable value
See the documentation for more details.
data sample_counts;
length sex $1;
input sex $ _NSIZE_;
datalines;
F 5
M 3
;;;;
run;
proc sort data=sashelp.class out=class;
by sex;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex;
run;
For two variables it's the same, you just need two variables in the sample_counts. Of course it makes it a lot more complicated, and you may want to produce this in an automated fashion.
proc sort data=sashelp.class out=class;
by sex age;
run;
data sample_counts;
length sex $1;
input sex $ age _NSIZE_;
datalines;
F 11 1
F 12 1
F 13 1
F 14 1
F 15 1
M 11 1
M 12 1
M 13 1
M 14 1
M 15 1
M 16 0
;;;;
run;
/* or do it in an automated way*/
data sample_counts;
set class;
by sex age; *your strata;
if first.age then do; *do this once per stratum level;
if age le 15 then _NSIZE_ = 1; *whatever your logic is for defining _NSIZE_;
else _NSIZE_=0;
output;
end;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex age;
run;

Summing values by character in SAS

I created this fakedata as an example:
data fakedata;
length name $5;
infile datalines;
input name count percent;
return;
datalines;
Ania 1 17
Basia 1 3
Ola 1 10
Basia 1 52
Basia 1 2
Basia 1 16
;
run;
The result I want to have is:
---> summed counts and percents for Basia
I would like to have summed count and percent for Basia as she was only once in the table with count 4 and percent 83. I tried exchanging name into a number to do GROUP BY in proc sql but it changes into order by (I had such an error). Suppose that it isn't so difficult, but I can't find the solution. I also tried some arrays without any success. Any help appreciated!
It sounds like proc sql does what you want:
proc sql;
select name, count(*) as cnt, sum(percent) as sum_percent
from fakedata
group by name;
You can add a where clause to get the results just for one name.
Hm, actually I got an answer.
proc summary data=fakedata;
by name;
var count percent;
output out=wynik (drop = _FREQ_ _TYPE_) sum(count)=count sum(percent)=percent;
run;
You can go back a step and use PROC FREQ most likely to generate this output in a single step. Based on counts the percents are not correct, but I'm not sure they're intended to be, right now they add up to over 100%. If you already have some summaries, then use the WEIGHT statement to account for the counts.
proc freq data=fakedata;
table name;
weight count;
run;

SAS - Proc Compare - show ALL duplicates

whilst using the PROC COMPARE is SAS, is it possible to list all duplicates found? By default a message will be displayed stating the first duplicate found and the total number of duplicates.
i.e:
data x1;
input x $ y $ z $ ;
datalines;
222 test abc
qqq test abc
aaa test abc
222 test abc
222 test abc
;
run;
data y1;
input x $ y $ z $ ;
datalines;
222 test abc
qqq test abc
aaa test abc
222 test abc
222 test abc
;
run;
***********************************;
*** sort data;
***********************************;
proc sort data=x1;
by x y;
run;
proc sort data=y1;
by x y;
run;
***********************************;
*** compare data;
***********************************;
proc compare listvar
base=x1
compare = y1;
id x y;
run;
************** END *****************;
output
The SAS System
The COMPARE Procedure
Comparison of WORK.X1 with WORK.Y1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.X1 23OCT14:16:03:38 23OCT14:16:03:38 3 5
WORK.Y1 23OCT14:16:03:38 23OCT14:16:03:38 3 5
Variables Summary
Number of Variables in Common: 3.
Number of ID Variables: 2.
WARNING: The data set WORK.X1 contains a duplicate observation at observation
number 2.
NOTE: At observation 2 the current and previous ID values are:
x=222 y=test.
NOTE: Further warnings for duplicate observations in this data set will not be
printed.
WARNING: The data set WORK.Y1 contains a duplicate observation at observation
number 2.
NOTE: At observation 2 the current and previous ID values are:
x=222 y=test.
NOTE: Further warnings for duplicate observations in this data set will not be
printed.
Observation Summary
Observation Base Compare ID
First Obs 1 1 x=222 y=test
Last Obs 5 5 x=qqq y=test
Number of Observations in Common: 5.
Number of Duplicate Observations found in WORK.X1: 2.
Number of Duplicate Observations found in WORK.Y1: 2.
Total Number of Observations Read from WORK.X1: 5.
Total Number of Observations Read from WORK.Y1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
NOTE: No unequal values were found. All values compared are exactly equal.
# Joe - thanks for the comment!
Proc Freq might be a good approach to find duplicates. Then just print them out with a Proc Print.
PROC FREQ;
TABLES keyvar / noprint out=keylist;
RUN;
PROC PRINT data=keylist;
WHERE count ge 2;
RUN;
I don't think there's a way to get the log or listing to list more than just the first duplicate, if that's what you're going after, using the ID statement.
What you are likely best off doing is using the OUTALL option, and outputting the results to a dataset (if you're not already). Then it would be fairly easy to see the duplicates.
For example:
data class2 class3;
set sashelp.class;
output;
output;
output class3;
run;
proc compare base=class2 compare=class3 out=outclass outall;
id name;
run;
You could also use the BY statement along with the ID statement, if it's sorted; then you'll still have duplicates, but each BY Group has a separate report, so you'd see the duplicates there.
proc compare base=class2 compare=class3 out=outclass outall;
by name;
id name;
run;
Finding exact number of duplicates for each id may be better suited for proc sql.
Something like:
proc sql;
create table x2 as select
*,
count(id_var)
from x1
group by x,y,z;
quit;
This could reveal any duplicate rows in either dataset.