creating a stratified sample in SAS with known stratas - sas

I have a target population with some characteristics and I have been asked to select an appropriate control based on these characteristics. I am trying to do a stratified sample using SAS base but I need to be able to define my 4 starta %s from my target and apply these to my sample. Is there any way I can do that? Thank you!

To do stratified sampling you can use PROC SURVEYSELECT
Here is an example:-
/*Dataset creation*/
data data_dummy;
input revenue revenue_tag Premiership_level;
datalines;
1000 High 1
90 Low 2
500 Medium 3
1200 High 4
;
run;
/*Now you need to Sort by rev_tag, Premiership_level (say these are the
variables you need to do stratified sampling on)*/
proc sort data = data_dummy;
by rev_tag Premiership_level;
run;
/*Now use SURVEYSELECT to do stratified sampling using 10% samprate (You can
change this 10% as per your requirement)*/
/*Surveyselect is used to pick entries for groups such that , both the
groups created are similar in terms of variables specified under strata*/
proc surveyselect data=data_dummy method = srs samprate=0.10
seed=12345 out=data_control;
strata rev_tag Premiership_level;
run;
/*Finally tag (if you want for more clarity) your 10% data as control
group*/
Data data_control;
Set data_control;
Group = "Control";
Run;
Hope this helps:-)

Related

Create unique label for repeated units with PROC SURVEYSELECT in SAS

I need to resample from a real (cluster) trial data set. So far, I have used the following PROC SURVEYSELECT procedure in SAS to sample 10 clusters from the trial with replacement, with 50% of clusters coming from the control arm and 50% coming from the treatment arm. I repeat this 100 times to get 100 replicates with 10 clusters each and equal allocation.
proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
cluster site;
strata rx / alloc=(0.5 0.5);
run;
Since I am using unrestricted random sampling (method=urs) to sample with replacement, I specified outhits so that SAS will inform me when a cluster is sampled more than once in each replication.
However, within each replicate in the output resamples dataset, I have not found a way to easily assign a unique identifier to clusters that appear more than once. If a cluster is sampled m times within a replicate, the observations within that cluster are simply repeated m times.
I attempted to use PROC SQL to identify distinct cluster ids and their occurrences within each replication, thinking I could use that to duplicate IDs as appropriate before joining additional data as necessary.
proc sql;
create table clusterselect as
select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;
However, I cannot figure out how to simply replicate rows in SAS.
Any help is appreciate, whether it be modifying PROC SURVEYSELECT to yield a unique cluster id within each replication or repeating cluster IDs as appropriate based on numberhits.
Thank you!
Here's what I've done:
/* 100 resamples with replacement */
proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
cluster site;
strata rx / alloc=(0.5 0.5);
run;
/* identify unique sites per replicate and their num of appearances (numberhits) */
proc sql;
create table clusterSelect as
select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;
/* for site, repeat according to numberhits */
/* create unique clusterId */
data uniqueIds;
set clusterSelect;
do i = 1 to numberhits;
clusterId = cat(site, i);
output;
end;
drop i numberhits;
run;
/* append data to cluster, retaining unique id */
proc sql;
create table resDat as
select
uid.replicate,
uid.clusterId,
uid.site,
mydata.*
from uniqueIds as uid
left join mydata
on uid.site = mydata.site
quit;
Are you just asking how to convert one observation into the number of observations indicated in the NUMBERHITS variable?
data want;
set resamples;
do _n_=1 to numberhits;
output;
end;
run;

Renaming date variable to perform an intck to calculate day difference

I have this dataset and need to calculate the days' difference between each dose date per period. How do I label each period study date so I can carry out an intck to calculate the days' difference per subject (ptno)
Just use the DIF() function to calculate the change in value for your date variable. SAS stores dates as number of days so the difference will be the number of days between the two observations. You could then test if the difference is 7 days or not.
data want;
set have;
by ptno period;
interval = dif(ex_stadt);
if first.ptno then interval=0;
seven_days = (interval = 7) ;
run;
The code of Tom works very well. I simulated the data set with a few rows based in
the sample showed above and it's OK.
Only thing absent is PROC SORT. If the data set is huge the log will exhibit an error.
proc sort data=have;
by ptno period;
run;

I need to find the confidence intervals for proportions using stratified data

I'm trying to report estimates of proportions of subjects of a stratified random sample
I've tried every website I can find for SAS proc surveymeans, and I don't understand what I'm doing wrong.
data b;
set Data;
keep id texting section;
run;
proc surveyselect data=b out=samp_b method=srs n=(15,12,10,8)
seed=123;
strata section;
run;
proc surveymeans data=samp_b;
strata section;
weight SamplingWeight;
var texting;
run;
I should get confidence intervals for the strata, but they are not showing up. Also I need confidence intervals for the proportions!
I don't know what version of SAS/STAT you are using, but per SAS/STAT 9.2 Proc Surveymeans documentation pages, you can do one or both of the following:
1) Add the relevant statistics keywords to the proc surveymeans statement
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveymeans_sect007.htm
In the PROC SURVEYMEANS statement, you also can use statistic-keywords to specify statistics for the procedure to compute. Available statistics include the population mean and population total, together with their variance estimates and confidence limits. You can also request data set summary information and sample design information.
The available statistics keywords are listed and described on these pages:
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveymeans_a0000000238.htm
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_surveymeans_sect007.htm#statug.surveymeans.smeanskeys
So, to print the 95% two-sided confidence interval for the mean, you would add CLM to the end of your Proc Surveymeans statement.
2) Save the Statistics table with confidence intervals to a separate SAS dataset with an additional ods output Statistics=MyStat; statement, per these instructions.

Compare datasets and plot in sas

I have two variables in 2 separate datasets in sas. Both have a primary key of Customer_Id and another column say LVR . One dataset has old values for the LVR Column. The other one has values from the new calculation for the same column.
I need to show the differences between both on a graph.
I tried to merge them and then tried proc gplot to plot the two LVRs.
Merged dataset looks something like this :
Cust_id LVR_new LVR_old
111 1 2
222 2 .
333 5 4
The dataset containing LVR_new is almost twice in size (number of rows) than the one containing LVR_old.We got more customers qualifying post the new calculations.
The merged dataset has 3046778 observations and 3 variables.
I tried to use proc gplot using the code below:
proc plot data=djia;
plot LVR_old*LVR_new = Cust_id;
run;
This has been running since long so i don't expect the results are going to be very useful.
Can anyone please suggest how can I achieve this. I need to showcase the differences between the two datasets on a graph to be able to show the shift in the results.
Thanks!
Why not use PROC TTEST? There are some ODS GRAPHICS plots that PROC TTEST makes.
Your problem looks exactly liked the paired comparisons example in the documentation.
http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_ttest_examples03.htm
proc ttest;
paired LVR_old*LVR_new;
run;

Renaming Coefficients that Result from Proc Logistic/Problems Surrounding Variable Names Common to Multiple Datasets

I am estimating a model for firm bankruptcy that involves 11 factors. I have data from 1900 to 2000 and my goal is to estimate my model using proc logistic for the period 1900-1950 and then test its performance on the 1951 through 2000 data. Proc logistic runs fine but the problem I have is that the estimated coefficients have the same name as my factors that I was using in my model. Suppose the dataset that contains all my observations is called myData and the dataset that contains the estimated coefficients which I obtain using an outtest statement (in proc logistic) is called factorEstimates. Now both of these data sets have the variables factor1, factor2, ..., factorN. Now I want to form the dataset outOfSampleResults that does something like the following:
data outOfSampleResults;
set myData factorEstimates;
newVar=factor1*factor1;
run;
Where the first mention of factor1 refers to that contained in myData and the second refers to that contained in factorEstimates. How can I inform sas which dataset it should read for this variable that is common to both of the datasets in the set statement? Alternatively, how could I quickly rename factor1, factor2, ..., factorN as factor1Estimate, factor2Estimate, ..., factorNEstimate in the factorEstimates dataset so as to circumvent this common variable name issue altogether?
Two quick ways to get estimates for a model already developed:
1. Proc logistic score statement
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_logistic_sect066.htm
Include the data in your original proc logistic but use a new variable and ensure that the dependent variable is missing for the observations you want to predict.
data stacked;
set all;
if year >1950 then predicted=.;
else predicted=y;
run;
proc logistic data=stacked;
model predicted = factor1 - factor12;
output out=out_predicted predicted=p;
run;