Create unique label for repeated units with PROC SURVEYSELECT in SAS - sas

I need to resample from a real (cluster) trial data set. So far, I have used the following PROC SURVEYSELECT procedure in SAS to sample 10 clusters from the trial with replacement, with 50% of clusters coming from the control arm and 50% coming from the treatment arm. I repeat this 100 times to get 100 replicates with 10 clusters each and equal allocation.
proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
cluster site;
strata rx / alloc=(0.5 0.5);
run;
Since I am using unrestricted random sampling (method=urs) to sample with replacement, I specified outhits so that SAS will inform me when a cluster is sampled more than once in each replication.
However, within each replicate in the output resamples dataset, I have not found a way to easily assign a unique identifier to clusters that appear more than once. If a cluster is sampled m times within a replicate, the observations within that cluster are simply repeated m times.
I attempted to use PROC SQL to identify distinct cluster ids and their occurrences within each replication, thinking I could use that to duplicate IDs as appropriate before joining additional data as necessary.
proc sql;
create table clusterselect as
select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;
However, I cannot figure out how to simply replicate rows in SAS.
Any help is appreciate, whether it be modifying PROC SURVEYSELECT to yield a unique cluster id within each replication or repeating cluster IDs as appropriate based on numberhits.
Thank you!
Here's what I've done:
/* 100 resamples with replacement */
proc surveyselect data=mydata out=resamples reps=100 sampsize=10 method=urs outhits;
cluster site;
strata rx / alloc=(0.5 0.5);
run;
/* identify unique sites per replicate and their num of appearances (numberhits) */
proc sql;
create table clusterSelect as
select distinct r.replicate, r.site, r.numberhits from resamples as r;
quit;
/* for site, repeat according to numberhits */
/* create unique clusterId */
data uniqueIds;
set clusterSelect;
do i = 1 to numberhits;
clusterId = cat(site, i);
output;
end;
drop i numberhits;
run;
/* append data to cluster, retaining unique id */
proc sql;
create table resDat as
select
uid.replicate,
uid.clusterId,
uid.site,
mydata.*
from uniqueIds as uid
left join mydata
on uid.site = mydata.site
quit;

Are you just asking how to convert one observation into the number of observations indicated in the NUMBERHITS variable?
data want;
set resamples;
do _n_=1 to numberhits;
output;
end;
run;

Related

creating a stratified sample in SAS with known stratas

I have a target population with some characteristics and I have been asked to select an appropriate control based on these characteristics. I am trying to do a stratified sample using SAS base but I need to be able to define my 4 starta %s from my target and apply these to my sample. Is there any way I can do that? Thank you!
To do stratified sampling you can use PROC SURVEYSELECT
Here is an example:-
/*Dataset creation*/
data data_dummy;
input revenue revenue_tag Premiership_level;
datalines;
1000 High 1
90 Low 2
500 Medium 3
1200 High 4
;
run;
/*Now you need to Sort by rev_tag, Premiership_level (say these are the
variables you need to do stratified sampling on)*/
proc sort data = data_dummy;
by rev_tag Premiership_level;
run;
/*Now use SURVEYSELECT to do stratified sampling using 10% samprate (You can
change this 10% as per your requirement)*/
/*Surveyselect is used to pick entries for groups such that , both the
groups created are similar in terms of variables specified under strata*/
proc surveyselect data=data_dummy method = srs samprate=0.10
seed=12345 out=data_control;
strata rev_tag Premiership_level;
run;
/*Finally tag (if you want for more clarity) your 10% data as control
group*/
Data data_control;
Set data_control;
Group = "Control";
Run;
Hope this helps:-)

Automating IF and then statement in sas using macro in SAS

I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!
Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;
For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).
Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;

How do I simply name a row heading in sas

All I would like to do is name each row in the output of this simple cluster algorithm. For example instead of row 1, 2, 3, and 4 have best, good, bad, and worst. Thanks!
proc fastclus data=tdriv.nfl2015 maxclus=4 out=clus;
var OffptsPerG DefPtsPerG;
run;
SAS doesn't have the concept of 'row header'. However, if you have a variable with values 1,2,3,4 (which you will - the cluster value!), you can use a format to do so.
proc format;
value clusf
1='Best'
2='Good'
3='Bad'
4='Worst'
;
quit;
proc datasets lib=work;
modify clus;
format cluster CLUSF.;
quit;
This assumes that you can reliably link 1,2,3,4 to those four values; I'm not sure FASTCLUS is reliable in that way. If it's not, you may have to code this afterwards by hand and/or use code to determine which cluster is which.
Joe's approach seems reasonable... Here's another one. Haven't tested it having no data to test with, but here it goes:
After running your proc fastclus, modify the output dataset, adding a variable which will serve as ID in a future proc print:
data clus;
format position $8.;
set clus;
if cluster=1 then position="Best";
else if cluster=2 then position="Good";
/* ... and so on ... */
run;
And then when printing:
proc print data=clus;
id position;
run;

SAS Proc SQL how to perform procedure only on N rows of a big table

I need to perform a procedure on a small set (e.g. 100 rows) of a very big table just to test the syntax and output. I have been running the following code for a while and it's still running. I wonder if it is doing something else. Or what is the right way to do?
Proc sql inobs = 100;
select
Var1,
sum(Var2) as VarSum
from BigTable
Group by
Var1;
Quit;
What you're doing is fine (limiting the maximum number of records taken from any table to 100), but there are a few alternatives. To avoid any execution at all, use the noexec option:
proc sql noexec;
select * from sashelp.class;
quit;
To restrict the obs from a specific dataset, you can use the data set obs option, e.g.
proc sql;
select * from sashelp.class(obs = 5);
quit;
To get a better idea of what SAS is doing behind the scenes in terms of index usage and query planning, use the _method and _tree options (and optionally combine with inobs as above):
proc sql _method _tree inobs = 5;
create table test as select * from sashelp.class
group by sex
having age = max(age);
quit;
These produce quite verbose output which is beyond the scope of this answer to explain fully, but you can easily search for more details if you want.
For further details on debugging SQL in SAS, refer to
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001360938.htm

PROC FREQ - How to find out the dataset name

I have a program where few datasets are created dynamically through a macro. Datasets are also named dynamically.
For example:
STAT_FILE_1,
STAT_FILE_2
Sometimes my macro may create two datasets,
STAT_FILE_1,
STAT_FILE_2
and sometimes, three
STAT_FILE_1,
STAT_FILE_2,
STAT_FILE_3
and the number may vary depending on the source data.
I am using PROC FREQ to have a summary data for the last/recent dataset.
Hence I am using the below code
PROC FREQ;
tables YEAR;
run;
I am getting the result, however i am not able to find out the dataset name used. Can some please help me in finding the dataset name which is used in PROC FREQ.
This will show you the last table created:
%put &syslast;