SAS - How to select random samples based on condition - sas

I have a SAS data set that contains a column of numbers ranging from -2000 to 4000.
I want to select 37 random samples based on the following conditions.
If num between -2000 to -1000, randomly select 10 samples from this range,
if num between -1000 to 0, randomly select 15 sample from this range,
if num between 0 to 1000, randomly select 12 samples from this range,
I've tried the following:
proc surveyselect data=save.table
method=srs n=37 out=save.table_sample seed=1953;
run;
But this would give me random 37 samples from the whole population. I want to randomly select according the data range.
Please help with SAS code, thanks so much in advance!

Create a grouping variable in your data set that you can use to group analysis.
data output;
set save.table;
if number < -1000 then group=1;
else if number < 0 then group=2;
else if number < 1000 then group=3;
run;
Use PROC SURVEYSELECT with either a data set that has the same variable, GROUP, as well as the sample size or list the sample size in the PROC SURVEYSELECT.
proc surveyselect data=output
method=srs n=37 out=save.table_sample seed=1953 sampsize=(37 15 12);
strata group;
run;
Couldn't test because no sample data was provided, so here's an example using SASHELP.HEART
proc sort data=sashelp.heart out=heart; by chol_status; run;
proc surveyselect data=heart (where=(not missing(chol_status))) method=srs sampsize=(5 10 15) out=want;
strata chol_status;
run;

If you want to continue to use proc surveyselect, then a simple way to do this is:
data set1 set2 set3;
set save.table;
if number < -1000 then output set1;
else if number < 0 then output set2;
else if number < 1000 then output set3;
run;
Then call proc surveyselect thrice with different n values on these 3 datasets.

Related

SAS select random samples from a dataset

I understand that to select a random sample, I can use
proc surveyselect data = raw_data method = srs n=200000 out=sample_data;
run;
However, sometimes my raw_data has the number of records < 200000. If the raw_data is small, I would like to just keep the raw_data; if it's larger than a million records, I would like to randomly select a 200k of records out of it. How should I do this?
Thank you!
Just create a macro variable for n. You can do this below, or you can use dictionary.tables or proc contents to get the count without actually counting all of the rows if you don't have reason to disbelieve those values.
proc sql;
select
case when count(1) < 1000000 then count(1) else 200000 end
into :sampcount
from yourdataset
;
quit;
proc surveyselect n=&sampcount. .... ;
run;

SAS: in the DATA step, how to calculate the grand mean of a subset of observations, skipping over missing values

I'm trying to calculate the grand mean of a subset of observations (e.g., observation 20 to observation 50) in the data step. In this calculation, I also want to skip over (ignore) any missing values.
I've tried to play around with the mean function using various if … then statements, but I can't seem to fit all of it together.
Any help would be much appreciated.
For reference, here's the basic outline of my data steps:
data sas1;
infile '[file path]';
input v1 $ 1-9 v2 $ 11 v3 13-17 [redacted] RPQ 50-53 [redacted] v23 101-106;
v1=translate(v1,"0"," ");
format [redacted];
label [redacted];
run;
data gmean;
set sas1;
id=_N_;
if id = 10-40 then do;
avg = mean(RPQ);
end;
/*Here, I am trying to calculate the grand mean of the RPQ variable*/
/*but only observations 10 to 40, and skipping over missing values*/
run;
Use the automatic variable /_N_/ to id the rows. Use a sum value that is retained row to row and then divide by the number of observations at the end. Use the missing() function to determine the number of observations present and whether or not to add to the running total.
data stocks;
set sashelp.stocks;
retain sum_total n_count 0;
if 10<=_n_<=40 and not missing(open) then do;
n_count=n_count+1;
sum_total=sum_total+open;
end;
if _n_ = 40 then average=sum_total/n_count;
run;
proc print data=stocks(obs=40 firstobs=40);
var average;
run;
*check with proc means that the value is correct;
proc means data=sashelp.stocks (firstobs=10 obs=40) mean;
var open;
run;

How to perform oversampling in SAS?

I have a data set with 1100 samples, target class isReturn, there are
800 isReturn='True'
300 isReturn='False'
How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?
Thanks in advance.
I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.
The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.
data have;
length isReturn $5;
do _n_=1 to 800;
isReturn='True';
output;
if _n_ le 300 then do;
isReturn='False';
output;
end;
end;
run;
data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
output;
output;
if ranuni(7) le k/n then do;
output;
k+-1;
end;
n+-1;
end;
run;
You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).
You could also use proc surveyselect to do this.
proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;
That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.
Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.
data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;
All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.

Is there a way to name proc rank groups based on values within the group?

So I have multiple continuous variables that I have used proc rank to divide into 10 groups, ie for each observation there is now a "GPA" and a "GRP_GPA" value, ditto for Hmwrk_Hrs and GRP_Hmwrk_Hrs. But for each of the new group columns the values are between 1 - 10. Is there a way to change that value so that rather than 1 for instance it would be 1.2-2.8 if those were the min and max values within the group? I know I can do it by hand using proc format or if then or case in sql but since I have something like 40 different columns that would be very time intensive.
It's not clear from your question if you want to store the min-max values or just format the rank columns with them. My solution below formats the rank column and utilises the ability of SAS to create formats from a dataset. I've obviously only used 1 variable to rank, for your data it will be a simple matter to wrap a macro around the code and run for each of your 40 or so variables. Hope this helps.
/* create ranked dataset */
proc rank data=sashelp.steel groups=10 out=want;
var steel;
ranks steel_rank;
run;
/* calculate minimum and maximum values per rank */
proc summary data=want nway;
class steel_rank;
var steel;
output out=want_min_max (drop=_:) min= max= / autoname;
run;
/* create dataset with formatted values */
data steel_rank_fmt;
set want_min_max (rename=(steel_rank=start));
retain fmtname 'stl_fmt' type 'N';
label=catx('-',steel_min,steel_max);
run;
/* create format from previous dataset */
proc format cntlin=steel_rank_fmt;
run;
/* apply formatted value to rank column */
proc datasets lib=work nodetails nolist;
modify want;
format steel_rank stl_fmt10.;
quit;
In addition to Keith's good answer, you can also do the following:
proc rank data = sashelp.cars groups = 10 out = test;
var enginesize;
ranks es;
run;
proc sql ;
select *, catx('-',min(enginesize), max(enginesize)) as esrange, es from test
group by es
order by make, model
;
quit;

SAS calculating percentages of multiple variables

I like to calculate percentages of multiple variables. That is calculting sum of each variable(column) and divide each variable sum by frequnecy. I tried to get proc summary to get those two stats and make array to compute percentages but the result values do not seem to be right but no errors in the log. I saw proc sql can do percentage calculation but I do not know how to do that for multiple variables. --must not be difficult but I just am not sure how to list them.
If you can either point out what I did wrong 1) in the proc summary way, or 2) direct me how to write proc SQL variable list, that would be great. thank you,
Here is code I wrote:
1)
proc summary data=m3resp;
var Lma1-Lma69;
output out=sumofLma sum=sumLma1-sumLma69;
run;
data sumLmaout;
set sumoflma;
array pctma[69];
do i=1 to 69;
pctma[i]=input(cat(sumLma,i),5.)/_freq_;
end;
drop i;
run;
2)
PROC SQL;
SELECT Lma1
(Lma1/SUM(Lma1)) AS PCTLma1
FROM m3resp;
QUIT;
PROC Means is a great tool for summarizing numeric variables. (To prevent the report, add NOPRINT on the PROC MEANS line).
data numbers;
input num1 num2;
datalines;
10 5
10 5
10 5
10 5
10 5
10 5
10 5
10 5
10 5
10 5
;
proc means n sum mean median data=work.numbers maxdec=2;
var num1 num2;
output out=work.totals(drop=_freq_ _type_)
sum(num1 num2)=NUM1_TOTAL NUM2_TOTAL
n(num1 num2)=NUM1_COUNT NUM2_COUNT;
run;