Random ordered sampling with replacement in SAS - sas

I have a data set from which I'd like to draw a sample with replacement. When I use proc surveyselect, the samples drawn are in the excact same order as in the original dataset and multiple draws are written below each other.
proc surveyselect data=sashelp.baseball outhits method=urs n=1000 out=mydata;
However, it's important to me that the position in the outtable is sampled as well. Is there an option in proc surveyselect, or am I better off to just sample the rownumber myself and output it, like outlined in this paper,p4?
As a toy example (not in SAS notation), suppose I have a list of values [a, b, c, d] and I draw five times with repetition (and keeping the order of draws):
First a, then c, then a, then b, then c. The result I want is [a, c, a, b, c], but sas only gives output of the type
[a,a,b,c,c] (with outhits)
[a 2, b 1, c 2, d 0] (with outall) or
[a 2, b 1, c 2] (without an additional option).

So here is a solution which only requires BASE SAS. Minor changes would be needed to allow inclusion of additional columns such as an ID or a DATE, for instance. I don't claim it's the most efficient way to do this. It relies heavily on PROC SQL which is my preference. Having said that, it should produce the results you wish in quite reasonable time.
The length of the generated SQL code justifies the need for a separate sas program. If you don't want to show the whole %included file in the log, just leave out the /source2 option.
Generate Sample Data
data mymatrix;
input c1 c2 c3 c4 c5;
datalines;
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
;
Declare Macro %DrawSample
Parameters:
lib = library in which ds is found
ds = table to sample from
out = table to generate
outfile = path/name of the sas program containing the insert strings
n = number of repetitions
%macro DrawSample(lib, ds, out, outfile, n);
%local nrows ncols cols;
proc sql;
/* get number of rows in source table */
select count(*)
into :nrows
from &lib..&ds;
/* get variable names */
select name, count(name)
into :cols separated by " ",
:ncols
from dictionary.columns
where libname = upcase("&lib")
and memname = upcase("&ds");
quit;
data _null_;
file "&outfile";
length query $ 256;
array column(&ncols) $32;
put "PROC SQL;";
put " /* create an empty table with same structure */";
put " create table &out as";
put " select *";
put " from &lib..&ds";
put " where 1 = 2;";
put " ";
do i = 1 to &n;
%* Randomize column order;
do j = 1 to &ncols;
column(j) = scan("&cols", 1 + floor((&ncols)*rand("uniform")));
end;
%* Build the query;
query = cat(" INSERT INTO &out SELECT ", column(1));
do j = 2 to &ncols;
query = catx(", ", query, column(j));
end;
rownumber = 1 + floor(&nrows * rand("uniform"));
query = catx(" ", query, "FROM &lib..&ds(firstobs=", rownumber,
"obs=", rownumber, ");");
put query;
end;
put "QUIT;";
run;
%include "&outfile" / source2;
%mend;
Calling the Macro
%DrawSample(lib=work, ds=mymatrix, out=matrixSample, outfile=myRandomSample.sas, n=1000);
Et voilĂ !

Not sure exactly what you're after, but something that may help is to use the option OUTALL instead of OUTHITS. This will create an output dataset the same size as the original, with a selected column to show if the record has been sampled and a numberhits column to show how many times that record has been selected. It won't create a row for each time a record is selected.
You can then select the observation number for all records in the sample.

Related

Extracting row with highest value in a column while also calculating averages by group

I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!
Example data:
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
Here is what I tried:
data mylib.data1(keep=type date value value_avg) mylib.data2;
set work.have;
by type;
if value ne . then NotMissing=1; else NotMissing=0;
if first.type then call missing(of value_avg);
value_avg+value;
if first.type then call missing(of num_per_cat);
num_per_cat+NotMissing;
Avg=divide((value_avg+value),(num_per_cat+NotMissing));
if last.type then output mylib.data1;
run;
This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
proc summary data = have nway;
class type;
var value;
output out = want_mean(drop = _:) mean = ;
run;
proc summary data = have nway;
class type;
var value;
output out = want_max(drop = _:) max = ;
run;
Both sets are easelly done by proc sql.
First one:
proc sql;
create table want1 as
select distinct type, max(value) as Max_value, mean(value) as Average_value
from have
group by type
;
quit;
Second one:
proc sql;
create table want2 as
select *
from have
having value = max(value)
;
quit;

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

Naming columns using values from data table

Let's say I have two columns A and B.
A B
12 "randstr"
39 "randstr"
2 "randstr"
This random string is repeated in each row.
I'm interested in how I can get the table below:
randstr B
12 "randstr"
39 "randstr"
2 "randstr"
The value in the column B was used to rename column A. I have tried using rename and all sorts of macro magic but failed. I have no idea how to proceed.
I've tried the answers below and they just don't allow for reading the value from the data and then using the value as a column name:
https://communities.sas.com/t5/General-SAS-Programming/dates-used-as-column-names/td-p/168803
https://stats.idre.ucla.edu/sas/code/a-few-sas-macro-programs-for-renaming-variables-dynamically/
SAS - Dynamically create column names using the values from another column
Renaming Column with Dynamic Name
The transformation could also be seen as a row-wise transposition.
data have;
attrib A length=8 B length=$32;
row+1;
input
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
run;
proc transpose data=have out=want(drop=row _name_);
by row;
var A;
id B;
copy B;
run;
In non-toy scenarios the B column is often not a single value. Try the same transpose with data having variation in B. The procedure will create two new columns from the values of B.
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
3141 xyz-456-def
Using this macro, it's fairly straightforward:
/* get first value in the dataset */
%let new_col=%mf_getvalue(work.YOURDATA,B);
/* rename variable A */
proc datasets library=work nolist;
modify YOURDATA;
rename A=%sysfunc(dequote(&new_col));
quit;

SAS SCAN Function and Missing Values

I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)

How do i perform calculation about the last n observations

how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;