SAS: Coffee anyone? - sas

I tried this in C# but have not had much success. So I am now trying in SAS. Using an EG session and my SAS code, we work with the list of students in SASHELP.CLASS.
These people want to get to know each other and have a monthly random pairing to go on a Coffee Date.
Rules:
A random Coffee Date List is Generated monthly;
I store each months pairing into a Historical Dataset, which I append monthly.
One person cannot have coffee with the same person within a 6 month period. So we keep a separate dataset for historical purposes with 3 Vars:
LastDate,InviterID,InvitedID
We check each pairing against the Historical list of which we only load the most recent 6 months data into a temp dataset for checking purposes.
If no recent matched pair is found, a new matched pair is added to a new Paired Dataset, and the 2 names (Rows) are removed from the original Participants dataset until the dataset has less than 2 rows. (a single person cannot be paired with another)
Unfortunately we have 19 people in this list so one person will be left out until we can add a new participant. Is anyone interested in joining our coffee club? :-)
So I start by deriving and ID (n) from the dataset, and I only keep the Name
Data Participants(Keep=ID Name);
FORMAT ID 8.;
set SASHelp.class;
ID=_n_;
run;
These 19 People will be my Participants in the Coffee Club.
I more or less follow the line of thought:
data _null_;
randvar = ceil(rand('UNIFORM') * 100000);
call symput('RANDSEED', randvar);
run;
data CR.names2(keep=MEMID randid);
set CR.MasterNames;
randid = rand('UNIFORM');
run;
proc sort data=CR.names2 ; by randid; run;
data CR.pairs(keep=pairgrp MEMID);
set CR.names2 nobs=num_peeps;
pairgrp+1;
if pairgrp > floor(num_peeps/2) then pairgrp=1;
run;
proc sort data=CR.pairs; by pairgrp;run;
proc transpose data=CR.pairs
out=CR.pairs2 (drop=_NAME_);
var memid;
by pairgrp;
run;
Data CR.Pairs3;
set CR.pairs2;
rename COL1=InviterID COL2=InvitedID;
run;
But I get stuck :-(
I need help with the rest please...
Has anyone else done this type of random pairing successfully before? I am grasping straws here...
Any help much appreciated.
Len

Here is my idea. This is far from efficient. Esp. when NOBS is getting big, as there is a cartesian product involved. Also I cheated on the odd number by adding another row in that case.
Prepare data and generate empty result table.
Create a list of all possible pairings (combinations) excluding recent pairings.
Random sort and descend through the list until every element has been picked once.
Append to result table.
There is a drawback as there might be members who will not get pairings as all possible partners are already picked. To avoid that we could iterate until we get a maximum of pairings.
EDIT: Added iteration. Now the program makes draws randomly until everyone is matched or a threshold is reached.
This problem should probably be implemented in a matrix orientated language like IML or R.
data Participants(Keep=ID Name) ;
set SASHelp.class nobs = num_peeps ;
ID=_n_ ;
output ;
if _n_ = 1 and mod(num_peeps,2) then do ; /* get even number of members: empty ID to pair with last participant*/
name = 'empty' ;
id = 0 ;
output ;
end ;
run ;
data list_of_meetings ;
length iteration InviterID InvitedID 8. ;
run ;
/****
iter = number of club meetings
hist = length of memory for pairings
tries = number of iterations to pair everyone
****/
%macro loop_coffee (iter=, hist=6, tries= 10) ;
proc sql noprint ;
select max(0,max(iteration)) + 1 into :base
from list_of_meetings ;
quit ;
%do i = &base. %to &iter. ; /* loop through number of meetings */
proc sort data = list_of_meetings (where=(iteration >= &i - &hist )) out = lookup nodupkey ; by InviterID InvitedID ; run ; /* get memory of pairings */
proc sql ; /* list all acceptable pairs */
create table all_pairs as
select a.ID as InviterID, b.ID as InvitedID
from Participants a
inner join Participants b
on a.ID lt b.ID
left join lookup c /* exclude the memory */
on a.ID eq c.InviterID and b.ID eq c.InvitedID
where c.InviterID is NULL ;
quit ;
%let j = 0 ;
%let all_pairs = 0 ;
%do %until (&all_pairs | &j > &tries) ; /* iterate and random sort until all members are paired */
%let j = %eval( &j + 1 ) ;
data all_pairs;
set all_pairs;
randnum = ranuni(12345 + &i + &j);
run;
proc sort data = all_pairs ; by randnum ; run ; /* random sort */
data out_pairs ; /* select the pairs: no. of IDs/2 */
declare hash h() ;
h.defineKey("ID") ;
h.defineDone() ;
do until ( eof1 ) ;
set Participants (keep= ID) end = eof1 ;
rc = h.add () ; /* populate list of members */
end ;
do until ( eof2 ) ;
set all_pairs (keep= InviterID InvitedID) end = eof2 ;
rc1 = h.check (key:InviterID) ;
rc2 = h.check (key:InvitedID) ;
if rc1 = 0 and rc2 = 0 then do ;
rc = h.remove (key:InviterID) ; /* delete member from list if paired */
rc = h.remove (key:InvitedID) ;
output ;
end ;
if h.num_items = 0 then do ;
call symput('all_pairs', 1 ) ;
stop ;
end;
end ;
stop ;
keep InviterID InvitedID ;
run ;
%end ;
data list_of_meetings ;
set list_of_meetings (where=(iteration ne .))
Out_pairs (in=pairs) ;
if pairs then iteration = &i. ;
run ;
%end ;
%mend ;
%loop_coffee (iter=10,hist=6,tries=10) ;

Related

Automate check for number of distinct values SAS

Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.

SAS SCAN Function and Missing Values

I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)

SAS: Insert multiple columns or rows in a specific position of an existing dataset

Goal: Add one or more empty rows after a specific position in the dataset
Here's the work I've done so far:
data test_data;
set sashelp.class;
output;
/* Add a blank line after row 5*/
if _n_ = 5 then do;
call missing(of _all_);
output;
end;
/* Add 4 blank rows after row 7*/
if _n_ = 7 then do;
/* inserts a blank row after row 8 (7 original rows + 1 blank row) */
call missing(of _all_);
/*repeats the newly created blank row: inserts 3 blank rows*/
do i = 1 to 3;
output;
end;
end;
run;
I'm still learning how to use SAS, but I "feel" like there's a better way to get to the same result, chiefly not having to use a for loop to insert multiple empty rows. Doing this several times, I lose track of which values n should be as well. I was wondering:
Is there a better way to do this for rows?
Is there an equivalent for columns? (A similar question probably)
These rows/columns are being added more to fit a report format. The dataset doesn't need these blank rows/columns for its own sake. Is some PROC or REPORT function that achieves the same thing?
Do you just want to add blanks to your report after each group of observations?
For example lets make a dummy dataset with a group variable.
data test_data ;
set sashelp.class ;
if _n_ <= 5 then group=1;
else if _n_ <=7 then group=2 ;
else group=3 ;
run;
Now we can make a report that adds two blank lines after each group.
proc report data=test_data ;
column group name age ;
define group / group noprint ;
define name / display ;
define age / display;
compute after group ;
line #1 ' ';
line #1 ' ';
endcomp;
run;
Or is it the case that your report data does not have all possible values and you want your output to have all possible values?
So we wanted a report that counts how many kids in our class by age and we need to report for ages 10 to 18. But there are no 10 year olds in our class. Make a dummy table and merge with that.
proc freq data=sashelp.class ;
tables age / noprint out=counts;
run;
proc print data=counts;
run;
data shell;
do age=10 to 18;
retain count 0 percent 0;
output;
end;
run;
data want ;
merge shell counts ;
by age;
run;
proc print data=want ;
run;

Polynomials in character form to numeric (SAS)

I have a SAS dataset which contains one column of polynomials. For example, X1**(-2)+X1**(2).
Is there a function to transform this into a numeric expression?
Many thanks,
If I understand you correctly, I don't think there is a specific function that will easily let you do this. You have two options - write your own logic to interpret the polynomial expressions, or use call execute to have SAS write out a (potentially very long) data step for you, assuming that the polynomials are all entered as valid data step code. Here's a call execute approach:
data have;
input x1 polynomial $255.;
infile datalines truncover;
datalines;
1 X1**(-2)+X1**(2)
2 X1**(-1)+X1**(1)
3 X1**(1)+X1**(-1)
;
run;
data _null_;
set have end = eof;
if _n_ = 1 then call execute('data want; set have; select(_n_);');
call execute(catx(' ','when(',_N_,') y =',polynomial,';'));
if eof then call execute('end; run;');
run;
Convert them to macro variables, and then resolve them into a calculation...
Using the dataset example in user667489's answer :
/* Create numbered macro variables, 1 per row of data */
data _null_ ;
set have end=eof ;
call symputx(cats('POLY',_n_),polynomial) ;
if eof then call symputx('POLYN',_n_) ;
run ;
%MACRO ROWLOOPER ;
%DO N = 1 %TO &POLYN ;
if _n_ = &N then result = &&POLY&N ;
%END ;
%MEND ;
data want ;
set have ;
/* Not very efficient, looping over all polynomials on each row of data */
/* So for 3 rows, you'll perform 9 iterations here */
%ROWLOOPER ;
run ;
Or, alternatively, write your dataset out into a SAS program, and %inc that program :
data _null_ ;
file "polynomials.sas" ;
set have end=eof ;
if _n_ = 1 then do ;
put "data poly;" ;
put " set have;" ;
end ;
put " result = " polynomial ";" ;
if eof then put "run;" ;
run ;
%inc "polynomials.sas" ;

Summing a table with an unknown number of variables?

I'm fairly new with SAS. I've used it a bit in the past but am really rusty.
I've got a table that looks like this:
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 .
1 . . t 3
For several unique keys.
I want everything to appear on one row so it looks like.
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 t 3
Another wrinkle is I don't know how many group and metric columns I'll have (although I'll always have the same number).
I'm not sure how to approach this. I'm able to get a list of column names and use them in a macro, I'm just not sure what proc or datastep function I need to use to collapse everything down. I would be extremely greatful for any suggestions.
There's a very simple way to do this using a nice trick. I've answered similar questions on this before, see here for one of them. This should achieve exactly what you're after.
You can use 2 temporary arrays (one for the character variables, and another for the numeric), and fill them with the non-blank values accordingly. When you reach last.key, you can load the temporary arrays back into the source variables.
If you know the maximum length of the character variables in advance, you can hard code it, but if not you can determine it dynamically.
This assumes that for each key, each variable is only populated once. Otherwise it will take the last value it sees for a particular variable within each key.
%LET LIB = work ;
%LET DSN = mydata ;
%LET KEYVAR = key ;
/* Get column name/type/max length */
proc sql ;
/* Numerics */
select name, count(name) into :NVARNAMES separated by ' ', :NVARNUM
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'num' ;
/* Characters */
select name, count(name), max(length) into :CVARNAMES separated by ' ', :CVARNUM, :CVARLEN
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'char' ;
quit ;
data flatten ;
set &LIB..&DSN ;
by &KEYVAR ;
array n{&NVARNUM} &NVARNAMES ;
array nt{&NVARNUM} _TEMPORARY_ ;
array c{&CVARNUM} &CVARNAMES ;
array ct{&CVARNUM} $&CVARLEN.. _TEMPORARY_ ;
retain nt ct ;
if first.&KEYVAR then do ;
call missing(of nt{*}, of ct{*}) ;
end ;
/* Load non-missing numeric values into temporary array */
do i = 1 to dim(n) ;
if not missing(n{i}) then nt{i} = n{i} ;
end ;
/* Load non-missing character values into temporary array */
do i = 1 to dim(c) ;
if not missing(c{i}) then ct{i} = c{i} ;
end ;
if last.&KEYVAR then do ;
/* Load numeric back into original variables */
call missing(of n{*}) ;
do i = 1 to dim(n) ;
n{i} = nt{i} ;
end ;
/* Load character back into original variables */
call missing(of c{*}) ;
do i = 1 to dim(c) ;
c{i} = ct{i} ;
end ;
output ;
end ;
drop i ;
run ;