Summing a table with an unknown number of variables? - sas

I'm fairly new with SAS. I've used it a bit in the past but am really rusty.
I've got a table that looks like this:
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 .
1 . . t 3
For several unique keys.
I want everything to appear on one row so it looks like.
Key Group1 Metric1 Group2 Metric2 Group3 Metric3
1 . r 20 t 3
Another wrinkle is I don't know how many group and metric columns I'll have (although I'll always have the same number).
I'm not sure how to approach this. I'm able to get a list of column names and use them in a macro, I'm just not sure what proc or datastep function I need to use to collapse everything down. I would be extremely greatful for any suggestions.

There's a very simple way to do this using a nice trick. I've answered similar questions on this before, see here for one of them. This should achieve exactly what you're after.

You can use 2 temporary arrays (one for the character variables, and another for the numeric), and fill them with the non-blank values accordingly. When you reach last.key, you can load the temporary arrays back into the source variables.
If you know the maximum length of the character variables in advance, you can hard code it, but if not you can determine it dynamically.
This assumes that for each key, each variable is only populated once. Otherwise it will take the last value it sees for a particular variable within each key.
%LET LIB = work ;
%LET DSN = mydata ;
%LET KEYVAR = key ;
/* Get column name/type/max length */
proc sql ;
/* Numerics */
select name, count(name) into :NVARNAMES separated by ' ', :NVARNUM
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'num' ;
/* Characters */
select name, count(name), max(length) into :CVARNAMES separated by ' ', :CVARNUM, :CVARLEN
from dictionary.columns
where libname = upcase("&LIB")
and memname = upcase("&DSN")
and name ^= upcase("&KEYVAR")
and type = 'char' ;
quit ;
data flatten ;
set &LIB..&DSN ;
by &KEYVAR ;
array n{&NVARNUM} &NVARNAMES ;
array nt{&NVARNUM} _TEMPORARY_ ;
array c{&CVARNUM} &CVARNAMES ;
array ct{&CVARNUM} $&CVARLEN.. _TEMPORARY_ ;
retain nt ct ;
if first.&KEYVAR then do ;
call missing(of nt{*}, of ct{*}) ;
end ;
/* Load non-missing numeric values into temporary array */
do i = 1 to dim(n) ;
if not missing(n{i}) then nt{i} = n{i} ;
end ;
/* Load non-missing character values into temporary array */
do i = 1 to dim(c) ;
if not missing(c{i}) then ct{i} = c{i} ;
end ;
if last.&KEYVAR then do ;
/* Load numeric back into original variables */
call missing(of n{*}) ;
do i = 1 to dim(n) ;
n{i} = nt{i} ;
end ;
/* Load character back into original variables */
call missing(of c{*}) ;
do i = 1 to dim(c) ;
c{i} = ct{i} ;
end ;
output ;
end ;
drop i ;
run ;

Related

SAS SCAN Function and Missing Values

I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)

SAS: Coffee anyone?

I tried this in C# but have not had much success. So I am now trying in SAS. Using an EG session and my SAS code, we work with the list of students in SASHELP.CLASS.
These people want to get to know each other and have a monthly random pairing to go on a Coffee Date.
Rules:
A random Coffee Date List is Generated monthly;
I store each months pairing into a Historical Dataset, which I append monthly.
One person cannot have coffee with the same person within a 6 month period. So we keep a separate dataset for historical purposes with 3 Vars:
LastDate,InviterID,InvitedID
We check each pairing against the Historical list of which we only load the most recent 6 months data into a temp dataset for checking purposes.
If no recent matched pair is found, a new matched pair is added to a new Paired Dataset, and the 2 names (Rows) are removed from the original Participants dataset until the dataset has less than 2 rows. (a single person cannot be paired with another)
Unfortunately we have 19 people in this list so one person will be left out until we can add a new participant. Is anyone interested in joining our coffee club? :-)
So I start by deriving and ID (n) from the dataset, and I only keep the Name
Data Participants(Keep=ID Name);
FORMAT ID 8.;
set SASHelp.class;
ID=_n_;
run;
These 19 People will be my Participants in the Coffee Club.
I more or less follow the line of thought:
data _null_;
randvar = ceil(rand('UNIFORM') * 100000);
call symput('RANDSEED', randvar);
run;
data CR.names2(keep=MEMID randid);
set CR.MasterNames;
randid = rand('UNIFORM');
run;
proc sort data=CR.names2 ; by randid; run;
data CR.pairs(keep=pairgrp MEMID);
set CR.names2 nobs=num_peeps;
pairgrp+1;
if pairgrp > floor(num_peeps/2) then pairgrp=1;
run;
proc sort data=CR.pairs; by pairgrp;run;
proc transpose data=CR.pairs
out=CR.pairs2 (drop=_NAME_);
var memid;
by pairgrp;
run;
Data CR.Pairs3;
set CR.pairs2;
rename COL1=InviterID COL2=InvitedID;
run;
But I get stuck :-(
I need help with the rest please...
Has anyone else done this type of random pairing successfully before? I am grasping straws here...
Any help much appreciated.
Len
Here is my idea. This is far from efficient. Esp. when NOBS is getting big, as there is a cartesian product involved. Also I cheated on the odd number by adding another row in that case.
Prepare data and generate empty result table.
Create a list of all possible pairings (combinations) excluding recent pairings.
Random sort and descend through the list until every element has been picked once.
Append to result table.
There is a drawback as there might be members who will not get pairings as all possible partners are already picked. To avoid that we could iterate until we get a maximum of pairings.
EDIT: Added iteration. Now the program makes draws randomly until everyone is matched or a threshold is reached.
This problem should probably be implemented in a matrix orientated language like IML or R.
data Participants(Keep=ID Name) ;
set SASHelp.class nobs = num_peeps ;
ID=_n_ ;
output ;
if _n_ = 1 and mod(num_peeps,2) then do ; /* get even number of members: empty ID to pair with last participant*/
name = 'empty' ;
id = 0 ;
output ;
end ;
run ;
data list_of_meetings ;
length iteration InviterID InvitedID 8. ;
run ;
/****
iter = number of club meetings
hist = length of memory for pairings
tries = number of iterations to pair everyone
****/
%macro loop_coffee (iter=, hist=6, tries= 10) ;
proc sql noprint ;
select max(0,max(iteration)) + 1 into :base
from list_of_meetings ;
quit ;
%do i = &base. %to &iter. ; /* loop through number of meetings */
proc sort data = list_of_meetings (where=(iteration >= &i - &hist )) out = lookup nodupkey ; by InviterID InvitedID ; run ; /* get memory of pairings */
proc sql ; /* list all acceptable pairs */
create table all_pairs as
select a.ID as InviterID, b.ID as InvitedID
from Participants a
inner join Participants b
on a.ID lt b.ID
left join lookup c /* exclude the memory */
on a.ID eq c.InviterID and b.ID eq c.InvitedID
where c.InviterID is NULL ;
quit ;
%let j = 0 ;
%let all_pairs = 0 ;
%do %until (&all_pairs | &j > &tries) ; /* iterate and random sort until all members are paired */
%let j = %eval( &j + 1 ) ;
data all_pairs;
set all_pairs;
randnum = ranuni(12345 + &i + &j);
run;
proc sort data = all_pairs ; by randnum ; run ; /* random sort */
data out_pairs ; /* select the pairs: no. of IDs/2 */
declare hash h() ;
h.defineKey("ID") ;
h.defineDone() ;
do until ( eof1 ) ;
set Participants (keep= ID) end = eof1 ;
rc = h.add () ; /* populate list of members */
end ;
do until ( eof2 ) ;
set all_pairs (keep= InviterID InvitedID) end = eof2 ;
rc1 = h.check (key:InviterID) ;
rc2 = h.check (key:InvitedID) ;
if rc1 = 0 and rc2 = 0 then do ;
rc = h.remove (key:InviterID) ; /* delete member from list if paired */
rc = h.remove (key:InvitedID) ;
output ;
end ;
if h.num_items = 0 then do ;
call symput('all_pairs', 1 ) ;
stop ;
end;
end ;
stop ;
keep InviterID InvitedID ;
run ;
%end ;
data list_of_meetings ;
set list_of_meetings (where=(iteration ne .))
Out_pairs (in=pairs) ;
if pairs then iteration = &i. ;
run ;
%end ;
%mend ;
%loop_coffee (iter=10,hist=6,tries=10) ;

split string to columns with content fill

I have data that looks like this:
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
I need to add:
A new column with column heading for each sequence, add prefix 'RT', drop the letters following the numeric part of the sequence
Fill the new column with the letters that follow the numeric part
of the sequence
I need to create this:
ID Sequence RTE6 RTK11 RTQ174 RTD177 RTV245
-----------------------------------------------------------------------
101 E6S,K11T,Q174K,D177E S T K E
102 K11T,V245EKQ T EKQ
I assume you want a SAS data set and not a report. ANYDIGIT makes it pretty easy to find the last non-digit sub-string.
data seq;
infile cards firstobs=3;
input id:$3. sequence :$50.;
cards;
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
;;;;
run;
proc print;
run;
data seq2V / View=seq2V;
set seq;
length w name sub $32 subl 8;
do i = 1 by 1;
w = scan(sequence,i,',');
if missing(w) then leave;
subl = anydigit(w,-99);
name = substrn(w,1,subl);
sub = substrn(w,subl+1);
output;
end;
run;
proc transpose data=seq2V out=seq3(drop=_name_) prefix=RT;
by id sequence;
var sub;
id name;
run;
proc print;
run;
I had a similar problem a while ago. The code is adapted to your problem.
If found this solution to work faster than anything I tried with proc transpose.
Still overall performance on huge datasets (espc. using many different sequences) is not great at all, as we loop 2*2 over all strings and also the final variables.
Can anyone offer a faster solution?
(Caution: MacroVar is limited to 65534 Characters.)
data var_name ;
set in_data;
length var string $30.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
var = substr(string,1,anydigit(string,-99));
output;
keep var;
end;
run;
proc sql noprint;
select distinct compress("RT"||var) into :var_list separated by ' '
from var_name;
quit;
%put &var_list.;
data out_data;
set in_data;
length string &var_list. $30. n 8. ;
array a_var [*] &var_list.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
do j = 1 to dim(a_var);
n = anydigit(string,-99) ;
if substr(vname(a_var[j]),3) eq substr(string,1,n) then a_var[j] = substr(string,n+1);
end;
end;
drop string i j n;
run;

Dynamic summing of n columns in SAS using arrays

I have a problem in SAS where I have to sum n columns(Time(1) to time(N)) where the N is defined as a variable in another column(Min_Remain_wthdrw_Prd).
I am writing the below code but it is not working:
data certain;set certain;
array t(*) t1-t60;
do while(i<=Min_Remain_wthdrw_Prd);
S_Disc=sum(t(1)-t(i));
end;
end;
run;
Kindly help
You have too many end statements, and you can just use a regular do loop...
data certain ;
set certain ;
array t(*) t1-t60 ;
S_Disc = 0 ;
do i = 1 to Min_Remain_wthdrw_Prd ;
S_Disc+t{i} ;
end ;
run;

How to output sas format as proc format syntax?

I have created a format based on a dataset. Now I want to store this format as a value-list as part of the proc format syntax in my sas program. Is there a way to accomplish this?
The reason for doing this is that I often need to make tables which group the country background of people into groups similar to continents. Until now this has been done by joining the data using country code as key variable with another dataset which contain a continents variable, and then applying a format $continents on the continents variable.
I want to be able to skip this join operation by making a format for continents that takes country codes as input values. I also want this format to be stored in the syntax file which produces the tables and not in a format catalog. Since the world has a lot of countries, writing this format manually seems prone to error.
This is just a guide, hasn't been tested with every scenario e.g. numeric, character & informat or multi-label/picture formats.
/* Create a dummy format */
data dummyfmt ;
retain fmtname 'DUMMY' type 'N' ;
do i = 1 to 10 ;
start = i ;
label = repeat(byte(round(ranuni(0) * (122 - 97 + 1),1) + 96),10) ;
if i = 10 then hlo = 'O' ;
output ;
end ;
run ;
proc format cntlin=dummyfmt ; run ;
/* Dump the format back out to a dataset */
proc format cntlout=dump library=work ;
select dummy ;
run ;
proc print heading=H ; run ;
/* Write out to log... */
data _null_ ;
set dump end=eof ;
if _n_ = 1 then do ;
put "proc format ;" ;
if type = 'N' then put " value " fmtname ;
if type = 'C' then put " value $" fmtname ;
if type = 'I' then put " invalue " fmtname ;
end ;
if hlo = 'O' then do ;
if type in('N' 'C') then put " other = '" label +(-1) "'" ;
if type = 'I' then put " other = " label ;
end ;
else do ;
if type in('N' 'C') then put " " start " = '" label +(-1) "'" ;
if type = 'I' then put " " start " = " label ;
end ;
if eof then do ;
put " ;" ;
put "run ;" ;
end ;
run ;
You may need to modify the above depending on your format, especially if there's ranges involved. The SEXCL and EEXCL columns would then be relevant.
/* Example output (from Log Window) */
proc format ;
value DUMMY
1 = 'bbbbbbbbbbb'
2 = 'hhhhhhhhhhh'
3 = 'ttttttttttt'
4 = 'fffffffffff'
5 = 'sssssssssss'
6 = 'bbbbbbbbbbb'
7 = 'aaaaaaaaaaa'
8 = 'ppppppppppp'
9 = 'eeeeeeeeeee'
other = 'wwwwwwwwwww'
;
run ;