SAS logic to populate 1 row based on another row - sas

I have a SAS dataset like this:
Name MgrName Dept.
A B
B C
C D
X Y
I need to fill in the Dept. using a recursive logic. I know D is the head of 'Payroll', so I fill in:
Name MgrName Dept.
A B
B C
C D Payroll
X Y
But using some kind of recursion,everybody that is in D's reporting chain (A, B,C) also needs to be assigned 'Payroll'. How can I do that in SAS?

Here is a hash based approach in the context of a Proc DS2 program.
Each node (name) has only one parent (mgrname) so a hash can key with name and have mgrname as data. A loop over find method will seek the ancestral parent for a node or not find one.
Example (name is id and namemgr is pid):
data have;
input id $ pid $;datalines;
A B
B C
C D
F D
G F
H F
P H
Q H
R H
X Y
run;
proc ds2;
data want / overwrite=yes;
declare package hash links();
declare char _seek_pid;
declare char _seek_id;
declare char dept;
keep id pid dept;
* populate hash;
method init();
links.definekey('id');
links.definedata('pid');
links.dataset('{select pid, id from have {options locktable=share}}');
links.multidata('yes');
links.definedone();
end;
* seek ancestor from which value should be applied;
method apply(char rootid, char value, char id);
declare int limit;
limit = 0;
_seek_id = id;
do while (
links.find([_seek_id], [_seek_pid]) = 0 and
limit < 100 and
rootid ne _seek_pid
);
limit+1;
_seek_id = _seek_pid;
end;
if rootid = _seek_pid then dept = value;
end;
* apply some values to some nodes and children thereof;
method run();
set have (locktable=share);
apply ('D','payroll', id);
apply ('F','shadow$', id);
end;
enddata;
run;
quit;
%let syslast = want;

There is probably a smarter way, but here is a hash object approach
data have;
input Name $ MgrName $;
datalines;
A B
B C
C D
X Y
;
data want(drop=rc);
declare hash h1(dataset:'have');
h1.definekey('MgrName');
h1.definedata('Name');
h1.definedone();
declare hash h2();
h2.definekey('Name');
h2.definedone();
length Name $ 100 MgrName $ 100;
do rc=h1.find(key:'D') by 0 while (rc=0);
h2.replace();
rc=h1.find(key:Name);
end;
do until (lr);
set have end=lr;
Dept=ifc(h2.check()=0, 'Payroll', '');
output;
end;
run;

Related

COUNTING VALUE PER PARTCIPANTS

I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

Conditionally retain

I need to create a new variable, called new_id, which displays the same value for either same id tasks or same location tasks. In this example:
Table 1
id Task location
a Task1 lat1
b Task2 lat2
b Task3 lat1
c Task4 lat3
c Task5 lat4
d Task6 lat5
e Task7 lat5
Table want
id Task Location New_id
a Task1 lat1 a
b Task2 lat2 a
b Task3 lat1 a
c Task4 lat3 c
c Task5 lat4 c
d Task6 lat5 d
e Task7 lat5 d
Task1 and Task3 must have the same new_id because they have the same location.
Task2 and Task3 must have the same new_id because they have the same id.
I tried to use a retain data step. First I sort on location, retain the first.variable, then sort id, retain the first.variable.
proc sort data=table1;
by location;
data table1_1;
set table1;
by location;
retain new_id_temp;
if first.location then new_id_temp =id;
new_id=new_id_temp;
run;
proc sort data=table1_1;
by id;
data table1_2;
set table1_1;
by id;
retain id_temp;
if first.id then id_temp=id;
new_id=id_temp;
run;
Based on the above code, I still got two different new_id and proc sort takes lots time if the datasets are large.
Can anyone help?
Your issue here is that you didn't update the second datastep to use new_id as the source for the retained ID, so it's using b not a.
data table1_2;
set table1_1;
by id;
retain id_temp;
if first.id then
id_temp=new_id;
new_id_fin=id_temp;
run;
I'm not sure this is really an effective way to solve your problem generally, but it should give you the results you want. You might want to search around the site (or the web) for other ways to solve this problem, as it's a well understood but complex issue.
/To help you understand the algorithm, I print intermediate results/
%let print_diagnostics = 1; * 0 : no diagnostics 1 : diagnostics *;
/Read in the example, extended with extra data/
options mprint;
title read input data;
data table1;
input id $ Task $ location $;
datalines;
a Task01 lat1
b Task02 lat2
b Tas0k3 lat1
b Task04 lat0
c Task05 lat3
c Task06 lat4
d Task07 lat5
e Task08 lat5
f Task09 lat4
f Task10 lat6
g Task11 lat6
g Task12 lat7
h Task13 lat7
;
proc print;
run;
/The solution needs some iteration, so we need a macro/
%macro re_identify (got, want);
* Initially, we assign id to new_id *;
data &want.;
set &got.;
new_id = id;
run;
* proceed re-assigning ids until stabilised *;
%let pass = 0;
%let proceed = 1;
%do %while (&proceed);
/To lookup the smallest new_id already used for an id or location, I use hash tables. For more information, read Data Step Hash Objects as Programming Tools/
* We will construct two hash tables
* one with the smallest new_id for each id and *
* one with the smallest new_id for each location *
* To achieve this, the smallest new_id should come first *;
%let pass = %eval(&pass + 1);
title pass &pass;
proc sort data=&want.;
by new_id;
run;
data
%if &print_diagnostics %then %do;
hash_id(keep=id id_id)
hash_loc(keep=location loc_id)
%end;
&want. (drop=rc loc_id id_id proceed);
/The hash tables have to be loaded only once of course. Mind the declaration of the data variables!/
* Create hash tables with for each id and location
* the smallest new_id used up to now *;
length loc_id id_id $ 1;
if _N_ eq 1 then do;
dcl hash h_id (dataset: "&want.(rename=(new_id=id_id))");
h_id.defineKey('id');
h_id.definedata('id_id','id');
h_id.defineDone();
dcl hash h_loc (dataset: "&want.(rename=(new_id=loc_id))");
h_loc.defineKey('location');
h_loc.definedata('loc_id','location');
h_loc.defineDone();
* Unless we have to lower the new id for any id or location,
* we can stop after this pass *;
proceed = 0;
end;
retain proceed;
* Read in the data *;
set &want. end=last;
* If there is a task with the same id or location
* with a smaller new_id, lower the new_id for this task *;
rc = h_id.find() + h_loc.find();
if rc then put 'WARNING: location not found' _all_;
if id_id lt new_id then new_id = id_id;
if loc_id lt new_id then new_id = loc_id;
output &want.;
* If we lowered the new_id,
* adapt the hash table
* and proceed after this pass *;
if id_id gt new_id then do;
id_id = new_id;
h_id.replace();
proceed = 1;
end;
if loc_id gt new_id then do;
loc_id = new_id;
h_loc.replace();
proceed = 1;
end;
/Adapting the hashtables with the replace statement is optional but can drastically reduce the number of passes./
* transfer the the decision to proceed
* from a data step variable to a macro variable *;
if last then call symput ('proceed', proceed);
%if &print_diagnostics %then %do;
if last then do;
dcl hiter i_id ('h_id') ;
dcl hiter i_loc ('h_loc') ;
do rc = i_id.first () by 0 while ( rc = 0 ) ;
output hash_id;
rc = i_id.next () ;
end;
do rc = i_loc.first () by 0 while ( rc = 0 ) ;
output hash_loc;
rc = i_loc.next () ;
end;
put "NOTE: after pass &pass." proceed=;
end;
%end;
run;
%if &print_diagnostics %then %do;
* Print intermediate results *;
title2 new id assigned to task; proc print data=&want.; run;
title2 new id assigned to id; proc print data=hash_id; run;
title2 new id assigned to location; proc print data=hash_loc; run;
%end;
%end;
%mend;
%re_identify(table1, table_want);
/And finally write out the report./
* sort in task order and print the final results *;
title final result;
proc sort data=table_want;
by Task;
proc print;
run;
/*
*/
This should give you the results you expect, with a caveat: If you were to add a Task8 having id F and location lat1, you'd need a more refined algorithm with 2 or more passes. But this solution will work fine as long as your id's and locations progress in a way that elements sharing common id's and/or locations are placed after one another.
Generate Sample Dataset
data tasks;
input id $ Task $ location $;
datalines;
a Task1 lat1
b Task2 lat2
b Task3 lat1
c Task4 lat3
c Task5 lat4
d Task6 lat5
e Task7 lat5
;
Generate all Possible Combinations using PROC FREQ
proc freq data=tasks;
table id * task * location / out=combinations (drop=percent count);
run;
Calculate New ID's Based on Your Criteria
data newIDs;
set combinations;
length prev_id $ 1
newID $ 1
prev_location $ 4;
retain newID prev_id prev_location;
* First scenario - first row;
if _N_ = 1 then do;
put _N_= "First scenario - first row";
newID = id;
output;
prev_id = id;
prev_location = location;
end;
* Second scenario - some redundancy between 2 rows;
else if id = prev_id or prev_location=location then do;
put _N_= "Second Scenario - some redundancy";
output;
prev_id = id;
prev_location = location;
end;
* Third scenario - no redundancy;
else do;
put _N_= "Third scenario - no redundancy";
newID = id;
output;
prev_id = id;
prev_location = location;
end;
keep id task location newID;
run;
Merge the Tasks Dataset to the newIDs Dataset
proc sql;
create table tasks_update as
select t.id
,i.newID
,t.Task
,t.location
from tasks as t
left join newIDs as i
on t.id = i.id
and t.task = i.task
and t.location = i.location
order by id;
quit;
Results
id newID Task location
a a Task1 lat1
b a Task2 lat2
b a Task3 lat1
c c Task4 lat3
c c Task5 lat4
d d Task6 lat5
e d Task7 lat5

SAS - Dynamically create column names using the values from another column

I Have a column with many flags that were parsed from a XML parser. Data looks like this:
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
I have to create a table with all these column names to capture the flags. Like:
USERKEYED VALMATCH DEVICEVERIFIED EXCEPTION USERREGISTRD ASSOCIATE EXTERNAL GROSSGIVEN UMAPPED
Y N N N N Y N Y N
Y N N N N Y Y Y N
Y N N Y N Y N Y N
How can I capture values dynamically in SAS? Either in a DATA step or a PROC step?
Thanks in advance.
Let's start with your example output data.
data expect ;
id+1;
length USERKEYED VALMATCH DEVICEVERIFIED EXCEPTION
USERREGISTRD ASSOCIATE EXTERNAL GROSSGIVEN UMAPPED $1 ;
input USERKEYED -- UMAPPED;
cards4;
Y N N N N Y N Y N
Y N N N N Y Y Y N
Y N N Y N Y N Y N
;;;;
Now we can recreate your example input data:
data have ;
do until (last.id);
set expect ;
by id ;
array flag _character_;
length string $200 ;
do _n_=1 to dim(flag);
string=catx(';',string,catx('=',vname(flag(_n_)),flag(_n_)));
end;
end;
keep id string;
run;
Which will look like this:
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=Y;GROSSGIVEN=Y;UMAPPED=N
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=Y;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N
So to process this we need to parse out the pairs from the variable STRING into multiple observations with the individual pairs' values split into NAME and VALUE variables.
data middle ;
set have ;
do _n_=1 by 1 while(_n_=1 or scan(string,_n_,';')^=' ');
length name $32 ;
name = scan(scan(string,_n_,';'),1,'=');
value = scan(scan(string,_n_,';'),2,'=');
output;
end;
keep id name value ;
run;
Then we can use PROC TRANSPOSE to convert those observations into variables.
proc transpose data=middle out=want (drop=_name_) ;
by id;
id name ;
var value ;
run;
The data that you have is a series of name/value pairs, using a ; as a delimiter. We can extract each name/value pair one at a time, and then parse those into values:
data tmp;
length my_string next_pair name value $200;
my_string = "USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;";
cnt = 1;
next_pair = scan(my_string,cnt,";");
do while (next_pair ne "");
name = scan(next_pair,1,"=");
value = scan(next_pair,2,"=");
output;
cnt = cnt + 1;
next_pair = scan(my_string,cnt,";");
end;
keep name value;
run;
Gives us:
name value
=================== =====
USERKEYED Y
VALMATCH N
DEVICEVERIFIED N
EXCEPTION N
USERREGISTRD N
ASSOCIATE Y
EXTERNAL N
GROSSGIVEN Y
UMAPPED N
We can then transpose the data so that the name is used for the column names:
proc transpose data=tmp out=want(drop=_name_);
id name;
var value;
run;
Which gives you the desired table.
DATA <MY_DATASET>;
SET INPUT_DATASET;
USERKEYED = substr(input_column, find(input_column, 'USERKEYED=')+10,1);
VALMATCH = substr(input_column, find(input_column, 'VALMATCH=')+9,1);
DEVICEVERIFIED = substr(input_column, find(input_column, 'DEVICEVERIFIED=')+15,1);
EXCEPTION = substr(input_column, find(input_column, 'EXCEPTION=')+10,1);
USERREGISTRD = substr(input_column, find(input_column, 'USERREGISTRD=')+13,1);
ASSOCIATE = substr(input_column, find(input_column, 'ASSOCIATE=')+10,1); EXTERNAL = substr(input_column, find(input_column, 'EXTERNAL=')+9,1);
GROSSGIVEN = substr(input_column, find(input_column, 'GROSSGIVEN=')+11,1);
UMAPPED = substr(input_column, find(input_column, UMAPPED=')+8,1);
run;
My answer is essentially in the first block of code, the rest is just explanation, one alternative and a nice tip.
Based on the answer you gave, the input data is already in a SAS data set, so that can be read to create a file of SAS code which can then be run using %include and so proc transpose is not required:
filename tempcode '<path><file-name.txt>'; /* set this up yourself */
/* write out SAS code to the fileref tempcode */
data _null_;
file tempcode;
set have;
if _n_=1 then
put 'Y="Y"; N="N"; drop Y N;';
put input_column;
put 'output;';
run;
/* %include the code to create the desired output */
data want;
%include tempcode;
run;
As the input data already almost looks like SAS assignment statements, we have taken advantage of that and so the SAS code that has been run from fileref tempcode using %include should look like:
Y="Y"; N="N"; drop Y N;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
output;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=N;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=Y;GROSSGIVEN=Y;UMAPPED=N;
output;
USERKEYED=Y;VALMATCH=N;DEVICEVERIFIED=N;EXCEPTION=Y;USERREGISTRD=N;ASSOCIATE=Y;EXTERNAL=N;GROSSGIVEN=Y;UMAPPED=N;
output;
As an alternative, fileref tempcode could contain all of the code for data step "data want;":
/* write out entire SAS data step code to the fileref tempcode */
data _null_;
file tempcode;
set have end=lastrec;
if _n_=1 then
put 'data want;'
/'Y="Y"; N="N"; drop Y N;';
put input_column;
put 'output;';
if lastrec then
put 'run;';
run;
%include tempcode; /* no need for surrounding SAS code */
As a tip, to see the code being processed by %include in the log you can use the following variation:
%include tempcode / source2;

Conditionally replace column values with column name in SAS dataset

I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.