Replace string with previous/forward other string in SAS - sas

I have a dataset which contains -1, which is the same as missing which I would like to replace with firstly previously known, and that is not found with forward known. By group V1 and V2
Example:
data test;
input V1 $ V2 $ V3 $;
infile datalines delimiter=',';
datalines;
A,X,AA0
A,X,-1
A,X,AA0
A,Y,-1
A,Y,BB2
B,X,-1
B,X,-1
B,X,CC1
B,Y,-1
B,Y,-1
;
After first run (filling down)
V1 V2 V3
1 A X AA0
2 A X AA0
3 A X AA0
4 A Y -1
5 A Y BB2
6 B X -1
7 B X -1
8 B X CC1
9 B Y -1
10 B Y -1
After second run (filling up):
V1 V2 V3
1 A X AA0
2 A X AA0
3 A X AA0
4 A Y BB2
5 A Y BB2
6 B X CC1
7 B X CC1
8 B X CC1
9 B Y NA
10 B Y NA
I found there is a similar question here
However, I dont get the desired result when replacing '.' for '-1' and the replaced values, for example AA0 becomes AA.
This is my try:
proc sort data=test;
by V1
V2;
run;
data want;
set test;
by V1 V2;
retain new_var ('-1');
if not last.V1 and V3 ne '-1' then new_var=V3;
else if V3 = '-1' then V3 = new_var;
if last.V1 then new_var = '-1';
/* drop year_tmp; */
run;

Use a DOW loop to find the first forward non-missing V3 value, followed by a second loop over group using different SET buffer to track most recent prior V3 value in group and fill in either needed forward V3 or prior V3. The code performs fill-in with a precedence of prior over first.
data want;
length firstv3 priorv3 $8;
do _n_ = 1 by 1 until (last.v2);
set test;
by v1 v2;
if missing(firstv3) and v3 ne '-1' then firstv3 = v3;
end;
do _n_ = 1 to _n_;
set test;
if v3 ne '-1' then
priorv3 = v3;
else do;
if not missing(priorv3) then v3 = priorv3;
else
if not missing(firstv3) then v3 = firstv3;
end;
output;
end;
run;

Related

Identifying groups/networks of customers

I am trying to create unique customer groups which are determined by customer interactivity across transactions.
Here is an example of the data:
Transaction #
Primary Customer
Cosigner
WANT: Customer Group
1
1
2
A
2
1
3
A
3
1
4
A
4
1
2
A
5
2
5
A
6
3
6
A
7
2
1
A
8
3
1
A
9
7
8
B
10
9
C
In this example, customer 1 is connected to customers 2-6 either directly or indirectly, so all transactions associated with customers 1-6 would be a part of an "A" group. Customer 7 and 8 are directly connected and would be labeled as a "B" group. Customer 9 has no connections and are the single member of the "C" group.
Any suggestions are appreciated!
Your data can be considered the edges of a graph. So your request is to find the connected subgraphs of that graph. That question has an answer on Stackoverflow and SAS Communities. But this question is more on topic than that older SO question. So let's post the subnet SAS macro from the SAS Communities answer here on SO where it will be easier to find.
This simple macro uses repeated PROC SQL queries to build the list of connected subgraphs until all of the original records have been assigned to a subgraph.
The macro is setup to let you pass in the name of the source dataset and the names of the two variables that hold the ids of the nodes.
So first let's convert your printout into an actual SAS dataset.
data have;
input id primary cosign want $;
cards;
1 1 2 A
2 1 3 A
3 1 4 A
4 1 2 A
5 2 5 A
6 3 6 A
7 2 1 A
8 3 1 A
9 7 8 B
10 9 . C
;
Now we can call the macro and tell it that PRIMARY and COSIGN are the variables with the node ids and that SUBNET is the name for the new variable to hold the ids of the connected subgraphs. NOTE: This version treats the graph as directed by default.
%subnet(in=have,out=want,from=primary,to=cosign,subnet=subnet);
Results:
Obs id primary cosign want subnet
1 1 1 2 A 1
2 2 1 3 A 1
3 3 1 4 A 1
4 4 1 2 A 1
5 5 2 5 A 1
6 6 3 6 A 1
7 7 2 1 A 1
8 8 3 1 A 1
9 9 7 8 B 2
10 10 9 . C 3
Here is the code of the %SUBNET() macro.
%macro subnet(in=,out=,from=from,to=to,subnet=subnet,directed=1);
/*----------------------------------------------------------------------
SUBNET - Build connected subnets from pairs of nodes.
Input Table :FROM TO pairs of rows
Output Table:input data with &subnet added
Work Tables:
NODES - List of all nodes in input.
NEW - List of new nodes to assign to current subnet.
Algorithm:
Pick next unassigned node and grow the subnet by adding all connected
nodes. Repeat until all unassigned nodes are put into a subnet.
To treat the graph as undirected set the DIRECTED parameter to 0.
----------------------------------------------------------------------*/
%local subnetid next getnext ;
%*----------------------------------------------------------------------
Put code to get next unassigned node into a macro variable. This query
is used in two places in the program.
-----------------------------------------------------------------------;
%let getnext= select node into :next from nodes where subnet=.;
%*----------------------------------------------------------------------
Initialize subnet id counter.
-----------------------------------------------------------------------;
%let subnetid=0;
proc sql noprint;
*----------------------------------------------------------------------;
* Get list of all nodes ;
*----------------------------------------------------------------------;
create table nodes as
select . as subnet, &from as node from &in where &from is not null
union
select . as subnet, &to as node from &in where &to is not null
;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Set subnet to next id ;
*----------------------------------------------------------------------;
%let subnetid=%eval(&subnetid+1);
update nodes set subnet=&subnetid where node=&next;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Get list of connected nodes for this subnet ;
*----------------------------------------------------------------------;
create table new as
select distinct a.&to as node
from &in a, nodes b, nodes c
where a.&from= b.node
and a.&to= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%if "&directed" ne "1" %then %do;
insert into new
select distinct a.&from as node
from &in a, nodes b, nodes c
where a.&to= b.node
and a.&from= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%end;
*----------------------------------------------------------------------;
* Update subnet for these nodes ;
*----------------------------------------------------------------------;
update nodes set subnet=&subnetid
where node in (select node from new )
;
%end;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%end;
*----------------------------------------------------------------------;
* Create output dataset by adding subnet number. ;
*----------------------------------------------------------------------;
create table &out as
select distinct a.*,b.subnet as &subnet
from &in a , nodes b
where a.&from = b.node
;
quit;
%mend subnet ;
You can use Hashes to compute your group identities and their members:
Example:
Proc DS2 is used for the succinctness of hash declaration and clarity that can be coded. The final pair Q H bridges two groups that were independent up-to that linkage point and requires the two groups to merge.
data customer;
length id1-id2 $8;
input id1-id2 ##; output;
datalines;
A B A C B A B D C A C D D C D .
E F E . F E F .
H J H K K L K M
P Q Q R R S S T
Q H
;
run;
%if %sysfunc(exist(vs)) %then %do;
proc delete data=vs;
proc delete data=gs;
%end;
options nosource;
proc ds2 ;
data _null_ ;
declare char(8) v1 v2 v;
declare double g gnew;
declare package hash vs([v], [v g], 0, '', 'ascending');
declare package hash gs([g], [g v], 0, '', 'ascending', '', '', 'multidata');
method add11(char(8) x1, char(8) x2); /* neither vertex has been seen before */
g + 1;
v = x1; vs.add(); gs.add();
v = x2; vs.add(); gs.add();
* put 'add00' x1 $char1. x2 $char1. ' ' g;
end;
method add10(char(8) x1, char(8) x2); /* x1 is not in a group, x2 is */
v = x2; vs.find(); * get group;
v = x1; vs.add(); * apply group to x2;
gs.add();
* put 'add10' x1 $char1. x2 $char1. ' ' g;
end;
method add01(char(8) x1, char(8) x2); /* x1 is in a group, x2 is not */
v = x1; vs.find(); * get group;
v = x2; vs.add(); * apply group to x1;
gs.add();
* put 'add01' x1 $char1. x2 $char1. ' ' g;
end;
method add00(char(8) x1, char(8) x2); /* both x1 and x2 are in a group */
declare double g1 g2;
v = x1; vs.find(); g1 = g; * get group of x1;
v = x2; vs.find(); g2 = g; * get group of x2;
if g1 ^= g2 then do;
* merge groups, v of higher group moved to lower group;
gnew = min(g1,g2);
g = max(g1,g2);
gs.find();
vs.replace([v], [v gnew]);
do while (gs.has_next() = 0);
gs.find_next();
vs.replace([v], [v gnew]);
end;
gs.removeall();
end;
* put 'add00' x1 $char1. x2 $char1. ' ' g g1 g2;
end;
method run();
declare int e1 e2;
declare char(2) f;
set customer;
if not missing(id1) and not missing(id2);
e1 = vs.check([id1]);
e2 = vs.check([id2]);
select (cats(e1^=0,e2^=0));
when ('11') add11(id1,id2);
when ('10') add10(id1,id2);
when ('01') add01(id1,id2);
when ('00') add00(id1,id2);
otherwise stop;
end;
end;
method term();
vs.output('vs');
gs.output('gs');
end;
run;
quit;

Proc report - Call Define to change format of the second row under GROUPING variable?

I want to apply a pre-defined format to several columns, but only for one variable. The problem is, this variable has two subgroups, LEFT and RIGHT, my codes only change the format for the first subgroup - Left, but not the second one - Right. I want to apply the same format to the second subgroup - Right.
Here is my code:
DATA have;
INPUT subject $ variable $ parameter $ V1-V6;
DATALINES;
A-001 qAF Left 1 2 3 4 5 6
A-001 qAF Right 1 2 3 4 5 6
A-001 Cortical Left 1 1 1 1 1 1
A-001 Cortical Right 1 2 1 1 1 1
A-001 Posterial Left 1 1 1 2 1 1
A-001 Posterial Right 1 1 1 1 1 3
;
RUN;
PROC FORMAT;
VALUE cort
1 = 'C1'
2 = 'C2';
RUN;
PROC REPORT DATA = have;
COLUMNS subject variable parameter V1 V2 V3 V4 V5 V6 dummy;
DEFINE subject / ORDER;
DEFINE variable / ORDER;
DEFINE dummy / COMPUTED NOPRINT;
COMPUTE dummy;
IF variable = 'Cortical' THEN DO;
DO i = 4 TO 9;
CALL DEFINE (i, 'format', 'cort.');
END;
END;
ENDCOMP;
COMPUTE AFTER variable;
LINE ' ';
ENDCOMP;
OPTIONS missing = '';
RUN;
You need to HOLD the value of VARIABLE. See COMPUTE BEFORE.
PROC REPORT DATA = have;
COLUMNS subject variable parameter V1 V2 V3 V4 V5 V6 dummy;
DEFINE subject / ORDER;
DEFINE variable / ORDER;
DEFINE dummy / COMPUTED NOPRINT;
compute before variable;
hold=variable;
endcomp;
COMPUTE dummy;
IF hold = 'Cortical' THEN DO;
DO i = 4 TO 9;
CALL DEFINE (i, 'format', 'cort.');
END;
END;
ENDCOMP;
COMPUTE AFTER variable;
LINE ' ';
ENDCOMP;
OPTIONS missing = '';
RUN;

SAS concatenate in SAS Data Step

I don't know how to describe this question but here is an example. I have an initial dataset looks like this:
input first second $3.;
cards;
1 A
1 B
1 C
1 D
2 E
2 F
3 S
3 A
4 C
5 Y
6 II
6 UU
6 OO
6 N
7 G
7 H
...
;
I want an output dataset like this:
input first second $;
cards;
1 "A,B,C,D"
2 "E,F"
3 "S,A"
4 "C"
5 "Y"
6 "II,UU,OO,N"
7 "G,H"
...
;
Both tables will have two columns. Unique value of range of the column "first" could be 1 to any number.
Can someone help me ?
something like below
proc sort data=have;
by first second;
run;
data want(rename=(b=second));
length new_second $50.;
do until(last.first);
set have;
by first second ;
new_second =catx(',', new_second, second);
b=quote(strip(new_second));
end;
drop second new_second;
run;
output is
first second
1 "A,B,C,D"
2 "E,F"
3 "A,S"
4 "C"
5 "Y"
6 "II,N,OO,UU"
7 "G,H"
You can use by-group processing and the retain function to achieve this.
Create a sample dataset:
data have;
input id value $3.;
cards;
1 A
1 B
1 C
1 D
2 E
2 F
3 S
3 A
4 C
5 Y
6 II
6 UU
6 OO
6 N
7 G
7 H
;
run;
First ensure that your dataset is sorted by your id variable:
proc sort data=have;
by id;
run;
Then use the first. and last. notation to identify when the id variable is changing or about to change. The retain statement tells the datastep to keep the value within concatenated_value over observations rather than resetting it to a blank value. Use the quote() function to apply the " chars around the result before outputting the record. Use the cats() function to perform the actual concatenation and separate the records with a ,.
data want;
length contatenated_value $500.;
set have;
by id;
retain contatenated_value ;
if first.id then do;
contatenated_value = '';
end;
contatenated_value = catx(',', contatenated_value, value);
if last.id then do;
contatenated_value = quote(cats(contatenated_value));
output;
end;
drop value;
run;
Output:
contatenated_
value id
"A,B,C,D" 1
"E,F" 2
"S,A" 3
"C" 4
"Y" 5
"II,UU,OO,N" 6
"G,H" 7

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

How to get only last 4 WORKING days data in SAS?

I'm trying to pull only last 4 working days data in SAS...I tried following code but I'm not getting what I'm intended to...
data input;
Input id $ id1 $ id2 $ num date date9.;
Format Date Date9.;
datalines;
x y z 3 19JUL2015
x y z 2 18JUL2015
x y z 3 17JUL2015
x y z 2 16JUL2015
x y z 3 15JUL2015
x y z 2 14JUL2015
x y z 3 13JUL2015
a b c 1 12JUL2015
a b c 1 11JUL2015
a b c 1 10JUL2015
a b c 1 09JUL2015
a b c 1 08JUL2015
a b c 2 07JUL2015
x y z 1 06JUL2015
;
Run;
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
*if intck('weekday',Date,today()) >4;
if 1<Weekday(Date)<7 and Date>=today()-4;
Run;
I think you need to reverse the > in your code, and add a qualification that you only want weekdays:
Data test;
Set input;
Weekday=Weekday(Date);
intck=intck('weekday',Date,today());
if intck('weekday',Date,'20JUL2015'd) le 4 and 1<weekday(Date)<7;
*if 1<Weekday(Date)<7 and Date>='20JUL2015'd-5;
Run;