Get the unique combinations per variable in SAS - sas

I am attempting to group by a variable that is not unique with a discrete variable to get the unique combinations per non-unique variable. For example:
A B
1 a
1 b
2 a
2 a
3 a
4 b
4 d
5 c
5 e
I want:
A Unique_combos
1 a, b
2 a
3 a
4 b, d
5 e
My current attempt is something along the lines of:
proc sql outobs=50;
title 'Unique Combinations of b per a';
select a, b
from mylib.mydata
group by distinct a;
run;

If you are happy to use a data step instead of proc sql you can use the retain keyword combined with first/last processing:
Example data:
data have;
attrib b length=$1 format=$1. informat=$1.;
input a
b $
;
datalines;
1 a
1 b
2 a
2 a
3 a
4 b
4 d
5 c
5 e
;
run;
Eliminate duplicates and make sure the data is sorted for first/last processing:
proc sql noprint;
create table tmp as select distinct a,b from have order by a,b;
quit;
Iterate over the distinct list and concatenate the values of b together:
data want;
length combinations $200; * ADJUST TO BE BIG ENOUGH TO STORE ALL THE COMBINATIONS;
set tmp;
by a;
retain combinations '';
if first.a then do;
combinations = '';
end;
combinations = catx(', ',combinations, b);
if last.a then do;
output;
end;
drop b;
run;
Result:
combinations a
a, b 1
a 2
a 3
b, d 4
c, e 5

You just need to put a distinct keyword in the select clause, eg:
title 'Unique Combinations of b per a';
proc sql outobs=50;
select distinct a, b
from mylib.mydata;
The run statement is unnecessary, the sql procedure is normally ended with a quit - although I personally never use it, as the statement will execute upon hitting the semicolon and the procedure quits anyway upon hitting the next step boundary.

Related

values of commun column in A replaced by that in B with function merge in SAS

I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

SAS merge datasets to overwrite values

I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.
This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).
Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;
The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).

How to easly reformat dataset in SAS

Suppose a data are as follows:
A B C
1 3 2
1 4 9
2 6 0
2 7 3
where A B and C are the variable names.
Is there a way to transform the table to
A 1
A 1
A 2
A 2
B 3
B 4
B 6
B 7
C 2
C 9
C 0
C 3
Expanding on the advice from #donPablo, here's how you would code it. Create an array to read across the data, then output each iteration of that array so you end up with the number of rows being the rows * columns from the original dataset. The VNAME function enables you to store the variable name (A, B, C) as a value in a separate variable.
data have;
input A B C;
datalines;
1 3 2
1 4 9
2 6 0
2 7 3
;
run;
data want;
set have;
length var1 $10;
array vars{*} _numeric_;
do i=1 to dim(vars);
var1=vname(vars{i});
var2=vars{i};
keep var1 var2;
output;
end;
run;
proc sort data=want;
by var1;
run;
The least amount of (expensive) development time might be --
Read and store the first row
For each subsequent row
Read the row
Create three records
Until end
Sort
How many times will this be run? Per day/ per year?
What number of rows are there?
Might we save 1 hr / month? 1 min / year? Something will need to read the entire file. Optomize last. Make it work first.
tkx
It should work correctly:
DATA A(keep A);
new_var = 'A';
SET your_data;
RUN;
DATA B(keep B);
new_var = 'B';
SET your_data;
RUN;
DATA C(keep C);
new_var = 'C';
SET your_data;
RUN;
PROC APPEND base=A data=B FORCE;
RUN;
PROC APPEND base=A data=C FORCE;
RUN;
Data A is a result data set.

Update one dataset with another without using PROC SQL

I have the below two datasets
Dataset A
id age mark
1 . .
2 . .
1 . .
Dataset B
id age mark
2 20 200
1 10 100
I need the below dataset as output
Output Dataset
id age mark
1 10 100
2 20 200
1 10 100
How to carry out this without using PROC SQL i.e. using DATA STEP?
There are many ways to do this. The easiest is to sort the two data sets and then use MERGE. For example:
proc sort data=A;
by id;
run;
proc sort data=B;
by id;
run;
data WANT;
merge A(drop=age mark) B;
by ID;
run;
The trick is to drop the variables you are adding from the first data set A; the new variables will come from the second data set B.
Of course, this solution does not preserve the original order of the observations in your data set AND only works because your second data set contains unique values of id.
I tried this and it worked for me, even if you have data you would like to preserve in that column. Just for completeness sake I added an SQL variant too.
data a;
input id a;
datalines;
1 10
2 20
;
data b;
input id a;
datalines;
1 .
1 5
1 .
2 .
3 4
;
data c (drop=b);
merge a (rename = (a=b) in=ina) b (in = inb);
by id;
if b ne . then a = b;
run;
proc sql;
create table d as
select a.id, a.a from a right join b on a.id=b.id where a.id is not null
union all
select b.id, b.a from a right join b on a.id = b.id where a.id is null
;
quit;