Remove all instances of duplicates in SAS

Remove all instances of duplicates in SAS - sas

I am merging two SAS datasets by ID number and would like to remove all instances of duplicate IDs, i.e. if an ID number occurs twice in the merged dataset then both observations with that ID will be deleted.
Web searches have suggested some sql methods and nodupkey, but these are not working because they are for typical duplicate cleansing where one instance is kept and then the multiples are deleted.

Assuming you are using a DATA step with a BY id; statement, then adding:
if NOT (first.id and last.id) then delete;
should do it. If that doesn't work, please show your code.
I'm actually a fan of writing dropped records to a separate dataset so you can track how many records were dropped at different points. So I would code this something like:
data want
drop_dups
;
merge a b ;
by id ;
if first.id and last.id then output want ;
else output drop_dups ;
run ;

Here is an SQL way to do it. You can use left/right/inner join best suitable for your needs. Note that this works on a single dataset just as well.
proc sql;
create table singles as
select * from dataset1 a inner join dataset2 b
on a.ID = b.ID
group by a.ID
having count(*) = 1;
quit;
For example from
ID x
5 2
5 4
1 6
2 7
3 6
You will select
ID x
1 6
2 7
3 6

Related

Optimize proc sql statements in SAS

I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?

The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.

show all values in categorical variable

The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes

You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;

First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;

The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.

SAS software: How to delete observations with more than five zeros for the dependent variable

I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.

I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;

PROC SQL - Counting distinct values across variables

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values.
However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
Consider a dataset that looks like this:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1.
The result should be 4 (abc, efg, hij, kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
This gives me:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?

It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;

I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.

identifying connected graphs given edges

How do I group people who are related, even indirectly? Concretely, using the first two columns of the data set like below, how do I in SAS (maybe using a DATA step or PROC SQL) programmatically derive the third column? Is there a non-iterative algorithm?
Background: Each person has multiple addresses. Through each address, each person is connected to zero or more persons. If two people are connected, they get the same group ID. If person A is directly connected to B and B is connected to C, then persons A, B, and C share a group.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;

The general methods for finding all connected components of a graph are Breadth-First-Search or Depth-First-Search. SAS is not the best tool for implementing such algorithms since they require using such data structures as queues.
Still it can be done with hash objects. Here's the code for BF-search.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;
run;
Create adjacency list - all pairs of people with a common address. And empty variable cluster which will be populated later with groups' IDs:
proc sql;
create table connections as
select distinct a.person_id as person_id_a, b.person_id as person_id_b, . as cluster
from people a
inner join people b
on a.address_id=b.address_id
;
quit;
Here goes the BF-search itself:
data _null_;
Declare hash object and its iterator for all unique people (vertices of the graph):
if 0 then set Connections;
dcl hash V(dataset:'Connections', ordered:'y');
V.defineKey('person_id_a');
V.defineData('person_id_a','cluster');
dcl hiter Vi('V');
V.defineDone();
Declare hash object for all connections (edges of the graph):
dcl hash E(dataset:'Connections', multidata:'y');
E.defineKey('person_id_a');
E.defineData('person_id_a','person_id_b');
E.defineDone();
Declare hash object and its iterator for the queue:
dcl hash Q(ordered:'y');
Q.defineKey('qnum','person_id_a');
Q.defineData('qnum','person_id_a');
dcl hiter Qi('Q');
Q.defineDone();
The outermost loop - for taking a new person without assigned cluster to be a root of the next cluster, when the queue is empty:
rc1=Vi.first();
do while(rc1=0);
if missing(cluster) then do;
qnum=1; Q.add(); *qnum-number of the person in the queue, to ensure that new people are added to the end of the queue.;
n+1; cluster=n;
V.replace();*assign cluster number to a person;
In the following two nested loops we de-queue the first person in the queue and look for all people connected to this person in adjacency list. Every found 'connection' we add to the end of the queue. When done with the first person, we delete him/her and de-queue the next one (who became the first now). All of them will be in the same cluster. And so on, until the queue is empty. Then we take a new root person for a new cluster.
rc2=Qi.first();
do while(rc2=0);
qnum=qnum+Q.num_items-1;
rc3=E.find();
do while(rc3=0);
person_id_a=person_id_b;
rc4=V.find();
if rc4=0 and missing(cluster) then do;
qnum+1; Q.add();
cluster=n;
V.replace();
end;
rc3=E.find_next();
end;
Qi.first();
Qi.delete();
Q.remove();
Qi=_new_ hiter ('Q');
rc2=Qi.first();
end;
end;
rc1=Vi.next();
end;
Output list of people with assigned clusters.
V.output(dataset:'clusters');
run;
proc sort data=clusters; by cluster; run;

This is a common problem that has complex solutions. How complex you need depends primarily on the complexity of your data. How often are linkages more than single linkages - ie, in your example above, C and D are linked by 5. Can you have an E that is linked to D by 6? If so then this requires either a different approach or a resolution step.
I show one simple method here. This is a very simplistic solution, but it sometimes is easier to understand and implement. Record linkage is a well covered subject that has a lot of papers available to explore; much better solutions exist that are more able to handle multiple linkage than the below solution (which handles 2 level linkage but not further, and has some weaknesses in handling data crosslinkages).
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
6 E 3
7 E 3
8 B 2
;
run;
data links(keep=link:);
set people;
by person_id address_id;
retain link_start;
if first.person_id and not last.person_id then do;
link_start = address_id;
end;
if first.address_id and not first.person_id then do;
link_end = address_id;
output;
end;
run;
data for_fmt;
set links;
start=link_end;
label=link_Start;
retain fmtname '$linkf';
output;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data people_linked;
set people;
new_addressid = put(address_id,$linkf.);
new_addressid = put(new_addressid, $linkf.);
run;
proc sort data=people_linked;
by new_addressid;
run;
data people_final;
set people_linked;
by new_addressid;
if first.new_addressID then
new_householdID+1;
run;

I have been working with a problem that requires a similar thing. I was able to solve using SAS OR using proc OPTNET (statement CONCOMP). Documentation even bring an example that illustrate the concept very well .
Thanks,
Murilo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove all instances of duplicates in SAS - sas

Related

Optimize proc sql statements in SAS

show all values in categorical variable

SAS software: How to delete observations with more than five zeros for the dependent variable

PROC SQL - Counting distinct values across variables

identifying connected graphs given edges

Categories

Resources