Complex data restructure issue in SAS - sas

I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017.
Customer ID First Issue Date Card Issue Date Card ID New Card ID
A 2/1/2017 2/1/2017 1234 1235
A 2/1/2017 5/2/2017 1235  
B 5/2/2017 5/2/2017 1245 1248
B 5/2/2017 5/2/2017 1236 1249
B 5/2/2017 10/3/2017 1248 1250
B 5/2/2017 5/3/2017 1249 1251
B 5/2/2017 10/4/2017 1250  
B 5/2/2017 5/4/2017 1251
 
What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).
Customer ID Open Date Card Issue Date Card ID Seq No
A 2/1/2017 2/1/2017 1234 1
A 2/1/2017 5/2/2017 1235 1
B 5/2/2017 5/2/2017 1245 1
B 5/2/2017 10/3/2017 1248 1
B 5/2/2017 10/4/2017 1250 1
B 5/2/2017 5/2/2017 1236 2
B 5/2/2017 5/3/2017 1249 2
B 5/2/2017 5/4/2017 1251 2
Please help me with this data transformation.
Here is the data step for the input file
data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;

The DATA Step hash object is very effective for traversing paths in identity tracked data. Presuming every Card_ID is unique over all customers, and each New_Card_ID value has a corresponding Card_ID value in the data set, then this code will find unique path ids amongst the myriad of reissues.
data paths(keep=card_id path_id);
if 0 then set have; * prep pdv;
call missing (Path_ID);
* for tracking the tip of the card_id trail;
DECLARE HASH currentCard(hashexp: 9);
currentCard.defineKey ('Card_ID');
currentCard.defineData ('Card_ID', 'Path_ID');
currentCard.defineDone();
* for tracking everything up to the tip (segment);
DECLARE HASH replacedCard(hashexp:10);
replacedCard.defineKey ('New_Card_ID');
replacedCard.defineData('Card_ID');
replacedCard.defineDone();
* fill the two hashes;
do until (lastrow);
set have (keep=Card_ID New_Card_ID) end=lastrow;
if missing(New_Card_ID) then
Path_ID + 1;
if missing(New_Card_ID)
then currentCard.add();
else replacedCard.add();
end;
* for each tip of a path output the tip and all its segments;
declare hiter tipIter('currentCard');
do while (tipIter.next() = 0);
output; * tip;
do while (replacedCard.find(key:Card_ID) = 0);
output; * segment;
end;
end;
stop;
run;
If you really need Seq = 1..N within Customer you will have to do additional sorting and merging.
My NESUG 2009 paper "Using HASH to find a sum over a transactional path" has a similar discussion about linked transactions.

What you are looking for is a Connected Component analysis. If you have it, PROC OPTNET can give you what you want.
Unfortunately, it doesn't support a BY statement, so you will have to generated the sequence number after you use it to group the cards.
First create the node, "to/from" data from your card data.
data nodes;
set test;
node = put(_n_,best12.);
from = card_id;
to = new_card_id;
if to = . then to=from;
run;
Then run the analysis.
proc optnet data_links=nodes out_nodes=nodes_out;
concomp;
run;
This generates a list of cards and their group (variable concomp).
Join that group back to the original data and sort it.
proc sql noprint;
create table want as
select a.customer,
a.First_Issue_Date,
a.Card_Issue_Date,
a.Card_ID,
b.concomp
from test as a
left join
nodes_out as b
on a.card_id = b.node
order by customer, concomp, Card_Issue_Date;
quit;
Now the groups are just ordered 1, 2, ..., N. You want can use a Data Step to take that information and create the seq_no
data want(drop=concomp);
set want;
by customer concomp;
retain seq_no ;
if first.customer then
seq_no = 0;
if first.concomp then
seq_no = seq_no + 1;
run;

Related

How to make a table (with proc report or data step) of a grouped variable where in different columns are counts of different variables?

Could you give some advise please how to calculate different counts to different columns when we group a certain variable with proc report (if it is possible with it)?
I copy here an example and the solution to better understand what i want to achieve. I can compile this table in sql in a way that i group them individually (with where statements, for example where Building_code = 'A') and then i join them to one table, but it is a little bit long, especially when I want to add more columns. Is there a way to define it in proc report or some shorter data step query, if yes can you give a short example please?
Example:
Solution:
Thank you for your time.
This should work. There is absolutely no need to do this by joining multiple tables.
data have;
input Person_id Country :$15. Building_code $ Flat_code $ age_category $;
datalines;
1000 England A G 0-14
1001 England A G 15-64
1002 England A H 15-64
1003 England B H 15-64
1004 England B J 15-64
1005 Norway A G 15-64
1006 Norway A H 65+
1007 Slovakia A G 65+
1008 Slovakia B H 65+
;
run;
This is a solution in proc sql. It's not really long or complicated. I don't think you could do it any shorter using data step.
proc sql;
create table want as
select distinct country, sum(Building_code = 'A') as A_buildings, sum(Flat_code= 'G') as G_flats, sum(age_category='15-64') as adults
from have
group by country
;
quit;

SAS proc sql inner join without duplicates

I am struggling to join two table without creating duplicate rows using proc sql ( not sure if any other method is more efficient).
Inner join is on: datepart(table1.date)=datepart(table2.date) AND tag=tag AND ID=ID
I think the problem is date and different names in table 1. By just looking that the table its clear that table1's row 1 should be joined with table 2's row 1 because the transaction started at 00:04 in table one and finished at 00:06 in table 2. I issue I am having is I cant join on dates with the timestamp so I am removing timestamps and because of that its creating duplicates.
Table1:
id tag date amount name_x
1 23 01JUL2018:00:04 12 smith ltd
1 23 01JUL2018:00:09 12 anna smith
table 2:
id tag ref amount date
1 23 19 12 01JUL2018:00:06:00
1 23 20 12 01JUL2018:00:10:00
Desired output:
id tag date amount name_x ref
1 23 01JUL2018 12 smith ltd 19
1 23 01JUL2018 12 anna smith 20
Appreciate your help.
Thanks!
You need to set a boundary for that datetime join. You are correct in why you are getting duplicates. I would guess the lower bound is the previous datetime, if it exists and the upper bound is this record's datetime.
As an aside, this is poor database design on someone's part...
Let's first sort table2 by id, tag, and date
proc sort data=table2 out=temp;
by id tag date;
run;
Now write a data step to add the previous date for unique id/tag combinations.
data temp;
set temp;
format low_date datetime20.
by id tag;
retain p_date;
if first.tag then
p_date = 0;
low_date = p_date;
p_date = date;
run;
Now update your join to use the date range.
proc sql noprint;
create table want as
select a.id, a.tag, a.date, a.amount, a.name_x, b.ref
from table1 as a
inner join
temp as b
on a.id = b.id
and a.tag = b.tag
and b.low_date < a.date <= b.date;
quit;
If my understanding is correct, you want to merge by ID, tag and the closest two date, it means that 01JUL2018:00:04 in table1 is the closest with 01JUL2018:00:06:00 in talbe2, and 01JUL2018:00:09 is with 01JUL2018:00:10:00, you could try this:
data table1;
input id tag date:datetime21. amount name_x $15.;
format date datetime21.;
cards;
1 23 01JUL2018:00:04 12 smith ltd
1 23 01JUL2018:00:09 12 anna smith
;
data table2;
input id tag ref amount date: datetime21.;
format date datetime21.;
cards;
1 23 19 12 01JUL2018:00:06:00
1 23 20 12 01JUL2018:00:10:00
;
proc sql;
select a.*,b.ref from table1 a inner join table2 b
on a.id=b.id and a.tag=b.tag
group by a.id,a.tag,a.date
having abs(a.date-b.date)=min(abs(a.date-b.date));
quit;

Aggregating using proc means SAS

For a project, I have a large dataset of 1.5m entries, I am looking to aggregate some car loan data by some constraint variables such as:
Country, Currency, ID, Fixed or floating , performing , Initial Loan Value , Car Type , Car Make
I am wondering if it is possible to aggregate data by summing the initial loan value for the numeric and then condensing the similar variables into one row with the same observation such that I turn the first dataset into the second
Country Currency ID Fixed_or_Floating Performing Initial_Value Current_Value
data have;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
data want;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 410 150
UK GBP 1 Floating Performing 275 170
UK GBP 1 Fixed Non-performing 220 80
;
run;
Essentially looking for a way to sum the numeric values while concatenating the character variables.
I've tried this code
proc means data=have sum;
var initial current;
by country currency id fixed performing;
run;
Unsure If i'll have to use a proc sql (would be too slow for such a large dataset) or possibly a data step.
any help in concatenating would be appreciated.
Create an output data set from Proc MEANS and concatenate the variables in the result. MEANS with a BY statement requires sorted data. Your have does not.
Concatenation of the aggregations key (those lovely categorical variables) into a single space separated key (not sure why you need to do that) can be done with CATX function.
data have_unsorted;
length country $2 currency $3 id 8 type $8 evaluation $20 initial current 8;
input country currency ID type evaluation initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
Way 1 - MEANS with CLASS/WAYS/OUTPUT, post process with data step
The cardinality of the class variables may cause problems.
proc means data=have_unsorted noprint;
class country currency ID type evaluation ;
ways 5;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 2 - SORT followed by MEANS with BY/OUTPUT, post process with data step
BY statement requires sorted data.
proc sort data=have_unsorted out=have;
by country currency ID type evaluation ;
proc means data=have noprint;
by country currency ID type evaluation ;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 3 - MEANS, given data that is grouped but unsorted, with BY NOTSORTED/OUTPUT, post process with data step
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
proc means data=have_unsorted noprint;
by country currency ID type evaluation NOTSORTED;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 4 - DATA Step, DOW loop, BY NOTSORTED and key construction
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
data want_way4;
do until (last.evaluation);
set have;
by country currency ID type evaluation NOTSORTED;
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
end;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 5 - Data Step hash
data can be processed with out a presort or clumping. In other words, data can be totally disordered.
data _null_;
length key $50 initial_sum current_sum 8;
if _n_ = 1 then do;
call missing (key, initial_sum, current_sum);
declare hash sums();
sums.defineKey('key');
sums.defineData('key','initial_sum','current_sum');
sums.defineDone();
end;
set have_unsorted end=end;
key = catx(' ',country,currency,ID,type,evaluation);
rc = sums.find();
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
sums.replace();
if end then
sums.output(dataset:'have_way5');
run;
1.5m entries is not very big dataset. The dataset is sorted first.
proc sort data=have;
by country currency id fixed performing;
run;
proc means data=have sum;
var initial current;
by country currency id fixed performing;
output out=sum(drop=_:) sum(initial)=Initial sum(current)=Current;
run;
Props to paige miller
proc summary data=testa nway;
var net_balance;
class ID fixed_or_floating performing_status initial country currency ;
output out=sumtest sum=sum_initial;
run;

SAS- Calculate Top Percent of Population

I am trying to seek some validation, this may be trivial for most but I am by no means an expert at statistics. I am trying to select patients in the top 1% based on a score within each drug and location. The data would look something like this (on a much larger scale):
Patient drug place score
John a TX 12
Steven a TX 10
Jim B TX 9
Sara B TX 4
Tony B TX 2
Megan a OK 20
Tom a OK 10
Phil B OK 9
Karen B OK 2
The code snipit I have written to calculate those top 1% patients is as follows:
proc sql;
create table example as
select *,
score/avg(score) as test_measure
from prior_table
group by drug, place
having test_measure>.99;
quit;
Does this achieve what I am trying to do, or am going about it all wrong? Sorry if this is really trivial to most.
Thanks
There are multiple ways to calculate and estimate a percentile. A simple way is to use PROC SUMMARY
proc summary data=have;
var score;
output out=pct p99=p99;
run;
This will create a data set named pct with a variable p99 containing the 99th percentile.
Then filter your table for values >=p99
proc sql noprint;
create table want as
select a.*
from have as a
where a.score >= (select p99 from pct);
quit;

How to calculate quantile data for table of frequencies in SAS?

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.