Lookup to match from one dataset to another

Lookup to match from one dataset to another - sas

I need help multiplying the values from data set 2 to data set 1.
Data#1:
policy#
Risk#
Premium
KOK1
002
150
KOK2
003
130
Data#2:
Source
policy#
Risk#
Item1
ageofbuild
KOK1
002
3
yearofbuild
KOK1
002
5
Discount1
KOK1
002
10%
Discount2
KOK1
002
5%
ageofbuild
KOK2
003
4
yearofbuild
KOK2
003
6
Discount1
KOK2
003
15%
Discount2
KOK2
003
5%
How to set up a formula to match the policy# and risk# from data#1 to data#2 and multiply the "premium" in data#1 with discount1 and discount2 from data#2 if they match the policy# and risk#?

There are two ways to do this. The first way is a simple merge where you merge the dataset by policy and risk #, then perform your calculations. For example:
data want;
merge data2(in=d2)
data1(in=d1);
by policy risk_nbr;
/* If policy and risk_nbr match from data2 and data1, then calculate
a premium */
if(d2 AND d1 AND find(source, 'Discount') ) then value = Premium*Item1;
run;
This is similar to a full join on policy, risk_nbr in SQL, but only multiplying if the two key values match. Note that both datasets must be sorted by policy and risk_nbr for this to work.
The second way is through a hash table lookup, which is one of my favorite ways of doing these small lookup tables. They're really fast.
Think of a hash table as an independent table that floats out in memory. We're going to talk to it using special methods that look up a value in the hash table by a key in our dataset and pull that value down so we can use it. Here's what that looks like.
data want;
/* Only load the hash table once */
if(_N_ = 1) then do;
dcl hash h(dataset: 'data2'); *Add a dataset to a hash table called 'h';
h.defineKey('policy', 'risk'); *Define our lookup key;
h.defineData('premium'); *The value we want to pull;
h.defineDone(); *Load the dataset into `h`;
/* Initialize the numeric variable 'premium' with a missing value
since it does not exist yet. This prevents data step warnings. */
call missing(premium);
end;
/* Look up the value of policy and risk in the set dataset and compare it
with the hash table's value of policy and risk.
If there is a match, rc = 0
*/
rc = h.Find();
if(rc = 0 AND find(source, 'Discount') ) then value = Premium*Item1;
drop rc;
run;
Hash tables are extremely powerful and very fast, especially if you are joining a small table with a large table. You don't need to do any pre-sorting, either.
If you want to learn more about hash tables, check out the paper I cut my processing time by 90% using hash tables - you can too!

Related

PROC SQL -- Attempting to create dummy variable based on value in one column and group in another column

I have three columns indicating the name of a center, the center's state, and whether or not the state associated with the center. I am trying to see if a center is part of a chain that operates only in intervention states, never in intervention states, or in intervention and non-intervention states.
data df;
input $ center state intervention chain;
Center1 CA 1 chain1
Center2 AZ 0 chain1
Center3 PA 0 chain2
Center4 HI 0 chain2
Center5 CA 1 chain3
Center6 CA 1 chain3;
run;
What is the best way to do this? I have tried creating three separate tables for intervention, nonintervention, and both that list the chains by state of operation. I then created dummy variables in each that indicate whether they operate in intervention, nonintervention, or both states. I then merged them back into the original table but there was overlap between the intervention/both tables, as well as overlap between the nonintervention/both tables.

I like a non-sql solution for this; it is doable in SQL, but it's so easy outside of it, comparatively.
data df;
input center $ state $ intervention chain $;
datalines;
Center1 CA 1 chain1
Center2 AZ 0 chain1
Center3 PA 0 chain2
Center4 HI 0 chain2
Center5 CA 1 chain3
Center6 CA 1 chain3
;
run;
proc sort data=df;
by chain intervention;
run;
data want;
do _n_ = 1 by 1 until (last.chain); *first, check to see if it is mixed or int or nonint;
set df;
by chain intervention; *by intervention lets us easily see mixed;
length int_type $16;
if last.intervention and not last.chain then mixed=1; *if we see a row that is the last row for a by-group of intervention, but not chain, then we know this chain is mixed;
if last.chain and mixed=1 then int_type='Mixed';
else if intervention then int_type='Intervention';
else int_type='Non-Intervention';
end;
do _n_ = 1 by 1 until (last.chain); *now run through the data set a second time to put the int_type variable on the dataset;
set df;
by chain;
output;
end;
run;
This is the DoW loop, and iterates twice through the dataset, first checking if the conditions for mixed are met, if so then it marks it mixed, if not it marks it based on the int_type of the last record.
If your dataset is so large that a sort is not desirable, this can be done in other ways without sorting, but this should be relatively fast even with the sort. If it's already sorted by CHAIN but not by CHAIN INTERVENTION, it could be done solely sorted by CHAIN, just with a bit more logic.

Summing a Column By Group In a Dataset With Macros

I have a dataset that looks like:
Month Cost_Center Account Actual Annual_Budget
June 53410 Postage 13 234
June 53420 Postage 0 432
June 53430 Postage 48 643
June 53440 Postage 0 917
June 53710 Postage 92 662
June 53410 Phone 73 267
June 53420 Phone 103 669
June 53430 Phone 90 763
...
I would like to first sum the Actual and Annual columns, respectively and then create a variable where it flags if the Actual extrapolated for the entire year is greater than than Annual column.
I have the following code:
Data Test;
set Combined;
%All_CC; /*MACRO TO INCLUDE ALL COST CENTERS*/
%Total_Other_Expenses;/*MACRO TO INCLUDE SPECIFIC Account Descriptions*/
Sum_Actual = sum(Actual);
Sum_Annual = sum(Annual_Budget);
Run_Rate = Sum_Actual*12;
if Run_Rate > Sum_Annual then Over_Budget_Alarm = 1;
run;
However, when I run this code, it does not sum by group, for example, this is the output I get:
Account_Description Sum_Actual Sum_Annual Run_Rate Over_Budget_Alarm
Postage 13 234 146
Postage 0 432 0
Postage 48 643 963 1
Postage 0 917 0
Postage 92 662 634 1
I'm looking for output where all the 'postage' are summed for Actual and Annual, leaving just one row of data.

Use PROC MEANS to summarize the data
Use a data step and IF/THEN statement to create your flags.
proc means data=have N SUM NWAY STACKODS;
class account;
var amount annual_budget;
ods output summary = summary_stats1;
output out = summary_stats2 N = SUM= / AUTONAME;
run;
data want;
set summary_stats;
if sum_actual > sum_annual_budget then flag=1;
else flag=0;
run;

SAS DATA step behavior is quite complex ("About DATA Step Execution" in SAS Language Reference: Concepts). The default behavior, that you're seeing, is: at the end of each iteration (i.e. for each input row) the row is written to the output data set, and the PDV - all data step variables - is reset.
You can't expect to write Base SAS "intuitively" without spending a few days learning it first, so I recommend using PROC SQL, unless you have a reason not to.
If you really want to aggregate in data step, you have to use something called BY groups processing: after ensuring the input data set is sorted by the BY vars, you can use something like the following:
data Test (keep = Month Account Sum_Actual Sum_Annual /*...your Run_Rate and Over_Budget_Alarm...*/);
set Combined; /* the input table */
by Month Account; /* must be sorted by these */
retain Sum_Actual Sum_Annual; /* don't clobber for each input row */
if first.account then do; /* instead do it manually for each group */
Sum_Actual = 0;
Sum_Annual = 0;
end;
/* accumulate the values from each row */
Sum_Actual = sum(Sum_Actual, Actual);
Sum_Annual = sum(Sum_Annual, Annual_Budget);
/* Note that Sum_Actual = Sum_Actual+Actual; will not work if any of the input values is 'missing'. */
if last.account then do;
/* The group has been processed.
Do any additional processing for the group as a whole, e.g.
calculate Over_Budget_Alarm. */
output; /* write one output row per group */
end;
run;

Proc SQL can be very effective for understanding aggregate data examination. With out seeing what the macros do, I would say perform the run rate checks after outputting data set test.
You don't show rows for other months, but I must presume the annual_budget values are constant across all months -- if so, I don't see a reason to ever sum annual_budget; comparing anything to sum(annual_budget) is probably at the incorrect time scale and not useful.
From the show data its hard to tell if you want to know any of these
which (or if some) months had a run_rate that exceeded the annual_budget
which (or if some) months run_rate exceeded the balance of annual_budget (i.e. the annual_budget less the prior months expenditure)
Presume each row in test is for a single year/month/costCenter/account -- if not the underlying data would have to be aggregated to that level.
Proc SQL;
* retrieve presumed constant annual_budget values from data;
* this information might (should) already exist in another table;
* presume constant annual budget value at each cost center | account combination;
* distinct because there are multiple months with the same info;
create table annual_budgets as
select distinct Cost_Center, Account, Annual_Budget
from test;
create table account_budgets as
select account, sum(annual_budget) as annual_budget
from annual_budgets
group by account;
* flag for some run rate condition;
create table annual_budget_mon_runrate_check as
select
2019 as year,
account,
sum(actual) as yr_actual, /* across all month/cost center */
min (
select annual_budget from account_budgets as inner
where inner.account = outer.account
) as account_budget,
max (
case when actual * 12 > annual_budget then 1 else 0 end
) as
excessive_runrate_flag label="At least one month had a cost center run rate that would exceed its annual_budget")
from
test as outer
group by
year, account;
You can add a where clause to restrict the accounts processed.
Changing the max to sum in the flag computation would return the number of cost center months with excessive run rates.

Extreme values using proc sql

I have a dataset of customers of a grocery store with information including name, customer id, gender, date of birth, customer type, etc.
DOB is in the format of DDMMMYYYY such as 25JAN1990.
Customer types include online shopper, gold member, silver member, etc.
I need to print a table that identifies the youngest and oldest customers for each customer type using PROC SQL. I'm not sure where to even start with this. I have a rough idea of grouping by customer type and using max and min function on dates but I am not sure if that will work or if I can implement it.
PROC SQL;
quit;

I think the important bit of this task is the SQL command having, which allows you to do comparisons withing the select statement.
Since there is no clear beginning data, I've taken liberty to produce some.
data begin;
input ID Test $ date mmddyy10.;
cards;
001 A 09/01/2011
001 A 10/02/2011
001 A 09/12/2012
001 A 10/10/2013
001 B 10/01/2011
001 B 01/01/2012
002 A 10/12/2014
002 A 10/13/2014
002 A 02/02/2015
002 A 11/15/2015
;
run;
proc sql;
select * from begin
group by ID, test /*These are the grouping variables*/
having date=max(date) or date=min(date); /*Conditions that must be fulfilled*/
quit;
More about having and group by: http://www.dofactory.com/sql/having

Defining variables in one table based on values in another table

I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.

That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.

Dynamic Where in SAS - Conditional Operation based on another set

To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?

I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.

You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;

I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Lookup to match from one dataset to another - sas

Related

PROC SQL -- Attempting to create dummy variable based on value in one column and group in another column

Summing a Column By Group In a Dataset With Macros

Extreme values using proc sql

Defining variables in one table based on values in another table

Dynamic Where in SAS - Conditional Operation based on another set

Categories

Resources