Apply if and do in SAS to merge a dataset - if-statement

I'm trying to merge a dataset to another table (hist_dataset) by applying one condition.
The dataset that I'm trying to merge looks like this:
Label
week_start
date
Value1
Value2
Ac
09Jan2023
13Jan2023
45
43
The logic that I'm using is the next:
If the value("week_start" column) of the first record is equal to today's week + 14 then merge the dataset with the dataset that I want to append.
If the value(week_start column) of the first record is not equal to today's week + 14 then do nothing, don't merge the data.
The code that I'm using is the next:
libname out /"path"
data dataset;
set dataset;
by week_start;
if first.week_start = intnx('week.2', today() + 14, 0, 'b') then do;
data dataset;
merge out.hist_dataset dataset;
by label, week_start, date;
end;
run;
But I'm getting 2 Errors:
117 - 185: There was 1 unclosed DO block.
161 - 185: No matching DO/SELECT statement.
Do you know how can make the program run correctly or do you know another way to do it?
Thanks,
'''

I cannot make heads or tails of what you are asking. So let me take a guess at what you are trying to do and give answer to my guesses.
Let's first make up some dataset and variable names. So you have an existing dateset named OLD that has key variables LABEL WEEK_START and DATE.
Now you have received a NEW dataset that has those same variables.
You want to first subset the NEW dataset to just those observations where the value of DATE is within 14 days of the first value of START_WEEK in the NEW dataset.
data subset ;
set new;
if _n_=1 then first_week=start_week;
retain first_week;
if date <= first_week+14 ;
run;
You then want to merge that into the OLD dataset.
data want;
merge old subset;
by label week_start date ;
run;

Related

Create new variable for a group of records in SAS

I have a dataset with the first 4 columns and I want to create the last column. My dataset has millions of records.
ID
Date
Code
Event of Interest
Want to Create
1
1/1/2022
101
*
201
1
1/1/2022
201
yes
201
1
1/1/2022
301
*
201
1
1/1/2022
401
*
201
2
1/5/2022
101
*
301
2
1/5/2022
201
*
301
2
1/5/2022
301
yes
301
I want to group records by ID and date. If one of the records in the grouping has a 'yes' in the event of interest variable, I want to assign that code to the entire grouping. I am using base SAS.
Any ideas?
Assuming that you will only have one yes value for each id and date, you can use a lookup table and merge them together. Here are a few ways to do it.
1. Self-merge
Simply merge the data onto itself where event = yes.
data want;
merge have
have(rename=(code = new_code
event = _event_)
where =(upcase(_event_) = 'YES')
)
;
by id date;
drop _event_;
run;
2. SQL Self-join
Same as above, but using a SQL inner join.
proc sql;
create table want as
select t1.*
, t2.code as new_code
from have as t1
INNER JOIN
have as t2
ON t1.id = t2.id
AND t1.date = t2.date
where upcase(t2.event) = 'YES'
;
quit;
3. Hash lookup table
This is more advanced but can be quite performant if you have the memory. Notice that it looks very similar to our merge statement in Option 1. We're creating a lookup table, loading it to memory, and using a hash join to pull values from that in-memory table. h.Find() will check the unique combination of (id, date) in the value read from the set statement against the hash table in memory. If a match is found, it will pull the value of new_code.
data want;
set have;
if(_N_ = 1) then do;
dcl hash h(dataset: "have(rename=(code= new_code)
where =(upcase(event) = 'YES')
)"
, hashexp:20);
h.defineKey('id', 'date');
h.defineData('new_code');
h.defineDone();
call missing(new_code);
end;
rc = h.Find();
drop rc;
run;
You could just remember the last value of CODE you want for the group by using a double DOW loop.
In the first loop copy the code value to the new variable. The second loop can re-read the observations and write them out with the extra variable filled in.
data want;
do until (last.date);
set have;
by id date ;
if 'Event of Interest'n='yes' then 'Want to Create'n=code;
end;
do until (last.date);
set have;
by id date;
output;
end;
run;

Produce custom table in SAS with a subsetted data set

I want to use SAS and eg. proc report to produce a custom table within my workflow.
Why: Prior, I used proc export (dbms=excel) and did some very basic stats by hand and copied pasted to an excel sheet to complete the report. Recently, I've started to use ODS excel to print all the relevant data to excel sheets but since ODS excel would always overwrite the whole excel workbook (and hence also the handcrafted stats) I now want to streamline the process.
The task itself is actually very straightforward. We have some information about IDs, age, and registration, so something like this:
data test;
input ID $ AGE CENTER $;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
The goal is to produce a table report which should look like this structure-wise:
ID NO-ID Total
Count 3 2 5
Age (mean) 27 45.5 34.4
Count by Center:
A 2 1 3
B 0 1 1
A 1 0 1
It seems, proc report only takes variables as columns but not a subsetted data set (ID NE .; ID =''). Of course I could just produce three reports with three subsetted data sets and print them all separately but I hope there is a way to put this in one table.
Is proc report the right tool for this and if so how should I proceed? Or is it better to use proc tabulate or proc template or...?
I found a way to achieve an almost match to what I wanted. First if all, I had to introduce a new variable vID (valid ID, 0 not valid, 1 valid) in the data set, like so:
data test;
input ID $ AGE CENTER $;
if ID = '' then vID = 0;
else vID = 1;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
After this I was able to use proc tabulate as suggested by #Reeza in the comments to build a table which pretty much resembles what I initially aimed for:
proc tabulate data = test;
class vID Center;
var age;
keylabel N = 'Count';
table N age*mean Center*N, vID ALL;
run;
Still, I wonder if there is a way without introducing the new variable at all and just use the SAS counters for missing and non-missing observations.
UPDATE:
#Reeza pointed out to use the proc format to assign a value to missing/non-missing ID data. In combination with the missing option (prints missing values) in proc tabulate this delivers the output without introducing a new variable:
proc format;
value $ id_fmt
' ' = 'No-ID'
other = 'ID'
;
run;
proc tabulate data = test missing;
format ID $id_fmt.;
class ID Center;
var age;
keylabel N = 'Count';
table N age*(mean median) Center*N, (ID=' ') ALL;
run;

SAS Macro help to loop monthly sas datasets

I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).
I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.
Cust_Expense
------------
Customer_Number Month
111 FEB2014
987 APR2017
784 FEB2014
768 APR2017
.....
145 AUG2017
345 AUG2014
I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time.
Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.
Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
create table want_&month. as
select *
from cust_expense(where=(month="&month.")) t1
inner join cust_&month. t2
on t1.customer_number=t2.customer_number
;
%end;
quit;
%mend;
%want;
Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
create table want as
select *
from cust_expense t1
inner join (
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
%if &i>1 %then union;
select *, "&month." as month
from cust_&month
%end;
) t2
on t1.customer_number=t2.customer_number
and t1.month=t2.month
;
quit;
%mend;
%want;
In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.
Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.
If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).
Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.
data want;
set jan feb mar apr ... ;
where put(id,CUSTEXPF1.) = '1';
run;
That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).
I guess you could output your data in specific dataset like this example :
data test;
infile datalines dsd;
input ID : $2. MONTH $3. ;
datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
;
run;
data JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;
You will avoid to loop through all your ID (250000)
and you will use dataset statement from SAS
At the end you will get 12 DATASET containing the ID related.
If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :
...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...
Regards

Extract month and year for new variable

I have a dataset with a variable that has the date of orders (MMDDYY10.). I need to extract the month and year to look at orders by month--year because there are multiple years. I am vaguely aware of the month and year functions, but how can I use them together to create a new month bin?
Ideally I would create a bin for each month/year so my output can look something like:
Date
Item 2011OCT 2011NOV 2011DEC 2012JAN ...
a 50 40 30 20
b 15 20 25 30
c 1 2 3 4
total
Here is a sample code I created:
data dsstabl;
set dsstabl;
order_month = month(AC_DATE_OF_IMAGE_ORDER);
order_year = year(AC_DATE_OF_IMAGE_ORDER);
order = compress(order_month||order_year);
run;
proc freq data
table item * _order;
run;
Thanks in advance!
When you are doing your analysis, use an appropriate format. MONYY. sounds like the right one. That will sort properly and will group values accordingly.
Something like:
proc means data=yourdata;
class datevar;
format datevar MONYY7.;
var whatever;
run;
So your table:
proc tabulate data=dsstabl;
class item datevar;
format datevar MONYY7.;
tables item,datevar*n;
run;
Or you can do it with nice trick.
data my_data_with_months;
set my_data;
MONTH = INTNX('month', NORMAL_DATE, 0, 'B');
run;
Always use it.

SAS-How to format arrays dynamically based on information in one column

I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.