suppose to have the following:
data have;
input ID :$20. Start :date9. End :date9.;
format start end ddmmyy9.;
cards;
0001 01JAN2015 30JUN2015
0001 01JUL2015 01FEB2016
0001 02FEB2016 11DEC2016
0001 12DEC2016 06FEB2017
0001 07FEB2017 31DEC2017
0002 01JAN2016 31DEC2017
0002 01JAN2018 01MAR2018
0002 01APR2018 31NOV2018
......................
;
and a list of dates:
data dates;
input dates :$20.;
format dates ddmmyy9.;
cards;
01JAN2015
31DEC2015
01JAN2016
31DEC2016
01JAN2017
31DEC2017
01JAN2018
31DEC2018
;
Is there a way to know if, for each ID, each date is in the range? For example: the ID 0001 contains all dates except 01JAN2018 and 31DEC2018.
Moreover, for each year I need to count how many IDs start at 01/01 and end at 31/12 so they appear for the entire year. For example, ID 0002 will not be counted for 2018 because it ends before 31/12. Desired output:
ID 01JAN2015 31DEC2015 01JAN2016 31DEC2016 01JAN2017 31DEC2017 01JAN2018 31DEC2018
0001 yes yes yes yes yes yes no no
0002 no no yes yes yes yes yes no
Final table:
Year Count
2015 1
2016 2
2017 2
2018 0
To match the dates in the range I tried:
proc sql;
create table want as;
select dates as t1;
join have as t2;
t2.dates between t1.start and t1.end
order by 1,2;
quit;
Unfortunately I lose the ID correspondence.
Can anyone help me please?
Thank you in advance
The first output can be achieved using proc sql followed by proc transpose and then a data step.
(The variables are created in a different order than in your desired output - hopefully that is not a problem for you.)
The second can be done with a data step followed by proc summary.
NB - I changed the invalid date 31NOV2018 to 30NOV2018 before running my code.
Output 1
* First, join the HAVE and DATES data sets to check which dates fall within a range;
proc sql;
create table want1 as
select h.ID, d.dates format=date9., 'yes' as flag
from dates d left join have h on (d.dates between h.start and h.end)
order by ID, dates
;
run;
* transpose this to the wider format required. Cells where there is no match will be blank;
proc transpose data=want1 out=want1 (where=(ID ne '') drop=_name_);
by ID;
id dates;
var flag;
run;
* Now populate the blank cells with "no";
data want1;
set want1;
array datevars (*) _all_;
do i = 1 to dim(datevars);
if datevars(i) = '' then datevars(i)='no';
end;
drop i;
run;
Output 2
* This assumes the data are in order by ID, start and end - if not then sort the data before this step;
* First read throught the HAVE data set and set count = 1 for each year that is fully covered by each ID;
data want2;
set have;
by ID;
retain first last count; * FIRST and LAST are the earliest START and latest END respectively for each ID;
if first.ID then do;
first=start;
count=0;
end;
if last.ID then do;
last=end;
do yr=year(first) to year(last); * for each ID we want to cycle through all the years covered;
if first <= mdy(1,1,yr) and last >= mdy(12,31,yr) then count = 1; * sets count=1 if the year is fully contained within the range;
output;
count=0;
end;
end;
drop start end first last;
format first last date9.;
run;
* The step above produces a row for each combination of year * ID, we just want totals by year which is done by PROC SUMMARY;
proc summary data=want2 nway;
class yr;
var count;
output out=want2 (drop=_type_ _freq_) sum=;
run;
Related
Date set having id and date .I want a date set with two duplicate id but condition is that one should be before 8th June and other should be after 8th June.
To take the first date and the first date after 2021-06-08 you can sort by ID and DATE and use LAG() to detect when you cross the date boundary.
data have ;
input id date :date. ;
format date date9.;
cards;
1 01jun2021
1 07jun2021
1 08jun2021
1 09jun2021
;
data want;
set have ;
by id date;
if first.id or ( (date<='08JUN2021'd) ne lag(date<='08JUN2021'd));
run;
results
Obs id date
1 1 01JUN2021
2 1 09JUN2021
For a project, I have a large dataset of 1.5m entries, I am looking to aggregate some car loan data by some constraint variables such as:
Country, Currency, ID, Fixed or floating , performing , Initial Loan Value , Car Type , Car Make
I am wondering if it is possible to aggregate data by summing the initial loan value for the numeric and then condensing the similar variables into one row with the same observation such that I turn the first dataset into the second
Country Currency ID Fixed_or_Floating Performing Initial_Value Current_Value
data have;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
data want;
set have;
input country $ currency $ ID Fixed $ performing $ initial current;
datalines;
UK GBP 1 Fixed Performing 410 150
UK GBP 1 Floating Performing 275 170
UK GBP 1 Fixed Non-performing 220 80
;
run;
Essentially looking for a way to sum the numeric values while concatenating the character variables.
I've tried this code
proc means data=have sum;
var initial current;
by country currency id fixed performing;
run;
Unsure If i'll have to use a proc sql (would be too slow for such a large dataset) or possibly a data step.
any help in concatenating would be appreciated.
Create an output data set from Proc MEANS and concatenate the variables in the result. MEANS with a BY statement requires sorted data. Your have does not.
Concatenation of the aggregations key (those lovely categorical variables) into a single space separated key (not sure why you need to do that) can be done with CATX function.
data have_unsorted;
length country $2 currency $3 id 8 type $8 evaluation $20 initial current 8;
input country currency ID type evaluation initial current;
datalines;
UK GBP 1 Fixed Performing 100 50
UK GBP 1 Fixed Performing 150 30
UK GBP 1 Fixed Performing 160 70
UK GBP 1 Floating Performing 150 30
UK GBP 1 Floating Performing 115 80
UK GBP 1 Floating Performing 110 60
UK GBP 1 Fixed Non-Performing 100 50
UK GBP 1 Fixed Non-Performing 120 30
;
run;
Way 1 - MEANS with CLASS/WAYS/OUTPUT, post process with data step
The cardinality of the class variables may cause problems.
proc means data=have_unsorted noprint;
class country currency ID type evaluation ;
ways 5;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 2 - SORT followed by MEANS with BY/OUTPUT, post process with data step
BY statement requires sorted data.
proc sort data=have_unsorted out=have;
by country currency ID type evaluation ;
proc means data=have noprint;
by country currency ID type evaluation ;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 3 - MEANS, given data that is grouped but unsorted, with BY NOTSORTED/OUTPUT, post process with data step
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
proc means data=have_unsorted noprint;
by country currency ID type evaluation NOTSORTED;
output out=sums sum(initial current)= / autoname;
run;
data want;
set sums;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 4 - DATA Step, DOW loop, BY NOTSORTED and key construction
The have rows will be processed in clumps of the BY variables. A clump is a sequence of contiguous rows that have the same by group.
data want_way4;
do until (last.evaluation);
set have;
by country currency ID type evaluation NOTSORTED;
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
end;
key = catx(' ',country,currency,ID,type,evaluation);
keep key initial_sum current_sum;
run;
Way 5 - Data Step hash
data can be processed with out a presort or clumping. In other words, data can be totally disordered.
data _null_;
length key $50 initial_sum current_sum 8;
if _n_ = 1 then do;
call missing (key, initial_sum, current_sum);
declare hash sums();
sums.defineKey('key');
sums.defineData('key','initial_sum','current_sum');
sums.defineDone();
end;
set have_unsorted end=end;
key = catx(' ',country,currency,ID,type,evaluation);
rc = sums.find();
initial_sum = SUM(initial_sum, initial);
current_sum = SUM(current_sum, current);
sums.replace();
if end then
sums.output(dataset:'have_way5');
run;
1.5m entries is not very big dataset. The dataset is sorted first.
proc sort data=have;
by country currency id fixed performing;
run;
proc means data=have sum;
var initial current;
by country currency id fixed performing;
output out=sum(drop=_:) sum(initial)=Initial sum(current)=Current;
run;
Props to paige miller
proc summary data=testa nway;
var net_balance;
class ID fixed_or_floating performing_status initial country currency ;
output out=sumtest sum=sum_initial;
run;
I am trying to extract all the Time occurrences for only the recent visit. Can someone help me with the code please.
Here is my data:
Obs Name Date Time
1 Bob 2017090 1305
2 Bob 2017090 1015
3 Bob 2017081 0810
4 Bob 2017072 0602
5 Tom 2017090 1300
6 Tom 2017090 1010
7 Tom 2017090 0805
8 Tom 2017072 0607
9 Joe 2017085 1309
10 Joe 2017081 0815
I need the output as:
Obs Name Date Time
1 Bob 2017090 1305,1015
2 Tom 2017090 1300,1010,0805
3 Joe 2017085 1309
Right now my code is designed to give me only one recent entry:
DATA OUT2;
SET INP1;
BY DATE;
IF FIRST.DATE THEN OUTPUT OUT2;
RETURN;
I would first sort the data by name and date. Then I would transpose and process the results.
proc sort data=have;
by name date;
run;
proc transpose data=have out=temp1;
by name date;
var value;
run;
data want;
set temp1;
by name date;
if last.name;
format value $2000.;
value = catx(',',of col:);
drop col: _name_;
run;
You may want to further process the new VALUE to remove excess commas (,) and missing value .'s.
Very similar to the question yesterday from another user, you can use quite a few solutions here.
SQL again is the easiest; this is not valid ANSI SQL and pretty much only SAS supports this, but it does work in SAS:
proc sql;
select name, date, time
from have
group by name
having date=max(date);
quit;
Even though date and time are not on the group by it's legal in SAS to put them on the select, and then SAS automatically merges (inner joins) the result of select name, max(date) from have group by name having date=max(date) to the original have dataset, returning multiple rows as needed. Then you'd want to collapse the rows, which I leave as an exercise for the reader.
You could also simply generate a table of maximum dates using any method you choose and then merge yourself. This is probably the easiest in practice to use, in particular including troubleshooting.
The DoW loop also appeals here. This is basically the precise SAS data step implementation of the SQL above. First iterate over that name, figure out the max, then iterate again and output the ones with that max.
proc sort data=have;
by name date;
run;
data want;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then output;
end;
run;
Of course here you more easily collapse the rows, too:
data want;
length timelist $1024;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
max_Date = max(max_date,date);
end;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if date=max_date then timelist=catx(',',timelist,time);
if last.name then output;
end;
run;
If the data is sorted then just retain the first date so you know which records to combine and output.
proc sort data=have ;
by name descending date time;
run;
data want ;
set have ;
by name descending date ;
length timex $200 ;
retain start timex;
if first.name then do;
start=date;
timex=' ';
end;
if date=start then do;
timex=catx(',',timex,time);
if last.date then do;
output;
call missing(start,timex);
end;
end;
drop start time ;
rename timex=time ;
run;
I have 4 columns in my SAS dataset as shown in first image below. I need to compare the dates of consecutive rows by ID. For each ID, if Date2 occurs before the next row's Date1 for the same ID, then keep the Bill amount. If Date2 occurs after the Date1 of the next row, delete the bill amount. So for each ID, only keep the bill where the Date2 is less than the next rows Date1. I have placed what the result set should look like at the bottom.
Result set should look like
You'll want to create a new variable that moves the next row's DATE1 up one row to make the comparison. Assuming your date variables are in a date format, use PROC EXPAND and make the comparison ensuring that you're not comparing the last value which will have a missing LEAD value:
DATA TEST;
INPUT ID: $3. DATE1: MMDDYY10. DATE2: MMDDYY10. BILL: 8.;
FORMAT DATE1 DATE2 MMDDYY10.;
DATALINES;
AA 07/23/2015 07/31/2015 34
AA 07/30/2015 08/10/2015 50
AA 08/12/2015 08/15/2015 18
BB 07/23/2015 07/24/2015 20
BB 07/30/2015 08/08/2015 20
BB 08/06/2015 08/08/2015 20
;
RUN;
PROC EXPAND DATA = TEST OUT=TEST1 METHOD=NONE;
BY ID;
CONVERT DATE1 = DATE1_LEAD / TRANSFORMOUT=(LEAD 1);
RUN;
DATA TEST2; SET TEST1;
IF DATE1_LEAD NE . AND DATE2 GT DATE1_LEAD THEN BILL=.;
RUN;
If you sort your data so that you are looking to the previous obs to compare your dates, you can use a the LAG Function in a DATA STEP.
I have the following data:
acct date
11111 01/01/2014
11111 01/01/2014
11111 02/02/2014
22222 01/01/2014
22222 01/01/2014
33333 01/01/2013
33333 03/03/2014
44444 01/01/2014
44444 01/01/2014
44444 01/01/2014
What would be the best way to accomplish the following in SAS? I want to compare the dates for each acct number and return all the records for the accts where there is at least one date that doesn't match.
So for the dataset above, I want to end up with the following:
acct date
11111 01/01/2014
11111 01/01/2014
11111 02/02/2014
33333 01/01/2013
33333 03/03/2014
A single PROC SQL will do the trick. Use count(distinct date) to count the number of different dates. Group that by acct to do the count by acct and when the result is greater than 1 filter it using a having clause. Next select acct and date as output columns.
This is SAS specific handling of SQL. Most other implementation will not allow this construct where you don't put all non-aggregate columns from the select in the group by clause.
proc sql noprint;
create table _output as
select acct, date format=ddmmyys10.
from _input
group by acct
having count(distinct date) > 1
order by acct, date;
quit;
Something like this would work. Sort your data by acct/date if not already, then check each last.date row. If the first last.date row is not also last.acct, then it is a set of rows where the respondent needs to be output. Here I only output one row per date/acct combination:
data want;
set have;
by acct date;
if (last.date) and not (last.acct) then do;
flg=1;
output;
end;
else if last.date and flg=1 then output;
else if first.acct then flg=0;
run;
If you need all rows, then you need to either take the above and merge it back to the original, or you could do a DoW loop:
data want;
do _n_=1 by 1 until (last.acct);
set have;
by acct date;
if (last.date) and not (last.acct) then do;
flg=1;
end;
end;
do _t_ = 1 by 1 until (last.acct);
set have;
by acct date;
if flg=1 then output;
end;
run;