Aggregating Over Actual Year in SAS - sas

Lets suppose we have the following table ("Purchases"):
Date Units_Sold Brand Year
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
And the same logic continues until the end of 2014, for different brands.
What I want to do is calculate the number of Units_Sold for every Brand, in each year. However, I don't want to do it for the calendar year, but for the actual year.
So an example of what I don't want:
proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;
The above logic is ok if we know that e.g. Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months.
So what I want to do is calculate:
for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc.
for B: the sum from 12/07/2010 until 11/07/2011, etc.
and so on for all Brands.
Is there a smart way of doing this?

Step 1: Make sure your dataset is sorted or indexed by Brand and Date
proc sort data=want;
by brand date;
run;
Step 2: Calculate the start/end dates for each product
The idea behind the below code:
We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. We'll call this Product_Year_Start.
The intnx function can be used to increment that date by 365 days, then subtract 1 from it. Let's call this date Product_Year_End.
Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. We'll just take the calculated Product_Year_End and Product_Year_Start for that brand and bump them up by one year.
This is all achieved using by-group processing and the retain statement.
data Comparison_Dates;
set have;
by brand date;
retain Product_Year_Start Product_Year_End;
if(first.brand) then do;
Product_Year_Start = date;
Product_Year_End = intnx('year', date, 1, 'S') - 1;
end;
if(Date > Product_Year_End) then do;
Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
end;
format Product_Year_Start Product_Year_End date9.;
run;
Step 3: Using the original SQL code, group instead by the new product start/end dates
proc sql;
create table want as
select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
, Brand
, sum(Units_Sold) as Unit_per_Year
from Comparison_Dates
group by Brand, calculated Product_Year
order by Brand, calculated Product_Year;
quit;

The following code is doing what you ask in a literal sense, for the earliest 'date' of each 'brand', it start aggregating 'unitssold', when hits 365 days mark, it resets count, and starts another cycle.
data have;
informat date ddmmyy10.;
input date units_sold brand $ year;
format date date9.;
cards;
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
;
proc sort data=have;
by brand date;
run;
data want;
do until (last.brand);
set have;
by brand date;
if first.brand then
do;
Sales_Over_365=0;
_end=intnx('day',date,365);
end;
if date <= _end then
Sales_Over_365+units_sold;
else
do;
output;
Sales_Over_365=units_sold;
_end=intnx('day',date,365);
end;
end;
output;
drop _end;
run;

You need to have a start date for each brand. For now we can use the first sale date, but that might not be what you want. Then you can classify each sales date into which year it is for that brand.
Let's start by creating a dataset from your sample data. The YEAR variable is not needed.
data have ;
input Date Units_Sold Brand $ Year ;
informat date ddmmyy10.;
format date yymmdd10.;
cards;
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
;;;;
Now we can get the answer you want with an SQL query.
proc sql ;
create table want as
select brand
, start_date
, 1+floor((date - start_date)/365) as sales_year
, intnx('year',start_date,calculated sales_year -1,'same')
as start_sales_year format=yymmdd10.
, sum(units_sold) as total_units_sold
from
( select brand
, min(date) as start_date format=yymmdd10.
, date
, units_sold
from have
group by 1
)
group by 1,2,3,4
;
quit;
This will produce this result:
total_
sales_ start_ units_
Brand start_date year sales_year sold
A 2010-03-18 1 2010-03-18 15
A 2010-03-18 2 2011-03-18 5
B 2010-07-12 1 2010-07-12 10

There is no straight forward way of doing it. You can do something like this.
To test the code, I saved your table in to a text file.
Then I created a class called Sale.
public class Sale
{
public DateTime Date { get; set; }
public int UnitsSold { get; set; }
public string Brand { get; set; }
public int Year { get; set; }
}
Then I populated a List<Sale> using the saved text file.
var lines = File.ReadAllLines(#"C:\Users\kosala\Documents\data.text");
var validLines = lines.Where(l => !l.Contains("Date")).ToList();//remove the first line.
List<Sale> sales = validLines.Select(l => new Sale()
{
Date = DateTime.Parse(l.Substring(0,10)),
UnitsSold = int.Parse(l.Substring(26,5)),
Brand = l.Substring(46,1),
Year = int.Parse(l.Substring(56,4)),
}).ToList();
//All the above code is for testing purposes. The actual code starts from here.
var totalUnitsSold = sales.OrderBy(s => s.Date).GroupBy(s => s.Brand);
foreach (var soldUnit in totalUnitsSold)
{
DateTime? minDate = null;
DateTime? maxDate = null;
int total = 0;
string brand = "";
foreach (var sale in soldUnit)
{
brand = sale.Brand;
if (minDate == null)
{
minDate = sale.Date;
}
if ((sale.Date - minDate).Value.Days <= 365)
{
maxDate = sale.Date;
total += sale.UnitsSold;
}
else
{
break;
}
}
Console.WriteLine("Brand : {0} UnitsSold Between {1} - {2} is {3}",brand, minDate.Value, maxDate.Value, total);
}

Related

Do loop not pulling in all data- creating dates

We have a code that we use to create quarterly reports of projects. There is a piece of code, a do loop, that takes the startdate and enddate of each project in our dataset and creates an observation for each month and year that the project took place in. For example if we have a project called "Employment Help" with a startdate value of 01JAN2022 and an enddate value of 01APR2022, the do loop will create 4 observations for this project with the month and year values of 1 2022, 2 2022, 3 2022, and 4 2022. We use this to count how many projects happened during our quarters. We are running into an issue where the do loop is dropping projects and not giving them a month or year value and we are losing projects in our count because of this. The dates are all in the same format.
Here is an example of some data that is pulled in, EXAMPLE 2 is properly pulled into the do loop, EXAMPLE 1 does not get pulled through.
Here is the code:
**data test2;
set users3;
do i = 0 to (year(enddate)-year(startdate));
year = year(startdate)+i;
end;
do i = 0 to (month(enddate)-month(startdate));
month = month(startdate)+i;
drop i;
output;
end;
run;**
Consider the following example:
data have;
input project$ startdate:date9. enddate:date9.;
format startdate enddate date9.;
datalines;
A 01JAN2022 01APR2022
B 01MAR2022 01JUN2022
C 01NOV2022 01JAN2023
;
run;
The third row will fail to run because the difference between the start month number and end month number is negative (1 - 11). Instead of doing two loops, one for year and one for month, do a single loop for all of the months from the start date. Use intnx() to generate your months using startdate as the reference month. i will offset each month from the start date. For example:
code output
intnx('month', '01JAN2022'd, 0) 01JAN2022
intnx('month', '01JAN2022'd, 1) 01FEB2022
intnx('month', '01JAN2022'd, 2) 01MAR2022
Since you're incrementing by exactly one month for each date, you can get the year and month number in a single loop.
data want;
set have;
do i = 0 to intck('month', startdate, enddate);
month = month(intnx('month', startdate, i) );
year = year(intnx('month', startdate, i) );
output;
end;
drop i;
run;
Your code doesn't seem to handle the cross of years, ie if a project started in 2021 and ended in 2022.
This should get you closer.
data have;
input startdate : date9. enddate : date9.;
format startdate enddate date9.;
cards;
01Jan2022 01Apr2022
01Sep2021 01Apr2022
;;;
run;
data want;
set have;
nmonths = intck('month', startdate, enddate) +1 ;
date = startdate;
do i = 1 to nmonths;
month = month(date);
year = year(date);
date = intnx('month', startdate, i, 'b');
output;
end;
run;

Deleting the zip files from an external folder having 2 year old using SAS

Iam trying with the below code and I could able to delete the files which is in zip format but I need to know how we can filter to delete only the 2 year old files . The file name will be SFRE_BIL_SIT_20160812_134317_PAM_FILES1.zip
when doing the filtering for deleting the 2 year old fiIes I need to concentrate only the year and month part for the file to do the deletion because the day part would be changing . Please find below the code
options mlogic;
%macro delete_all_zip_files_in_folder(folder);
filename filelist "&folder";
data _null_;
dir_id = dopen('filelist');
total_members = dnum(dir_id);
do i = 1 to total_members;
member_name = dread(dir_id,i);
if scan(lowcase(member_name),2,'.')='zip' then do;
file_id = mopen(dir_id,member_name,'i',0);
if file_id > 0 then do;
freadrc = fread(file_id);
rc = fclose(file_id);
rc = filename('delete',member_name,,,'filelist');
rc = fdelete('delete');
end;
rc = fclose(file_id);
end;
end;
rc = dclose(dir_id);
run;
%mend;
%delete_all_zip_files_in_folder(C:\Users\UCS1MKP\Desktop\test)
If your files are named in the same manner - 3 part code then date, all separated with _ you can use scan(). You allready used scan() to check for files with zip extension. You can do the same to extract date part from file name and then parse it and convert to numbers.
Solution 1
Fourth word of member_name is YYYYMMDD part:
datestring = scan(member_name,4,'_')
First 4 digits are year:
year = input(substr(datestring,1,4),best.)
Then 2 digits for month:
month = input(substr(datestring,5,2),best.)
Finally 2 digits for day (just in case you will need day):
day = input(substr(datestring,5,2),best.)
At the end just compare with actual month and year:
if month = month(today()) and year = year(today()) then do;
*your code for deleting file;
end;
But t will work only for given month. To make it delete files older than 2 years you can do something like below.
Solution 2
Allready having all date parts construct date:
date = mdy(month, day, year)
Then compare it with actual date minus 2 years. To shift date exacly 2 years from today you should use intnx() function
if intnx('year', today(), -2, 'S') > date do;
*your code for deleting file;
end;

How can I select the first and last week of each month in SAS?

I have monthly data with several observations per day. I have day, month and year variables. How can I retain data from only the first and the last 5 days of each month? I have only weekdays in my data so the first and last five days of the month changes from month to month, ie for Jan 2008 the first five days can be 2nd, 3rd, 4th, 7th and 8th of the month.
Below is an example of the data file. I wasn't sure how to share this so I just copied some lines below. This is from Jan 2, 2008.
Would a variation of first.variable and last.variable work? How can I retain observations from the first 5 days and last 5 days of each month?
Thanks.
1 AA 500 B 36.9800 NH 2 1 2008 9:10:21
2 AA 500 S 36.4500 NN 2 1 2008 9:30:41
3 AA 100 B 36.4700 NH 2 1 2008 9:30:43
4 AA 100 B 36.4700 NH 2 1 2008 9:30:48
5 AA 50 S 36.4500 NN 2 1 2008 9:30:49
If you want to examine the data and determine the minimum 5 and maximum 5 values then you can use PROC SUMMARY. You could then merge the result back with the data to select the records.
So if your data has variables YEAR, MONTH and DAY you can make a new data set that has the top and bottom five days per month using simple steps.
proc sort data=HAVE (keep=year month day) nodupkey
out=ALLDAYS;
by year month day;
run;
proc summary data=ALLDAYS nway;
class year month;
output out=MIDDLE
idgroup(min(day) out[5](day)=min_day)
idgroup(max(day) out[5](day)=max_day)
/ autoname ;
run;
proc transpose data=MIDDLE out=DAYS (rename=(col1=day));
by year month;
var min_day: max_day: ;
run;
proc sql ;
create table WANT as
select a.*
from HAVE a
inner join DAYS b
on a.year=b.year and a.month=b.month and a.day = b.day
;
quit;
/****
get some dates to play with
****/
data dates(keep=i thisdate);
offset = input('01Jan2015',DATE9.);
do i=1 to 100;
thisdate = offset + round(599*ranuni(1)+1); *** within 600 days from offset;
output;
end;
format thisdate date9.;
run;
/****
BTW: intnx('month',thisdate,1)-1 = first day of next month. Deduct 1 to get the last day
of the current month.
intnx('month',thisdate,0,"BEGINNING") = first day of the current month
****/
proc sql;
create table first5_last5 AS
SELECT
*
FROM
dates /* replace with name of your data set */
WHERE
/* replace all occurences of 'thisdate' with name of your date variable */
( intnx('month',thisdate,1)-5 <= thisdate <= intnx('month',thisdate,1)-1 )
OR
( intnx('month',thisdate,0,"BEGINNING") <= thisdate <= intnx('month',thisdate,0,"BEGINNING")+4 )
ORDER BY
thisdate;
quit;
Create some data with the desired structure;
Data inData (drop=_:); * froget all variables starting with an underscore*;
format date yymmdd10. time time8.;
_instant = datetime();
do _i = 1 to 1E5;
date = datepart(_instant);
time = timepart(_instant);
yy = year(date);
mm = month(date);
dd = day(date);
*just some more random data*;
letter = byte(rank('a') +floor(rand('uniform', 0, 26)));
*select week days*;
if weekday(date) in (2,3,4,5,6) then output;
_instant = _instant + 1E5*rand('exponential');
end;
run;
Count the days per month;
proc sql;
create view dayCounts as
select yy, mm, count(distinct dd) as _countInMonth
from inData
group by yy, mm;
quit;
Select the days;
data first_5(drop=_:) last_5(drop=_:);
merge inData dayCounts;
by yy mm;
_newDay = dif(date) ne 0;
retain _nrInMonth;
if first.mm then _nrInMonth = 1;
else if _newDay then _nrInMonth + 1;
if _nrInMonth le 5 then output first_5;
if _nrInMonth gt _countInMonth - 5 then output last_5;
run;
Use the INTNX() function. You can use INTNX('month',...) to find the beginning and ending days of the month and then use INTNX('weekday',...) to find the first 5 week days and last five week days.
You can convert your month, day, year values into a date using the MDY() function. Let's assume that you do that and create a variable called TODAY. Then to test if it is within the first 5 weekdays of last 5 weekdays of the month you could do something like this:
first5 = intnx('weekday',intnx('month',today,0,'B'),0) <= today
<= intnx('weekday',intnx('month',today,0,'B'),4) ;
last5 = intnx('weekday',intnx('month',today,0,'E'),-4) <= today
<= intnx('weekday',intnx('month',today,0,'E'),0) ;
Note that those ranges will include the week-ends, but it shouldn't matter if your data doesn't have those dates.
But you might have issues if your data skips holidays.

lag daily data by 1 month in sas

I have a data set with daily data in SAS. I would like to convert this to monthly form by taking differences from the previous month's value by id. For example:
thedate, id, val
2012-01-01, 1, 10
2012-01-01, 2, 14
2012-01-02, 1, 11
2012-01-02, 2, 12
...
2012-02-01, 1, 20
2012-02-01, 2, 15
I would like to output:
thedate, id, val
2012-02-01, 1, 10
2012-02-01, 2, 1
Here is one way. If you license SAS-ETS, there might be a better way to do it with PROC EXPAND.
*Setting up the dataset initially;
data have;
informat thedate YYMMDD10.;
input thedate id val;
datalines;
2012-01-01 1 10
2012-01-01 2 14
2012-01-02 1 11
2012-01-02 2 12
2012-02-01 1 20
2012-02-01 2 15
;;;;
run;
*Sorting by ID and DATE so it is in the right order;
proc sort data=have;
by id thedate;
run;
data want;
set have;
retain lastval; *This is retained from record to record, so the value carries down;
by id thedate;
if (first.id) or (last.id) or (day(thedate)=1); *The only records of interest - the first record, the last record, and any record that is the first of a month.;
* To do END: if (first.id) or (last.id) or (thedate=intnx('MONTH',thedate,0,'E'));
if first.id then call missing(lastval); *Each time ID changes, reset lastval to missing;
if missing(lastval) then output; *This will be true for the first record of each ID only - put that record out without changes;
else do;
val = val-lastval; *set val to the new value (current value minus retained value);
output; *put the record out;
end;
lastval=sum(val,lastval); *this value is for the next record;
run;
You could achieve this using a PROC SQL, and the intnx function to bring last months date forward a month...
proc sql ;
create table lag as
select b.thedate, b.id, (b.val - a.val) as val
from mydata b
left join
mydata a on b.date = intnx('month',a.date,1,'s')
and b.id = a.id
order by b.date, b.id ;
quit ;
This may need tweaking to handle scenarios where the previous month doesn't exist or months which have a different number of days to the previous month.

Computing Compounded Return in SAS

I have a dataset of date(monthly), person and return(monthly). I need to calculate the compounded monthly return of the dataset from April Year t to March Year t+1 for each person.
For example,
Annual return Person A= April Year 1* May Year 1*......*March Year 2.
Can I know how can I do that in SAS? Do I need an array?
Do you mean your dataset looks like
Date , Person , Return
"01Apr2009"d , A , 1.1
etc
etc
?
/* Then you can do */
data ur_input_data_view / view = ur_input_data_view;
/* assuming your dataset is stored in ur_input_data */
set ur_input_data;
/* obtain the year and month from the date variable */
year = year(date);
mth = year(date);
run;
/* you can skip the sort if your data is already sorted by person then date */
proc sort data = ur_input_data;
by person date;
run;
data output_data;
set ur_input_data_view ;
by person date;
/* reset annual return to null if looking at a new person or month is april */
if first.person or mth = 4 then do;
annual_return = return;
end;
retain annual_return;
annual_return = annual_return * return;
/* this will output nothing for the persons that don't have a record at march */
if mth = 3 then output;
run;