data missing for the selected date range in SAS - sas

I have an issue with finding out the missing months from the data set in SAS. Since I am new to SAS, I need some help on working on it. I have a data set which is shown as below: In the below example I took the date range from 201810 to 201906 (which is 8 months sample data but need to have 15 months to check the missing data). I want to do this in SAS
+----+------------+
| ID | Elig Month |
+----+------------+
| 1 | 201810 |
| 1 | 201811 |
| 1 | 201901 |
| 1 | 201902 |
| 1 | 201903 |
| 1 | 201904 |
| 1 | 201905 |
| 1 | 201906 |
| 2 | 201811 |
| 2 | 201901 |
| 2 | 201903 |
| 2 | 201904 |
| 2 | 201905 |
| 2 | 201906 |
| 3 | 201901 |
| 3 | 201902 |
| 3 | 201903 |
| 3 | 201904 |
| 3 | 201905 |
| 3 | 201906 |
| 4 | 201810 |
| 4 | 201903 |
| 4 | 201904 |
| 4 | 201905 |
| 4 | 201906 |
| 5 | 201906 |
| 6 | 201810 |
+----+------------+
I want to see if that data is present for all the months between 15 months date range. I have date format as 201901 (yearmonth). I want to check if the data is missing and create groups based on the missing number of months say
1. if only one month is missing then I want to group as "1 month missing"
2. if two months missing consecutively then name the group as " 2 month missing"
3. if 3 months then "3 month missing"
4. if 4 - 6 months missing then "4-6 months missing"
5. if missing months alternatively like available in one month and not available in next month and then available in next two months then I want to group them as "Chaos"
6. if missing more than 7-12 months then "7-12 months missing"
7. if missing more than 12 months then "12+ months missing"
8. if can be seen only once in ending periods name as "reborn"
9. If seen in the start of the period and never see any data set f or 15 moths then "dead"
The expected result is show as below:
+----+-------+--------------------+
| ID | Group | Group description |
+----+-------+--------------------+
| 1 | 1 | 1 months missing |
| 2 | 5 | choas |
| 3 | 2 | 2 months missing |
| 4 | 4 | 4-6 months missing |
| 5 | 8 | Reborn |
| 6 | 9 | Dead |
+----+-------+--------------------+

First to replicate your data
/***********************************************************************/
/* ORIGINAL DATA */
/***********************************************************************/
Data have;
Input ID date yymmn6.;
datalines;
1 201810
1 201811
1 201901
1 201902
1 201903
1 201904
1 201905
1 201906
2 201811
2 201901
2 201903
2 201904
2 201905
2 201906
3 201901
3 201902
3 201903
3 201904
3 201905
3 201906
4 201810
4 201903
4 201904
4 201905
4 201906
5 201906
6 201810
;
Run;
Data Have;
Set Have;
format date yymmn6.;
Run;
Then the following macro you can use to create a master list of all the possible year months between the dates you want:
/***********************************************************************/
/* How to Find out IF Your Data is Missing a Date in sequence */
/***********************************************************************/
%let start_date=01OCT2018; /*Change this to your starting date*/
%let end_date=01jun2019; /*Change this to your Ending date*/
data month;
want=1;
date="&start_date"d;
do while (date<="&end_date"d);
output;
date=intnx('month', date, 1, 's');
end;
format date yymmn6.;
run;
Lastly you just merge the two, any column that has a null/missing ID is what you will group by for your categorization logic.
Proc sort data= have;
by date;
Proc sort data=month;
by date;
Run;
Data Want;
merge Have month;
by date;
Run;

The dataset you have provided does not let you create all the categories - so assuming 9 months instead of 18 here. You will have to make some changes to make it work for 18 months. Here is one way of doing this:
I am reading the months as just numbers:
data have;
input id month;
1 201810
1 201811
1 201901
;
run;
If you read month as a date field, then you need to do so when creating the allmonths dataset below also.
/* Create a dataset that contains all months for all IDs */
proc sort data=have(keep=id) nodupkey out=ids;
by id;
run;
/* Very lazy way of populating the months. There are elegant ways to do this */
data allmonths;
set ids;
do month = 201810 to 201812;
output;
end;
do month = 201901 to 201906;
output;
end;
run;
/* Merge the full combination with what you have and put a marker to indicate if a particular month is present for the ID or not */
data merged;
merge allmonths have (in=a);
by id month;
if a then present=1;
else present =0;
run;
/* Form a bit pattern and use that to categorize your cases */
data want;
set merged;
by id;
retain pattern counter cnt0;
if first.id then do;
pattern = repeat('0',8);
counter = 1;
cnt0 = 0;
end;
if present then substr(pattern,counter,1) = '1';
else cnt0 + 1;
counter + 1;
/* You could use a macro to auto generate these combinations if you expect to have very many categories */
length desc $50;
if last.id then do;
if pattern = "000000001" then desc = "Reborn";
else if pattern = "100000000" then desc = "Dead";
else if cnt0 = 1 then desc = "1 months missing";
else if cnt0 = 2 and index(pattern, '00') then desc = "2 months missing";
else if cnt0 = 2 then desc = "chaos";
else if cnt0 = 3 and index(pattern, '000') then desc = "3 months missing";
else if cnt0 = 3 then desc = "chaos";
else if cnt0 = 4 and index(pattern, '0000') then desc = "4 - 6 months missing";
else if cnt0 = 4 then desc = "chaos";
else if cnt0 = 5 and index(pattern, '00000') then desc = "4 - 6 months missing";
else if cnt0 = 5 then desc = "chaos";
else if cnt0 = 6 and index(pattern, '000000') then desc = "4 - 6 months missing";
else if cnt0 = 6 then desc = "chaos";
output;
end;
keep id desc;
run;

Related

Grouping child items and displaying parent sum

I have the following table
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
I would like to group the table by group, insert the grouped sum into value, and then ungroup:
+-------+--------+
| item | value |
+-------+--------+
| 1 | 30 |
| a | 10 |
| b | 20 |
| 2 | 70 |
| b | 30 |
| c | 40 |
+-------+--------+
The purpose of the result is to interpret the first column as items a and b belonging to group 1 with sum 30 and items b and c belonging to group 2 with sum 70.
Such a data transformation can be indicative of a reporting requirement more than a useful data structure for downstream processing. Proc REPORT can create output in the form desired.
data have;
infile datalines;
input group $ item $ value ##; datalines;
1 a 10 1 b 20 2 b 30 2 c 40
;
proc report data=have;
column group item value;
define group / order order=data noprint;
break before group / summarize;
compute item;
if missing(item) then item=group;
endcomp;
run;
I assume that both group and item are character variables
data have;
infile datalines firstobs=4 dlm='|';
input group $ item $ value;
datalines;
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
;
data want (keep=group value);
do _N_=1 by 1 until (last.group);
set have;
by group;
v + value;
end;
value = v;output;v=0;
do _N_=1 to _N_;
set have;
group = item;
output;
end;
run;

How can I add observations to the existing dataset based on dates?

I have a dataset like this:
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
run;
I would like to add observations with dates from the maximum date (hence from 30JUN2019) until 31DEC2019 (by months) with the value of index being the last available value: 14. How can I achieve this in SAS? I want the code to be flexible, thus for every such dataset, take the maximum of date and add monthly observations from that maximum until DEC2019 with the value of index being equal to the last available value (here in the example the value in JUN2019).
An explicit DO loop over the SET provides the foundation for a concise solution with no extraneous worker variables. Automatic variable last is automatically dropped.
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
data want;
do until (last);
set have end=last;
output;
end;
do last = month(date) to 11; %* repurpose automatic variable last as a loop index;
date = intnx ('month',date,1,'e');
output;
end;
run;
Always helpful to refresh understanding. From SET Options documentation
END=variable
creates and names a temporary variable that contains an end-of-file indicator. The variable, which is initialized to zero, is set to 1 when SET reads the last observation of the last data set listed. This variable is not added to any new data set.
You can do it using end in set statement and retain statement.
data want(drop=i tIndex tDate);
set have end=eof;
retain tIndex tDate;
if eof then do;
tIndex=Index;
tDate=Date;
end;
output;
if eof then do;
do i=1 to 12-month(tDate);
index=tIndex;
date = intnx('month',tDate,i,'e');
output;
end;
end;
run;
INPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
+-----------+-------+
OUTPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
| 31JUL2019 | 14 |
| 31AUG2019 | 14 |
| 30SEP2019 | 14 |
| 31OCT2019 | 14 |
| 30NOV2019 | 14 |
| 31DEC2019 | 14 |
+-----------+-------+

adding rows given a certain condition

I have a database with 3 columns. ID, Date and amount. It is ordered by ID and Date. All I want to do is to add a row after the latest occurrence of every ID with the same ID, Date = Date + 1 Month and Amount = 0.
As an Illustration I want to go from this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
B | 01FEB| 0 |
B | 01MAR| 1 |
to this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
A | 01MAR| 0 | <- ADD THIS ROW
B | 01FEB| 0 |
B | 01MAR| 1 |
B | 01APR| 0 |<- ADD THIS ROW
I know I should use intxn but beyond that I don't really know what to do. I appreciate any input.
Assuming that the DATE variable has actual date values in it you just need to output twice on the last observation in each group.
data want;
set have;
by id;
output;
if last.id then do;
date=intnx('month',date,1,'b');
amount=0;
output;
end;
run;

Removing observations before 'beginning' and after 'ending' - SAS code

My table has some leading and trailing observations that I am trying to remove. I want to remove the rows that come before every 'begin' event and after every 'end' event for every single group. The table resembles the below:
| Time | Group | Event | Value |
| 1 | 1 | NA | 0 |
| 2 | 1 | NA | 0 |
| 3 | 1 | Begin | 1.1 |
| 4 | 1 | NA | 1.2 |
| 5 | 1 | NA | 1.3 |
| 6 | 1 | End | 1.4 |
| 7 | 1 | NA | 0 |
| 1 | 2 | NA | 0 |
| 2 | 2 | Begin | 1.1 |
| 3 | 2 | NA | 1.2 |
| 4 | 2 | End | 1.3 |
| 5 | 2 | NA | 1.4 |
On the presumption that the incoming data is already sorted and that there are zero or more serially bounded ranges of Begin to End within each group:
data want;
do until (last.group);
set have;
by group time;
if event = 'Begin' then _keeprow = 1;
if _keeprow then output;
if event = 'End' then _keeprow = 0;
end;
drop _keeprow;
end;
I have came out an easy way but will be limited by the actual data size.
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
run;
proc sort data = have;
by group time;
run;
data have1;
set have;
count + 1;
by group;
if first.group then count = -100;
if event = 'Begin' then count = 0;
if event = 'End' then count = 100;
if count < 0 or count >100 then delete;
run;
The current code could be applied to the small size data if you have less than 100 observations between 'Begin' and 'End' and less than 100 observations before 'Begin'. You can adjust the initial count value according to the true data size.
one way to do is
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
data have2(keep= Group min_var max_var);
set have;
by group;
retain min_var max_var;
if trim(Event)= "Begin" then min_var =_n_ ;
if trim(Event)= "End" then max_var =_n_;
if last.group;
run;
data want;
merge have have2;
by group;
if _n_ ge min_var and _n_ le max_var ;
drop min_var max_var;
run;

PROC TABULATE WITH TOTAL

I am doing reports with proc tabulate, but unable to add total in a report.
Example
+--------+------+----------+--------+---+---+---+
| Shop | Year | Month | Family | A | B | C |
+--------+------+----------+--------+---+---+---+
| raoas | 2006 | january | TA12 | 5 | 6 | 0 |
| taba | 2008 | january | TS01 | 0 | 1 | 1 |
| suptop | 2008 | april | TZ05 | 0 | 0 | 1 |
| taba | 2006 | December | TA12 | 5 | 6 | 0 |
| raoas | 2008 | january | TA15 | 0 | 2 | 0 |
| sup | 2008 | april | TQ05 | 0 | 1 | 1 |
+--------+------+----------+--------+---+---+---+
code
proc tabulate data=REPORTDATA_T6 format=12.;
CLASS YEAR;
var A C;
table (A C)*SUM='',YEAR=''
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;
output
YEAR 2006 2008 2009
A 800 766 813
C 854 832 812
I tried with... table(A C)*sum,year all... it will sum up for all the years but I want by year.
I tried with all the possible ways and tried... table(A C)*sum all,year. It will give number of observations ie N.. Thanx JON CLEMENTS But I dont want to add as TOTAL VARIABLE in the table, becoz this is a sample data but the number of variables are more then 10, some time I need to change variables, So, every time i dont want to add new variable as total.
I'm not sure if it's possible to do what you want in one step using only original data. Keyword ALL works only for summing up categories of CLASS-variables, but you want to sum up two different variables.
But it's easy enough with interim step, creating dataset where A, B, C variables will become categories of one variable:
data REPORTDATA_T6;
input Shop $ Year Month $ Family $ A B C;
datalines;
raoas 2006 january TA12 5 6 0
taba 2008 january TS01 0 1 1
suptop 2008 april TZ05 0 0 1
taba 2006 December TA12 5 6 0
raoas 2008 january TA15 0 2 0
sup 2008 april TQ05 0 1 1
;
run;
proc sort data=REPORTDATA_T6; by Shop Year Month Family; run;
proc transpose data=REPORTDATA_T6 out=REPORTDATA_T6_long;
var A B C;
by Shop Year Month Family;
run;
proc tabulate data=REPORTDATA_T6_long;
class _NAME_ YEAR;
var COL1;
table (_NAME_ all)*COL1=' '*SUM=' ', YEAR=' '
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;