Inserting subtotals on demand in PROC REPORT - sas

I have this dataset (an made-up example, but in the same structure):
data have;
infile datalines delimiter=',';
length country city measure $50.;
input country $ city $ level measure $ mdate total;
informat mdate date9.;
format mdate date9.;
datalines;
England,London,1,Red doors opened,24MAR2014,4
England,London,1,Green doors opened,24MAR2014,6
England,London,2,Doors closed,24MAR2014,7
England,London,1,Red doors opened,25MAR2014,5
England,London,1,Blue doors opened,25MAR2014,4
England,London,1,Green doors opened,25MAR2014,3
England,London,2,Doors closed,25MAR2014,6
England,Manchester,1,Red doors opened,24MAR2014,3
England,Manchester,2,Doors closed,24MAR2014,1
England,Manchester,2,Doors closed,25MAR2014,4
Scotland,Glasgow,1,Red doors opened,24MAR2014,4
Scotland,Glasgow,1,Red doors opened,25MAR2014,3
Scotland,Glasgow,1,Green doors opened,25MAR2014,2
Scotland,Glasgow,2,Doors closed,25MAR2014,4
;;;;
run;
I want to output the 'doors opened' per country/city per day, then subtotal the doors opened, then output the doors closed, then subtract the doors opened from the doors closed to give a 'balance' (per country/city). At the end of each country, I want one line summing the balance (per day) for each country.
So the above would give something like:
Country + City + Measure + 24MAR2014 + 25MAR2014
---------+------------+--------------------+-----------+----------
England + London + Red doors opened + 4 + 5
+ + Green doors opened + 6 + 3
+ + Blue doors opened + . + 4
+ + TOTAL DOORS OPENED + 10 + 12
+ + Doors closed + 7 + 6
+ + BALANCE + -3 + -6
+ Manchester + Red doors opened + 3 + .
+ + TOTAL DOORS OPENED + 3 + .
+ + Doors closed + 1 + 4
+ + BALANCE + -2 + 4
+ ALL + BALANCE + -5 + -2
Scotland + Glasgow + Red doors opened + 4 + 3
+ + Green doors opened + . + 2
+ + TOTAL DOORS OPENED + 4 + 5
+ + Doors closed + . + 4
+ + BALANCE + -4 + -1
+ ALL + BALANCE + -4 + -1
I've deliberately left it so not every measure appears for each instance and the Doors Closed total is sometimes missing. The rows in CAPS are those I want to add with PROC REPORT, i.e. not in the original data.
I've got the basic layout using PROC REPORT, but don't really have an idea where to go to start inserting subtotals on demand. I've added a 'level' variable, to try and give me something to order/group on.
I need one country per output page and the rows kept in that order per grouping, i.e. XXX Doors Opened, TOTAL DOORS OPENED, Doors Closed, BALANCE, so I think maybe the extra columns are needed.
So far, this is what I have done:
proc report data=have out=proc;
by country;
columns city level measure mdate,total;
define city / group;
define level / group noprint;
define measure / group;
define mdate / across;
define total / analysis sum;
compute before level;
endcomp;
compute after level;
if level = 2 and break = '_level_' then do;
measure = 'TOTAL DOORS OPENED';
end;
endcomp;
run;
I know I should be able to do something using the level variable, so I've added some compute blocks before and after it and examined the output dataset. I've tried to add a value of 'TOTAL DOORS OPENED', but this isn't working.
To be honest, I've only just started using PROC REPORT, so this is a bit out of my comfort zone.
Thanks for any help. Please let me know if the question isn't clear.

Sometimes (often for my field of work) it is better to regard PROC REPORT as a fancy PROC PRINT and make your calculations in the dataset.
I would added a variable like TYPE denoting if the entry tells us about the open or closed doors then calculated the sums by contry/city/level/type/day; also I would duplicated all observations with level= 3 (meaning BALANCE in your table) and negated the measures where TYPE=closed then calculated the sums by country/city/day, they stacked the all results together in one dataset with proper keys and transposed with ID=day. PROC REPORT can take it from there. Do not trust COMPUTE blocks too much, they are often useful but hell to debug. Just make a dataset what appears as your desired table and throw it to REPORT.

Related

SAS-How to count the number of observation over the 10 years prior to certain month

I have a sample that include two variables: ID and ym. ID id refer to the specific ID for each trader and ym refer to the year-month variable. And I want to create a variable that show the number of years over the 10 years period prior month t as shown in the following figure.
ID ym Want
1 200101 0
1 200301 1
1 200401 2
1 200501 3
1 200601 4
1 200801 5
1 201201 5
1 201501 4
2 200001 0
2 200203 1
2 200401 2
2 200506 3
I attempt to use by function and fisrt.id to count the number.
data want;
set have;
want+1;
by id;
if first.id then want=1;
run;
However, the year in ym is not continuous. When the time gap is higher than 10 years, this method is not working. Although I assume I need to count the number of year in a rolling window (10 years), I am not sure how to achieve it. Please give me some suggestions. Thanks.
Just do a self join in SQL. With your coding of YM it is easy to do interval that is a multiple of a year, but harder to do other intervals.
proc sql;
create table want as
select a.id,a.ym,count(b.ym) as want
from have a
left join have b
on a.id = b.id
and (a.ym - 1000) <= b.ym < a.ym
group by a.id,a.ym
order by a.id,a.ym
;
quit;
This method retains the previous values for each ID and directly checks to see how many are within 120 months of the current value. It is not optimized but it works. You can set the array m() to the maximum number of values you have per ID if you care about efficiency.
The variable d is a quick shorthand I often use which converts years/months into an integer value - so
200012 -> (2000*12) + 12 = 24012
200101 -> (2001*12) + 1 = 24013
time from 200012 to 200101 = 24013 - 24012 = 1 month
data have;
input id ym;
datalines;
1 200101
1 200301
1 200401
1 200501
1 200601
1 200801
1 201201
1 201501
2 200001
2 200203
2 200401
2 200506
;
proc sort data=have;
by id ym;
data want (keep=id ym want);
set have;
by id;
retain seq m1-m100;
array m(100) m1-m100;
** Convert date to comparable value **;
d = 12 * floor(ym/100) + mod(ym,10);
** Initialize number of previous records **;
want = 0;
** If first record, set retained values to missing and leave want=0 **;
if first.id then call missing(seq,of m1-m100);
** Otherwise loop through previous months and count how many were within 120 months **;
else do;
do i = 1 to seq;
if d <= (m(i) + 120) then want = want + 1;
end;
end;
** Increment variables for next iteration **;
seq + 1;
m(seq) = d;
run;
proc print data=want noobs;

Multiple operations on a single value in SAS?

I'm trying to create a column that will apply to different interests to it based on how much each customer's cumulative purchases are. Not sure but I was thinking that I'd need to use a do while statement but entirely sure. :S
This is what I got so far but I don't know how to get it to perform two operations on one value. Such that, it will apply one interest rate until say, 4000, and then apply the other interest rate to the rest above 4000.
data cards;
set sortedccards;
by Cust_ID;
if first.Cust_ID then cp=0;
cp+Purchase;
if cp<=4000 then cb=(cp*.2);
if cp>4000 then cb=(cp*.2)+(cp*.1);
format cp dollar10.2 cp dollar10.2;
run;
What I'd like my output to look like.
You will want to also track the prior cumulative purchase in order to detect when a purchase causes the cumulative to cross the threshold (or breakpoint) $4,000. Breakpoint crossing purchases would be split into pre and post portions for different bonus rates.
Example:
Program flow causes retained variable pcp to act like a LAGged variable.
data have;
input id $ p;
datalines;
C001 1000
C001 2300
C001 2000
C001 1500
C001 800
C002 6200
C002 800
C002 300
C003 2200
C003 1700
C003 2500
C003 600
;
data want;
set have;
by id;
if first.id then do;
cp = 0;
pcp = 0; retain pcp; /* prior cumulative purchase */
end;
cp + p; /* sum statement causes cp to be implicitly retained */
* break point is 4,000;
if (cp > 4000 and pcp > 4000) then do;
* entire purchase in post breakpoint territory;
b = 0.01 * p;
end;
else
if (cp > 4000) then do;
* split purchase into pre and post breakpoint portions;
b = 0.10 * (4000 - pcp) + 0.01 * (p - (4000 - pcp));
end;
else do;
* entire purchase in pre breakpoint territory;
b = 0.10 * p;
end;
* update prior for next implicit iteration;
pcp = cp;
run;
Here is a fairly straightforward solution which is not optimized but works. We calculate the cumulative purchases and cumulative bonus at each step (which can be done quite simply), and then calculate the current period bonus as cumulative bonus minus previous cumulative bonus.
This is assuming that the percentage is 20% up to $4000 and 30% over $4000.
data have;
input id $ period MMDDYY10. purchase;
datalines;
C001 01/25/2019 1000
C001 02/25/2019 2300
C001 03/25/2019 2000
C001 04/25/2019 1500
C001 05/25/2019 800
C002 03/25/2019 6200
C002 04/25/2019 800
C002 05/25/2019 300
C003 02/25/2019 2200
C003 03/25/2019 1700
C003 04/25/2019 2500
C003 05/25/2019 600
;
run;
data want (drop=cumul_bonus);
set have;
by id;
retain cumul_purchase cumul_bonus;
if first.id then call missing(cumul_purchase,cumul_bonus);
** Calculate total cumulative purchase including current purchase **;
cumul_purchase + purchase;
** Calculate total cumulative bonus including current purchase **;
cumul_bonus = (0.2 * cumul_purchase) + ifn(cumul_purchase > 4000, 0.1 * (cumul_purchase - 4000), 0);
** Bonus for current purchase = total cumulative bonus - previous cumulative bonus **;
bonus = ifn(first.id,cumul_bonus,dif(cumul_bonus));
format period MMDDYY10.
purchase cumul_purchase bonus DOLLAR10.2
;
run;
proc print data=want;

SAS for subsetting data

Consider following exemplary SAS dataset with following layout.
Price Num_items
100 10
120 15
130 20
140 25
150 30
I want to group them into 4 categories by defining a new variable called cat such that the new dataset looks as follows:
Price Num_items Cat
100 10 1
120 15 1
130 20 2
140 25 3
150 30 4
Also I want to group them so that they have about equal number of items (For example in above grouping Group 1 has 25, Group 2 has 20 ,Group 3 has 25 and Group 4 has 30 observations). Note that the price column is sorted in ascending order (that is required).
I am struggling to start with SAS for above. So any help would be appreciated. I am not looking for a complete solution but pointers towards preparing a solution would help.
Cool problem, subtly complex. I agree with #J_Lard that a data step with some retainment would likely be the quickest way to accomplish this. If I understand your problem correctly, I think the code below would give you some ideas as to how you want to solve it. Note that depending on the num_items, and group_target, your mileage will vary.
Generate similar, but larger data set.
data have;
do price=50 to 250 by 10;
/*Seed is `_N_` so we'll see the same random item count.*/
num_items = ceil(ranuni(_N_)*10)*5;
output;
end;
run;
Categorize.
/*Desired group size specification.*/
%let group_target = 50;
data want;
set have;
/*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
if _N_=1 then do;
cat + 1;
cat_num_items + num_items;
end;
else do;
/*If the item count for a new price puts the category count above the target, apply logic.*/
if cat_num_items + num_items > &group_target. then do;
/*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
cat+1;
cat_num_items = num_items;
end;
/*Otherwise keep it in the currnet category and increment category count.*/
else cat_num_items + num_items;
end;
/*Otherwise keep the item count in the current category and increment category count.*/
else cat_num_items + num_items;
end;
drop cat_num_items;
run;
Check.
proc sql;
create table check_want as
select cat,
sum(num_items) as cat_count
from want
group by cat;
quit;

Generating panel data in Stata

How can I generate panel data in Stata?
I would like that each individual is affected by unobserved heterogeneity.
For example, I want the DGP (data generating process) is something like:
Wages_{it}= \beta (Labor market experience_{it}) + \alpha_{i} + \epsilon_{it},
where \alpha_{i} is the unobserved heterogeneity and where \epsilon_{it} is the error term which is normally distributed.
Finally, I would like that (Labor market experience_{it}) is an AR(1) process, e.g.:
Labor market experience_{it}= 0.8 * (Labor market experience_{i,t-1}) + v_{it},
where v_{it} is the error term which is normally distributed.
You can do something like this by using subscripting combined with bysort:
clear
set seed 10011979
set obs 4 // Set the number of panels (N)
gen id = _n
gen alpha = rnormal(0,1)
expand 3 // Set the number of periods (T)
bys id: gen t=_n
xtset id t
bysort id (t): gen lme = rnormal(0,1) + rnormal(0,1) if _n==1
bysort id (t): replace lme = .8 * lme[_n-1] + rnormal(0,1) if _n!=1
gen w = 3 * lme + alpha + rnormal(0,1)
drop alpha

Which months are included in a date range?

I have a dataset with from and to dates of registration for a group of users. I would like to programmatically find which months lie in between those dates for each user, without having to hard code in any months, etc. I only want a summary of numbers registered in each month, so if that makes it quicker, so much the better.
E.g. I have something like
User-+-From-------+-To-----------------
A + 11JAN2011 + 15MAR2011
A + 16JUN2011 + 17AUG2011
B + 10FEB2011 + 12FEB2011
C + 01AUG2011 + 05AUG2011
And I want something like
Month---+-Registrations
JAN2011 + 1 (A)
FEB2011 + 2 (AB)
MAR2011 + 1 (A)
APR2011 + 0
MAY2011 + 0
JUN2011 + 1 (A)
JUL2011 + 1 (A)
AUG2011 + 2 (AC)
Note I don't need the bit in brackets; that was just to try and clarify my point.
Thanks for any help.
One easy way is to construct an intermediate dataset and then PROC FREQ.
data have;
informat from to DATE9.;
format from to DATE9.;
input user $ from to;
datalines;
A 11JAN2011 15MAR2011
A 16JUN2011 17AUG2011
B 10FEB2011 12FEB2011
C 01AUG2011 05AUG2011
;;;;
run;
data int;
set have;
_mths=intck('month',from,to,'d'); *number of months after the current one (0=current one). 'd'=discrete=count 1st of month as new month;
do _i = 0 to _mths; *start with current month, iterate over months;
month = intnx('month',from,_i,'b');
output;
end;
format month MONYY7.;
run;
proc freq data=int;
tables month/out=want(keep=month count rename=count=registrations);
run;
You can eliminate the _mths step by doing that in the do loop.