I have a a customer level data with customer's Pre-covid, post-covid and In-covid Balances. The data look something like this
Accountid
Covid Flag
Balance
123
Pre-Covid
100
123
In-Covid
200
123
Post-Covid
400
I need to create a new column with % difference between these covid flags. So the extra column should create the % difference between the balance of pre-covid period to in covid(row 1 to 2), in covid to pre-covid(from row 2 to 3) and finally from pre-covid to post covid(row 1 and 3)
the final data should look something like this
Accountid
COVID FLAG
% Difference
123
pre to in Covid
100%
123
In to Post Covid
100%
123
pre to Post Covid
300%
How do I create the % difference column and the new covid Flag?
i can only think of Lag function to do this, i can us ethe lag function for 1 to 2, 2 to 3 , but how do i do this for 1 to 3?
Since there are only three values, we can use some simple data step logic to store all of our values of interest into temporary variables as we find them, then output them one at a time at the last row of each account ID. To illustrate this, here is what the background calculations look like as we read row-by-row:
accountid covid_flag balance pre_covid in_covid post_covid pct_diff
123 Pre-Covid 100 100 . . .
123 In-Covid 200 100 200 . .
123 Post-Covid 400 100 200 400 .
----------------------------------------------------------------------------------------
Point where we output and calculate % diff
----------------------------------------------------------------------------------------
123 pre to in Covid 400 100 200 400 100%
123 In to Post Covid 400 100 200 400 100%
123 pre to Post Covid 400 100 200 400 300%
Here's how this code looks:
data want;
set have;
by accountid;
/* Temporary variables to hold the balance found in each row */
retain pre_covid in_covid post_covid;
/* Reset temporary variables at the start of each account ID */
if(first.accountid) then call missing(pre_covid, in_covid, post_covid);
/* Save each covid flag balance to temporary variables */
select(upcase(covid_flag) );
when('PRE-COVID') pre_covid = balance;
when('IN-COVID') in_covid = balance;
when('POST-COVID') post_covid = balance;
end;
/* Uncomment to view intermediate steps */
/* output;*/
/* At the very last account, calculate the differences and output for each one */
if(last.accountid) then do;
covid_flag = 'pre to in Covid';
pct_diff = (in_covid - pre_covid)/pre_covid;
output;
covid_flag = 'In to Post Covid';
pct_diff = (post_covid - in_covid)/in_covid;
output;
covid_flag = 'pre to Post Covid';
pct_diff = (post_covid - pre_covid)/pre_covid;
output;
end;
format pct_diff percent8.;
run;
Related
I have a column for dollar-amount that I need to break apart into $1000 segments - so $0-$999, $1,000-$1,999, etc.
I could use Case/When, but there are an awful lot of groups I would have to make.
Is there a more efficient way to do this?
Thanks!
You could just use arithmetic. For example you could convert them to upper limit of the $1,000 range.
up_to = 1000*ceil(dollar/1000);
Let's make up some example data:
data test;
do dollar=0 to 5000 by 500 ;
up_to = 1000*ceil(dollar/1000);
output;
end;
run;
Results:
Obs dollar up_to
1 0 0
2 500 1000
3 1000 1000
4 1500 2000
5 2000 2000
6 2500 3000
7 3000 3000
8 3500 4000
9 4000 4000
10 4500 5000
11 5000 5000
Absolutely. This is a great use case for user-defined formats.
proc format;
value segment
0-<1000 = '0-1000'
1000-<2000 = '1000s'
2000-<3000 = '2000s'
;
quit;
If the number is too high to write out, do it with code!
data segments;
retain
fmtname 'segment'
type 'n' /* numeric format */
eexcl 'Y' /* exclude the "end" match, so 0-1000 excluding 1000 itself */
;
do start = 0 to 1e6 by 1000;
end = start + 1000;
label = catx('- <',start,end); * what you want this to show up as;
output;
end;
run;
proc format cntlin=segments;
quit;
Then you can use segment = put(dollaramt,segment.); to assign the value of segment, or just apply the format format dollaramt segment.; if you're just using it in PROC SUMMARY or somesuch.
And you can combine the two approaches above to generate a User Defined Format that will bin the amounts for you.
Create bins to set up a user defined format. One drawback of this method is that it requires you to know the range of data ahead of time.
Use a user defined function via PROC FCMP.
Use a manual calculation
I illustrate version of the solution for 1 & 3 below. #2 requires PROC FCMP but I think using it a plain data step can be simpler.
data thousands_format;
fmtname = 'thousands_fmt';
type = 'N';
do Start = 0 to 10000 by 1000;
END = Start + 1000 - 1;
label = catx(" - ", put(start, dollar12.0), put(end, dollar12.0));
output;
end;
run;
proc format cntlin=thousands_format;
run;
data demo;
do i=100 to 10000 by 50;
custom_format = put(i, thousands_fmt.);
manual_format = catx(" - ", put(floor(i/1000)*1000, dollar12.0), put((ceil(i/1000))*1000-1, dollar12.0));
output;
end;
run;
I'm trying to create a column that will apply to different interests to it based on how much each customer's cumulative purchases are. Not sure but I was thinking that I'd need to use a do while statement but entirely sure. :S
This is what I got so far but I don't know how to get it to perform two operations on one value. Such that, it will apply one interest rate until say, 4000, and then apply the other interest rate to the rest above 4000.
data cards;
set sortedccards;
by Cust_ID;
if first.Cust_ID then cp=0;
cp+Purchase;
if cp<=4000 then cb=(cp*.2);
if cp>4000 then cb=(cp*.2)+(cp*.1);
format cp dollar10.2 cp dollar10.2;
run;
What I'd like my output to look like.
You will want to also track the prior cumulative purchase in order to detect when a purchase causes the cumulative to cross the threshold (or breakpoint) $4,000. Breakpoint crossing purchases would be split into pre and post portions for different bonus rates.
Example:
Program flow causes retained variable pcp to act like a LAGged variable.
data have;
input id $ p;
datalines;
C001 1000
C001 2300
C001 2000
C001 1500
C001 800
C002 6200
C002 800
C002 300
C003 2200
C003 1700
C003 2500
C003 600
;
data want;
set have;
by id;
if first.id then do;
cp = 0;
pcp = 0; retain pcp; /* prior cumulative purchase */
end;
cp + p; /* sum statement causes cp to be implicitly retained */
* break point is 4,000;
if (cp > 4000 and pcp > 4000) then do;
* entire purchase in post breakpoint territory;
b = 0.01 * p;
end;
else
if (cp > 4000) then do;
* split purchase into pre and post breakpoint portions;
b = 0.10 * (4000 - pcp) + 0.01 * (p - (4000 - pcp));
end;
else do;
* entire purchase in pre breakpoint territory;
b = 0.10 * p;
end;
* update prior for next implicit iteration;
pcp = cp;
run;
Here is a fairly straightforward solution which is not optimized but works. We calculate the cumulative purchases and cumulative bonus at each step (which can be done quite simply), and then calculate the current period bonus as cumulative bonus minus previous cumulative bonus.
This is assuming that the percentage is 20% up to $4000 and 30% over $4000.
data have;
input id $ period MMDDYY10. purchase;
datalines;
C001 01/25/2019 1000
C001 02/25/2019 2300
C001 03/25/2019 2000
C001 04/25/2019 1500
C001 05/25/2019 800
C002 03/25/2019 6200
C002 04/25/2019 800
C002 05/25/2019 300
C003 02/25/2019 2200
C003 03/25/2019 1700
C003 04/25/2019 2500
C003 05/25/2019 600
;
run;
data want (drop=cumul_bonus);
set have;
by id;
retain cumul_purchase cumul_bonus;
if first.id then call missing(cumul_purchase,cumul_bonus);
** Calculate total cumulative purchase including current purchase **;
cumul_purchase + purchase;
** Calculate total cumulative bonus including current purchase **;
cumul_bonus = (0.2 * cumul_purchase) + ifn(cumul_purchase > 4000, 0.1 * (cumul_purchase - 4000), 0);
** Bonus for current purchase = total cumulative bonus - previous cumulative bonus **;
bonus = ifn(first.id,cumul_bonus,dif(cumul_bonus));
format period MMDDYY10.
purchase cumul_purchase bonus DOLLAR10.2
;
run;
proc print data=want;
I have a dataset that has weekly values stored by location. I want to determine how many times the value has changed. Initially I thought I could just count distinct values, but the issue is that sometimes the values are repeated. Consider the example below:
data have;
input location $ week value;
cards;
NC 1 100
NC 2 200
NC 3 200
NC 4 200
NC 5 100
NC 6 200
SC 1 500
SC 2 500
SC 3 500
SC 4 500
SC 5 500
SC 6 500
;
run;
Notice that the value at location NC changes three times, at weeks 2,5,6. The value at location SC changes 0 times.
I would like an output of the change frequency...something like:
NC 3
SC 0
Any help would be greatly appreciated. Thank you.
Use the NOTSORTED keyword on a BY statement and you can then count the number of FIRST. occurrences.
proc sort data=have;
by location week;
run;
data want
set have;
by location value notsorted ;
if first.location then nchange=0;
else nchange + first.value;
if last.location;
keep location nchange ;
run;
Make sure the data is sorted. Your example is, but if not then
proc sort data=have;
by location week;
run;
After that, use the BY statement inside the data step. This will create indicators that tell you when you are at the start and end of the BY group.
RETAIN, will keep values between lines.
data want;
set have;
by location;
retain last count;
if first.location then do;
count = 0;
last = value;
end;
if last ^= value then
count = count + 1;
last = value;
if last.location then
output;
run;
I need your help on developing a de-hoc query for hoc(range) data, below is an example of Shares Outstanding HOC:
ID StartDT EndDT SharesOutstanding
ABC 01-Jan-2010 03-Feb-2013 100
ABC 04-Feb-2014 03-Sep-2014 160
XYZ 01-Jan-2011 03-Mar-2012 52
XYZ 04-Mar-2012 09-Aug-2013 108
XYZ 10-Aug-2013 03-Sep-2014 120
Now I want to dehoc or break the above range data to per day. Below is the desired output:
ID Date Shares
ABC 01-Jan-2010 100
ABC 02-Jan-2010 100
ABC 03-Jan-2010 100
ABC 04-Jan-2010 100
ABC 05-Jan-2010 100
.......
ABC 03-Feb-2014 100
ABC 04-Feb-2014 160
....till 03-Sep-2014
I am using SAS Code with PROCSQL but that is very time consuming
Need your help on this query at earliest
Thanks
Hitesh
This should be fairly easy with a data step and some do-loops.
data want(drop = StartDT EndDT i);
set have;
format date date9.;
do i = 0 to (EndDT-StartDT);
date = StartDT + i;
output;
end;
run;
Do you really want lots of repeated rows, though, or are you just interested in getting the difference of dates?
Wonder if you can help me
I’ve got a dataset where the value in a column is also the field name of a column. I want to be able to use the value of the column to call the applicable field in a formula.
For instance … I have columns…
MERCH_NO
V01
M02
V08
M08
AMOUNT
PLAN
A record would look like this …and what I want the calc field to do…
MERCH_NO V01 M02 V08 M08 AMOUNT PLAN CALC
123456 2 2 1 1 100.00 V01 value of V01 * AMOUNT
456789 4 4 4 4 250.00 M08 value of M08 * AMOUNT
If the PLAN field for a record says V01, then the value of the V01 column must be used in the CALC field. If the PLAN field says, M08, then the M08 value should be used. There are about 40 plans.
A static example of how to use VVALUEX() function for that.
data result;
V01 = 2;
AMOUNT=100;
CALC = 'value of V01 * AMOUNT';
length arg1 arg2 $32;
arg1 = scan(compress(CALC, 'value of'), 1);
arg2 = scan(compress(CALC, 'value of'), 2);
put arg1 arg2;
result = input(VVALUEX(arg1), 16.) * input(VVALUEX(arg2), 16.);
run;
For your situation, you'd have to create logic to recognize all know patters of CALC, types and formats of variables (since VVALUEX() returns formatted values).
A dynamic approach but probably not suitable for lots of data is to generate the code for each row (see below).
Currently assumes a simple expression usable in IF .. THEN.
data input;
length CALC $50;
input V01 M08 AMOUNT CALC 9-58;
cards;
2 1 100 value of V01 * AMOUNT
2 4 100 value of M08 * AMOUNT
;
run;
/* code generation */
data _null_;
file 'mycalc.sas';
set input end=last;
length line $150;
if _N_=1 then do;
put 'data result;';
put ' set input;';
end;
line = 'if _N_ = ' || put(_N_, 8. -L) ||
' then RESULT = ' || compress(CALC, 'value of') || ';';
put line;
if last then put 'run;';
run;
%include 'mycalc.sas'; /* run the code */
Ok, now if see I didn't notice your note about PLAN field - please adjust as you need.
Vasja's approach is the correct one - here is that approach using the PLAN variable as you describe.
data have;
input MERCH_NO V01 M02 V08 M08 AMOUNT PLAN $;
calc = input(vvaluex(plan),best12.) * amount;
put calc=;
datalines;
123456 2 2 1 1 100.00 V01
456789 4 4 4 4 250.00 M08
;;;;
run;