I am not familiar with SAS base and macro language syntax ,my codes keep going wrong..can someone offer a piece of SAS macro code of my pseudocode.
1.create a macro array to store all the distinct variable in table Map_num;
select distinct variable:into numVarList separated by ' ' from Map_num;
quit;
2.for loop the macro array numVarList and for loop each value of each element
(1)pick up the ith element
(2)for loop all the value of the ith element,
(3)if the value of the customer (from customerScore table)is within the scale of "start" and "end",then update score=score+woe*beta
for example:
the customerScore table is:
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
| cst_id | A | B | C | D | E | F | G | H | I | J | K | score |
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
| 1 | 688567 | 873 | 134878 | 546546 | 3123 | 6 | 5345 | 768678 | 348957 | -921839 | -8217 | 0 |
| 2 | 3198 | 54667 | 9789867 | 53456756 | 78978 | 6456 | 645 | 534 | -219 | 13312 | 4543 | 0 |
| 3 | 35324 | 6456568 | 43 | 56756 | -8217 | 688567 | 873 | 134878 | 12 | 89173 | 213142 | 0 |
| 4 | 348957 | -921839 | -8217 | 5345 | 434534 | 3198 | 54667 | 9789867 | -8217 | -8217 | 8908102 | 0 |
| 5 | -219 | 13312 | 4543 | 4234 | 54667 | 35324 | 6456568 | 43 | 213142 | 213142 | 213 | 0 |
| 6 | 12 | 89173 | 213142 | 23234 | 348957 | -921839 | -8217 | 688567 | 873 | 134878 | 23424 | 0 |
| 7 | 688567 | 89173 | 213142 | -8217 | -219 | 13312 | 4543 | 3198 | 54667 | 9789867 | 3434 | 0 |
| 8 | 3198 | -8217 | 21313 | -8217 | 12 | 89173 | 213142 | 35324 | 6456568 | 43 | 3123 | 0 |
| 9 | 35324 | -8217 | 688567 | 688567 | 873 | 134878 | 688567 | 873 | 134878 | -8217 | 11 | 0 |
| 10 | 348957 | 89173 | 213142 | 3198 | 54667 | 9789867 | 3198 | 54667 | 9789867 | -8217 | 3198 | 0 |
| 11 | -219 | -921839 | -8217 | 35324 | 6456568 | 43 | 35324 | 6456568 | 43 | -921839 | -8217 | 0 |
| 12 | 12 | 13312 | 4543 | 89173 | 4234 | 3198 | 688567 | 873 | 134878 | 13312 | 4543 | 0 |
| 13 | 12 | 89173 | 213142 | 348957 | -921839 | -8217 | 3198 | 54667 | 9789867 | 89173 | 213142 | 0 |
| 14 | 2 | 89173 | 213142 | -219 | 13312 | 4543 | 35324 | 6456568 | 43 | 54667 | 4543 | 0 |
| 15 | 348957 | -921839 | -8217 | 12 | 89173 | 213142 | 13312 | 4543 | 89173 | 4234 | 4543 | 0 |
| 16 | -219 | 13312 | 35324 | 6456568 | 43 | 213142 | 89173 | 213142 | 348957 | -921839 | -8217 | 0 |
| 17 | 12 | 89173 | -921839 | -8217 | 688567 | 873 | 89173 | 213142 | -219 | 13312 | 4543 | 0 |
| 18 | 688567 | 873 | 13312 | 4543 | 3198 | 54667 | -921839 | -8217 | 12 | 89173 | 213142 | 0 |
| 19 | 3198 | 54667 | 9789867 | 688567 | 873 | 134878 | 43 | 213142 | 213142 | 213 | 9789867 | 0 |
| 20 | 35324 | 6456568 | 43 | 43 | 213142 | 213142 | 213 | 89173 | 4234 | 3198 | 688567 | 0 |
+--------+--------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+-------+
if table Map_num is below,then cst_id score is update:score=0+(-1.2)*3 + 2*3 + (0.1)*3 + 7*3
+----------+------------+------------+------+------+
| variable | start | end | woe | beta |
+----------+------------+------------+------+------+
| A | -999999999 | 57853 | -1 | 3 |
| A | 57853 | 89756 | -1.1 | 3 |
| A | 89756 | 897452 | -1.2 | 3 |
| A | 897452 | 9999999999 | -1.3 | 3 |
| B | -999999999 | 4235 | 2 | 3 |
| B | 4235 | 65785 | 3 | 3 |
| B | 65785 | 9999999999 | 4 | 3 |
| C | -999999999 | 9673 | 3.1 | 3 |
| C | 9673 | 75341 | 2.1 | 3 |
| C | 75341 | 98543 | 1.1 | 3 |
| C | 98543 | 567864 | 0.1 | 3 |
| C | 567864 | 9999999999 | -1 | 3 |
| D | -999999999 | 8376 | 5 | 3 |
| D | 8376 | 93847 | 6 | 3 |
| D | 93847 | 9999999999 | 7 | 3 |
+----------+------------+------------+------+------+
if table Map_num is below,then cst_id score is update:score=0+3*2 + 5*2 + 0*2 + 7*2 +3*2
+----------+------------+------------+-----+------+
| variable | start | end | woe | beta |
+----------+------------+------------+-----+------+
| E | -999999999 | 3 | 1 | 2 |
| E | 3 | 500000 | 3 | 2 |
| E | 500000 | 800000 | 2 | 2 |
| E | 800000 | 9999999999 | 4 | 2 |
| A | -999999999 | 6700 | 6 | 2 |
| A | 590000 | 680000 | 4 | 2 |
| A | 680000 | 9999999999 | 5 | 2 |
| C | -999999999 | 89678 | 9 | 2 |
| C | 89678 | 566757 | 0 | 2 |
| C | 566757 | 986785 | 2.8 | 2 |
| C | 986785 | 9999999999 | 1.1 | 2 |
| K | -999999999 | 7865 | 7 | 2 |
| K | 7865 | 25637 | 9 | 2 |
| K | 25637 | 65742 | 8 | 2 |
| K | 65742 | 9999999999 | 0.2 | 2 |
| B | -999999999 | 56753 | 3 | 2 |
| B | 56753 | 5465624 | 4 | 2 |
| B | 5465624 | 9999999999 | 1 | 2 |
+----------+------------+------------+-----+------+
thanks in advance!
table customerScore and Map_num are changing everyday for each rows and their column name:variable,start,end,woe,beta are not changed.I need to update the column score in table customerScore and the score is according to table Map_num.If the column A value in table customerScore is 688567 ,so it is 89756 <688567<897452,so the socre will be update:score=score+(-1.2 )* 3...is that clear for you?!
it is a nested loop using SAS macro as I comprehended.
Unfortunately the customerScore is not in a form that is readily aligned for a really simple SQL computation.
SQL way
One important aspect is to recognize the selection of map and woe for each score part from map_num can be done relatively easily in SQL, but processing the individual variables has to be 'coaxed' through macro
Consider only the variable A from the first map_num as a example case.
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="A"
and map_num.start < customerScore.A <= mapnum.end
) as A_contribution_to_score
from
customerScore
Now consider the B contribution that is added to the overall expression
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="A"
and map_num.start < customerScore.A <= mapnum.end
)
+
select (
map_num.woe * map_num.beta
from map_num
where map_num.variable="B"
and map_num.start < customerScore.B <= mapnum.end
)
from
customerScore
You should see that a macro could determine the distinct map_num values of variable to be used to construct a rather lengthy SQL expression that searches for the appropriate woe and beta product to apply to each row in customerScore.
The macro and SQL update statement could be something like
%macro updateScore (data=, map=);
%local i n_var;
proc sql noprint;
select distinct variable into :variable1- from ↦
%let N_var = &sqlobs;
update &data as OUTER
set score = score
%do I = 1 %to &N_var;
%let variable = &&variable&i;
+
( select
INNER.woe * INNER.beta
from &map as INNER
where INNER.variable="&variable"
and INNER.start < OUTER.&variable <= INNER.end
)
%end;
; /* end of update statement */
quit;
%mend;
%updateScore(data=customerScore, map=map_num)
Your data structure needs a bit of work if you want the score update made via a map_num to be reversible (i.e. capable of having an undo action applied).
If tracking the map selections is important you would want an additional similar query in the macro that creates a table recording the important aspects of the map data selection
create table mapplication as
select cst_id
%do I = 1 %to &N_var;
%let variable = &&variable&i;
%let innerness = from &map as INNER where INNER.variable="&variable" and INNER.start < OUTER.&variable <= INNER.end;
, &variable
, ( select INNER.woe &innerness ) as &variable._woe
, ( select INNER.beta &innerness ) as &variable._beta
, ( select INNER.start &innerness ) as &variable._start
, ( select INNER.end &innerness ) as &variable._end
%end;
from &data as OUTER;
Examining the 'mapplication' data can possibly help diagnose bad map_num data.
First let's start with a working set of data so we have something that SAS code can work with.
data cust ;
input cst_id A B ;
cards;
1 688567 873
2 3198 54667
;
data map_data ;
input variable :$32. start end woe beta ;
cards;
A -999999999 57853 -1 3
A 57853 89756 -1.1 3
A 89756 897452 -1.2 3
A 897452 9999999999 -1.3 3
B -999999999 4235 2 3
B 4235 65785 3 3
B 65785 9999999999 4 3
;
If you want to combine the first table with the second then you need to transpose it.
proc transpose data=cust out=cust_data(rename=(col1=value)) name=variable ;
by cst_id ;
run;
The result for our small example looks like this.
Obs cst_id variable value
1 1 A 688567
2 1 B 873
3 2 A 3198
4 2 B 54667
Since the transpose has moved the variable names into data values instead of metadata values we can now easily join the customer data with the map data.
I will assume that you only want the cases where the value of the variable falls between the START and END variables.
proc sql ;
create table want as
select *
from cust_data a
inner join map_data b
on a.variable = b.variable
and a.value between b.start and b.end
order by 1,2
;
quit;
For this little sample it would be this data.
Obs cst_id variable value start end woe beta
1 1 A 688567 89756 897452 -1.2 3
2 1 B 873 -999999999 4235 2.0 3
3 2 A 3198 -999999999 57853 -1.0 3
4 2 B 54667 4235 65785 3.0 3
At this point you now have something that might make it possible to calculate a score, if you could explain what the formula is.
So assuming that you want to take the sum of WOE*BETA then your SQL query should probably look like this.
proc sql ;
create table scores as
select a.cst_id,sum(woe*beta) as score
from cust_data a
inner join map_data b
on a.variable = b.variable
and a.value between b.start and b.end
group by 1
order by 1
;
quit;
Which has this result.
Obs cst_id score
1 1 2.4
2 2 6.0
Not sure where macro code or looping would help with this problem. If the names of the input dataset vary then you could use macro variables to hold the names, but the input dataset names are used only once each in this code.
For example you could make macro variables CUST, MAP and OUT.
%let cust=work.cust;
%let map=work.map_data;
%let out=work.scores;
Then replace the dataset names in the code with the macro variable references.
proc transpose data=&cust. out=cust_data(rename=(col1=value)) name=variable ;
by cst_id ;
run;
proc sql ;
create table &out. as
select a.cst_id,sum(woe*beta) as score
from cust_data a
inner join &map. b
on a.variable = b.variable
and a.value between b.start and b.end
group by 1
order by 1
;
quit;
Related
I have two table looks like and I want to add column score to tableA from tableB, then get tableC, how to do in SAS?
the only rule is to add a column in tableA name "score " and its value is same as column "score" in tableB (which are all the same in tableB)
+----+---+---+---+
| id | b | c | d |
+----+---+---+---+
| 1 | 5 | 7 | 2 |
| 2 | 6 | 8 | 3 |
| 3 | 7 | 8 | 1 |
| 4 | 5 | 7 | 2 |
| 5 | 6 | 8 | 3 |
| 6 | 7 | 8 | 1 |
+----+---+---+---+
tableA
+---+---+-------+
| e | f | score |
+---+---+-------+
| 3 | 7 | 11 |
| 4 | 6 | 11 |
| 5 | 5 | 11 |
+---+---+-------+
tableB
+----+---+---+---+-------+
| id | b | c | d | score |
+----+---+---+---+-------+
| 1 | 5 | 7 | 2 | 11 |
| 2 | 6 | 8 | 3 | 11 |
| 3 | 7 | 8 | 1 | 11 |
| 4 | 5 | 7 | 2 | 11 |
| 5 | 6 | 8 | 3 | 11 |
| 6 | 7 | 8 | 1 | 11 |
+----+---+---+---+-------+
tableC
If the "id" is present in both tables, you can use the following to create Table C:
PROC SQL;
CREATE TABLE tableC AS
SELECT a.*, b.score
FROM tableA a JOIN tableB b
ON a.id = b.id;
QUIT;
Please confirm that this is what you need?
First time post so hopefully someone can kindly assist on this problem I'm facing within SAS EG (still learning SAS coding so please be kind!)
If you see a snippet of the dataset below what I'm trying to do is tally up the scores (pts) by Ref based on consecutive occurrences that flag has showed for that Ref.
For Example:
If you take Ref 505 for A_Flag there is 2 different sets of consecutive occurrences of that flag then scoring will be as follows:
1st ID > 1st instance = 25 points
2nd ID > 2nd instance but 1st consecutive instance = double to 50 points
3rd ID > 0 instance = 0 points
4th ID > 1st instance = 25 points
5th ID > 2nd instance but 1st consecutive instance = double to 50 points
6th ID > 0 instance = 0 points
Therefore for this Ref A_Pts will be 150 points.
Another example:
If you take Ref 527 for B_Flag there is 4 consecutive occurrences of that flag so coring per ID:
1st ID > 0 instance = 0 points
2nd ID > 1st instance = 10 points
3rd ID > 2nd instance but 1st consecutive instance = double to 20 points
4th ID > 3rd instance but 2nd consecutive instance = double to 40 points
5th ID > 4th instance but 3rd consecutive instance = double to 80 points
Therefore for this Ref B_Pts will be 150 points
I have to say the data is in the necessary order for what I'm trying to achieve.
I'd tried using LAG function but that will only work based on the 1st consecutive instance.
I also tried calculate a count - an enumeration variable based on cats(Ref,A_Flag) - but it then orders the data incorrectly and doesnt count up accordingly
Hopefully this makes sense to someone out there!
The dataset in question:
+-----------+-----+--------+--------+--------+-------+-------+
| date | Ref | FormID | A_Flag | B_Flag | A_Pts | B_Pts |
+-----------+-----+--------+--------+--------+-------+-------+
| 01-Feb-17 | 505 | 74549 | A | | 25 | 0 |
| 01-Feb-17 | 505 | 74550 | A | | 25 | 0 |
| 10-Jan-17 | 505 | 82900 | | B | 0 | 10 |
| 13-Jan-17 | 505 | 82906 | A | | 25 | 0 |
| 09-Jan-17 | 505 | 82907 | A | | 25 | 0 |
| 11-Jan-17 | 505 | 82909 | | B | 0 | 10 |
| 03-Jan-17 | 527 | 62549 | A | | 25 | 0 |
| 04-Jan-17 | 527 | 62550 | | B | 0 | 10 |
| 04-Jan-17 | 527 | 76151 | | B | 0 | 10 |
| 04-Jan-17 | 527 | 76152 | A | B | 25 | 10 |
| 04-Jan-17 | 527 | 76153 | A | B | 25 | 10 |
+-----------+-----+--------+--------+--------+-------+-------+
Desired output (unless there is a better suggestion):
+-----------+-----+--------+--------+--------+-----------+-----------+
| date | Ref | FormID | A_Flag | B_Flag | A_Pts_Agg | B_Pts_Agg |
+-----------+-----+--------+--------+--------+-----------+-----------+
| 01-Feb-17 | 505 | 74549 | A | | 25 | 0 |
| 01-Feb-17 | 505 | 74550 | A | | 50 | 0 |
| 10-Jan-17 | 505 | 82900 | | B | 0 | 10 |
| 13-Jan-17 | 505 | 82906 | A | | 25 | 0 |
| 09-Jan-17 | 505 | 82907 | A | | 50 | 0 |
| 11-Jan-17 | 505 | 82909 | | B | 0 | 10 |
| 03-Jan-17 | 527 | 62549 | A | | 25 | 0 |
| 04-Jan-17 | 527 | 62550 | | B | 0 | 10 |
| 04-Jan-17 | 527 | 76151 | | B | 0 | 20 |
| 04-Jan-17 | 527 | 76152 | A | B | 25 | 40 |
| 04-Jan-17 | 527 | 76153 | A | B | 50 | 80 |
+-----------+-----+--------+--------+--------+-----------+-----------+
So when totalled up it'll be
+-----+-----------+-----------+
| Ref | A_Pts_Agg | B_Pts_Agg |
+-----+-----------+-----------+
| 505 | 150 | 20 |
| 527 | 100 | 150 |
+-----+-----------+-----------+
Try this:
data have;
infile cards dlm='|';
input date :date7. Ref :8. FormID :8. A_Flag :$1. B_Flag :$1. A_Pts :8. B_Pts :8.;
format date date7.;
cards;
| 01-Feb-17 | 505 | 74549 | A | | 25 | 0 |
| 01-Feb-17 | 505 | 74550 | A | | 25 | 0 |
| 10-Jan-17 | 505 | 82900 | | B | 0 | 10 |
| 13-Jan-17 | 505 | 82906 | A | | 25 | 0 |
| 09-Jan-17 | 505 | 82907 | A | | 25 | 0 |
| 11-Jan-17 | 505 | 82909 | | B | 0 | 10 |
| 03-Jan-17 | 527 | 62549 | A | | 25 | 0 |
| 04-Jan-17 | 527 | 62550 | | B | 0 | 10 |
| 04-Jan-17 | 527 | 76151 | | B | 0 | 10 |
| 04-Jan-17 | 527 | 76152 | A | B | 25 | 10 |
| 04-Jan-17 | 527 | 76153 | A | B | 25 | 10 |
;
run;
data want;
set have;
by Ref;
retain A_pts_agg B_pts_agg;
if first.Ref then do;
A_pts_agg = A_pts;
B_pts_agg = B_pts;
end;
if lag(A_flag) ne (A_flag) then A_pts_agg = A_pts;
else if A_flag = 'A' then A_pts_agg = A_pts_agg * 2;
if lag(B_flag) ne (B_flag) then B_pts_agg = B_pts;
else if B_flag = 'B' then B_pts_agg = B_pts_agg * 2;
run;
I am using the following code to delete some rows from a dataset on a certain condition:
data MK_RETURN;
/*delete some data to solve the beta zero problem*/
if CUM_RETURN<RMIN then delete;
run;
However, I found out that the dataset MK_RETURN became not only empty, but also missing all the variables but CUM_RETURN and return.
Before the delete operation, the dataset contains six ~ seven variables. But after the delete operation, the dataset only contains two (empty variables), i.e. CUM_RETURN, RMIN.
What is wrong here?
The input data is something like
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
| SYMBOL | DATE | time | CUM_RETURN | return_sec | RMIN | one_M | MK_RETURN_RATE |
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
| A | 20130108 | 1 | 0 | | 0.00023571 | 1.90E-11 | 3.130243764 |
| A | 20130108 | 2 | | -0.00117855 | 0.000235988 | 1.90E-11 | 0.000274509 |
| A | 20130108 | 3 | 0.000471976 | 0.000471976 | 0.000235877 | 1.90E-11 | 6.86083E-05 |
| A | 20130108 | 4 | | -0.000471754 | 0.000235988 | 1.90E-11 | 6.86036E-05 |
| A | 20130108 | 5 | -0.000471976 | -0.000943953 | 0.000236211 | 1.90E-11 | 6.85989E-05 |
| A | 20130108 | 6 | | -0.002362112 | 0.000236771 | 1.90E-11 | 0 |
| A | 20130108 | 7 | 0.000711876 | 0.001183852 | 0.000236491 | 1.90E-11 | -0.000137188 |
| A | 20130108 | 8 | | 0.001300698 | 0.000236183 | 1.90E-11 | 0 |
| A | 20130108 | 9 | 0.000711876 | 0 | 0.000236183 | 1.90E-11 | 0 |
| A | 20130108 | 10 | | 0 | 0.000236183 | 1.90E-11 | 0.000137207 |
| A | 20130108 | 11 | 0.000711876 | 0 | 0.000236183 | 1.90E-11 | 0.000137188 |
| A | 20130108 | 12 | | 0.000590458 | 0.000236044 | 1.90E-11 | 6.85848E-05 |
| A | 20130108 | 13 | 0.000711876 | 0 | 0.000236044 | 1.90E-11 | 0 |
| A | 20130108 | 14 | | -0.000118022 | 0.000236072 | 1.90E-11 | -0.0003429 |
| A | 20130108 | 15 | 0.000711876 | 0 | 0.000236072 | 1.90E-11 | -0.000068604 |
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
You didn't declare an input dataset (no set statement) - so you have created a new, empty dataset called MK_RETURN with two variables that were assigned as missing numerics given the absence of a definition.
Try the following (if not too late):
data MK_RETURN;
set INPUTDATASET; /* THIS is the line you need */
/*delete some data to solve the beta zero problem*/
if CUM_RETURN<RMIN then delete;
run;
I was trying to use the following code to do calculate cumulative return ():
retain MIDPRICE CUM_RETURN;
LAG_MIDPRICE = lag(MIDPRICE);
LAG_CUMRETURN = lag(CUM_RETURN);
return_sec = (MIDPRICE - LAG_MIDPRICE) / LAG_MIDPRICE;
if first.symbol then CUM_RETURN = 0;
else CUM_RETURN = return_sec + LAG_CUMRETURN;
However, with the if and else statement, the SAS is skipping a row:
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
| SYMBOL | DATE | time | CUM_RETURN | return_sec | RMIN | one_M | MK_RETURN_RATE |
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
| A | 20130108 | 1 | 0 | | 0.00023571 | 1.90E-11 | 3.130243764 |
| A | 20130108 | 2 | | -0.00117855 | 0.000235988 | 1.90E-11 | 0.000274509 |
| A | 20130108 | 3 | 0.000471976 | 0.000471976 | 0.000235877 | 1.90E-11 | 6.86083E-05 |
| A | 20130108 | 4 | | -0.000471754 | 0.000235988 | 1.90E-11 | 6.86036E-05 |
| A | 20130108 | 5 | -0.000471976 | -0.000943953 | 0.000236211 | 1.90E-11 | 6.85989E-05 |
| A | 20130108 | 6 | | -0.002362112 | 0.000236771 | 1.90E-11 | 0 |
| A | 20130108 | 7 | 0.000711876 | 0.001183852 | 0.000236491 | 1.90E-11 | -0.000137188 |
| A | 20130108 | 8 | | 0.001300698 | 0.000236183 | 1.90E-11 | 0 |
| A | 20130108 | 9 | 0.000711876 | 0 | 0.000236183 | 1.90E-11 | 0 |
| A | 20130108 | 10 | | 0 | 0.000236183 | 1.90E-11 | 0.000137207 |
| A | 20130108 | 11 | 0.000711876 | 0 | 0.000236183 | 1.90E-11 | 0.000137188 |
| A | 20130108 | 12 | | 0.000590458 | 0.000236044 | 1.90E-11 | 6.85848E-05 |
| A | 20130108 | 13 | 0.000711876 | 0 | 0.000236044 | 1.90E-11 | 0 |
| A | 20130108 | 14 | | -0.000118022 | 0.000236072 | 1.90E-11 | -0.0003429 |
| A | 20130108 | 15 | 0.000711876 | 0 | 0.000236072 | 1.90E-11 | -0.000068604 |
+--------+----------+------+--------------+--------------+-------------+----------+----------------+
As you can see, I want CUM_RETURN = return_sec + lag(CUM_RETURN), but it seems that now it is doing CUM_RETURN = return_sec + lag(lag(CUM_RETURN)).
I am aware of the issue that you cannot write lag directly in if and else conditions, that is why i used a LAG variable before the if else condition. But it seems that it is still working in a weird way ...
Besides, if i delete the if statement and do
if first.symbol then CUM_RETURN = 0;
CUM_RETURN = return_sec + LAG_CUMRETURN;
The entire column of CUM_RETURN just becomes empty ...
I don't think you need LAG_CUMRETURN, you have CUMRETURN in a retain.
(throwing out the other noise)
retain cumreturn;
if first.symbol then cumreturn = 0;
cumreturn = sum(cumreturn,return_sec);
That should get you what you want. The SUM() function treats a missing value as a 0, so for the first record of each security, CUMRETURN will stay 0.
I want to sum the volume variable for each name (TRD_STCK_CD) and date (TRD_EVENT_TM) variables.
Here is a sample of my data:
+--------------+--------------+-------------+--------+------------+---------
| TRD_EVENT_DT | TRD_EVENT_TM | TRD_STCK_CD | TRD_EVENT_ROUFOR | VOLUME |
+--------------+--------------+-------------+--------+------------+---------
| 3/24/2008 | 12:28:01 | ALBZ1 | 12:30 | 15370000 |
| 3/24/2008 | 13:13:44 | ALBZ1 | 13:00 | 15670 |
| 3/24/2008 | 12:20:38 | AZAB1 | 12:30 | 6830000 |
| 3/24/2008 | 13:13:44 | AZAB1 | 13:00 | 6950 |
| 3/24/2008 | 9:14:57 | BALI1 | 9:00 | 7871000 |
| 3/24/2008 | 9:15:06 | BALI1 | 9:30 | 1700000 |
| 3/24/2008 | 9:15:14 | BALI1 | 9:30 | 8500000 |
| 3/24/2008 | 9:15:24 | BALI1 | 9:30 | 5100000 |
| 3/24/2008 | 9:29:27 | BALI1 | 9:30 | 8500000 |
| 3/24/2008 | 12:28:00 | BALIl | 12:30 | 8500000 |
| 3/24/2008 | 12:28:07 | BALIl | 12:30 | 8500000 |
| 3/24/2008 | 13:13:44 | BALI1 | 13:00 | 8650 |
+--------------+--------------+-------------+--------+------------+---------
I have deleted some col. for simplicity. In next step, I want a table such as below:
+--------------+--------------+-------------+--------+------------+---------
| TRD_EVENT_DT | TRD_EVENT_TM | TRD_STCK_CD | TRD_EVENT_ROUFOR | VOLUME | volume_Sum |
+--------------+--------------+-------------+--------+------------+---------
| 3/24/2008 | 12:28:01 | ALBZ1 | 12:30 | 15370000 | |
| 3/24/2008 | 13:13:44 | ALBZ1 | 13:00 | 15670 | 15385670 |
| 3/24/2008 | 12:20:38 | AZAB1 | 12:30 | 6830000 | |
| 3/24/2008 | 13:13:44 | AZAB1 | 13:00 | 6950 | 6836950 |
| 3/24/2008 | 9:14:57 | BALI1 | 9:00 | 7871000 | |
| 3/24/2008 | 9:15:06 | BALI1 | 9:30 | 1700000 | |
| 3/24/2008 | 9:15:14 | BALI1 | 9:30 | 8500000 | |
| 3/24/2008 | 9:15:24 | BALI1 | 9:30 | 5100000 | |
| 3/24/2008 | 9:29:27 | BALI1 | 9:30 | 8500000 | |
| 3/24/2008 | 12:28:00 | BALIl | 12:30 | 8500000 | |
| 3/24/2008 | 12:28:07 | BALIl | 12:30 | 8500000 | |
| 3/24/2008 | 13:13:44 | BALI1 | 13:00 | 8650 | 48679650 |
+--------------+--------------+-------------+--------+------------+---------
Please pay attention to last col. It has been generated by summing volumes that have same TRD_STCK_CD var. So each TRD_STCK_CD obs. has just one Volume_Sum data.
Slightly different implementation of the same idea:
/*Sort by TRD_STCK_CD and temporal variables.*/
proc sort data=have out=have_sorted;
by TRD_STCK_CD
TRD_EVENT_DT
TRD_EVENT_TM;
run;
/*Sum VOLUME until the last of each TRD_STCK_CD is reached.*/
data want;
set have_sorted;
by TRD_STCK_CD
TRD_EVENT_DT
TRD_EVENT_TM;
retain tmp_volume_sum;
tmp_volume_sum + VOLUME;
if last.TRD_STCK_CD then do;
Volume_Sum = tmp_volume_sum;
call missing(tmp_volume_sum);
end;
drop tmp_:;
run;
I simplified this even more to something with just 2 columns. The code and the volume.
Here is the sample table creation:
data have;
do code = 'a','b','c';
do i=1 to floor(5*ranuni(1))+1;
volume = floor(500*ranuni(1));
output;
end;
end;
drop i;
run;
First use PROC SQL to sum the volume grouped by code. Save that in a table and put an index on code.
proc sql noprint;
create table sums as
select code, sum(volume) as volume_sum
from have
group by code;
create index code on sums;
quit;
I assume you have sorted your table by code. If not, do so.
Now we run through the data we have. Set the volume_sum to null. If we are on the last record for that code, then look up the value from the SUMS table.
data want;
set have;
by code;
volume_sum = .;
if last.code then
set sums key=code;
run;
Printed I get:
code volume volume_sum
a 485 485
b 129 .
b 460 589
c 271 .
c 265 .
c 24 .
c 33 .
c 409 1002