I would like to plot dataset and obtain desired output with the right setup.
Plot the scatter such that the points are in shade red-color, from light red to dark red depending on the scale (ratio) of 0-1 (0=light red, 1=dark red).
Show the legend also showing the scale red color according to the ration 0-1 (point 1.)
Data explanation:
area - city (shortcut)
id - user id
var - variable
time - datetime
exit - consumer left
ratio - proportion (between 0-1)
Data sample and attempt plotting (obviously not correct):
data data;
input area $ id $ var $ time $ exit $ ratio $;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
data attrs;
input id $ risk $ fillcolor $;
datalines;
ratio 0.05 Verylightred
ratio 0.15 Lightred
ratio 0.20 Red
ratio 0.25 Darkred
ratio 0.30 Verydarkred
ratio 0.35 Verydarkstrongred
;
run;
proc sgpanel data=data dattrmap=attrs;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled) group=ratio attrid=ratio;
run;
This will get you closer.
Ratio should be numeric to be graphed
Ratio is continuous, how should it be used to group?
For the colour on the data attribute map, the length of the colours is not long enough and risk should be numeric
I don't know exactly how to specify the ranges you'd like for the colours you'd like but this gets you closer using the automatic legend.
One way to get at this is to add the variable to the data set for each group and then you can control the colour of each group with the data attribute map. This would mean adding a column in the 'data' data set called ratio_group whcih maps to the values in the data attribute map table. Use that variable the group.
data data;
input area $ id $ var $ time $ exit $ ratio ;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
proc sgpanel data=data ;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled size=10)
colorresponse=ratio
colormodel=(verylightred lightred red darkred verydarkred verydarkstrongred);
colaxis grid minorgrid;
rowaxis grid minorgrid;
run;
For marker size look at the SIZE option under the MARKERATTRS option.
For grids, look at the GRID/MINORGRID options under the COLAXIS and ROWAXIS statements.
COLAXIS documentation
I have the following values
Time Value
3 0.03
6 0.04
9 0.05
12 0.06
As you can see, they move at steps of three and one respectively.
If I want to apply the following formula for each of them
X=sum(X,2/value)
What should I do?
I have tried as follows:
data want;
array my_array{0-10} $ _temporary_;
X=0;
Do i=1 to 5;
My_array(i)=sum(x,2/value)*i;
X= =sum(x,2/value)*i;
End;
Total=X;
Run;
However I am not looping through value, only through time (i goes from 1 to 4).
I would like to calculate for each time X applying the formula above, in order to have one column extra in the table above, then get the sum of these values.
In the example provided by Kermit in the answer below, the expected output (values under x should satisfy the formula mentioned above) would be the following:
time value x sum_x
3 0.03 200
6 0.04 300
9 0.05 360
12 0.06 400
Your expected results do not seem to match your explanation of the formula. You could use two arrays to allow you to pair the TIME and VALUE amounts.
data want;
array t [4] _temporary_ (3 6 9 12);
array v [4] _temporary_ (0.03 0.04 0.05 0.06);
do index=1 to dim(t);
time=t[index];
value=v[index];
x=sum(x,2/value)*index;
output;
end;
run;
Results
Obs index time value x
1 1 3 0.03 66.67
2 2 6 0.04 233.33
3 3 9 0.05 820.00
4 4 12 0.06 3413.33
If I understand correctly, time would be set on a quarterly basis and you would like to get the sum of X for each time. Next time, consider giving the expected output in your question.
data stage1;
do time=3 to 12 by 3;
do value = 0.03, 0.04, 0.05, 0.06;
x=(2/value)*time;
output;
end;
end;
run;
proc sort data=stage1;
by time value;
run;
data want;
do _n_=1 by 1 until(last.time);
set stage1;
by time;
sum_x=sum(sum_x, x);
output;
end;
run;
time value x sum_x
3 0.03 200 200
3 0.04 150 350
3 0.05 120 470
3 0.06 100 570
6 0.03 400 400
6 0.04 300 700
6 0.05 240 940
6 0.06 200 1140
9 0.03 600 600
9 0.04 450 1050
9 0.05 360 1410
9 0.06 300 1710
12 0.03 800 800
12 0.04 600 1400
12 0.05 480 1880
12 0.06 400 2280
EDIT after comments
Why would you use a do loop? Just perform element-wise multiplication within a table.
data want;
set have;
x=(2/value)*time;
retain sum_x 0;
sum_x=sum(sum_x, x);
output;
run;
I'm using proc genmod to predict an outcome measured at 4 time points. The outcome is a total score on a mood inventory, which can range from 0 to 82. A lot of participants have a score of 0, so the negative binomial distribution in proc genmod seemed like a good fit for the data.
Now, I'm struggling with how to write/interpret the estimates statements. The primary predictors are TBI status at baseline (0=no/1=yes), and visit (0=baseline, 1=second visit, 2=third visit, 4=fourth visit), and an interaction of TBI status and visit.
How do I write my estimates, such that I'm getting out:
1. the average difference in mood inventory score for person with TBI versus a person without, at baseline.
and
2. the average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits?
Below is what I have thus far, but I'm not sure how to interpret the output, also below, if indeed my code is correct.:
proc genmod data = analyze_long_3 ;
class id screen_tbi (param = ref ref = first) ;
model nsi_total = visit_cent screen_tbi screen_tbi*visit_cent /dist=negbin ;
output predicted = predstats;
repeated subject=id /type=cs;
estimate "tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 1 0 /exp;
estimate "no tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 0 1 /exp;
estimate 'longitudinal TBI' intercept 1
visit_cent -1 1 1 1
screen_tbi 1 0
screen_tbi*visit_cent 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 / exp;
estimate 'longitudinal no TBI ' intercept 1
visit_cent -1 1 1 1
screen_tbi 0 1
screen_tbi*visit_cent 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 / exp;
where sample = 1 ;
run;
The first research question is to have the average difference score, at baseline, for person with TBI versus a person without. It can be achieved by the following steps:
1) Get the estimated average log (score) when TBI = yes, and Visit = baseline;
2) Get the estimated average log (score) when TBI = no, and Visit =baseline;
3) 1) – 2) to have the difference in log(score) values
4) Exp[3)] to have the difference as percentage of change in scores
To simplify, let T=TBI levels, and V = Visit Levels. One thing to clarify, in your post, there are 4 visit points, the first as reference; therefore there should be 3 parameters for V, not four.
Taking the example of step 1), let’s try to write the ESTIMATE statement. It is a bit tricky. At first it sounds like this (T=0 and V =0 as reference):
ESTIMATE ‘Overall average’ intercept T 1 V 0 0 0;
But it is wrong. In the above statement, all arguments for V are set to 0. When all arguments are 0, it is the same as taking out V from the statement:
ESTIMATE ‘Overall average’ intercept T 1;
This is not the estimate of average for T=1 at baseline level. Rather, it produces an average for T=1, regardless of visit points, or, an average for all visit levels.
The problem is that the reference is set as V=0. In that case, SAS cannot tell the difference between estimates for the reference level, and the estimates for all levels. Indeed it always estimates the average for all levels. To solve it, the reference has to be set to -1, i.e., T=-1 and V=-1 as reference, such that the statement likes:
ESTIMATE ‘Average of T=1 V=baseline’ intercept T 1 V -1 -1 -1;
Now that SAS understands: fine! the job is to get the average at baseline level, not at all levels.
To make the reference value as -1 instead of 0, in the CLASS statement, the option should be specified as PARAM = EFFECT, not PARAM = REF. That brings another problem: once PARAM is not set as REF, SAS will ignore the user defined references. For example:
CLASS id T (ref=’…’) V (ref=’…’) / PARAM=EFFECT;
The (ref=’…’) is ignored when PARAM=EFFECT. How to let SAS make TBI=No and Visit=baseline as references? Well, SAS automatically takes the last level as the reference. For example, if the variable T is ordered ascendingly, the value -1 comes as the first level, while the value 1 comes as the last level; therefore 1 will be the reference. Conversely, if T is ordered in descending order, the value -1 comes at the end and will be used as the ref. This is achieved by the option ‘DESCENDING’ in the CLASS statement.
CLASS id T V / PARAM=EFFECT DESCENDING;
That way, the parameters are ordered as:
T 1 (TBI =1)
T -1 (ref level of TBI, i.e., TBI=no)
V 1 0 0 (for visit =4)
V 0 1 0 (visit = 3)
V 0 0 1 (visit =2)
V -1 -1 -1 (this is the ref level, visit=baseline)
The above information is reported in the ODS table ‘Class Level Information’. It is always good to check the very table each time after running PROC GENMOD. Note that the level (visit = 4) comes before the level (visit =3), visit =3 coming before visit=2.
Now, let’s talk a bit about the parameters and the model equation. As you might know, in SAS, the V for multi-levels is indeed broken down into dummy Vs. If baseline is set as ref level, the dummies will be like:
V4 = the fourth visit or baseline
V3= the third visit, or baseline
V2 = the second visit or baseline
Accordingly, the equation can be written as:
LOG(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
whereas:
s = the total score on a mood inventory
T = 1 for TBI status of yes, = -1 for TBI status of no
V4 = 1 for the fourth visit, = -1 for baseline
V3 = 1 for the third visit, =-1 for baseline
V2 = 1 for the second visit, = -1 for the baseline
b0 to b4 are beta estimates for the parameters
Of note, the order in the model is the same as the order defined in the statement CLASS, and the same as the order in the ODS table ‘Class Level Information’. The V4, V3, V2 have to appear in the model, all or none, i.e., if the VISIT term is to be included, V4 V3 V2 should be all introduced into the model equation. If the VISIT term is not included, none of V4, V3, and V2 should be in the equation.
With interaction terms, 3 more dummy terms must be created:
T_V4 = T*V4
T_V3 = T*V3
T_V2 = T*V2
Hence the equation with interaction terms:
Log(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6* T_V3 + b7* T_V2
The SAS statement of ‘ESTIMATE’ is correspondent to the model equation.
For example, to estimate an overall average for all parameters and all levels, the equation is:
[Log(S)] = b0 ;
whereas [LOG(S)] stands for the expected LOG(score). Accordingly, the statement is:
ESTIMATE ‘overall (all levels of T and V)’ INTERCEPT;
In the above statement, ‘INTERCEPT’ in the statement is correspondent to ‘b0’ in the equation
To estimate an average of log (score) for T =1, and for all levels of visit points, the equation is
[LOG(S)] = b0 + b1 * T = b0 + b1 * 1
And the statement is
ESTIMATE ‘T=Yes, V= all levels’ INTERCEPT T 1;
In the above case, ‘T 1’ in the statement is correspondent to the part “*1” in the equation (i.e., let T=1)
To estimate an average of log (score) for T =1, and for visit = baseline, the equation is:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
= b0 + b1*(1) + b2*(-1)+ b3*(-1) + b4*(-1)
The statement is:
ESTIMATE ‘T=Yes, V=Baseline’ INTERCEPT T 1 V -1 -1 -1;
‘V -1 -1 -1’ in the statement is correspondent to the values of V4, V3, and V2 in the equation. We’ve mentioned above that the dummies V4 V3 and V2 must be all introduced into the model. That is why for the V term, there are always three numbers, such as ‘V -1 -1 -1’, or ‘V 1 1 1’, etc. SAS will give warning in log if you make it like ‘V -1 -1 -1 -1’, because there are four '-1's, 1 more than required. In that case, the excessive '-1' will be ignored. On the contrary, ‘V 1 1’ is fine. It is the same as ‘V 1 1 0’. But what does 'V 1 1 0' means? To figure it out, you have to read Allison’s book (see reference).
For now, let’s carry on, and add the interaction terms. The equation:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6*T_V3 + b7*T_V2
As T_V4 = T*V4 = 1 * (-1) = -1, similarly T_V3 = -1, T_V2=-1, substitute into the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(-1) + b6*(-1) + b7*(-1)
The statement is:
ESTIMATE ‘(1) T=Yes, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V -1 -1 -1;
The ‘T*V -1 -1 -1’ are correspondent to the values of T_V4, T_V3 and T_V2 in the equation.
And that is the statement for step 1)!
Step 2 follows the same thoughts. To get the estimated average log (score) when TBI = no, and Visit =baseline.
T = -1, V4=-1, V3=-1, V2=-1.
T_V4 = T * V4 = (-1) * (-1) = 1
T_V3 = T * V3 = (-1) * (-1) = 1
T_V2 = T * V2 = (-1) * (-1) = 1
Substituting the values in the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(1) + b6*(1) + b7*(1)
Note that the numbers: For T: 1; for V: -1 -1 -1; for interaction terms: 1 1 1
And the SAS statement:
ESTIMATE ‘(2) T=No, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V 1 1 1;
The estimate results can be found in the ODS table ‘Contrast Estimate Results’.
For step 3), subtract the estimate (1) – (2), to have the difference of log(score); and for step(4), have the exponent of the diff in step 3).
For the second research question:
The average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits.
Over the 4 study visits means for all visit levels. By now, you might have known that the statement is simpler:
ESTIMATE ‘(1) T=Yes, V=all levels’ INTERCEPT T 1;
ESTIMATE ‘(2) T=Yes, V=all levels’ INTERCEPT T -1;
Why there are no interaction terms? Because all visit levels are considered. And when all levels are considered, you do not have to put any visit-related terms into the statement.
Finally, the above approach requires some manual calculation. Indeed it is possible to make one single line of ESTIMATE statement that is equivalent to the aforementioned approach. However, the method we discussed above is way easier to understand. For more sophisticated methods, please read Allison’s book.
Reference:
1. Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition. Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.
I'm trying to solve a problem in SAS where I have quantities of customers across a range of groups, and the quantities I select need to be as even across the different categories as possible. This will be easier to explain with a small table, which is a simplification of a much larger problem I'm trying to solve.
Here is the table:
Customer Category | Revenue band | Churn Band | # Customers
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
Suppose I need to select 3000 customers from category A, and 3000 customers from category B. From the second category, within each A and B, I need to select an equal amount from 1 and 2. If possible, I need to select a proportional amount across each 1, 2, and 3 subcategories. Is there an elegant solution to this problem? I'm relatively new to SAS and so far I've investigated OPTMODEL, but the examples are either too simple or too advanced to be much use to me yet.
Edit: I've thought about using survey select. I can use this to select equal sizes across the Revenue Bands 1, 2, and 3. However where I'm lacking customers in the individual churn bands, surveyselect may not select the maximum number of customers available where those numbers are low, and I'm back to manually selecting customers.
There are still some ambiguities in the problem statement, but I hope that the PROC OPTMODEL code below is a good start for you. I tried to add examples of many different features, so that you can toy around with the model and hopefully get closer to what you actually need.
Of the many things you could optimize, I am minimizing the maximum violation from your "If possible" goal, e.g.:
min MaxMismatch = MaxChurnMismatch;
I was able to model your constraints as a Linear Program, which means that it should scale very well. You probably have other constraints you did not mention, but that would probably beyond the scope of this site.
With the data you posted, you can see from the output of the print statements that the optimal penalty corresponds to choosing 1500 customers from A,1,1, where the ideal would be 1736. This is more expensive than ignoring the customers from several groups:
[1] ChooseByCat
A 3000
B 3000
[1] [2] [3] Choose IdealProportion
A 1 1 1500 1736.670
A 1 2 0 135.882
A 1 3 0 78.762
A 2 1 28 9.934
A 2 2 1240 1003.330
A 2 3 232 82.310
B 1 1 1500 1580.210
B 1 2 0 193.358
B 1 3 0 161.072
B 2 1 1500 1608.593
B 2 2 0 153.976
B 2 3 0 161.072
Proportion MaxChurnMisMatch
0.35478 236.67
That is probably not the ideal solution, but figuring how to model exactly your requirements would not be as useful for this site. You can contact me offline if that is relevant.
I've added quotes from your problem statement as comments in the code below.
Have fun!
data custCounts;
input cat $ rev churn n;
datalines;
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
;
proc optmodel printlevel = 0;
set CATxREVxCHURN init {} inter {<'A',1,1>};
set CAT = setof{<c,r,ch> in CATxREVxCHURN} c;
num n{CATxREVxCHURN};
read data custCounts into CATxREVxCHURN=[cat rev churn] n;
put n[*]=;
var Choose{<c,r,ch> in CATxREVxCHURN} >= 0 <= n[c,r,ch]
, MaxChurnMisMatch >= 0, Proportion >= 0 <= 1
;
/* From OP:
Suppose I need to select 3000 customers from category A,
and 3000 customers from category B. */
num goal = 3000;
/* See "implicit slice" for the parenthesis notation, i.e. (c) below. */
impvar ChooseByCat{c in CAT} =
sum{<(c),r,ch> in CATxREVxCHURN} Choose[c,r,ch];
con MatchCatGoal{c in CAT}:
ChooseByCat[c] = goal;
/* From OP:
From the second category, within each A and B,
I need to select an equal amount from 1 and 2 */
con MatchRevenueGroupsWithinCat{c in CAT}:
sum{<(c),(1),ch> in CATxREVxCHURN} Choose[c,1,ch]
= sum{<(c),(2),ch> in CATxREVxCHURN} Choose[c,2,ch]
;
/* From OP:
If possible, I need to select a proportional amount
across each 1, 2, and 3 subcategories. */
con MatchBandProportion{<c,r,ch> in CATxREVxCHURN, sign in / 1 -1 /}:
MaxChurnMismatch >= sign * ( Choose[c,r,ch] - Proportion * n[c,r,ch] );
min MaxMismatch = MaxChurnMismatch;
solve;
print ChooseByCat;
impvar IdealProportion{<c,r,ch> in CATxREVxCHURN} = Proportion * n[c,r,ch];
print Choose IdealProportion;
print Proportion MaxChurnMismatch;
quit;