Need help writing estimates statements in proc genmod - sas

I'm using proc genmod to predict an outcome measured at 4 time points. The outcome is a total score on a mood inventory, which can range from 0 to 82. A lot of participants have a score of 0, so the negative binomial distribution in proc genmod seemed like a good fit for the data.
Now, I'm struggling with how to write/interpret the estimates statements. The primary predictors are TBI status at baseline (0=no/1=yes), and visit (0=baseline, 1=second visit, 2=third visit, 4=fourth visit), and an interaction of TBI status and visit.
How do I write my estimates, such that I'm getting out:
1. the average difference in mood inventory score for person with TBI versus a person without, at baseline.
and
2. the average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits?
Below is what I have thus far, but I'm not sure how to interpret the output, also below, if indeed my code is correct.:
proc genmod data = analyze_long_3 ;
class id screen_tbi (param = ref ref = first) ;
model nsi_total = visit_cent screen_tbi screen_tbi*visit_cent /dist=negbin ;
output predicted = predstats;
repeated subject=id /type=cs;
estimate "tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 1 0 /exp;
estimate "no tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 0 1 /exp;
estimate 'longitudinal TBI' intercept 1
visit_cent -1 1 1 1
screen_tbi 1 0
screen_tbi*visit_cent 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 / exp;
estimate 'longitudinal no TBI ' intercept 1
visit_cent -1 1 1 1
screen_tbi 0 1
screen_tbi*visit_cent 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 / exp;
where sample = 1 ;
run;

The first research question is to have the average difference score, at baseline, for person with TBI versus a person without. It can be achieved by the following steps:
1) Get the estimated average log (score) when TBI = yes, and Visit = baseline;
2) Get the estimated average log (score) when TBI = no, and Visit =baseline;
3) 1) – 2) to have the difference in log(score) values
4) Exp[3)] to have the difference as percentage of change in scores
To simplify, let T=TBI levels, and V = Visit Levels. One thing to clarify, in your post, there are 4 visit points, the first as reference; therefore there should be 3 parameters for V, not four.
Taking the example of step 1), let’s try to write the ESTIMATE statement. It is a bit tricky. At first it sounds like this (T=0 and V =0 as reference):
ESTIMATE ‘Overall average’ intercept T 1 V 0 0 0;
But it is wrong. In the above statement, all arguments for V are set to 0. When all arguments are 0, it is the same as taking out V from the statement:
ESTIMATE ‘Overall average’ intercept T 1;
This is not the estimate of average for T=1 at baseline level. Rather, it produces an average for T=1, regardless of visit points, or, an average for all visit levels.
The problem is that the reference is set as V=0. In that case, SAS cannot tell the difference between estimates for the reference level, and the estimates for all levels. Indeed it always estimates the average for all levels. To solve it, the reference has to be set to -1, i.e., T=-1 and V=-1 as reference, such that the statement likes:
ESTIMATE ‘Average of T=1 V=baseline’ intercept T 1 V -1 -1 -1;
Now that SAS understands: fine! the job is to get the average at baseline level, not at all levels.
To make the reference value as -1 instead of 0, in the CLASS statement, the option should be specified as PARAM = EFFECT, not PARAM = REF. That brings another problem: once PARAM is not set as REF, SAS will ignore the user defined references. For example:
CLASS id T (ref=’…’) V (ref=’…’) / PARAM=EFFECT;
The (ref=’…’) is ignored when PARAM=EFFECT. How to let SAS make TBI=No and Visit=baseline as references? Well, SAS automatically takes the last level as the reference. For example, if the variable T is ordered ascendingly, the value -1 comes as the first level, while the value 1 comes as the last level; therefore 1 will be the reference. Conversely, if T is ordered in descending order, the value -1 comes at the end and will be used as the ref. This is achieved by the option ‘DESCENDING’ in the CLASS statement.
CLASS id T V / PARAM=EFFECT DESCENDING;
That way, the parameters are ordered as:
T 1 (TBI =1)
T -1 (ref level of TBI, i.e., TBI=no)
V 1 0 0 (for visit =4)
V 0 1 0 (visit = 3)
V 0 0 1 (visit =2)
V -1 -1 -1 (this is the ref level, visit=baseline)
The above information is reported in the ODS table ‘Class Level Information’. It is always good to check the very table each time after running PROC GENMOD. Note that the level (visit = 4) comes before the level (visit =3), visit =3 coming before visit=2.
Now, let’s talk a bit about the parameters and the model equation. As you might know, in SAS, the V for multi-levels is indeed broken down into dummy Vs. If baseline is set as ref level, the dummies will be like:
V4 = the fourth visit or baseline
V3= the third visit, or baseline
V2 = the second visit or baseline
Accordingly, the equation can be written as:
LOG(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
whereas:
s = the total score on a mood inventory
T = 1 for TBI status of yes, = -1 for TBI status of no
V4 = 1 for the fourth visit, = -1 for baseline
V3 = 1 for the third visit, =-1 for baseline
V2 = 1 for the second visit, = -1 for the baseline
b0 to b4 are beta estimates for the parameters
Of note, the order in the model is the same as the order defined in the statement CLASS, and the same as the order in the ODS table ‘Class Level Information’. The V4, V3, V2 have to appear in the model, all or none, i.e., if the VISIT term is to be included, V4 V3 V2 should be all introduced into the model equation. If the VISIT term is not included, none of V4, V3, and V2 should be in the equation.
With interaction terms, 3 more dummy terms must be created:
T_V4 = T*V4
T_V3 = T*V3
T_V2 = T*V2
Hence the equation with interaction terms:
Log(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6* T_V3 + b7* T_V2
The SAS statement of ‘ESTIMATE’ is correspondent to the model equation.
For example, to estimate an overall average for all parameters and all levels, the equation is:
[Log(S)] = b0 ;
whereas [LOG(S)] stands for the expected LOG(score). Accordingly, the statement is:
ESTIMATE ‘overall (all levels of T and V)’ INTERCEPT;
In the above statement, ‘INTERCEPT’ in the statement is correspondent to ‘b0’ in the equation
To estimate an average of log (score) for T =1, and for all levels of visit points, the equation is
[LOG(S)] = b0 + b1 * T = b0 + b1 * 1
And the statement is
ESTIMATE ‘T=Yes, V= all levels’ INTERCEPT T 1;
In the above case, ‘T 1’ in the statement is correspondent to the part “*1” in the equation (i.e., let T=1)
To estimate an average of log (score) for T =1, and for visit = baseline, the equation is:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
= b0 + b1*(1) + b2*(-1)+ b3*(-1) + b4*(-1)
The statement is:
ESTIMATE ‘T=Yes, V=Baseline’ INTERCEPT T 1 V -1 -1 -1;
‘V -1 -1 -1’ in the statement is correspondent to the values of V4, V3, and V2 in the equation. We’ve mentioned above that the dummies V4 V3 and V2 must be all introduced into the model. That is why for the V term, there are always three numbers, such as ‘V -1 -1 -1’, or ‘V 1 1 1’, etc. SAS will give warning in log if you make it like ‘V -1 -1 -1 -1’, because there are four '-1's, 1 more than required. In that case, the excessive '-1' will be ignored. On the contrary, ‘V 1 1’ is fine. It is the same as ‘V 1 1 0’. But what does 'V 1 1 0' means? To figure it out, you have to read Allison’s book (see reference).
For now, let’s carry on, and add the interaction terms. The equation:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6*T_V3 + b7*T_V2
As T_V4 = T*V4 = 1 * (-1) = -1, similarly T_V3 = -1, T_V2=-1, substitute into the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(-1) + b6*(-1) + b7*(-1)
The statement is:
ESTIMATE ‘(1) T=Yes, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V -1 -1 -1;
The ‘T*V -1 -1 -1’ are correspondent to the values of T_V4, T_V3 and T_V2 in the equation.
And that is the statement for step 1)!
Step 2 follows the same thoughts. To get the estimated average log (score) when TBI = no, and Visit =baseline.
T = -1, V4=-1, V3=-1, V2=-1.
T_V4 = T * V4 = (-1) * (-1) = 1
T_V3 = T * V3 = (-1) * (-1) = 1
T_V2 = T * V2 = (-1) * (-1) = 1
Substituting the values in the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(1) + b6*(1) + b7*(1)
Note that the numbers: For T: 1; for V: -1 -1 -1; for interaction terms: 1 1 1
And the SAS statement:
ESTIMATE ‘(2) T=No, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V 1 1 1;
The estimate results can be found in the ODS table ‘Contrast Estimate Results’.
For step 3), subtract the estimate (1) – (2), to have the difference of log(score); and for step(4), have the exponent of the diff in step 3).
For the second research question:
The average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits.
Over the 4 study visits means for all visit levels. By now, you might have known that the statement is simpler:
ESTIMATE ‘(1) T=Yes, V=all levels’ INTERCEPT T 1;
ESTIMATE ‘(2) T=Yes, V=all levels’ INTERCEPT T -1;
Why there are no interaction terms? Because all visit levels are considered. And when all levels are considered, you do not have to put any visit-related terms into the statement.
Finally, the above approach requires some manual calculation. Indeed it is possible to make one single line of ESTIMATE statement that is equivalent to the aforementioned approach. However, the method we discussed above is way easier to understand. For more sophisticated methods, please read Allison’s book.
Reference:
1. Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition. Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.

Related

Power BI - Substract rows based on a id

I have two tables as follows:
id
N1
N2
N3
N4
N5
1
UP
REIT
2
UP
REIT
UPDigital
DI
3
UP
REIT
UPDigital
DI
SI
4
UP
REIT
UPdigital
DI
IT
5
UP
FCUP
id_entity
id_person
exit
join
2
1
1
0
5
1
0
1
3
10
1
0
4
10
0
1
4
25
1
0
4
12
0
1
I need to calculate people's joins and exits, so to calculate the exists I created the following measure
N exits = IF(CALCULATE(sum(Folha2[exit])-sum(Folha2[join])) < 0,0, sum(Folha2[exit])-sum(Folha2[join]))
And for the joins this
N joins = IF(CALCULATE(sum(Folha2[join])-sum(Folha2[exit])) < 0,0, sum(Folha2[join])-sum(Folha2[exit]))
This is the result, but it is not correct.
My problem is that this way it is not based on the id_person
For example, in the last two rows of the second table, the person with id_person=25 left entity 4 and the person with id_person=12 entered entity 4.
This way he subtracts the two lines not taking into account that they are two different people
The correct thing would be the following number of exists
UP - 1
FCUP - 0
REIT - 2
UPDigital -2
DI - 2
IT - 1
SI - 1
Is it possible to calculate this in Power bi ?

openoffice calc sumproduct with a twist

my first attempt in VBA apart from using simple functions; asking for a kick start here:
assume this (part of a) sheet
factor b-count c-count d-count
A2 b2 c2 d2 ...
A3 b3 c3 d3 ...
Assume that these are the first columns and rows A1 to D3, holding numeric values each.
If factor is 1, I want A(N) (column 'A', row N >= 2) to hold the sumproduct of row 1 and row N.
The twist comes when factor is not 1. In that case I want a sumproduct of
count*round(value * factor).
Example:
1.5 2 1 0 4
=myfunc(2) 4 8 11 15
=myfunc(3) 11 20 28 36
=myfunc(4) 29 53 74 94
where myfunc(2) should result in
round(4*1,5)*2+round(8*1,5)*1+round(15*1,5)*4 = 6*2+12*1+23*4 = 12+12+92 = 116, myfunc(3) = 17*2+30+54*4 = 34+30+216 = 280, myfunc(4) = 44*2+80+141*4 = 88+80+564 = 732 etc.
I could just insert a row below each one, multiplying every value with the factor; but I would love something fancier.
basically thought (pun not intended):
col='B'
sum=0
do while (col)(N)>0
sum=sum+(col)(1)*round((col)(N)*A1;0)
col=col+1
loop
A(n)=sum
where (col)(N) refers to the cell in column col and row N.
Not important enough to study the manual; but it would be great if someone can do this off the cuff.
Another point: I have read that custom functions must be stored in the "Standard Library";
but I could not find any mention on HOW to do that. Who will point me to the right manual page?
Go to Tools -> Macros -> Organize Macros -> OpenOffice Basic. Select My Macros -> Standard -> Module 1 (that is what is meant by the Standard library), and press Edit.
Paste the following code.
Function SumProductOfTwoRows(firstColumn As Long, row As Long, firstRow As Long)
'For example: =SUMPRODUCTOFTWOROWS(COLUMN(); ROW(); ROW($A$1))
firstColumn = firstColumn - 1 'column A is index 0
row = row - 1 'row 1 is index 0
firstRow = firstRow - 1 'row 1 is index 0
oSheet = ThisComponent.CurrentController.ActiveSheet
sum = 0
column = firstColumn + 1
factor = oSheet.getCellByPosition(firstColumn, firstRow).getValue()
Do
value = oSheet.getCellByPosition(column, row).getValue()
count = oSheet.getCellByPosition(column, firstRow).getValue()
If value = 0 Then Exit Do
sum = sum + count * CLng(value * factor)
column = column + 1
Loop
SumProductOfTwoRows = sum
End Function
Enter this formula in A2 and drag to fill down to A4.
=SUMPRODUCTOFTWOROWS(COLUMN(); ROW(); ROW($A$1))
The result:
This kind of user-defined function produces an error when re-opening the file. To avoid the error, see my answer at https://stackoverflow.com/a/39254907/5100564.

dramatic error in lp_solve?

I've a simple problem that I passed to lp_solve via the IDE (5.5.2.0)
/* Objective function */
max: +r1 +r2;
/* Constraints */
R1: +r1 +r2 <= 4;
R2: +r1 -2 b1 = 0;
R3: +r2 -3 b2 = 0;
/* Variable bounds */
b1 <= 1;
b2 <= 1;
/* Integer definitions */
int b1,b2;
The obvious solution to this problem is 3. SCIP as well as CBC give 3 as answer but not lp_solve. Here I get 2. Is there a major bug in the solver?
Thank's in advance.
I had contact to the developer group that cares about lpsolve software. The error will be fixed in the next version of lpsolve.
When I tried it, I am getting 3 as the optimal value for the Obj function.
Model name: 'LPSolver' - run #1
Objective: Maximize(R0)
SUBMITTED
Model size: 3 constraints, 4 variables, 6 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
Relaxed solution 4 after 4 iter is B&B base.
Feasible solution 2 after 6 iter, 3 nodes (gap 40.0%)
Optimal solution 2 after 7 iter, 4 nodes (gap 40.0%).
Excellent numeric accuracy ||*|| = 0
MEMO: lp_solve version 5.5.2.0 for 32 bit OS, with 64 bit REAL variables.
In the total iteration count 7, 1 (14.3%) were bound flips.
There were 2 refactorizations, 0 triggered by time and 0 by density.
... on average 3.0 major pivots per refactorization.
The largest [LUSOL v2.2.1.0] fact(B) had 8 NZ entries, 1.0x largest basis.
The maximum B&B level was 3, 0.8x MIP order, 3 at the optimal solution.
The constraint matrix inf-norm is 3, with a dynamic range of 3.
Time to load data was 0.001 seconds, presolve used 0.017 seconds,
... 0.007 seconds in simplex solver, in total 0.025 seconds.

SAS - Selecting optimal quantities

I'm trying to solve a problem in SAS where I have quantities of customers across a range of groups, and the quantities I select need to be as even across the different categories as possible. This will be easier to explain with a small table, which is a simplification of a much larger problem I'm trying to solve.
Here is the table:
Customer Category | Revenue band | Churn Band | # Customers
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
Suppose I need to select 3000 customers from category A, and 3000 customers from category B. From the second category, within each A and B, I need to select an equal amount from 1 and 2. If possible, I need to select a proportional amount across each 1, 2, and 3 subcategories. Is there an elegant solution to this problem? I'm relatively new to SAS and so far I've investigated OPTMODEL, but the examples are either too simple or too advanced to be much use to me yet.
Edit: I've thought about using survey select. I can use this to select equal sizes across the Revenue Bands 1, 2, and 3. However where I'm lacking customers in the individual churn bands, surveyselect may not select the maximum number of customers available where those numbers are low, and I'm back to manually selecting customers.
There are still some ambiguities in the problem statement, but I hope that the PROC OPTMODEL code below is a good start for you. I tried to add examples of many different features, so that you can toy around with the model and hopefully get closer to what you actually need.
Of the many things you could optimize, I am minimizing the maximum violation from your "If possible" goal, e.g.:
min MaxMismatch = MaxChurnMismatch;
I was able to model your constraints as a Linear Program, which means that it should scale very well. You probably have other constraints you did not mention, but that would probably beyond the scope of this site.
With the data you posted, you can see from the output of the print statements that the optimal penalty corresponds to choosing 1500 customers from A,1,1, where the ideal would be 1736. This is more expensive than ignoring the customers from several groups:
[1] ChooseByCat
A 3000
B 3000
[1] [2] [3] Choose IdealProportion
A 1 1 1500 1736.670
A 1 2 0 135.882
A 1 3 0 78.762
A 2 1 28 9.934
A 2 2 1240 1003.330
A 2 3 232 82.310
B 1 1 1500 1580.210
B 1 2 0 193.358
B 1 3 0 161.072
B 2 1 1500 1608.593
B 2 2 0 153.976
B 2 3 0 161.072
Proportion MaxChurnMisMatch
0.35478 236.67
That is probably not the ideal solution, but figuring how to model exactly your requirements would not be as useful for this site. You can contact me offline if that is relevant.
I've added quotes from your problem statement as comments in the code below.
Have fun!
data custCounts;
input cat $ rev churn n;
datalines;
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
;
proc optmodel printlevel = 0;
set CATxREVxCHURN init {} inter {<'A',1,1>};
set CAT = setof{<c,r,ch> in CATxREVxCHURN} c;
num n{CATxREVxCHURN};
read data custCounts into CATxREVxCHURN=[cat rev churn] n;
put n[*]=;
var Choose{<c,r,ch> in CATxREVxCHURN} >= 0 <= n[c,r,ch]
, MaxChurnMisMatch >= 0, Proportion >= 0 <= 1
;
/* From OP:
Suppose I need to select 3000 customers from category A,
and 3000 customers from category B. */
num goal = 3000;
/* See "implicit slice" for the parenthesis notation, i.e. (c) below. */
impvar ChooseByCat{c in CAT} =
sum{<(c),r,ch> in CATxREVxCHURN} Choose[c,r,ch];
con MatchCatGoal{c in CAT}:
ChooseByCat[c] = goal;
/* From OP:
From the second category, within each A and B,
I need to select an equal amount from 1 and 2 */
con MatchRevenueGroupsWithinCat{c in CAT}:
sum{<(c),(1),ch> in CATxREVxCHURN} Choose[c,1,ch]
= sum{<(c),(2),ch> in CATxREVxCHURN} Choose[c,2,ch]
;
/* From OP:
If possible, I need to select a proportional amount
across each 1, 2, and 3 subcategories. */
con MatchBandProportion{<c,r,ch> in CATxREVxCHURN, sign in / 1 -1 /}:
MaxChurnMismatch >= sign * ( Choose[c,r,ch] - Proportion * n[c,r,ch] );
min MaxMismatch = MaxChurnMismatch;
solve;
print ChooseByCat;
impvar IdealProportion{<c,r,ch> in CATxREVxCHURN} = Proportion * n[c,r,ch];
print Choose IdealProportion;
print Proportion MaxChurnMismatch;
quit;

Formula that uses previous value

In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2
Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.
Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1