Let's say I have a table that looks like this:
Human Theme Score
----------------------
J1 Surfing 2
J2 Eating 3
J3 Sleeping 4
J2 Eating 5
J2 Surfing 6
Now I want to add columns to get the total value of a theme of the humans separately. To give you an idea:
Jury Theme Score EatingTotal SurfingTotal SleepingTotal
------------------------------------------------------------------
J1 Surfing 2 0 2 0
J2 Eating 3 8 6 0
J3 Sleeping 4 0 0 4
J2 Eating 5 8 6 0
J2 Surfing 6 8 6 0
How do I get there? Is there a way in DAX to say something like
SurfingTotal = SUM(if table[Theme] = Surfing)
Still new to PowerBI
There are probably several ways to do it, but this one seems to work:
Total Eating Measure =
CALCULATE(
SUM(Jury[Score]);
ALLEXCEPT(
Jury;
Jury[Human]
);
Jury[Theme]="Eating"
) + 0
I've added the zero in the end, to prevent showing blank values.
I'm using proc genmod to predict an outcome measured at 4 time points. The outcome is a total score on a mood inventory, which can range from 0 to 82. A lot of participants have a score of 0, so the negative binomial distribution in proc genmod seemed like a good fit for the data.
Now, I'm struggling with how to write/interpret the estimates statements. The primary predictors are TBI status at baseline (0=no/1=yes), and visit (0=baseline, 1=second visit, 2=third visit, 4=fourth visit), and an interaction of TBI status and visit.
How do I write my estimates, such that I'm getting out:
1. the average difference in mood inventory score for person with TBI versus a person without, at baseline.
and
2. the average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits?
Below is what I have thus far, but I'm not sure how to interpret the output, also below, if indeed my code is correct.:
proc genmod data = analyze_long_3 ;
class id screen_tbi (param = ref ref = first) ;
model nsi_total = visit_cent screen_tbi screen_tbi*visit_cent /dist=negbin ;
output predicted = predstats;
repeated subject=id /type=cs;
estimate "tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 1 0 /exp;
estimate "no tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 0 1 /exp;
estimate 'longitudinal TBI' intercept 1
visit_cent -1 1 1 1
screen_tbi 1 0
screen_tbi*visit_cent 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 / exp;
estimate 'longitudinal no TBI ' intercept 1
visit_cent -1 1 1 1
screen_tbi 0 1
screen_tbi*visit_cent 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 / exp;
where sample = 1 ;
run;
The first research question is to have the average difference score, at baseline, for person with TBI versus a person without. It can be achieved by the following steps:
1) Get the estimated average log (score) when TBI = yes, and Visit = baseline;
2) Get the estimated average log (score) when TBI = no, and Visit =baseline;
3) 1) – 2) to have the difference in log(score) values
4) Exp[3)] to have the difference as percentage of change in scores
To simplify, let T=TBI levels, and V = Visit Levels. One thing to clarify, in your post, there are 4 visit points, the first as reference; therefore there should be 3 parameters for V, not four.
Taking the example of step 1), let’s try to write the ESTIMATE statement. It is a bit tricky. At first it sounds like this (T=0 and V =0 as reference):
ESTIMATE ‘Overall average’ intercept T 1 V 0 0 0;
But it is wrong. In the above statement, all arguments for V are set to 0. When all arguments are 0, it is the same as taking out V from the statement:
ESTIMATE ‘Overall average’ intercept T 1;
This is not the estimate of average for T=1 at baseline level. Rather, it produces an average for T=1, regardless of visit points, or, an average for all visit levels.
The problem is that the reference is set as V=0. In that case, SAS cannot tell the difference between estimates for the reference level, and the estimates for all levels. Indeed it always estimates the average for all levels. To solve it, the reference has to be set to -1, i.e., T=-1 and V=-1 as reference, such that the statement likes:
ESTIMATE ‘Average of T=1 V=baseline’ intercept T 1 V -1 -1 -1;
Now that SAS understands: fine! the job is to get the average at baseline level, not at all levels.
To make the reference value as -1 instead of 0, in the CLASS statement, the option should be specified as PARAM = EFFECT, not PARAM = REF. That brings another problem: once PARAM is not set as REF, SAS will ignore the user defined references. For example:
CLASS id T (ref=’…’) V (ref=’…’) / PARAM=EFFECT;
The (ref=’…’) is ignored when PARAM=EFFECT. How to let SAS make TBI=No and Visit=baseline as references? Well, SAS automatically takes the last level as the reference. For example, if the variable T is ordered ascendingly, the value -1 comes as the first level, while the value 1 comes as the last level; therefore 1 will be the reference. Conversely, if T is ordered in descending order, the value -1 comes at the end and will be used as the ref. This is achieved by the option ‘DESCENDING’ in the CLASS statement.
CLASS id T V / PARAM=EFFECT DESCENDING;
That way, the parameters are ordered as:
T 1 (TBI =1)
T -1 (ref level of TBI, i.e., TBI=no)
V 1 0 0 (for visit =4)
V 0 1 0 (visit = 3)
V 0 0 1 (visit =2)
V -1 -1 -1 (this is the ref level, visit=baseline)
The above information is reported in the ODS table ‘Class Level Information’. It is always good to check the very table each time after running PROC GENMOD. Note that the level (visit = 4) comes before the level (visit =3), visit =3 coming before visit=2.
Now, let’s talk a bit about the parameters and the model equation. As you might know, in SAS, the V for multi-levels is indeed broken down into dummy Vs. If baseline is set as ref level, the dummies will be like:
V4 = the fourth visit or baseline
V3= the third visit, or baseline
V2 = the second visit or baseline
Accordingly, the equation can be written as:
LOG(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
whereas:
s = the total score on a mood inventory
T = 1 for TBI status of yes, = -1 for TBI status of no
V4 = 1 for the fourth visit, = -1 for baseline
V3 = 1 for the third visit, =-1 for baseline
V2 = 1 for the second visit, = -1 for the baseline
b0 to b4 are beta estimates for the parameters
Of note, the order in the model is the same as the order defined in the statement CLASS, and the same as the order in the ODS table ‘Class Level Information’. The V4, V3, V2 have to appear in the model, all or none, i.e., if the VISIT term is to be included, V4 V3 V2 should be all introduced into the model equation. If the VISIT term is not included, none of V4, V3, and V2 should be in the equation.
With interaction terms, 3 more dummy terms must be created:
T_V4 = T*V4
T_V3 = T*V3
T_V2 = T*V2
Hence the equation with interaction terms:
Log(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6* T_V3 + b7* T_V2
The SAS statement of ‘ESTIMATE’ is correspondent to the model equation.
For example, to estimate an overall average for all parameters and all levels, the equation is:
[Log(S)] = b0 ;
whereas [LOG(S)] stands for the expected LOG(score). Accordingly, the statement is:
ESTIMATE ‘overall (all levels of T and V)’ INTERCEPT;
In the above statement, ‘INTERCEPT’ in the statement is correspondent to ‘b0’ in the equation
To estimate an average of log (score) for T =1, and for all levels of visit points, the equation is
[LOG(S)] = b0 + b1 * T = b0 + b1 * 1
And the statement is
ESTIMATE ‘T=Yes, V= all levels’ INTERCEPT T 1;
In the above case, ‘T 1’ in the statement is correspondent to the part “*1” in the equation (i.e., let T=1)
To estimate an average of log (score) for T =1, and for visit = baseline, the equation is:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
= b0 + b1*(1) + b2*(-1)+ b3*(-1) + b4*(-1)
The statement is:
ESTIMATE ‘T=Yes, V=Baseline’ INTERCEPT T 1 V -1 -1 -1;
‘V -1 -1 -1’ in the statement is correspondent to the values of V4, V3, and V2 in the equation. We’ve mentioned above that the dummies V4 V3 and V2 must be all introduced into the model. That is why for the V term, there are always three numbers, such as ‘V -1 -1 -1’, or ‘V 1 1 1’, etc. SAS will give warning in log if you make it like ‘V -1 -1 -1 -1’, because there are four '-1's, 1 more than required. In that case, the excessive '-1' will be ignored. On the contrary, ‘V 1 1’ is fine. It is the same as ‘V 1 1 0’. But what does 'V 1 1 0' means? To figure it out, you have to read Allison’s book (see reference).
For now, let’s carry on, and add the interaction terms. The equation:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6*T_V3 + b7*T_V2
As T_V4 = T*V4 = 1 * (-1) = -1, similarly T_V3 = -1, T_V2=-1, substitute into the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(-1) + b6*(-1) + b7*(-1)
The statement is:
ESTIMATE ‘(1) T=Yes, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V -1 -1 -1;
The ‘T*V -1 -1 -1’ are correspondent to the values of T_V4, T_V3 and T_V2 in the equation.
And that is the statement for step 1)!
Step 2 follows the same thoughts. To get the estimated average log (score) when TBI = no, and Visit =baseline.
T = -1, V4=-1, V3=-1, V2=-1.
T_V4 = T * V4 = (-1) * (-1) = 1
T_V3 = T * V3 = (-1) * (-1) = 1
T_V2 = T * V2 = (-1) * (-1) = 1
Substituting the values in the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(1) + b6*(1) + b7*(1)
Note that the numbers: For T: 1; for V: -1 -1 -1; for interaction terms: 1 1 1
And the SAS statement:
ESTIMATE ‘(2) T=No, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V 1 1 1;
The estimate results can be found in the ODS table ‘Contrast Estimate Results’.
For step 3), subtract the estimate (1) – (2), to have the difference of log(score); and for step(4), have the exponent of the diff in step 3).
For the second research question:
The average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits.
Over the 4 study visits means for all visit levels. By now, you might have known that the statement is simpler:
ESTIMATE ‘(1) T=Yes, V=all levels’ INTERCEPT T 1;
ESTIMATE ‘(2) T=Yes, V=all levels’ INTERCEPT T -1;
Why there are no interaction terms? Because all visit levels are considered. And when all levels are considered, you do not have to put any visit-related terms into the statement.
Finally, the above approach requires some manual calculation. Indeed it is possible to make one single line of ESTIMATE statement that is equivalent to the aforementioned approach. However, the method we discussed above is way easier to understand. For more sophisticated methods, please read Allison’s book.
Reference:
1. Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition. Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.
I have a df like so:
Year ID Count
1997 1 0
1998 2 0
1999 3 1
2000 4 0
2001 5 1
and I want to remove all rows before the first occurrence of 1 in Count which would give me:
Year ID Count
1999 3 1
2000 4 0
2001 5 1
I can remove all rows AFTER the first occurrence like this:
df=df.loc[: df[(df['Count'] == 1)].index[0], :]
but I can't seem to follow the slicing logic to make it do the opposite.
I'd do:
df[(df.Count == 1).idxmax():]
df.Count == 1 returns a boolean array. idxmax() will identify the index of the maximum value. I know the max value will be True and when there are more than one Trues it will return the position of the first one found. That's exactly what you want. By the way, that value is 2. Finally, I slice the dataframe for everything from 2 onward with df[2:]. I put all that in one line in the answer above.
you can use cumsum() method:
In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
Year ID Count
2 1999 3 1
3 2000 4 0
4 2001 5 1
Explanation:
In [14]: (df.Count == 1).cumsum()
Out[14]:
0 0
1 0
2 1
3 1
4 2
Name: Count, dtype: int32
Timing against 500K rows DF:
In [18]: df = pd.concat([df] * 10**5, ignore_index=True)
In [19]: df.shape
Out[19]: (500000, 3)
In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop
In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop
In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop
Conclusion: #piRSquared's idxmax() solution is a clear winner...
Using np.where:
df[np.where(df['Count']==1)[0][0]:]
Timings
Timings were performed on a larger version of the DataFrame:
df = pd.concat([df]*10**5, ignore_index=True)
Results:
%timeit df[np.where(df['Count']==1)[0][0]:]
100 loops, best of 3: 2.74 ms per loop
%timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 6.18 ms per loop
%timeit df[(df.Count == 1).cumsum() > 0]
10 loops, best of 3: 26.6 ms per loop
%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
Just slice the other way :
if idx is your index do :
df.loc[idx:]
Instead of
df.loc[:idx]
That means :
df.loc[df[(df['Count'] == 1)].index[0]:, :]
Trying to calculate real delay between InvokeAfter-function's executions.
Function is supposed to fire five times a second
index delay now
0 0 18:47:33
1 0 18:47:33
2 0 18:47:33
3 0 18:47:33
4 0 18:47:33
5 1 18:47:34
6 1 18:47:34
7 1 18:47:34
8 1 18:47:34
9 1 18:47:34
10 2 18:47:35
11 2 18:47:35
12 2 18:47:35
13 2 18:47:35
14 2 18:47:35
...
But I get this
Column real_delay is a difference between this row and previous
CODE
let
t = Table.FromList({0..19}, Splitter.SplitByNothing()),
delay = Table.AddColumn(t, "delay", each Number.IntegerDivide([Column1], 5)),
InvokeAfter = Table.AddColumn(delay, "InvokeTimeNow", each Function.InvokeAfter(
()=>DateTime.Time(DateTime.LocalNow()), #duration(0,0,0,[delay]))
),
real_delay = Table.AddColumn(InvokeAfter, "real_delay", each try InvokeAfter{[Column1=[Column1]-1]}[InvokeTimeNow]-[InvokeTimeNow] otherwise "-")
in
real_delay
What's wrong with code? Or maybe with InvokeAfter-function???
5 times a second means you should be waiting (second / 5) = 0.2 fractional seconds each invocation.
If you run this code:
let
t = Table.FromList({0..19}, Splitter.SplitByNothing()),
delay = Table.AddColumn(t, "delay", each 0.2),
InvokeAfter = Table.AddColumn(delay, "InvokeTimeNow", each Function.InvokeAfter(
()=>DateTime.Time(DateTime.LocalNow()), #duration(0,0,0,[delay]))
),
real_delay = Table.AddColumn(InvokeAfter, "real_delay", each try InvokeAfter{[Column1=[Column1]-1]}[InvokeTimeNow]-[InvokeTimeNow] otherwise "-")
in
real_delay
you'll see the function was invoked about 5 times per second.
== SOLUTION ==
Here is my own solution. A surprising one, I think...
NEW CODE
let
threads=5,
t = Table.FromList({0..19}, Splitter.SplitByNothing()),
delay = Table.AddColumn(t, "delay", each if Number.Mod([Column1], threads)=0 and [Column1]>0 then 1 else 0),
InvokeAfter = Table.AddColumn(delay, "InvokeTimeNow", each Function.InvokeAfter(()=>DateTime.Time(DateTime.LocalNow()), #duration(0,0,0,[delay]))),
real_delay = Table.AddColumn(InvokeAfter, "real_delay", each try InvokeAfter{[Column1=[Column1]-1]}[InvokeTimeNow]-[InvokeTimeNow] otherwise "-")
in
real_delay
The original idea was the multithreading parsing. And since there were some limits for simultaneous connections, I had to adapt.
I thought there is a "null-zero-start" moment, after which function is invoked - the moment, when cell is calculated (all cells almost at the same time). And second parameter means a delay after this start point. But it appears to accumulate all the delays. Very strange behaviour, imho...
So I solved the problem, but still do not understand why =)
In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2
Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.
Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1