Upper limit on duration for Survival Analysis - lifelines

I have a lifelines model which I fit using the following:
model = WeibullAFTFitter()
model.fit(train, 'duration', event_col='y', show_progress=True)
However, the time duration it predicts for my test set is extremely large (by using predicted_time = model.predict_expectation(test)). In fact in the uncensored case the average error between test duration and predicted duration is 2289.3773 +/- 7584.9916.
The only issue is that the maximum possible duration is 1500 (Assume the machines are replaced every 5 years). So my questions are:
Is there a way to set an upper limit on time?
If I normalise the duration to have 0 mean and standard deviation of 1, would the duration estimates improve?

Related

Convert IF-ELSE statement in Linear Programming (ORTools)

I am trying to create an Optimization for Gas Storage using Linear Programming (OR Tools).
I need to write a case like this:
if current_balance > 70% of Total Volume:
set a limit for gas injection as 10
else:
set a limit for gas injection as 30
Current balance is the Total amount of gas that is available today in a gas storage.
I tried looking at Big M notation.
Is there other way except Big M? And if i have to use Big M then how can I use it in above problem?
Edited:
How can i build equation for following case:
if current_balance > 70% of Total Volume and current_balance < 80% of Total Volume:
set a limit for gas injection as 10
else if current_balance > 80% of Total Volume:
set a limit for gas injection as 30
I don't think there is an other way but Big M, although Big M get much better when you put some thought into it and choose M wisely and not too big (as small as possible). When the current balance is never allowed to exceed the total volume the following formulation is the tightest to work for you case. Here exceed is a boolean variable indicating whether you are exceeding the 70% of the total volume.
current_balance - (30%TotalVolume)*exceed <= 70%TotalVolume
gas_injection <= 30 - 20*exceed

LP Duals and Reduced Costs with CPLEX

I am working on a Column Generation algorithm using CPLEX to solve the Reduced Master Problem.
After adding the new variables to the RMP, I set their upper bounds to 0, solve the RMP again and retrieve their reduced costs (to check if the value I calculated matches the one provided by CPLEX).
In the first iterations, the reduced costs match.
However, after some iterations, I start getting different reduced cost.
When I run CPLEX Interative Optimizer, read the LP model (or MPS) and compare the duals of the constraints, I get some different values.
Does it make any sense?
I've tried using different methods for solving my LP. Also tried changing tolerances.
Problem stats
Objective sense : Minimize
Variables : 453308 [Fix: 8, Box: 453300]
Objective nonzeros : 6545
Linear constraints : 578166 [Less: 70814, Greater: 503886, Equal: 3466]
Nonzeros : 2710194
RHS nonzeros : 7986
Variables : Min LB: 0.0000000 Max UB: 74868.86
Objective nonzeros : Min : 0.01000000 Max : 10000.00
Linear constraints :
Nonzeros : Min : 0.004000000 Max : 396.8800
RHS nonzeros : Min : 0.01250000 Max : 74868.86
Displaying the solution quality I get these info:
Max. unscaled (scaled) bound infeas. = 8.52651e-014 (3.33067e-015)
Max. unscaled (scaled) reduced-cost infeas. = 2.24935e-010 (5.62339e-011)
Max. unscaled (scaled) Ax-b resid. = 5.90461e-011 (3.69038e-012)
Max. unscaled (scaled) c-B'pi resid. = 2.6489e-011 (7.27596e-012)
Max. unscaled (scaled) |x| = 45433 (2839.56)
Max. unscaled (scaled) |slack| = 4970.49 (80.1926)
Max. unscaled (scaled) |pi| = 295000 (206312)
Max. unscaled (scaled) |red-cost| = 411845 (330962)
Condition number of scaled basis = 1.1e+008
As mentioned in the comment by Erwin, what you are experiencing is probably degeneracy.
Both the primal and dual solutions are often not unique in problems larger than a toy model.
By fixing a set of primal variables to their optimal level, assuming the solution was otherwise primal-dual optimal and the solution is stored in CPLEX, then it should take zero iterations reoptimizing the model after applying the fixes. Hence it should return the same solution. But if no solution is stored in CPLEX and you reoptimize from scratch, then CPLEX may return a different (but also optimal) (primal and/or dual) solution.
Do you see iterations in the log ?
As debug, try to write out the model before fixing and after, then do a diff on these two files to make sure there's not a modeling/programming mistake on your side.
You are also welcome to contact me at bo.jensen (at) dk (dot) ibm (dot) com and I will try to help you as I don't follow stack overflow closely.
My guess would be that when you are setting up the subproblem you fail to account for the reduced cost of variables out of basis at their upper bound. Those reduced costs are essentially the dual values of the upper bound constraint and hence must be taken into account when setting up the subproblem.
This sort of accidental omission typically happens when the generated variables are created with an upper bound.
If really this is your problem, then the your easiest solution may be simply not specifying upper bounds for the new variables, which you can do if the upper bound is implied (e.g., from the new variable being part of a clique constraint).

AWS CloudWatch metric math with a cumulative metric's value 30 minutes ago to show rate of change

I have a AWS CloudWatch custom metric that represents a cumulative value which continues to increase overtime. I will add that metric to a dashboard, but I also want to show the rate of change of this metric over the last 30 minutes. Ideally I would like a function to return the metric's value from 30 minutes ago and subtract that from the current value. The "Rate()" function does not seem to help.
I could submit the metrics value a second time with a timestamp that is 30 minutes in the future and subtract these two metrics, but I am hoping for a solution that uses metric math and does not force me to submit another metric. I can think of other use cases where I might want to do math with metrics from different time periods.
Hope I am just missing something here!
You can use some arithmetic to obtain the previous value and then you're able to calculate the percentage of change as you want.
The value you want is: (value_now - value_before) / value_before
Breaking this into 2 parts:
Obtain value_now - value_before. This is the absolute delta of the values.
Obtain value_before. This is the value of the metric in the last datapoint.
Assuming that your metric in Cloudwatch is m.
Step 1: The absolute delta
The absolute_delta can be obtained with: absolute_delta = RATE(m) * PERIOD(m).
Step 2: The previous value
With some arithmetic it is possible to obtain previous_value. Given the definition of absolute delta:
absolute_delta = value_now - value_before
Since we have value_now = m and absolute_delta, then it's a matter of inverting the equation:
value_before = value_now - absolute_delta
Final equation
Just plug everything together and you have your final metric:
change_percentage = 100 * absolute_delta / value_before
In CloudWatch terms:
Metric math function RATE() calculates the rate of change per second.
Returns the rate of change of the metric, per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values.
From https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
So to get the rate of change for your period you could do this:
RATE(m1)*PERIOD(m1)
and set the period of the dashboard to the wanted value.
Problem in your case is that you need it for a period of 30 min, I don't think you can set 30 min as period on the CloudWatch dashboard. Closest values would be 15 min or 1 hour.

Counting riemann events in rate function

Hi I have a use case where I have to aggregate my application response time for a time interval of 10 i.e (rate 10 and then calculate the average . The real problem is there is no way to calculate the number of events in riemann rate function for the time interval of 10. Is there any way to do that other than using (fixed-time-window .
Rate is unlikely to be the function you want for this. If I understand it you would like your code to:
gather all the events that happen in ten minutes
once the events are all available, calculate the average of the :metric key
emit one event with the service name, and that average as the value of it's metric.
If I'm not understanding then this answer won't fit, so let me know.
Rate takes in a reporting-interval and any number of streams to forward the average rate to. That first parameter to rate only determines how often it report the current rate and has no effect on the period over which it's aggregated. the built in rate function only has one available agrigation interval, it always reports in "metric per second". so it accumulates events for one second, and averages the mertic over that second, and it properly handles edge cases like reporting intervals with no events as a zero metric a reasonable number of times, though not forever. You should use rate where it fits, and not use it where you need explicit control over the aggregation period.
I often want events per minute, so I set the reporting period to 60 seconds, then multiply the output by 60 before forwarding it. This saves me handling all the edge cases in custom aggregation functions. keep in mind that this looses some accuracy in the rounding.
(rate 60
(adjust [:metric * 60]
index datadog))
You may also want to do something like:
(fixed-time-window 10
(smap folds/median
... your stuff ...

Running a GEE model for a rate, and adjusting with a covariate that is also a rate (GENMOD SAS)

I want to run a GEE for clustered data - I am trying to get incidence rate ratios (IRR) for antibiotic reactions between two drugs. I have searched for information on constructing GEE models (GENMOD in SAS, xtgee in Stata) but I can't find criteria on what type of variables can be included as covariates. My model is this:
proc genmod data = mydata;
class Pt fev1_cat;
model rate_pip = cumulative_dose_before fev1_cat Average_Dose_Admis mero_rate /
type3 dist=poisson link=log;
repeated subject=Pt;
run;
rate_pip is the rate of adverse events (AE) for antibiotic in question, mero_rate is the rate of AE for a different antibiotic. The other variables are either categorical or continuous.
If I adjust the GEE with a covariate that is a rate, is it 1) a correct use of the GEE model, and 2) would the interpretation of the exp(coef) be the IRR between the two rates of AE, or is it interpreted as: for each unit increase in rate of mero_rate, the IRR of rate_pip is x times higher/lower?
I can't say whether this is a correct use of a GEE model without knowing a little more about the data structure, but I don't know of anything that's special about GEE models that would preclude the use of rate variables as predictors (as compared to say a ordinary least squares regression model). If the model is okay without the mero_rate predictor, it would probably be okay with it too. Maybe the caveat is that it can't be too correlated with the other predictors.
As far as interpretation goes, I think you've pretty much got it. The log of the incidence rate increases by beta units for a mero_rate value of x+1 events per unit time, compared to a mero_rate value of x events per unit time, all other things equal.