I'm putting together some tables with results from a series of Cox Proportional Hazards Models. I'd like to exponentiate the coefficients so the tables display the Hazard Ratios rather than the raw beta values. Does anyone know of a way to do this with huxtable? It's my preferred package for building regression tables. I've done some googling and can't find a solution.
You can use the tidy_args argument to huxreg:
library(huxtable)
library(survival)
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
mod <- coxph(Surv(time, status) ~ x + strata(sex), test1)
huxreg(mod)
─────────────────────────────────────────────────
(1)
─────────────────────────
x 0.802
(0.822)
─────────────────────────
N 5.000
R2 0.144
logLik -3.328
AIC 8.655
─────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1
huxreg(mod, tidy_args = list(exponentiate = TRUE))
─────────────────────────────────────────────────
(1)
─────────────────────────
x 2.231
(0.822)
─────────────────────────
N 5.000
R2 0.144
logLik -3.328
AIC 8.655
─────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1
tidy(mod, exponentiate = TRUE) appears to exponentiate the coefficients but not the standard errors, which is presumably a bug in broom and worth reporting? Confidence intervals appear correct, though, so you can do:
huxreg(mod, tidy_args = list(exponentiate = TRUE),
error_format = "[{conf.low}-{conf.high}]", ci_level = 0.95)
─────────────────────────────────────────────────
(1)
─────────────────────────
x 2.231
[0.445-11.180]
─────────────────────────
N 5.000
R2 0.144
logLik -3.328
AIC 8.655
─────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1
Related
I am trying to write a multivariate Singular Spectrum Analysis with Monte Carlo test. To this extent I am working on a code piece that can reconstruct the input series using the lagged trajectory matrix and projection base (ST-PCs) that result from the pca/ssa decomposition of the input series. The attached code piece works for a lagged univariate (that is, single) time series, but I am struggling to make this reconstruction for a lagged multivariate time series. I don't quite get the procedure mathematically and - not surprisingly - I also did not manage to program it. Useful links are attached to the function descriptions of the accompanying code. Input data should be of the form (time * number of series), so say 288x3 implying 3 time series of 288 time levels.
I hope you can help me out!
import numpy as np
def lagged_covariance_matrix(data, M):
""" Computes the lagged covariance matrix using the Broomhead & King method
Background: Plaut, G., & Vautard, R. (1994). Spells of low-frequency oscillations and
weather regimes in the Northern Hemisphere. Journal of the atmospheric sciences, 51(2), 210-236.
Arguments:
data : pxn time series, where p denotes the length of the time series and n the number of channels
M : window length """
# explicitely 'add' spatial dimension if input is a single time series
if np.ndim(data) == 1:
data = np.reshape(data,(len(data),1))
T = data.shape[0]
L = data.shape[1]
N = T - M + 1
X = np.zeros((T, L, M))
for i in range(M):
X[:,:,i] = np.roll(data, -i, axis = 0)
X = X[:N]
# X constitutes the trajectory matrix and is a stacked hankel matrix
X = np.reshape(X, (N, M*L), order = 'C') # https://www.jstatsoft.org/article/viewFile/v067i02/v67i02.pdf
# choose the smallest projection basis for computation of the covariance matrix
if M*L >= N:
return 1/(M*L) * X.dot(X.T), X
else:
return 1/N * X.T.dot(X), X
def sort_by_eigenvalues(eigenvalues, PCs):
""" Sorts the PCs and eigenvalues by descending size of the eigenvalues """
desc = np.argsort(-eigenvalues)
return eigenvalues[desc], PCs[:,desc]
def Reconstruction(M, E, X):
""" Reconstructs the series as the sum of M subseries.
See: https://en.wikipedia.org/wiki/Singular_spectrum_analysis, 'Basic SSA' &
the work of Vivien Sainte Fare Garnot on univariate time series (https://github.com/VSainteuf/mcssa)
Arguments:
M : window length
E : eigenvector basis
X : trajectory matrix """
time = len(X) + M - 1
RC = np.zeros((time, M))
# step 3: grouping
for i in range(M):
d = np.zeros(M)
d[i] = 1
I = np.diag(d)
Q = np.flipud(X # E # I # E.T)
# step 4: diagonal averaging
for k in range(time):
RC[k, i] = np.diagonal(Q, offset = -(time - M - k)).mean()
return RC
#=====================================================================================================
#=====================================================================================================
#=====================================================================================================
# input data
data = None
# number of lags a.k.a. window length
M = 45 # M = 1 means no lag
covmat, X = lagged_covariance_matrix(data, M)
# get the eigenvalues and vectors of the covariance matrix
vals, vecs = np.linalg.eig(covmat)
eig_data, eigvec_data = sort_by_eigenvalues(vals, vecs)
# component reconstruction
recons_data = Reconstruction(M, eigvec_data, X)
The following works but does not make direct use of the projection base (ST-PCs). Hence the original question still stands, but this already helps a great lot and solves the problem for me. This code piece makes use of the similarity between the ST-PCs projection base and the u & vt matrices obtained from the single value decomposition of the lagged trajectory matrix. I think it gives back the same answer as one would obtain using the ST-PCs projection base?
def lag_reconstruction(data, X, M, pairs = None):
""" Reconstructs the series as the sum of M subseries using the lagged trajectory matrix.
Based on equation 2.9 of Plaut, G., & Vautard, R. (1994). Spells of low-frequency oscillations and weather regimes in the Northern Hemisphere. Journal of Atmospheric Sciences, 51(2), 210-236.
Inspired by work of R. van Westen and C. Wieners """
time = data.shape[0] # number of time levels of the original series
L = data.shape[1] # number of input series
N = time - M + 1
u, s, vt = np.linalg.svd(X, full_matrices = False)
rc = np.zeros((time, L, M))
for t in range(time):
counter = 0
for i in range(M):
if t-i >= 0 and t-i < N:
counter += 1
if pairs:
for k in pairs:
rc[t,:,i] += u[t-i, k] * s[k] * vt[k, i*L : i*L + L]
else:
for k in range(len(s)):
rc[t,:,i] += u[t-i, k] * s[k] * vt[k, i*L : i*L + L]
rc[t] = rc[t]/counter
return rc
I have this function to reach a certain 1 dimensional value accelerated and damped with overshoot. That is: given an inital value, a velocity and a acceleration (force/mass), the target value is attained by accelerating to it and gets increasingly damped while getting closer to the target value.
This all works fine, howver If i want to know what the TotalAngle is after time 't' I have to run this function say N steps with a 'small' dt to find the 'limit'.
I was wondering If i can (and how) to intergrate over dt so that the TotalAngle can be determined given a time 't' initially.
Regards, Tanks for any help.
dt = delta time step per frame
input = 1
TotalAngle = 0 at t=0
Velocity = 0 at t=0
void FAccelDampedWithOvershoot::Update(float dt, float input, float& Velocity, float& TotalAngle)
{
const float Force = 500000.f;
const float DampForce = 5000.f;
const float MaxAngle = 45.f;
const float InvMass = 1.f / 162400.f;
float target = MaxAngle * input;
float ratio = (target - TotalAngle) / MaxAngle;
float fMove = Force * ratio;
float fDamp = -Velocity * DampForce;
Velocity += (fMove + fDamp) * invMass * dt;
TotalAngle += Velocity * dt;
}
Updated with fixed bugs in math
Originally I've lost mass and MaxAngle a few times. This is why you should first solve it on a paper and then enter to the SO rather than trying to solve it in the text editor.
Anyway, I've fixed the math and now it seems to work reasonably well. I put fixed solution just over previous one.
Well, this looks like a Newtonian mechanics which means differential equations. Let's try to solve them.
SO is not very friendly to math formulas and I'm a bit bored to type characters so here is what I use:
F = Force
Fd = DampForce
MA = MaxAngle
A= TotalAngle
v = Velocity
m = 1 / InvMass
' for derivative i.e. something' is 1-st derivative of something by t and something'' is 2-nd derivative
if I divide you last two lines of code by dt and merge in all the other lines I can get (I also assume that input = 1 as other case is obviously symmetrical)
v' = ([F * (1 - A / MA)] - v * Fd) / m
and applying A' = v we get
m * A'' = F(1 - A/MA) - Fd * A'
or moving to one side we get a simple 2-nd order differential equation
m * A'' + Fd * A' + F/MA * A = F
IIRC, the way to solve it is to first solve characteristic equation which here is
m * x^2 + Fd * x + F/MA = 0
x[1,2] = (-Fd +/- sqrt(Fd^2 - 4*F*m/MA))/ (2*m)
I expect that part under sqrt i.e. (Fd^2 - 4*F*m/MA) is negative thus solution should be of the following form. Let
Dm = Fd/(2*m)
K = sqrt(F/MA/m - Dm^2)
(note the negated value under sqrt so it works now) then
A(t) = e^(-Dm*t) * [P * sin(K*t) + Q * cos(K*t)] + C
where P, Q and C are some constants.
The solution is easier to find as a sum of two solutions: some specific solution for
m * A'' + Fd * A' + F/MA * A = F
and a general solution for homogeneou
m * A'' + Fd * A' + F/MA * A = 0
that makes original conditions fit. Obviously specific solution A(t) = MA works and thus C = MA. So now we need to fit P and Q of general solution to match starting conditions. To find them we need
A(0) = - MA
A'(0) = V(0) = 0
Given that e^0 = 1, sin(0) = 0 and cos(0) = 1 you get something like
Q = -MA
P = 0
or
P = 0
Q = - MA
C = MA
thus
A(t) = MA * [1 - e^(-Dm*t) * cos(K*t)]
where
Dm = Fd/(2*m)
K = sqrt(F/MA/m - Dm^2)
which kind of makes sense given your task.
Note also that this equation assumes that everything happens in radians rather than degrees (i.e. derivative of [sin(t)]' is just cos(t)) so you should transform all your constants accordingly or transform the solution.
const float Force = 500000.f * M_PI / 180;
const float DampForce = 5000.f * M_PI / 180;
const float MaxAngle = M_PI_4;
which on my machine produces
Dm = 0.000268677541
K = 0.261568546
This seems to be similar to original funcion is I step with dt = 0.01f and the main obstacle seems to be precision loss because of float
Hope this helps!
This is not a full answer and I am sure someone else can work it out, but there is no room in the comments and it may help you find a better solution.
The image below shows the velocity (blue) as your function integrates at time steps 1. The red shows the function below that calculates the value for time t
The function F(t)
F(t) = sin((t / f) * pi * 2) * (1 / (((t / f) + a) ^ c)) * b
With f = 23.7, a = 1.4, c = 2, and b= 50 that give the red plot in the image above
All the values are just approximation.
f determines the frequency and is close to a match,
a,b,c control the falloff in amplitude and are a by eye guestimate.
If it does not matter that you have a perfect match then this will work for you. totalAngle uses the same function but t has 0.25 added to it. Unfortunately I did not get any values for a,b,c for totalAngle and I did notice that it was offset so you will have to add the offset value d (I normalised everything so have no idea what the range of totalAngle was)
Function F(t) for totalAngle
F(t) = sin(((t+0.25) / f) * pi * 2) * (1 / ((((t+0.25) / f) + a) ^ c)) * b + d
Sorry only have f = 23.7, c= 2, a~1.4 nothing for b=? d=?
I haven't been able to find particular solutions to this differential equation.
from sympy import *
m = float(raw_input('Mass:\n> '))
g = 9.8
k = float(raw_input('Drag Coefficient:\n> '))
v = Function('v')
f1 = g * m
t = Symbol('t')
v = Function('v')
equation = dsolve(f1 - k * v(t) - m * Derivative(v(t)), 0)
print equation
for m = 1000 and k = .2 it returns
Eq(f(t), C1*exp(-0.0002*t) + 49000.0)
which is correct but I want the equation solved for when v(0) = 0 which should return
Eq(f(t), 49000*(1-exp(-0.0002*t))
I believe Sympy is not yet able to take into account initial conditions. Although dsolve has the option ics for entering initial conditions (see the documentation), it appears to be of limited use.
Therefore, you need to apply the initial conditions manually. For example:
C1 = Symbol('C1')
C1_ic = solve(equation.rhs.subs({t:0}),C1)[0]
print equation.subs({C1:C1_ic})
Eq(v(t), 49000.0 - 49000.0*exp(-0.0002*t))
I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.
The Julia code to calculate the Gradient:
sigmoid(z) = 1 / (1 + e ^ -z)
hypotesis(theta, x) = sigmoid(scalar(theta' * x))
function gradient(theta, x, y)
(m, n) = size(x)
h = [hypotesis(theta, x[i,:]') for i in 1:m]
g = Array(Float64, n, 1)
for j in 1:n
g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
end
g
end
If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.
The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.
The gradient is correct (up to a scalar multiple, as #roygvib points out). The problem is with the gradient descent.
If you look at the values of the cost function during your gradient descent, you will see a lot of NaN,
which probably come from the exponential:
lowering the step size (e.g., to 1e-5) will avoid the overflow,
but you will have to increase the number of iterations a lot (perhaps to 10_000_000).
A better (faster) solution would be to let the step size vary.
For instance, one could multiply the step size by 1.1
if the cost function improves after a step
(the optimum still looks far away in this direction: we can go faster),
and divide it by 2 if it does not (we went too fast and ended up past the minimum).
One could also do a line search in the direction of the gradient to find the best step size
(but this is time-consuming and can be replaced by approximations, e.g., Armijo's rule).
Rescaling the predictive variables also helps.
I tried comparing gradient() in the OP's code with numerical derivative of cost_j() (which is the objective function of minimization) using the following routine
function grad_num( theta, x, y )
g = zeros( 3 )
eps = 1.0e-6
disp = zeros( 3 )
for k = 1:3
disp[:] = theta[:]
disp[ k ]= theta[ k ] + eps
plus = cost_j( disp, x, y )
disp[ k ]= theta[ k ] - eps
minus = cost_j( disp, x, y )
g[ k ] = ( plus - minus ) / ( 2.0 * eps )
end
return g
end
But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y ), from which it seems that the division by m is missing:
#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )
#/ modified code
g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m
Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?
But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations)
...
c = cost_j( theta, x, y )
for i = 1:n_iterations
c_prev = c
c = cost_j( theta, x, y )
if c - c_prev < 0.0
alpha *= 1.01
else
alpha /= 1.05
end
theta[:] = theta - alpha * gradient(theta, x, y)
end
...
end
and called this routine as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
The variation of cost_j versus iteration steps is plotted below.
I need to compute a similarity measure call the Dice coefficient over large matrices (600,000 x 500) of binary vectors in R. For speed I use C / Rcpp. The function runs great but as I am not a computer scientist by background I would like to know if it could run faster. This code is suitable for parallelisation but I have no experience parallelising C code.
The Dice coefficient is a simple measure of similarity / dissimilarity (depending how you take it). It is intended to compare asymmetric binary vectors, meaning one of the combination (usually 0-0) is not important and agreement (1-1 pairs) have more weight than disagreement (1-0 or 0-1 pairs). Imagine the following contingency table:
1 0
1 a b
0 c d
The Dice coef is: (2*a) / (2*a +b + c)
Here is my Rcpp implementation:
library(Rcpp)
cppFunction('
NumericMatrix dice(NumericMatrix binaryMat){
int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
NumericMatrix results(ncols, ncols);
for(int i=0; i < ncols-1; i++){ // columns fixed
for(int j=i+1; j < ncols; j++){ // columns moving
double a = 0;
double d = 0;
for (int l = 0; l < nrows; l++) {
if(binaryMat(l, i)>0){
if(binaryMat(l, j)>0){
a++;
}
}else{
if(binaryMat(l, j)<1){
d++;
}
}
}
// compute Dice coefficient
double abc = nrows - d;
double bc = abc - a;
results(j,i) = (2*a) / (2*a + bc);
}
}
return wrap(results);
}
')
And here is a running example:
x <- rbinom(1:200000, 1, 0.5)
X <- matrix(x, nrow = 200, ncol = 1000)
system.time(dice(X))
user system elapsed
0.814 0.000 0.814
The solution proposed by Roland was not entirely satisfying for my use case. So based on the source code from the arules package I implement a much faster version. The code in arules rely on an algorithm from Leisch (2005) using the tcrossproduct() function in R.
First, I wrote a Rcpp / RcppEigen version of crossprod that is 2-3 time faster. This is based on the example code in the RcppEigen vignette.
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;
const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));
const int m(A.rows()), n(A.cols());
MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));
return wrap(AtA);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
Then I wrote a small R function to compute the Dice coefficient.
diceR <- function(X){
a <- fcprd(X)
nx <- ncol(X)
rsx <- colSums(X)
c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)
m <- (2 * a) / (2*a + b + c)
return(m)
}
This new function is ~8 time faster than the old one and ~3 time faster than the one in arules.
m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
# expr min lq median uq max neval
# dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635 100
# diceR(X) 62.98642 76.5510 92.02528 159.2557 507.1662 100
# dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492 100
I cannot run your function at work, but is the result the same as this?
library(arules)
plot(dissimilarity(X,method="dice"))
system.time(dissimilarity(X,method="dice"))
#user system elapsed
#0.04 0.00 0.04