Predict zero-inflated negative binomial to raster with raster::predict - r-raster

I am trying to predict a ZINB model to a raster. My code looks like this:
m1 <- pscl::zeroinfl(Herbround ~ LandcoverCode | LandcoverCode, data = data, dist = 'negbin')
r <- raster("data\\final.tif") # read in raster
names(r) <- "LandcoverCode" # match terminology in the model
#predict model to raster
predict(r, m1)
Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]) :
contrasts apply only to factors
In addition: Warning message:
In model.frame.default(delete.response(object$terms$full), newdata, :
variable 'LandcoverCode' is not a factor
I then get the error above. If I use an lm or glm model it works without issues and I know that my LandcoverCode is in fact a factor with multiple levels (6 to be exact). I have tried adding in:
predict(r, m1, type = "count")
But that doesn't work either. Any help with predicting a ZINB model to a raster would be greatly appreciated, thanks.

Related

Combine multiple (>2) survival curves (null models) in same plot

I am trying to combine multiple survfit objects on the same plot, using function ggsurvplot_combine from package survminer. When I made a list of 2 survfit objects, it perfectly works. But when I combine 3 survfit objects in different ways, I receive the error:
error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated
I've read similar posts on combining survivl plots (https://cran.r-project.org/web/packages/survminer/survminer.pdf, https://github.com/kassambara/survminer/issues/195, R plotting multiple survival curves in the same plot, https://rpkgs.datanovia.com/survminer/reference/ggsurvplot_combine.html) and on this specific error, for which solutions are been provided with using 'unique'. However, I do not even understand for which factor variable this error accounts. I do not have the right to share my data or figures, so I'll try to replicate it:
Data:
time: follow-up between untill event or end of follow-up
endpoints: 1= event, 0=no event or censor
Null models:
KM1 <- survfit(Surv(data$time1,data$endpoint1)~1,
type="kaplan-meier", conf.type="log", data=data)
KM2 <- survfit(Surv(data$time2,data$endpoint2)~1, type="kaplan-meier",
conf.type="log", data=data)
KM3 <- survfit(Surv(data$time3,data$endpoint3)~1, type="kaplan-meier",
conf.type="log", data=data)
List null models:
list_that_works <- list(KM1,KM3)
list_that_fails <- list(KM1,KM2,KM3)
It seems as if the list contains of just two arguments: list(PFS=, OS=)
Combine >2 null models in one plot:
ggsurvplot_combine(list_that_works, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives the plot I'm looking for, but with 2 cumulative incidence curves.
ggsurvplot_combine(list_that_fails, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives error 'error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated'.
When I try combining 3 plots with using
ggsurvplot(c(KM1,KM2,KM3), data=data, conf.int=TRUE, fun="event", combine=TRUE), it gives the error:
Error: Problem with mutate() 'column 'survsummary'
survsummary = purrr::map2(grouped.d$fit, grouped.d$name, .surv_summary, data=data'. x $ operator is invlid for atomic vectors.
Any help is highly appreciated!
Also another way to combine surv fits is very welcome!
My best bet is that it has something to do with the 'list' function that only contains of two arguments: list(PFS=, OS=)
I fixed it! Instead of removing the post, I'll share my solution, it may be of help for others:
I made a list of the formulas instead of the null models, so:
formulas <- list(
KM1 = Surv(time1, endpoint1)~1,
KM2 = Surv(time2, endpoint2)~1,
KM3 = Surv(time3, endpoint3)~1)
I made a null model of the 3 formulas at once:
fit <- surv_fit(formulas, data=data)
Then I made a plot with this survival fit:
ggsurvplot_combine(fit, data=data)

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

Dimensions problem linear regression Python scikit learn

I'm implementing a function in which I have to perform a linear regression using scikit learn.
What I have when running it with an example:
X_train.shape=(34,3)
X_test.shape=(12,3)
Y_train.shape=(34,1)
Y_test.shape=(12,1)
Then
lm.fit(X_train,Y_train)
Y_pred = lm.predict(X_test)
However Python tells me there is a mistake at this line
dico['R2 value']=lm.score(Y_test, Y_pred)
What Python tells me:
ValueError: shapes (12,1) and (3,1) not aligned: 1 (dim 1) != 3 (dim 0)
Thanks in advance for the help anyone could bring me :)
Alex
For using lm.score() you need to pass X_test, y_test.
dico['R2 value']=lm.score(X_test, Y_test)
See the documentation here:
score(X, y, sample_weight=None)
X : array-like, shape = (n_samples, n_features) Test samples.
For some estimators this may be a precomputed kernel matrix instead,
shape = (n_samples, n_samples_fitted], where n_samples_fitted is the
number of samples used in the fitting for the estimator.
y : array-like, shape = (n_samples) or (n_samples, n_outputs) True values for X.
sample_weight : array-like, shape = [n_samples], optional Sample weights.
You are trying to use the score method as a metric method, which is wrong. A score() method on any estimator will itself calculate the predictions and then send them to appropriate metric scorer.
If you want to use Y_test and Y_pred yourself, then you can do this:
from sklearn.metrics import r2_score
dico['R2 value'] = r2_score(Y_test, Y_pred)

AssertError when fitting a model

I have a small data set, it has less than 2000 rows. I am trying to fit a LinearRegressionModel using ML, well the data set has only one feature (which I already normalized), after the model was fitted, I evaluated it using a RegressionEvaluator and measuring metrics R2 and RMSE. Then I noticed the error was high, and hence decided to create more artificial features, in order to describe better the phenomena. To achieve this I created the following UDF (notice I check it works).
numberFeatures = 12
def addFeatures(value):
v = value.toArray()[0]
return Vectors.dense([v ** (1.0 / x) for x in xrange(2, 10)] +
[v ** x for x in xrange(1, numberFeatures)])
addFeaturesUDF = udf(addFeatures, VectorUDT())
# Here I test it
print(addFeatures(Vectors.dense(2)))
# [1.0,0.666666666667,0.5,0.4,0.333333333333,0.285714285714,0.25,0.222222222222,2.0,4.0,8.0,16.0,32.0,64.0,128.0,256.0,512.0,1024.0,2048.0]
After this I modify my DataFrame to add more features, using addFeaturesUDF, I can show it bellow.
dtBoosted = dt.withColumn("features", addFeaturesUDF(col("features")))
dtBoosted.show(5)
#+--------+-----+----------+--------------------+
#| date|price| feature| features|
#+--------+-----+----------+--------------------+
#|733946.0| 9.92|[733946.0]|[0.0,0.0,0.0,0.0,...|
#|733948.0| 8.05|[733948.0]|[4.88997555012224...|
#|733949.0| 8.05|[733949.0]|[7.33496332518337...|
#|733950.0| 7.91|[733950.0]|[9.77995110024449...|
#|733951.0| 7.91|[733951.0]|[0.00122249388753...|
#+--------+-----+----------+--------------------+
# only showing top 5 rows
And works, but when I attempt to fit the model it shows an AssertError.
dtTrain, dtValidation = dtBoosted.randomSplit([0.75, 0.25], seed=107)
lr = LinearRegression(maxIter=100, labelCol="price", featuresCol="features")
lrm = lr.fit(dtTrain)
What is the problem? What am I doing wrong? It worked with one feature and some other features!

PCA for dimensionality reduction before Random Forest

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?
I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
Many thanks in advance.
My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA
The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.
#I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.
The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.
The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.
The souce-code suggestion for PCA-RF:
#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
f.args=as.list(match.call()[-1])
pca_obj = princomp(x)
rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
out=mget(ls())
class(out) = "PCA_RF"
return(out)
}
#print method
print.PCA_RF = function(object) print(object$rf_obj)
#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
print("predicting PCA_RF")
f.args=as.list(match.call()[-1])
if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
newdata = sXtest), #newdata input, see help(predict.randomForest)
f.args[-1:-2]))) #any other parameters are passed to predict.randomForest
}
#testTrain predict #
make.component.data = function(
inter.component.variance = .9,
n.real.components = 5,
nVar.per.component = 20,
nObs=600,
noise.factor=.2,
hidden.function = function(x) apply(x,1,mean),
plot_PCA =T
){
Sigma=matrix(inter.component.variance,
ncol=nVar.per.component,
nrow=nVar.per.component)
diag(Sigma) = 1
x = do.call(cbind,replicate(n = n.real.components,
expr = {mvrnorm(n=nObs,
mu=rep(0,nVar.per.component),
Sigma=Sigma)},
simplify = FALSE)
)
if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
y = hidden.function(x)
ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
out = list(x=x,y=ynoised)
pars = ls()[!ls() %in% c("x","y","Sigma")]
attr(out,"pars") = mget(pars) #attach all pars as attributes
return(out)
}
A run code example:
#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)
Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[ 1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])
rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)
rf
rf2
pred_rf = predict(rf ,test$x)
pred_rf2 = predict(rf2,test$x)
cat("rf, R^2:",cor(test$y,pred_rf )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)
cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2
pairs(list(trueY = test$y,
native_rf = pred_rf,
PCA_RF = pred_rf2)
)
You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.
Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.