Combine multiple (>2) survival curves (null models) in same plot - list

I am trying to combine multiple survfit objects on the same plot, using function ggsurvplot_combine from package survminer. When I made a list of 2 survfit objects, it perfectly works. But when I combine 3 survfit objects in different ways, I receive the error:
error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated
I've read similar posts on combining survivl plots (https://cran.r-project.org/web/packages/survminer/survminer.pdf, https://github.com/kassambara/survminer/issues/195, R plotting multiple survival curves in the same plot, https://rpkgs.datanovia.com/survminer/reference/ggsurvplot_combine.html) and on this specific error, for which solutions are been provided with using 'unique'. However, I do not even understand for which factor variable this error accounts. I do not have the right to share my data or figures, so I'll try to replicate it:
Data:
time: follow-up between untill event or end of follow-up
endpoints: 1= event, 0=no event or censor
Null models:
KM1 <- survfit(Surv(data$time1,data$endpoint1)~1,
type="kaplan-meier", conf.type="log", data=data)
KM2 <- survfit(Surv(data$time2,data$endpoint2)~1, type="kaplan-meier",
conf.type="log", data=data)
KM3 <- survfit(Surv(data$time3,data$endpoint3)~1, type="kaplan-meier",
conf.type="log", data=data)
List null models:
list_that_works <- list(KM1,KM3)
list_that_fails <- list(KM1,KM2,KM3)
It seems as if the list contains of just two arguments: list(PFS=, OS=)
Combine >2 null models in one plot:
ggsurvplot_combine(list_that_works, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives the plot I'm looking for, but with 2 cumulative incidence curves.
ggsurvplot_combine(list_that_fails, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives error 'error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated'.
When I try combining 3 plots with using
ggsurvplot(c(KM1,KM2,KM3), data=data, conf.int=TRUE, fun="event", combine=TRUE), it gives the error:
Error: Problem with mutate() 'column 'survsummary'
survsummary = purrr::map2(grouped.d$fit, grouped.d$name, .surv_summary, data=data'. x $ operator is invlid for atomic vectors.
Any help is highly appreciated!
Also another way to combine surv fits is very welcome!
My best bet is that it has something to do with the 'list' function that only contains of two arguments: list(PFS=, OS=)

I fixed it! Instead of removing the post, I'll share my solution, it may be of help for others:
I made a list of the formulas instead of the null models, so:
formulas <- list(
KM1 = Surv(time1, endpoint1)~1,
KM2 = Surv(time2, endpoint2)~1,
KM3 = Surv(time3, endpoint3)~1)
I made a null model of the 3 formulas at once:
fit <- surv_fit(formulas, data=data)
Then I made a plot with this survival fit:
ggsurvplot_combine(fit, data=data)

Related

Extracting output before the softmax layer, then manually calculating softmax gives a different result

I have a model trained to classify rgb values into 1000 categories.
#Model architecture
model = Sequential()
model.add(Dense(512,input_shape=(3,),activation="relu"))
model.add(BatchNormalization())
model.add(Dense(512,activation="relu"))
model.add(BatchNormalization())
model.add(Dense(1000,activation="relu"))
model.add(Dense(1000,activation="softmax"))
I want to be able to extract the output before the softmax layer so I can conduct analyses on different samples of categories within the model. I want execute softmax for each sample, and conduct analyses using a function named getinfo().
Model
Initially, I enter X_train data into model.predict, to get a vector of 1000 probabilities for each input. I execute getinfo() on this array to get the desired result.
Pop1
I then use model.pop() to remove the softmax layer. I get new predictions for the popped model, and execute scipy.special.softmax. However, getinfo() produces an entirely different result on this array.
Pop2
I write my own softmax function to validate the 2nd result, and I receive an almost identical answer to Pop1.
Pop3
However, when I simply calculate getinfo() on the output of model.pop() with no softmax function, I get the same result as the initial Model.
data = np.loadtxt("allData.csv",delimiter=",")
model = load_model("model.h5")
def getinfo(data):
objects = scipy.stats.entropy(np.mean(data, axis=0), base=2)
print(('objects_mean',objects))
colours_entropy = []
for i in data:
e = scipy.stats.entropy(i, base=2)
colours_entropy.append(e)
colours = np.mean(np.array(colours_entropy))
print(('colours_mean',colours))
info = objects - colours
print(('objects-colours',info))
return info
def softmax_max(data):
# calculate softmax whilst subtracting the max values (axis=1)
sm = []
count = 0
for row in data:
max = np.argmax(row)
e = np.exp(row-data[count,max])
s = np.sum(e)
sm.append(e/s)
sm = np.asarray(sm)
return sm
#model
preds = model.predict(X_train)
getinfo(preds)
#pop1
model.pop()
preds1 = model.predict(X_train)
sm1 = scipy.special.softmax(preds1,axis=1)
getinfo(sm1)
#pop2
sm2 = softmax_max(preds1)
getinfo(sm2)
#pop3
getinfo(preds1)
I expect to get the same output from Model, Pop1 and Pop2, but a different answer to Pop3, as I did not compute softmax here. I wonder if the issue is with computing softmax after model.predict? And whether I am getting the same result in Model and Pop3 because softmax is constraining the values between 0-1, so for the purpose of the getinfo() function, the result is mathematically equivalent?
If this is the case, then how do I execute softmax before model.predict?
I've gone around in circles with this, so any help or insight would be much appreciated. Please let me know if anything is unclear. Thank you!
model.pop() does not immediately have an effect. You need to run model.compile() again to recompile the new model that doesn't include the last layer.
Without the recompile, you're essentially running model.predict() twice in a row on the exact same model, which explains why Model and Pop3 give the same result. Pop1 and Pop2 give weird results because they are calculating the softmax of a softmax.
In addition, your model does not have the softmax as a separate layer, so pop takes off the entire last Dense layer. To fix this, add the softmax as a separate layer like so:
model.add(Dense(1000)) # softmax removed from this layer...
model.add(Activation('softmax')) # ...and added to its own layer

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Viz LDA model with Bokeh and T-sne

I have tried to follow this tutorial (https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html) of visualizing LDA with t-sne and bokeh.
But i run into a bit of problem.
When i tried to run the following code:
plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
color=colormap[_lda_keys][:num_example],
source=bp.ColumnDataSource({
"content": text[:num_example],
"topic_key": _lda_keys[:num_example]
}))
NB: In the tutorial the content is called news, in mine it is called text
i get this error:
Supplying a user-defined data source AND iterable values to glyph methods is
not possibe. Either:
Pass all data directly as literals:
p.circe(x=a_list, y=an_array, ...)
Or, put all data in a ColumnDataSource and pass column names:
source = ColumnDataSource(data=dict(x=a_list, y=an_array))
p.circe(x='x', y='x', source=source, ...)
To me this do not make so much sense and i have not succeded in finding any annswer to it ethier here, github or else where. Hope that some on can help. best Niels
I've been also battling with that piece of code and I've found two problems with it.
First, when you pass a source to the scatter function, like the error states, you must include all data in the dictionary, i.e., x and y axes, colors, labels, and any other information that you want to include in the tooltip.
Second, the x and y axes have a different shape than the information passed to the tooltip, so you also have to slice both arrays in the axes with the num_example variable.
The following code got me running:
# create the dictionary with all the information
plot_dict = {
'x': tsne_lda[:num_example, 0],
'y': tsne_lda[:num_example, 1],
'colors': colormap[_lda_keys][:num_example],
'content': text[:num_example],
'topic_key': _lda_keys[:num_example]
}
# create the dataframe from the dictionary
plot_df = pd.DataFrame.from_dict(plot_dict)
# declare the source
source = bp.ColumnDataSource(data=plot_df)
title = 'LDA viz'
# initialize bokeh plot
plot_lda = bp.figure(plot_width=1400, plot_height=1100,
title=title,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
# build scatter function from the columns of the dataframe
plot_lda.scatter('x', 'y', color='colors', source=source)

How to rename a column of a data frame with part of the data frame identifier in R?

I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")

How do you combine multiple boxplots from a List of data-frames?

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.
Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")
a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.
Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)