Rpart - accuracy of bigrams - rpart

Good evening, everyone!
I am facing a problem in R. I have a dataset containing Amazon reviews of the Playstation 4 and I would like to create a prediction model with the help of rpart and also would like to have the accuracy of this model.
The reviews have been successfully loaded to R, a corpus has been created and some preprocessing tasks have been applied:
library(RWeka)
library(tm)
library(rpart)
corpus <- Corpus(VectorSource(tr.review.ps4$reviewText))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
The bigrams and a term document matrix are created with the following code:
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
txtTdmBi <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, bounds = list(global=c(10, Inf))))
Then sparse-terms are deleted and a matrix is created:
dtm <- removeSparseTerms(txtTdmBi, 0.999)
dtmsparse <- as.data.frame(as.matrix(txtTdmBi))
The original dataset consists of 7561 objects. Therefore a training and test set is created as follows:
train <- dtmsparse[1:6500,]
test <- dtmsparse[6501:7561,]
Then the training is done. $overall refers to the star rating from one to five.
train$overall <- tr.review.ps4[1:6500,]$overall
When using unigrams the prediction model is created as follows:
model <- rpart(overall ~., data = train, method= 'class')
However, this is not working in my case because - I guess - the connection to the original review dataset has to be established. But how? I don't have an idea.
When I am entering this code I get following error-output:
Error in terms.formula(formula, data = data) :
Can anyone help me? Thanks a lot.
Best regards
Paul

today I still was searching for a solution of my problem. Luckily I found the mistake.
The errore message occured because the TermDocumentMatrix was in the wrong postion.
I had to transpose the matrix with the following code:
txtTdmBi.t=t(txtTdmBi)
Finally it worked.
Best regards
Paul

Related

How to update fillColor palette to selected input in shiny map?

I am having trouble transitioning my map from static to reactive so a user can select what data they want to look at. Somehow I'm not successfully connecting the input to the dataframe. My data is from a shapefile and looks roughly like this:
NAME Average Rate geometry
1 Alcona 119.7504 0.1421498 MULTIPOLYGON (((-83.88711 4...
2 Alger 120.9212 0.1204398 MULTIPOLYGON (((-87.11602 4...
3 Allegan 128.4523 0.1167062 MULTIPOLYGON (((-85.54342 4...
4 Alpena 114.1528 0.1410852 MULTIPOLYGON (((-83.3434 44...
5 Antrim 124.8554 0.1350004 MULTIPOLYGON (((-84.84877 4...
6 Arenac 127.8809 0.1413534 MULTIPOLYGON (((-83.7555 43...
In the server section below, you can see that I tried to use reactive to get the selected variable and when I write print(select) it does print the correct variable name, but when I try to put it into the colorNumeric() function it's clearly not being recognized. The map I get is all just the same shade of blue instead of different shades based on the value of the variable in that county.
ui <- fluidPage(
fluidRow(
selectInput(inputId="var",
label="Select variable",
choices=list("Average"="Average",
"Rate"="Rate"),
selected=1)
),
fluidRow(
leafletOutput("map")
)
)
server <- function(input, output, session) {
# Data sources
counties <- st_read("EITC_counties.shp") %>%
st_transform(crs="+init=epsg:4326")
counties_clean <- select(counties, NAME, X2020_Avg., X2020_Takeu)
counties_clean <- counties_clean %>%
rename("Average"="X2020_Avg.",
"Rate"="X2020_Takeu")
# Map
variable <- reactive({
input$var
})
output$map <- renderLeaflet({
select <- variable()
print(select)
pal <- colorNumeric(palette = "Blues", domain = counties_clean$select, na.color = "black")
color_pal <- counties_clean$select
leaflet()%>%
setView( -84.51, 44.18, zoom=5) %>%
addPolygons(data=counties_clean, layerId=~NAME,
weight = 1, smoothFactor=.5,
fillOpacity=.7,
fillColor=~pal(color_pal()),
highlightOptions = highlightOptions(color = "white",
weight = 2,
bringToFront = TRUE)) %>%
addProviderTiles(providers$CartoDB.Positron)
})
}
shinyApp(ui, server)
I've tried making the reaction into an event and also using the observe function using a leaflet proxy but it only produced errors. I also tried to skip the reactive definition and just put input$var directly into the palette (counties_clean$input$var), but it similarly did not show any color variation.
When I previously created a static map setting the palette using counties_clean$Average it came out correctly, but replacing Average with a user input is where I appear to be going wrong. Thanks in advance for any guidance you can provide and please let me know if I can share any additional clarification.
Unfortunately, your code is not reproducible without the data, but the mistake is most likely in this line
color_pal <- counties_clean$select
What this line does, is to extract a column named select from your data. This column is not existing, so it will return NULL.
What you want though, is to extract a column whose name is given by the content of select, so you want to try:
color_pal <- counties_clean[[select]]

Combine multiple (>2) survival curves (null models) in same plot

I am trying to combine multiple survfit objects on the same plot, using function ggsurvplot_combine from package survminer. When I made a list of 2 survfit objects, it perfectly works. But when I combine 3 survfit objects in different ways, I receive the error:
error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated
I've read similar posts on combining survivl plots (https://cran.r-project.org/web/packages/survminer/survminer.pdf, https://github.com/kassambara/survminer/issues/195, R plotting multiple survival curves in the same plot, https://rpkgs.datanovia.com/survminer/reference/ggsurvplot_combine.html) and on this specific error, for which solutions are been provided with using 'unique'. However, I do not even understand for which factor variable this error accounts. I do not have the right to share my data or figures, so I'll try to replicate it:
Data:
time: follow-up between untill event or end of follow-up
endpoints: 1= event, 0=no event or censor
Null models:
KM1 <- survfit(Surv(data$time1,data$endpoint1)~1,
type="kaplan-meier", conf.type="log", data=data)
KM2 <- survfit(Surv(data$time2,data$endpoint2)~1, type="kaplan-meier",
conf.type="log", data=data)
KM3 <- survfit(Surv(data$time3,data$endpoint3)~1, type="kaplan-meier",
conf.type="log", data=data)
List null models:
list_that_works <- list(KM1,KM3)
list_that_fails <- list(KM1,KM2,KM3)
It seems as if the list contains of just two arguments: list(PFS=, OS=)
Combine >2 null models in one plot:
ggsurvplot_combine(list_that_works, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives the plot I'm looking for, but with 2 cumulative incidence curves.
ggsurvplot_combine(list_that_fails, data=data, conf.int=TRUE, fun="event", combine=TRUE)
This gives error 'error in levels - ( tmp value = as.character(levels)): factor level 3 is duplicated'.
When I try combining 3 plots with using
ggsurvplot(c(KM1,KM2,KM3), data=data, conf.int=TRUE, fun="event", combine=TRUE), it gives the error:
Error: Problem with mutate() 'column 'survsummary'
survsummary = purrr::map2(grouped.d$fit, grouped.d$name, .surv_summary, data=data'. x $ operator is invlid for atomic vectors.
Any help is highly appreciated!
Also another way to combine surv fits is very welcome!
My best bet is that it has something to do with the 'list' function that only contains of two arguments: list(PFS=, OS=)
I fixed it! Instead of removing the post, I'll share my solution, it may be of help for others:
I made a list of the formulas instead of the null models, so:
formulas <- list(
KM1 = Surv(time1, endpoint1)~1,
KM2 = Surv(time2, endpoint2)~1,
KM3 = Surv(time3, endpoint3)~1)
I made a null model of the 3 formulas at once:
fit <- surv_fit(formulas, data=data)
Then I made a plot with this survival fit:
ggsurvplot_combine(fit, data=data)

How to plot PCA with paired data?

I am currently working with genetic data from different patients. To date I have always worked with PCAs by comparing independent groups. Example: (Sick Vs Control, Treatment Vs Control etc.)
But now I have paired data. I mean that there exists a relationship between the samples of different groups. The typical example is having a group of subjects and measuring each of them before and after treatment.
I did this PCA with Thermofisher program, but I would like to do in R. This is the output of the ThermoFisher program. B (Before treatment) P (Post-treatment)
I tried to looking for any example in Google, but I didn't found it.
An example would be something like this:
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("P_BT", 1:5, sep=""),
paste("P_AT", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

Qgis or Python: converting a CSV file of simple locations to raster?

I have a CSV file as follows:
Diversity,Longitude,Latitude
7,114.99638889,-33.85333333
6,114.99790583,-33.85214594
10,115,-33.85416667
2,115.0252075,-33.84447519
I would like to convert it to a raster file with a set 'no data' value over most of the area and the values in cells at the long/lat locations.
Is there an easy way to do that in Qgis or python?
Cheers,
Steve
Not what you asked for, but here is how you can approach it in R
get the data:
d <- read.csv('file.csv')
d <- cbind(d[,2:3], d[,1])
load the raster package:
library(raster)
If your data are regularly spaced:
r <- rasterFromXYZ(d)
writeRaster(r, 'file.tif')
else create an empty raster and rasterize:
r <- raster(extent(d[,1:2]))
res(r) <- 1 # adjust this and other parameters as you see fit
r <- rasterize(d[,1:2], d[,3], fun=mean)

How do you combine multiple boxplots from a List of data-frames?

This is a repost from the Statistics portion of the Stack Exchange. I had asked the question there, I was advised to ask this question here. So here it is.
I have a list of data-frames. Each data-frame has a similar structure. There is only one column in each data-frame that is numeric. Because of my data-requirements it is essential that each data-frame has different lengths. I want to create a boxplot of the numerical values, categorized over the attributes in another column. But the boxplot should include information from all the data-frames.
I hope it is a clear question. I will post sample data soon.
Sam,
I'm assuming this is a follow up to this question? Maybe your sample data will illustrate the nuances of your needs better (the "categorized over attributes in another column" part), but the same melting approach should work here.
library(ggplot2)
library(reshape2)
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(1000))
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
#Separate boxplots for each data.frame
qplot(factor(variable), value, data = df, geom = "boxplot")
#All values plotted together as one boxplot
qplot(factor(1), value, data = df, geom = "boxplot")
a<-data.frame(c(1,2),c("x","y"))
b<-data.frame(c(3,4,5),c("a","b","c"))
boxplot(c(a[1],b[1]))
With the "1"'s i select the column i want out of the data-frame.
A data-frames can not have different column-lengths (has to have same number of rows for each column), but you can tell boxplot to plot multiple datasets in parallel.
Using the melt() function and base R boxplot:
#Fake data
a <- data.frame(a = rnorm(10))
b <- data.frame(b = rnorm(100))
c <- data.frame(c = rnorm(100) + 5)
#In a list
myList <- list(a,b,c)
#In a melting pot
df <- melt(myList)
# plot using base R boxplot function
boxplot(value ~ variable, data = df)