R deleting duplicates when duplicates slightly differ by 1 or 2 letters - regex

I am collecting tweets with the twitteR package and get many duplicates. This code works fine:
tweets <- searchTwitter(keyword, n=500, lang="en", since=NULL, until=NULL, retryOnRateLimit=100)
mydata <- sapply(tweets, function(x) x$getText())
mydata <- unique(mydata, incomparables = F, nmax = NA)
The problem is that it actually doesn't delete any duplicates because it doesn't recognises them as such. The duplicate tweets typically contain shortened URLS that differ by 1 or 2 digits. So I tried to clean the tweets of URLS with this code:
tweets <- searchTwitter(keyword, n=500, lang="en", since=NULL, until=NULL,
retryOnRateLimit=100)
mydata <- sapply(tweets, function(x) x$getText())
mydata <- data.frame(tweetsText, stringsAsFactors = FALSE)
names(mydata) <- c('words')
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
mydata$words <- removeURL(mydata$words)
removeURL <- function(x) gsub("https[[:alnum:]]*", "", x)
mydata$words <- removeURL(mydata$words)
mydata$words <- unique(mydata$words, incomparables = F, nmax = NA)
Now I get the error message:
Error in $<-.data.frame(*tmp*, "words", value = c("Tripping around #DisneySprings.....) : replacement has 295 rows, data has 300
Advice? Thanks!

Your error is easily reproducible:
mydata <- data.frame(list(w = c(0, 1, 0, 1)))
mydata$words <- c(0, 1, 1)
# Error in `$<-.data.frame`(`*tmp*`, "words", value = c(0, 1, 1)) :
# replacement has 3 rows, data has 4
This just means that you need to assign a vector of the same length as the length of the data frame.
To filter out duplicate values you need to change the last line of your code to:
res <- mydata[!duplicated(mydata$words), ]

Related

how to manipulate dataframe in R shiny app

Please I need assistant concerning a shiny code. I want to manipulate a data frame input by separating them into column vector for computation but I keep getting this error
Warning in <reactive>(...): NAs introduced by coercion
the code is as follows
library(shiny)
ui <- fluidPage(
# dataset
data <- data.frame(e1 = c(3, 7, 2, 14, 66),
e2 = c(2, 16, 15, 66, 30),
n1 = c(18, 25, 45, 62, 81),
n2= c(20, 30, 79, 64, 89))
# Application title
titlePanel("Demo"),
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
# Input: Upload file
fileInput(inputId = 'file', label = 'upload the file')
),
# Display Output
mainPanel(
uiOutput("final")
)
)
)
# Define server logic required to draw a histogram
server <- function(input, output) {
# separating the dataframe into 4 column vectors
e1 <- reactive(as.numeric(input$file[,1]))
e2 <- reactive(as.numeric(input$file[,2]))
n1 <- reactive(as.numeric(input$file[,3]))
n2 <- reactive(as.numeric(input$file[,4]))
# File Upload function
data <- reactive({
file1 <- input$file
if(is.null(file1)){return()}
read.table(file = file1$datapath, sep = ',', header = TRUE)
})
output$result <- renderUI({
y <- (e1()/n1()) - (e2()/n2())
lg_y <- log(y)
v2 <- ((n1() - e1())/e1() * n1()) + ((n2() - e2())/e2() * n2())
w <- 1/v2
w1 <- sum(w)
w2 <- sum(w^2)
c <- w1 - (w2/w1)
s2 <- w * lg_y
ybar <- sum(s2)/sum(w)
Q <- sum(w*((lg_y - ybar)^{2}))# Cochrane homogeneity test statistic
Q.pval <- reactive(pchisq(Q, k() - 1,lower.tail = FALSE))
Isqd <- max(100*((Q-(k()-1))/Q),0)
})
}
# Run the application
shinyApp(ui = ui, server = server)
I have searched almost every question on this forum and haven't seen where the question was answered. please I look forward to your help
Still can't run the code above because you don't define function k(). Also FYI, your renderUI is set for "result" but your uiOutput is set for "final".
You get the warning Warning in <reactive>(...): NAs introduced by coercion because your true data set probably includes a non-numeric in it. I did not get any issues with the data set you provided above.
There are a couple ways forward:
1) Write a function to remove all non-numerics before you process the data. See here for a few examples.
2) Just keep the warning, it is a warning after all so it won't stop your code from running. Currently it turns your non-numerics into NA
3) Use suppressWarnings() but that is usually not recommended.
I do have a suggestion to clean up your code though:
# File Upload function
data <- reactive({
file1 <- input$file
if(is.null(file1)){return()}
read.table(file = file1$datapath, sep = ',', header = TRUE, stringsAsFactors = FALSE)
})
# separating the dataframe into 4 column vectors
e1 <- reactive(as.numeric(data()[,1]))
e2 <- reactive(as.numeric(data()[,2]))
n1 <- reactive(as.numeric(data()[,3]))
n2 <- reactive(as.numeric(data()[,4]))

Outputting the results from bife object to Latex in Rmarkdown?

I'm estimating a fixed-effects probit model using the bife package in R. I'm trying to extract the output into something I can use with either stargazer or texreg so I can output them into a paper using Rmarkdown to create a LaTeX object. I'm aware I can manually extract the coefficients and standard errors, etc., but I'm wondering if there isn't a more efficient way to coerce this object into something that'd work with either package.
Here's a reproducible example:
install.packages("bife")
library(bife)
data("iris")
iris$big <- ifelse(iris$Sepal.Length > median(iris$Sepal.Length),1,0)
output <- bife(big ~ Sepal.Width + Petal.Length | Species, data=iris, "logit")
I think I found an alternative solution for this one, even if it is probably too late
Basically, first, I went on the repository of the package "texreg" and found this function:
extract.bife <- function(model,
include.loglik = TRUE,
include.deviance = TRUE,
include.nobs = TRUE,
...) {
s <- summary(model)
coefficient.names <- rownames(s$cm)
co <- s$cm[, 1]
se <- s$cm[, 2]
pval <- s$cm[, 4]
gof <- numeric()
gof.names <- character()
gof.decimal <- logical()
if (include.loglik == TRUE) {
lik <- logLik(model)
gof <- c(gof, lik)
gof.names <- c(gof.names, "Log Likelihood")
gof.decimal <- c(gof.decimal, TRUE)
}
if (include.deviance == TRUE) {
gof <- c(gof, deviance(model))
gof.names <- c(gof.names, "Deviance")
gof.decimal <- c(gof.decimal, TRUE)
}
if (include.nobs == TRUE) {
n <- s$nobs["nobs"]
gof <- c(gof, n)
gof.names <- c(gof.names, "Num. obs.")
gof.decimal <- c(gof.decimal, FALSE)
}
tr <- createTexreg(
coef.names = coefficient.names,
coef = co,
se = se,
pvalues = pval,
gof.names = gof.names,
gof = gof,
gof.decimal = gof.decimal
)
return(tr)
}
So for your example, just apply it on your model and use the function texreg and you may have a Latex-"like" output
tr <- extract.bife(output)
texreg(tr)
I hope it will help!
Best

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

Splitting string columns FAST in R

I have a data frame with 107 columns and 745000 rows (much bigger than in my example).
The case is, that I have character type columns which I want to separate, because they seem to contain some type-ish ending at the end of each sequence.
I want to saparate these type-ending parts to new columns.
I have made my own solution, but it seem to be far too slow for iterating through all the 745000 rows 53 times.
So I embed my solution in the following code, with some arbitrary data:
set.seed(1)
code_1 <- paste0(round(runif(5000, 100000, 999999)), "_", round(runif(1000, 1, 15)))
code_2 <- sample(c(paste0(round(runif(10, 100000, 999999)), "_", round(runif(10, 1, 15))), NA), 5000, replace = TRUE)
code_3 <- sample(c(paste0(round(runif(3, 100000, 999999)), "_", round(runif(3, 1, 15))), NA), 5000, replace = TRUE)
code_4 <- sample(c(paste0(round(runif(1, 100000, 999999)), "_", round(runif(1, 1, 15))), NA), 5000, replace = TRUE)
code_type_1 <- rep(NA, 5000)
code_type_2 <- rep(NA, 5000)
code_type_3 <- rep(NA, 5000)
code_type_4 <- rep(NA, 5000)
df <- data.frame(cbind(code_1,
code_2,
code_3,
code_4,
code_type_1,
code_type_2,
code_type_3,
code_type_4),
stringsAsFactors = FALSE)
df_new <- data.frame(code_1 = character(),
code_2 = character(),
code_3 = character(),
code_4 = character(),
code_type_1 = character(),
code_type_2 = character(),
code_type_3 = character(),
code_type_4 = character(),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- strsplit(df[, c(i)], "[_]")
for (j in 1:nrow(df)) {
df_new[c(j), c(i)] <- unlist(temp[j])[1]
df_new[c(j), c(i_t)] <- ifelse(is.na(unlist(temp[j])[1]), NA, unlist(temp[j])[2])
}
print(i)
}
for (i in 1:8) {
df_new[, c(i)] <- factor(df_new[, c(i)])
}
Do anyone have some ideas how to speed things up here?
First we pre-allocate the results data.frame to the desired final length. This is very important; see The R Inferno, Circle 2. Then we vectorize the inner loop. We also use fixed = TRUE and avoid the regex in strsplit.
system.time({
df_new1 <- data.frame(code_1 = character(nrow(df)),
code_2 = character(nrow(df)),
code_3 = character(nrow(df)),
code_4 = character(nrow(df)),
code_type_1 = character(nrow(df)),
code_type_2 = character(nrow(df)),
code_type_3 = character(nrow(df)),
code_type_4 = character(nrow(df)),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- do.call(rbind, strsplit(df[, c(i)], "_", fixed = TRUE))
df_new1[, i] <- temp[,1]
df_new1[, i_t] <- ifelse(is.na(temp[,1]), NA, temp[,2])
}
df_new1[] <- lapply(df_new1, factor)
})
# user system elapsed
# 0.029 0.000 0.029
all.equal(df_new, df_new1)
#[1] TRUE
Of course, there are ways to make this even faster, but this is close to your original approach and should be sufficient.
Here's another way, using gsub inside a custom function in combination with purrr::dmap() - which is equivalent to lapply, but outputs a data.frame instead of a list.
library(purrr)
# Define function which gets rid of everything after and including "_"
replace01 <- function(df, ptrn = "_.*")
dmap(df[,1:4], gsub, pattern = ptrn, replacement = "")
# Because "pattern" is argument we can change it to get 2nd part, then cbind()
test <- cbind(replace01(df),
replace01(df, ptrn = ".*_"))
Note that the output here character columns, you can always convert them to factor if you like.
Another possibility:
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
y <- c(x[,1], x[,2])
y[y==""] <- NA
y
})), colnames(df)) -> df_new
or
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
c(x[,1], x[,2])
})), colnames(df)) -> df_new
df_new[df_new==""] <- NA
df_new
which is marginally faster:
Unit: milliseconds
expr min lq mean median uq max neval cld
na_after 669.8357 718.1301 724.8803 723.5521 732.9998 790.1405 10 a
na_inner 719.3362 738.1569 766.4267 762.1594 791.6198 825.0269 10 b

Text file to list in R

I have a large text file with a variable number of fields in each row. The first entry in each row corresponds to a biological pathway, and each subsequent entry corresponds to a gene in that pathway. The first few lines might look like this
path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
I need to read this file into R as a list, with each element being a character vector, and the name of each element in the list being the first element on the line, for example:
> pathways <- list(
+ path1=c("gene1","gene2"),
+ path2=c("gene3","gene4","gene5","gene6"),
+ path3=c("gene7","gene8","gene9")
+ )
>
> str(pathways)
List of 3
$ path1: chr [1:2] "gene1" "gene2"
$ path2: chr [1:4] "gene3" "gene4" "gene5" "gene6"
$ path3: chr [1:3] "gene7" "gene8" "gene9"
>
> str(pathways$path1)
chr [1:2] "gene1" "gene2"
>
> print(pathways)
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
...but I need to do this automatically for thousands of lines. I saw a similar question posted here previously, but I couldn't figure out how to do this from that thread.
Thanks in advance.
Here's one way to do it:
# Read in the data
x <- scan("data.txt", what="", sep="\n")
# Separate elements by one or more whitepace
y <- strsplit(x, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(y) <- sapply(y, `[[`, 1)
#names(y) <- sapply(y, function(x) x[[1]]) # same as above
# Remove the first vector element from each list element
y <- lapply(y, `[`, -1)
#y <- lapply(y, function(x) x[-1]) # same as above
One solution is to read the data in via read.table(), but use the fill = TRUE argument to pad the rows with fewer "entries", convert the resulting data frame to a list and then clean up the "empty" elements.
First, read your snippet of data in:
con <- textConnection("path1 gene1 gene2
path2 gene3 gene4 gene5 gene6
path3 gene7 gene8 gene9
")
dat <- read.table(con, fill = TRUE, stringsAsFactors = FALSE)
close(con)
Next we drop the first column, first saving it for the names of the list later
nams <- dat[, 1]
dat <- dat[, -1]
Convert the data frame to a list. Here I just split the data frame on the indices 1,2,...,n where n is the number of rows:
ldat <- split(dat, seq_len(nrow(dat)))
Clean up the empty cells:
ldat <- lapply(ldat, function(x) x[x != ""])
Finally, apply the names
names(ldat) <- nams
Giving:
> ldat
$path1
[1] "gene1" "gene2"
$path2
[1] "gene3" "gene4" "gene5" "gene6"
$path3
[1] "gene7" "gene8" "gene9"
A quick solution based on the linked page...
inlist <- strsplit(readLines("file.txt"), "[[:space:]]+")
pathways <- lapply(inlist, tail, n = -1)
names(pathways) <- lapply(inlist, head, n = 1)
One more solution:
sl <- c("path1 gene1 gene2", "path2 gene1 gene2 gene3") # created by readLines
f <- function(l, s) {
v <- strsplit(s, " ")[[1]]
l[[v[1]]] <- v[2:length(v)]
return(l)
}
res <- Reduce(f, sl, list())