Modifying elements in a vector that do not match a pattern - regex

I have a vector with entries like these:
[5484] "QUERY___05nirs_WM_WATMOP079"
[5485] "QUERY___05nirs_WM_WATMAP075"
[5486] "QUERY___05nirs_WM_WATMAP037"
[5487] "QUERY___05nirs_WM_WATMOP071"
[5488] "QUERY___03nirs_WM_WATMAP168"
[5489] "2022819637_Scalindua_MAnamSca741_C384"
[5490] "237637177_clone_PeruG11"
[5491] "237637158_clone_PeruD2"
[5492] "237637172_clone_PeruD12"
[5493] "237637168_clone_PeruE11"
I would like to append "QUERY___" at the beginning of those elements that do not contain it already. I figured out how to get a logical vector with grepl that tells me which elements do not have "QUERY", but have no idea on how to use that vector to change the original vector.

Try:
gsub("(^[^QUERY].*)","QUERY___\\1",string)
#[1] "QUERY___05nirs_WM_WATMOP079"
#[2] "QUERY___05nirs_WM_WATMAP075"
#[3] "QUERY___05nirs_WM_WATMAP037"
#[4] "QUERY___05nirs_WM_WATMOP071"
#[5] "QUERY___03nirs_WM_WATMAP168"
#[6] "QUERY___2022819637_Scalindua_MAnamSca741_C384"
#[7] "QUERY___237637177_clone_PeruG11"
#[8] "QUERY___237637158_clone_PeruD2"
#[9] "QUERY___237637172_clone_PeruD12"
#[10] "QUERY___237637168_clone_PeruE11"

Here's a simple approach
string <- c("QUERY___05nirs_WM_WATMOP079",
"QUERY___05nirs_WM_WATMAP075",
"QUERY___05nirs_WM_WATMAP037",
"QUERY___05nirs_WM_WATMOP071",
"QUERY___03nirs_WM_WATMAP168",
"2022819637_Scalindua_MAnamSca741_C384",
"237637177_clone_PeruG11",
"237637158_clone_PeruD2",
"237637172_clone_PeruD12",
"237637168_clone_PeruE11")
> ind <- grepl("^QUERY___", string)
> ( string2 <- c(string[ind], paste0("QUERY___", string[!ind])) )
#[1] "QUERY___05nirs_WM_WATMOP079" "QUERY___05nirs_WM_WATMAP075"
#[3] "QUERY___05nirs_WM_WATMAP037" "QUERY___05nirs_WM_WATMOP071"
#[5] "QUERY___03nirs_WM_WATMAP168" "QUERY___2022819637_Scalindua_MAnamSca741_C384"
#[7] "QUERY___237637177_clone_PeruG11" "QUERY___237637158_clone_PeruD2"
#[9] "QUERY___237637172_clone_PeruD12" "QUERY___237637168_clone_PeruE11"

Related

Filter/grep functions behaving oddly

Take the following code to select only alphanumeric strings from a list of strings:
isValid = function(string){
return(grep("^[A-z0-9]+$", string))
}
strings = c("aaa", "test#test.com", "", "valid")
print(Filter(isValid, strings))
The output is [1] "aaa" "test#test.com".
Why is "valid" not outputted, and why is "test#test.com" outputted?
The Filter function accepts a logical vector, you supplied a numeric. Use grepl:
isValid = function(string){
return(grepl("^[A-z0-9]+$", string))
}
strings = c("aaa", "test#test.com", "", "valid")
print(Filter(isValid, strings))
[1] "aaa" "valid"
Why didn't grep work? It is due to R's coercion of numeric values to logical and the weirdness of Filter.
Here's what happened, grep("^[A-z0-9]+$", string) correctly returns 1 4. That is the index of matches on the first and fourth elements.
But that is not how Filter works. It runs the condition on each element with as.logical(unlist(lapply(x, f))).
So it ran isValid(strings[1]) then isValid(strings[2]) and so on. It created this:
[[1]]
[1] 1
[[2]]
integer(0)
[[3]]
integer(0)
[[4]]
[1] 1
It then called unlist on that list to get 1 1 and turned that into a logical vector TRUE TRUE. So in the end you got:
strings[which(c(TRUE, TRUE))]
which turned into
strings[c(1,2)]
[1] "aaa" "test#test.com"
Moral of the story, don't use Filter :)
You could go the opposite direction with this and exclude any strings with punctuation, i.e.
isValid <- function(string){
v1 <- string[!string %in% grep('[[:punct:]]', string, value = TRUE)]
return(v1[v1 != ''])
}
isValid(strings)
#[1] "aaa" "valid"

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

Why does strsplit return a list

Consider
text <- "who let the dogs out"
fooo <- strsplit(text, " ")
fooo
[[1]]
[1] "who" "let" "the" "dogs" "out"
the output of strsplit is a list. The list's first element then is a vector, that contains the words above.
Why does the function behave that way? Is there any case in which it would return a list with more than one element?
And I can access the words using
fooo[[1]][1]
[1] "who"
, but is there no simpler way?
To your first question, one reason that comes to mind is so that it can keep different length result vectors in the same object, since it is vectorized over x:
text <- "who let the dogs out"
vtext <- c(text, "who let the")
##
> strsplit(text, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
> strsplit(vtext, " ")
[[1]]
[1] "who" "let" "the" "dogs" "out"
[[2]]
[1] "who" "let" "the"
If this were to be returned as a data.frame, matrix, etc... instead of a list, it would have to be padded with additional elements.

Replace parts of string using package stringi (regex)

I have some string
string <- "abbccc"
I want to replace the chains of the same letter to just one letter and number of occurance of this letter. So I want to have something like this:
"ab2c3"
I use stringi package to do this, but it doesn't work exactly like I want. Let's say I already have vector with parts for replacement:
vector <- c("b2", "c3")
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector)
The output:
[1] "ab2b2" "ac3c3"
The output I want: [1] "ab2c3"
I also tried this way
stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all=FALSE)
but i get error
Error in stri_replace_all_regex(string, "([a-z])\\1{1,8}", vector, vectorize_all = FALSE) :
vector length not consistent with other arguments
Not regex but astrsplit and rle with some paste magic:
string <- c("abbccc", "bbaccc", "uffff", "aaabccccddd")
sapply(lapply(strsplit(string, ""), rle), function(x) {
paste(x[[2]], ifelse(x[[1]] == 1, "", x[[1]]), sep="", collapse="")
})
## [1] "ab2c3" "b2ac3" "uf4" "a3bc4d3"
Not a stringi solution and not a regex either, but you can do it by splitting the string and using rle:
string <- "abbccc"
res<-paste(collapse="",do.call(paste0,rle(strsplit(string,"",fixed=TRUE)[[1]])[2:1]))
gsub("1","",res)
#[1] "ab2c3"

Remove all characters before a period in a string

This keeps everything before a period:
gsub("\\..*","", data$column )
how to keep everything after the period?
To remove all the characters before a period in a string(including period).
gsub("^.*\\.","", data$column )
Example:
> data <- 'foobar.barfoo'
> gsub("^.*\\.","", data)
[1] "barfoo"
To remove all the characters before the first period(including period).
> data <- 'foo.bar.barfoo'
> gsub("^.*?\\.","", data)
[1] "bar.barfoo"
You could use stringi with lookbehind regex
library(stringi)
stri_extract_first_regex(data1, "(?<=\\.).*")
#[1] "bar.barfoo"
stri_extract_first_regex(data, "(?<=\\.).*")
#[1] "barfoo"
If the string doesn't have ., this retuns NA (it is not clear about how to deal with this in the question)
stri_extract_first_regex(data2, "(?<=\\.).*")
#[1] NA
###data
data <- 'foobar.barfoo'
data1 <- 'foo.bar.barfoo'
data2 <- "foobar"
If you don't want to think about the regex for this the qdap package has the char2end function that grabs from a particular character until the end of the string.
data <- c("foo.bar", "foo.bar.barfoo")
library(qdap)
char2end(data, ".")
## [1] "bar" "bar.barfoo"
use this :
gsub(".*\\.","", data$column )
this will keep everything after period
require(stringr)
I run a course on Data Analysis and the students came up with this solution :
get_after_period <- function(my_vector) {
# Return a string vector without the characters
# before a period (excluding the period)
# my_vector, a string vector
str_sub(my_vector, str_locate(my_vector, "\\.")[,1]+1)
}
Now, just call the function :
my_vector <- c('foobar.barfoo', 'amazing.point')
get_after_period(my_vector)
[1] "barfoo" "point"