Regular expression-based list matching in R

Regular expression-based list matching in R - regex

I have two lists (more exactly, character atomic vectors) that I want to compare using regular expressions to produce a sub-set of one of the lists. I can use a 'for' loop for this, but is there some simpler code? Following exemplifies my case:
# list of unique cities
city <- c('Berlin', 'Perth', 'Oslo')
# list of city-months, like 'New York-Dec'
temp <- c('Berlin-Jan', 'Delhi-Jan', 'Lima-Feb', 'Perth-Feb', 'Oslo-Jan')
# need sub-set of 'temp' for only 'Jan' month for only the items in 'city' list:
# 'Berlin-Jan', 'Oslo-Jan'
Added clarification: In the actual case that I am seeking code for, the values of the 'month' equivalent are more complex, and rather random alphanumeric values with only the first two characters having informational value of my interest (has to be '01').
Added actual case example:
# equivalent of 'city' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}
patient <- c('TCGA-43-4897', 'TCGA-65-4897', 'TCGA-78-8904', 'TCGA-90-8984')
# equivalent of 'temp' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-[\d]{2}[0-9A-Z]+
sample <- c('TCGA-21-5732-01A333', 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76', 'TCGA-78-8904-11A70')
# sub-set wanted (must have '01' after the 'patient' ID part)
# 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76'

Something like this?
temp <- temp[grepl("Jan", temp)]
temp[sapply(strsplit(temp, "-"), "[[", 1) %in% city]
# [1] "Berlin-Jan" "Oslo-Jan"
Even better, borrowing the idea from #agstudy:
> temp[temp %in% paste0(city, "-Jan")]
# [1] "Berlin-Jan" "Oslo-Jan"
Edit: How about this?
> sample[gsub("(.*-01).*$", "\\1", sample) %in% paste0(patient, "-01")]
# [1] "TCGA-43-4897-01A159" "TCGA-65-4897-01T76"

Here's a solution after the others, with your new requirements:
sample[na.omit(pmatch(paste0(patient, '-01'), sample))]

You can use gsub
x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp)
> temp[x==1]
[1] "Berlin-Jan" "Oslo-Jan"
the pattern here is :
"Berlin-Jan|Perth-Jan|Oslo-Jan"

Here's a solution with two partial string matches...
temp[agrep("Jan",temp)[which(agrep("Jan",temp) %in% sapply(city, agrep, x=temp))]]
# [1] "Berlin-Jan" "Oslo-Jan"
As a function just for fun...
fun <- function(x,y,pattern) y[agrep(pattern,y)[which(agrep(pattern,y) %in% sapply(x, agrep, x=y))]]
# x is a vector containing your data for filter
# y is a vector containing the data to filter on
# pattern is the quoted pattern you're filtering on
fun(temp, city, "Jan")
# [1] "Berlin-Jan" "Oslo-Jan"

Related

Performance - How to get those words in a list of words that matches with a give sentence in R

Am trying to get only those words from a list that are present in a given sentence.The words can include bigram words as well. For example,
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
myresult should be :
"really good" "better"
I have 1000 sentences like this on which i need to compare the words. The word list is also bigger. i tried brute force method using grep command but it took a lot of time (as expected). I am looking to get the matching words in a way that the performance is better.

require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better

I was able to use #epi99's answer with a slight modification.
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]

What about
unlist(sapply(wordList, function(x) grep(x, sentence)))

Replace matches according to the pattern that was matched

Given a set of regular expressions, is there a simple way to match multiple patterns, and replace the matched text according to the pattern that was matched?
For example, for the following data x, each element begins with either a number or a letter, and ends with either a number or a letter. Let's call these patterns num_num (for begins with number, ends with number), num_let (begins with number, ends with letter), let_num, and let_let.
x <- c('123abc', '78fdsaq', 'aq12111', '1p33', '123', 'pzv')
type <- list(
num_let='^\\d.*[[:alpha:]]$',
num_num='^\\d(.*\\d)?$',
let_num='^[[:alpha:]].*\\d$',
let_let='^[[:alpha:]](.*[[:alpha:]])$'
)
To replace each string with the name of the pattern it follows, we could do:
m <- lapply(type, grep, x)
rep(names(type), sapply(m, length))[order(unlist(m))]
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
Is there a more efficient approach?
gsubfn?
I know that with gsubfn we can simultaneously replace different matches, e.g.:
library(gsubfn)
gsubfn('.*', list('1p33'='foo', '123abc'='bar'), x)
## [1] "bar" "78fdsaq" "aq12111" "foo" "123" "pzv"
but I'm not sure whether the replacements can be made dependent on the pattern that was matched rather than on the match itself.
stringr?
str_replace_all doesn't play nicely with this example, since matches are replaced for patterns iteratively, and we end up with everything being overwritten with let_let:
library(stringr)
str_replace_all(x, setNames(names(type), unlist(type)))
## [1] "let_let" "let_let" "let_let" "let_let" "let_let" "let_let"
Reordering type so the pattern corresponding to let_let appears first solves the problem, but needing to do this makes me nervous.
type2 <- rev(type)
str_replace_all(x, setNames(names(type2), unlist(type2)))
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"

Perhaps one of these.
# base R method
mm2 <- character(length(x))
for( n in 1:length(type)) mm2 <- replace(mm2, grep(type[n],x), names(type)[n])
# purrr 0.2.0 method
library(purrr)
mm3 <- map(grep, .x=type, x = x) %>% (function(z) replace(x, flatten_int(z), rep(names(type), lengths(z))))
The base R method is somewhat faster than the posted code for both small and larger data sets. The purrr method is slower than the posted code for small data sets but about the same as the base R method for larger data sets.

stringr
We can use str_replace_all if we alter the replacements so they are no longer matched by any of the regular expressions and then add an additional replacement to return them to their original form. For example
library(stringr)
type2 <- setNames(c(str_replace(names(type), "(.*)", "__\\1__"), "\\1"),
c(unlist(type), "^__(.*)__$"))
str_replace_all(x, type2)
## [1] "num_let" "num_let" "let_num" "num_num" "num_num" "let_let"
grepl and tidyr
Another approach is match first and then replace, one way to do this is to use grepl and tidyr
library(plyr)
library(dplyr)
library(tidyr)
out <- data.frame(t(1*aaply(type, 1, grepl, x)))
out[out == 0] <- NA
out <- out %>%
mutate(id = 1:nrow(.)) %>%
gather(name,value, -id, na.rm = T) %>%
select(name)
as.character(out[,1])
## [1] "num_let" "num_let" "num_num" "num_num" "let_num" "let_let"
While this approach doesn't look as efficient it makes it easy to find rows where there are more or less than one match.
From what I understand substitution matching is implemented in pcre2 and I believe allows this type of problem to be solved directly in the regex. Unfortunately it seems that no one has built a pcre2 package for R yet.

Use lapply on a subset of list elements and return list of same length as original in R

I want to apply a regex operation to a subset of list elements (which are character strings) using lapply and return a list of same length as the original. The list elements are long strings (derived from reading in long text files and collapsing paragraphs into a single string). The regex operation is valid only for the subset of list elements/strings. I want the non-subsetted list elements (character strings) to be returned in their original state.
The regex operation is str_extract from the stringr package, i.e. I want to extract a substring from a longer string. I subset the list elements based on a regex pattern in the filename.
An example with simplified data:
library(stringr)
texts <- as.list(c("abcdefghijkl", "mnopqrstuvwxyz", "ghijklmnopqrs", "uvwxyzabcdef"))
filenames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(texts) <- filenames
regexp <- "abcdef"
I know in advance to which strings I want to apply the regex operation, and hence I want to subset these strings. That is, I don't want to run the regex over all elements in the list, as doing so will return some invalid results (which is not apparent in this simplified example).
I've made a few naive efforts, e.g.:
x <- lapply(texts[str_detect(names(texts), "1997")], str_extract, regexp)
> x
$AB1997R.txt
[1] "abcdef"
$DC1997S.txt
[1] "abcdef"
which returns a reduced-length list containing just the substrings found.
But the results I want to get are:
> x
$AB1997R.txt
[1] "abcdef"
$BG2000S.txt
[1] "mnopqrstuvwxyz"
$MN1999R.txt
[1] "ghijklmnopqrs"
$DC1997S.txt
[1] "abcdef"
where the strings not containing the regex pattern are returned in their original state.
I have informed myself about stringr, lapply and llply (in the plyr package), but many operations are illustrated using dataframes as examples, not lists, and don't involve regex operations on character strings. I can achieve my goal using a for loop, but I'm trying to get away from that, as is generally advised, and get better at using the apply-class of functions.

You can use the subset operator [<-:
x <- texts
is1997 <- str_detect(names(texts), "1997")
x[is1997] <- lapply(texts[is1997], str_extract, regexp)
x
# $AB1997R.txt
# [1] "abcdef"
#
# $BG2000S.txt
# [1] "mnopqrstuvwxyz"
#
# $MN1999R.txt
# [1] "ghijklmnopqrs"
#
# $DC1997S.txt
# [1] "abcdef"
#

You can try sub
sub(paste0('.*(', regexp, ').*'), '\\1', texts)
# AB1997R.txt BG2000S.txt MN1999R.txt DC1997S.txt
# "abcdef" "mnopqrstuvwxyz" "ghijklmnopqrs" "abcdef"
Also, if you need to match the names of 'texts' with 1997, we can use grep
indx <- grep('1997', names(texts))
texts[indx] <- sub(paste0('.*(', regexp, ').*'), '\\1', texts[indx])
as.list(texts)

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy

Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"

A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.

I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

stringr split column by alpha and numeric

I can only use stringer/ regular expression, I am working in r
I have a csv I have downloaded called mpg2,and a subset of this containing only Mercedes Benz makes. What I am trying to do is split the model into alpha and numeric so I can plot them. for example, a mercedes C300 would need to be split into C and 300, or GLS500 into GLS and 550.
so now I have all of the model numbers, now I want to split between letters and numbers.
I have tried
mercedes<- subset(mpg2, make=="Mercedes-Benz")
str_split(mercedes$model, "[0:9]")
but this doesn't do what I want it to and I have played with n= and that doesn't work either.
then I have
MB$modelnumber<-as.numeric(gsub("([0-9]+).*$", "\\1", mercedes$model))
Which makes a column of only numbers, I can't get the letters to work.
If I need to upload my specific dataset let me know, I just have to figure out how to do that.
But I need to basically split "XYZ123" into its alpha and numeric parts and put them in 2 separate columns.

something like this :
x <- "XYZ123"
x <- gsub("([0-9]+)",",\\1",x)
strsplit(x,",")
i ve replaced the original group of numbers by ,group of numbers. so that i can do a split on ot easily.

You can use something like this:
SplitMe <- function(string, alphaFirst = TRUE) {
Pattern <- ifelse(isTRUE(alphaFirst), "(?<=[a-zA-Z])(?=[0-9])", "(?<=[0-9])(?=[a-zA-Z])")
strsplit(string, split = Pattern, perl = T)
}
String <- c("C300", "GLS500", "XYZ123")
SplitMe(String)
# [[1]]
# [1] "C" "300"
#
# [[2]]
# [1] "GLS" "500"
#
# [[3]]
# [1] "XYZ" "123"
To get the output as a two column matrix, just use do.call(rbind, ...):
do.call(rbind, SplitMe(String))
# [,1] [,2]
# [1,] "C" "300"
# [2,] "GLS" "500"
# [3,] "XYZ" "123"
The above is just a convenience function that I have saved for the following scenarios:
strsplit(String, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)
and
strsplit(String, split = "(?<=[0-9])(?=[a-zA-Z])", perl = T)
This function won't change a GLS500 into a GLS550 though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression-based list matching in R - regex

Here's a solution after the others, with your new requirements: sample[na.omit(pmatch(paste0(patient, '-01'), sample))]

You can use gsub x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp) > temp[x==1] [1] "Berlin-Jan" "Oslo-Jan" the pattern here is : "Berlin-Jan|Perth-Jan|Oslo-Jan"

Related

Performance - How to get those words in a list of words that matches with a give sentence in R

Replace matches according to the pattern that was matched

Use lapply on a subset of list elements and return list of same length as original in R

Correct wrongly formatted dates

stringr split column by alpha and numeric

Categories

Resources