R -Multiple conditional matching in a string - regex

Is there a better way to do a string conditional match? for example the word farm is conditionally matched with rose, floral and tree. ideally I would like to do the matching without repeating farm
str = c('rose','farm','rose farm','floral', 'farm floral', 'tree farm')
grep("((?=.*farm)(?=.*rose)|(?=.*farm)(?=.*floral)|(?=.*farm)(?=.*tree))", str, value = TRUE,,perl = TRUE)
this return
[1] "rose farm" "farm floral" "tree farm"

One way — use a grouping construct to combine the set of words:
grep('(?=.*farm)(?=.*(?:rose|floral|tree))', str, value = TRUE, perl = TRUE)
# [1] "rose farm" "farm floral" "tree farm"

Related

Find index locations by regex pattern and replace them with a list of indexes in Scala

I have strings in this format:
object[i].base.base_x[i] and I get lists like List(0,1).
I want to use regular expressions in scala to find the match [i] in the given string and replace the first occurance with 0 and the second with 1. Hence getting something like object[0].base.base_x[1].
I have the following code:
val stringWithoutIndex = "object[i].base.base_x[i]" // basically this string is generated dynamically
val indexReplacePattern = raw"\[i\]".r
val indexValues = List(0,1) // list generated dynamically
if(indexValues.nonEmpty){
indexValues.map(row => {
indexReplacePattern.replaceFirstIn(stringWithoutIndex , "[" + row + "]")
})
else stringWithoutIndex
Since String is immutable, I cannot update stringWithoutIndex resulting into an output like List("object[0].base.base_x[i]", "object[1].base.base_x[i]").
I tried looking into StringBuilder but I am not sure how to update it. Also, is there a better way to do this? Suggestions other than regex are also welcome.
You couldloop through the integers in indexValues using foldLeft and pass the string stringWithoutIndex as the start value.
Then use replaceFirst to replace the first match with the current value of indexValues.
If you want to use a regex, you might use a positive lookahead (?=]) and a positive lookbehind (?<=\[) to assert the i is between opening and square brackets.
(?<=\[)i(?=])
For example:
val strRegex = """(?<=\[)i(?=])"""
val res = indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
s.replaceFirst(strRegex, row.toString)
}
See the regex demo | Scala demo
How about this:
scala> val str = "object[i].base.base_x[i]"
str: String = object[i].base.base_x[i]
scala> str.replace('i', '0').replace("base_x[0]", "base_x[1]")
res0: String = object[0].base.base_x[1]
This sounds like a job for foldLeft. No need for the if (indexValues.nonEmpty) check.
indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
indexReplacePattern.replaceFirstIn(s, "[" + row + "]")
}

Getting a substring from a list of strings

So, as part of learning the language, I wanted to check three strings for a certain pattern and return the first match of that pattern only.
My attempt was to use a combination of find and regular expressions to traverse the list:
def date = [
"some string",
"some other string 11.11.2000",
"another one 20.10.1990"
].find { title ->
title =~ /\d{2}\.\d{2}\.\d{4}/
}
This kind of works, leaving the whole string in date.
My goal, however, would be to end up with "11.11.2000" in date; I assume somehow I should be able to access the capture group, but how?
If you want to return a specific value when finding a matching element in a collection (which as in your case might be part of that element), you need to use findResult.
Your code might then look like this
def date = [
"some string",
"some other string 11.11.2000",
"another one 20.10.1990"
].findResult { title ->
def res = title =~ /\d{2}\.\d{2}\.\d{4}/
if (res) {
return res[0]
}
}
Extending UnholySheep's answer, you can also do this:
assert [
"some string",
"some other string 11.11.2000",
"another one 20.10.1990"
].findResult { title ->
def matcher = title =~ /\d{2}\.\d{2}\.\d{4}/
matcher.find() ? matcher.group() : null
} == '11.11.2000'
For all matches, just use findResults instead of findResult, like this:
assert [
"some string",
"some other string 11.11.2000",
"another one 20.10.1990"
].findResults { title ->
def matcher = title =~ /\d{2}\.\d{2}\.\d{4}/
matcher.find() ? matcher.group() : null
} == ['11.11.2000', '20.10.1990']

R: Can grep() include more than one pattern?

For instance, in this example, I would like to remove the elements in text that contain http and america.
> text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
Hence, I would use the logical operator, |.
> pattern <- "http|america"
Which works because this is considered to be one pattern.
> grep(pattern, text, invert = TRUE, value = TRUE)
[1] "One word#" "three two one"
What if I have a long list of words that I would like to use in the pattern? How can I do it? I don't think I can keep on using the logical operators a lot of times.
Thank you in advance!
Generally, as #akrun said:
text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
pattern = c("http", "america")
grep(paste(pattern, collapse = "|"), text, invert = TRUE, value = TRUE)
# [1] "One word#" "three two one"
You wrote that your list of words is "long." This solution doesn't scale indefinitely, unsurprisingly:
long_pattern = paste(rep(pattern, 1300), collapse = "|")
nchar(long_pattern)
# [1] 16899
grep(long_pattern, text, invert = TRUE, value = TRUE)
# Error in grep(long_pattern, text, invert = TRUE, value = TRUE) :
But if necessary, you could MapReduce, starting with something along the lines of:
text[Reduce(`&`, Map(function(p) !grepl(p, text), long_pattern))]
# [1] "One word#" "three two one"

Fuzzy, but not too fuzzy string matching with agrep

I have a string like this:
text <- c("Car", "Ca-R", "My Car", "I drive cars", "Chars", "CanCan")
I would like to match a pattern so it is only matched once and with max. one substitution/insertion. the result should look like this:
> "Car"
I tried the following to match my pattern only once with max. substitution/insertion etc and get the following:
> agrep("ca?", text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
[1] "Car" "Ca-R" "My Car" "I drive cars" "CanCan"
Is there a way to exclude the strings which are n-characters longer than my pattern?
An alternative which replaces agrep with adist:
text[which(adist("ca?", text, ignore.case=TRUE) <= 1)]
adist gives the number of insertions/deletions/substitutions required to convert one string to another, so keeping only elements with an adist of equal to or less than one should give you what you want, I think.
This answer is probably less appropriate if you really want to exclude things "n-characters longer" than the pattern (with n being variable), rather than just match whole words (where n is always 1 in your example).
You can use nchar to limit the strings based on their length:
pattern <- "ca?"
matches <- agrep(pattern, text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
n <- 4
matches[nchar(matches) < n+nchar(pattern)]
# [1] "Car" "Ca-R" "My Car" "CanCan"

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"