R: lookaround within lookaround - regex

I need to match any 'r' that is preceded by two different vowels. For example, 'our' or 'pear' would be matching but 'bar' or 'aar' wouldn't. I did manage to match for the two different vowels, but I still can't make that the condition (...) of lookbehind for the ensuing 'r'. Neither (?<=...)r nor ...\\Kr yields any results. Any ideas?
x <- c('([aeiou])(?!\\1)(?=(?1))')
y <- c('our','pear','bar','aar')
y[grepl(paste0(x,collapse=''),y,perl=T)]
## [1] "our" "pear"`

These two solutions seem to work:
the why not way:
x <- '(?<=a[eiou]|e[aiou]|i[aeou]|o[aeiu]|u[aeio])r'
y[grepl(x, y, perl=T)]
the \K way:
x <- '([aeiou])(?!\\1)[aeiou]\\Kr'
y[grepl(x, y, perl=T)]
The why not way variant (may be more efficient because it searches the "r" before):
x <- 'r(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
or to quickly exclude "r" not preceded by two vowels (without to test the whole alternation)
x <- 'r(?<=[aeiou][aeiou]r)(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'

As HamZa points out in the comments using skip and fail verbs is one way to do what we want. Basically we tell it to ignore cases where we have two identical vowels followed by "r"
# The following is the beginning of the regex and isn't just R code
# the ([aeiou]) captures the first vowel, the \\1 references what we captured
# so this gives us the same vowel two times in a row
# which we then follow with an "r"
# Then we tell it to skip/fail for this
([aeiou])\\1r(*SKIP)(*FAIL)
Now we told it to skip those cases so now we tell it "or cases where we have two vowels followed by an 'r'" and since we already eliminated the cases where those two vowels are the same this will get us what we want.
|[aeiou]{2}r
Putting it together we end up with
y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
grep("([aeiou])\\1r(*SKIP)(*FAIL)|[aeiou]{2}r", y, perl = TRUE, value = TRUE)
#[1] "our" "pear" "sseiras"

Here is a less than elegant solution:
y[grepl("[aeiou]{2}r", y, perl=T) & !grepl("(.)\\1r", y, perl=T)]
Probably has some corner case failures where the first set matches at different location than the second set (will have to think about that), but something to get you started.

Another one through negative lookahead assertion.
> y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
> grep("(?!(?:aa|ee|ii|oo|uu)r)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
> grep("(?!aa|ee|ii|oo|uu)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
(?!aa|ee|ii|oo|uu) asserts that the first two chars in the match won't be aa or ee or .... or uu. So this [aeiou][aeiou] would match any two vowels other but it wouldn't be repeated . That's why we set the condition at first. r matches the r which follows the vowels.

Related

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

removing look ahead in regex

I am finding sub string in a string repeating thrice consecutively and removing the obtained sub string from it using gregexpr. However, in my attempt to find sub strings I need to remove lookahead. For example, consider a string kajaaaaaaaaaaaa, here aaaa is outputting along with aaa, aa and a. Since the last three are included in aaaa how can I get rid of them? I have tried a lot but have been unable to do it. I want to capture a sub string repeating itself consecutively for atleast thrice in a string.
s <- 'kajaaaaaaaaaaaa'
m <- gregexpr(sprintf'(?=(.{2,})\\1{2,})',t) s, perl=TRUE)
unique(mapply(function(x, y) substr(s, x, x+y-1),
attr(m[[1]], 'capture.start'),
attr(m[[1]], 'capture.length')))
If I understand your regex correctly:
m <- gregexpr('(.)(?=(\1{3}))', s, perl=TRUE)
which will match anything repeating three times after the original
The result will be two match groups one for a and one for "aaa" use the latter, as you have to still have a match group to find repeats
I need to remove lookahead.
Just omit it, lookahead is not needed here:
> gregexpr('(..+)\\1{2,}', s, perl=TRUE) -> m
> mapply(function(x, y) substr(s, x, x+y-1), attr(m[[1]], 'capture.start')
+ , attr(m[[1]], 'capture.length'))
[1] "aaaa"

Convert character to lowerCamelCase in R

I have character vector which looks like this:
x <- c("cult", "brother sister relationship", "word title")
And I want to convert it to the lowerCamelCase style looking like this:
c("cult", "brotherSisterRelationship", "wordTitle")
I played around with gsub, gregexpr, strplit, regmatches and many other functions, but couldn't get a grip.
Especially two spaces in a character seem to be difficult to handle.
Maybe someone here has an idea how to do this.
> x <- c("cult", "brother sister relationship", "word title")
> gsub(" ([^ ])", "\\U\\1", x, perl=TRUE)
[1] "cult" "brotherSisterRelationship"
[3] "wordTitle"
Quoting from pattern matching and replacement:
For perl = TRUE only, it can also contain "\U" or "\L" to convert the
rest of the replacement to upper or lower case and "\E" to end case
conversion.
A non-base alternative:
library(R.utils)
toCamelCase(x, capitalize = FALSE)
# [1] "cult" "brotherSisterRelationship" "wordTitle"

Separate a sentence into words and endmarks

I want to break a sentence apart into words and end marks (assume all other punctuation has been removed). I've written a working function to break string(s) apart as described but I think the part:
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x, nchar(x), nchar(x))))
is a cob job that can be better achieved without using the substring and just splitting on spaces and between the endmark with an or | statement of sorts but don't know how I'd achieve this. Any direction with this would be appreciated.
breaker <- function(string) {
FUN <- function(x) {
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x,
nchar(x), nchar(x))))
}
lapply(string, FUN)
}
#EXAMPLES
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
Here is a regex pattern that'll do the whole job on its own. It will match (and thus allow strsplit() to split the string) either at a space or right before one of the sentence-ending punctuation marks.
pat <- "[[:space:]]|(?=[.!?])"
The first half of the pattern matches space characters, and any match will cause strsplit() to 'eat up' the matched characters when it splits the string. The second half of the pattern (the part inside of the (?=...)) matches sentence-ending punctuation. It is an example of a 'zero-width positive lookahead assertion' (see ?regexp for details), and as such, will not lead strsplit() to 'eat up' the matching punctuation.
For your example vectors, you don't even need the call to lapply():
breaker <- function(X) {
strsplit(X, "[[:space:]]|(?=[.!?])", perl=TRUE)
}
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
you can also use scan_tokenizer() and MC_tokenizer() from the tm package
> library(tm)
> ?MC_tokenizer
> MC_tokenizer("what are the number of words in this sentence?")
[1] "what" "are" "the" "number" "of" "words" "in"
[8] "this" "sentence"