Lookahead Behaviour - regex

How can you make the lookahead non-greedy? I would like the first case not to match anything (like the second case), but it returns "winnie". I guess because it is greedily matching after the "the"?
str <- "winnie the pooh bear"
## Unexpected
regmatches(str, gregexpr("winnie|bear(?= bear|pooh)", str, perl=T))
# [1] "winnie"
## Expected
regmatches(str, gregexpr("winnie(?= bear|pooh)", str, perl=T))
# character(0)

The lookahead is being applied to bear in winnie|bear(?= bear|pooh) and not winnie.If you want it to apply on both use
(?:winnie|bear)(?= bear|pooh)
Now it will apply on both.
Because winnie matched the ored part bear never came into picture and neither the lookahead.
In the second case lookahead is applied on winnie.SO it fails.

Related

R: (*SKIP)(*FAIL) for multiple patterns

Given test <- c('met','meet','eel','elm'), I need a single line of code that matches any 'e' that is not in 'me' or 'ee'. I wrote (ee|me)(*SKIP)(*F)|e, which does exclude 'met' and 'eel', but not 'meet'. Is this because | is exclusive or? At any rate, is there a solution that just returns 'elm'?
For the record, I know I can also do (?<![me])e(?!e), but I would like to know what the solution is for (*SKIP)(*F) and why my line is wrong.
This is the correct solution with (*SKIP)(*F):
(?:me+|ee+)(*SKIP)(*FAIL)|e
Demo on regex101, using the following test cases:
met
meet
eel
elm
degree
zookeeper
meee
Only e in elm, first e in degree and last e in zookeeper are matched.
Since e in ee is forbidden, any e in after m is forbidden, and any e in a substring of consecutive e is forbidden. This explains the sub-pattern (?:me+|ee+).
While I am aware that this method is not extensible, it is at least logically correct.
Analysis of other solutions
Solution 0
(ee|me)(*SKIP)(*F)|e
Let's use meet as an example:
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Forbid backtracking to pattern to the left
# Set index of bump along advance to current position
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Pattern failed. No choice left. Bump along.
# Note that backtracking to before (*SKIP) is forbidden,
# so e in second branch is not tried
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Can't match ee or me. Try the other branch
meet # (ee|me)(*SKIP)(*F)|e
^ # ^
# Found a match `e`
The problem is due to the fact that me consumes the first e, so ee fails to match, leaving the second e available for matching.
Solution 1
\w*(ee|me)\w*(*SKIP)(*FAIL)|e
This will just skips all words with ee and me, which means it will fail to match anything in degree and zookeeper.
Demo
Solution 2
(?:ee|mee?)(*SKIP)(?!)|e
Similar problem as solution 0. When there are 3 e in a row, the first 2 e are matched by mee?, leaving the third e available for matching.
Solution 3
(?:^.*[me]e)(*SKIP)(*FAIL)|e
This throws away the input up to the last me or ee, which means that any valid e before the last me or ee will not be matched, like first e in degree.
Demo
You need a preceding/following boundary forcing the regex engine to not retry the substring.
gsub('\\w*[em]e\\w*(*SKIP)(?!)|e', '', test, perl=T)
Or as #CasimiretHippolyte pointed out — preceding with an optional "e" ...
gsub('(?:ee|mee?)(*SKIP)(?!)|e', '', test, perl=T)
Updated per comments ( Use a quantifier (for other cases) ):
gsub('[em]e+(*SKIP)(?!)|e', '', test, perl=T)
Note: I decided to use (?!) instead of (*F) which is also used to force a regex to fail.
(?!) # equivalent to ( (*FAIL) or (*F) - both synonyms for (?!) ),
# causes matching failure, forcing backtracking to occur
Overall, the syntax can be written as (*SKIP)(*FAIL), (*SKIP)(*F) or (*SKIP)(?!)
You can add \w* in your first pattern to help the engine with more data, telling that ee or me can appear at the beginning, middle or end of a string.
You can use a regex like this:
\w*(ee|me)\w*(*SKIP)(*FAIL)|e
R regex would be,
> test <- c('met','meet','eel','elm')
> gsub("\\w*(?:ee|me)\\w*(*SKIP)(*FAIL)|e", "fi", perl=TRUE, test)
[1] "met" "meet" "eel" "film"
OR
> gsub('(?:^.*[me]e)(*SKIP)(*FAIL)|e', 'fi', test, perl=T)
[1] "met" "meet" "eel" "film"
Working demo

R: lookaround within lookaround

I need to match any 'r' that is preceded by two different vowels. For example, 'our' or 'pear' would be matching but 'bar' or 'aar' wouldn't. I did manage to match for the two different vowels, but I still can't make that the condition (...) of lookbehind for the ensuing 'r'. Neither (?<=...)r nor ...\\Kr yields any results. Any ideas?
x <- c('([aeiou])(?!\\1)(?=(?1))')
y <- c('our','pear','bar','aar')
y[grepl(paste0(x,collapse=''),y,perl=T)]
## [1] "our" "pear"`
These two solutions seem to work:
the why not way:
x <- '(?<=a[eiou]|e[aiou]|i[aeou]|o[aeiu]|u[aeio])r'
y[grepl(x, y, perl=T)]
the \K way:
x <- '([aeiou])(?!\\1)[aeiou]\\Kr'
y[grepl(x, y, perl=T)]
The why not way variant (may be more efficient because it searches the "r" before):
x <- 'r(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
or to quickly exclude "r" not preceded by two vowels (without to test the whole alternation)
x <- 'r(?<=[aeiou][aeiou]r)(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
As HamZa points out in the comments using skip and fail verbs is one way to do what we want. Basically we tell it to ignore cases where we have two identical vowels followed by "r"
# The following is the beginning of the regex and isn't just R code
# the ([aeiou]) captures the first vowel, the \\1 references what we captured
# so this gives us the same vowel two times in a row
# which we then follow with an "r"
# Then we tell it to skip/fail for this
([aeiou])\\1r(*SKIP)(*FAIL)
Now we told it to skip those cases so now we tell it "or cases where we have two vowels followed by an 'r'" and since we already eliminated the cases where those two vowels are the same this will get us what we want.
|[aeiou]{2}r
Putting it together we end up with
y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
grep("([aeiou])\\1r(*SKIP)(*FAIL)|[aeiou]{2}r", y, perl = TRUE, value = TRUE)
#[1] "our" "pear" "sseiras"
Here is a less than elegant solution:
y[grepl("[aeiou]{2}r", y, perl=T) & !grepl("(.)\\1r", y, perl=T)]
Probably has some corner case failures where the first set matches at different location than the second set (will have to think about that), but something to get you started.
Another one through negative lookahead assertion.
> y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
> grep("(?!(?:aa|ee|ii|oo|uu)r)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
> grep("(?!aa|ee|ii|oo|uu)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
(?!aa|ee|ii|oo|uu) asserts that the first two chars in the match won't be aa or ee or .... or uu. So this [aeiou][aeiou] would match any two vowels other but it wouldn't be repeated . That's why we set the condition at first. r matches the r which follows the vowels.

R regex to remove all except letters, apostrophes and specified multi-character strings

Is there an R regex to remove all except letters, apostrophes and specified multi-character strings? The "specified multi-character strings" are arbitrary and of arbitrary length. Let's say "~~" & && in this case (so ~ & & should be removed but not ~~ & &&)
Here I have:
gsub("[^ a-zA-Z']", "", "I like~~cake~too&&much&now.")
Which gives:
## [1] "I like~~cake~toomuchnow"
And...
gsub("[^ a-zA-Z'~&]", "", "I like~~cake~too&&much&now.")
gives...
## "I like~~cake~too&&much&now"
How can I write an R regex to give:
"I like~~caketoo&&muchnow"
EDIT Corner cases from Casimir and BrodieG...
I'd expect this behavior:
x <- c("I like~~cake~too&&much&now.", "a~~~b", "a~~~~b", "a~~~~~b", "a~&a")
## [1] "I like~~caketoo&&muchnow." "a~~b"
## [3] "a~~~~b" "a~~~~b"
## [5] "aa"
Neither of the current approaches gives this.
One way, match/capture the "specified multi-character strings" while replacing the others.
gsub("(~~|&&)|[^a-zA-Z' ]", "\\1", x)
# [1] "I like~~caketoo&&muchnow" "a~~b"
# [3] "a~~~~b" "a~~~~b"
# [5] "aa"
(?<![&~])[^ a-zA-Z'](?![&~])
Try this.See demo.Use this with perl=True option.
https://regex101.com/r/wU7sQ0/25
You can use this pattern:
gsub("[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z)", "", x, perl=TRUE)
online demo
The idea is to build an always true pattern that is the translation of this sentence:
substrings I want to keep are always followed by a character I want to remove or the end of the string
So, all you need to do is to describe the substring you want to keep:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*
Note that, since this subpattern is optional (it matches the empty string) and greedy, the whole pattern will never fail whatever the position on the string, so all matches are consecutive (no need to add a \G anchor) from the begining to the end.
For the same reason there is no need to add possessive quantifiers or to use atomic groups to prevent catastrophic backtrackings because (?:[^A-Za-z ']|\\z) can't fail.
This pattern allows to replace a string in few steps, but you can improve it more:
if you avoid the last match (that is useless since it matches only characters you want to keep or the empty string before the end) with the backtracking control verb (*COMMIT).
It forces the regex engine to stop the search once the end of the string is reached:
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z ']|\\z(*COMMIT).)
if you make the pattern able to match several special characters in one match:
(except if they are ~ or &)
[A-Za-z ']*(?:(?:~~|&&)[A-Za-z ']*)*\\K(?:[^A-Za-z '][^A-Za-z '~&]*|\\z(*COMMIT).)
demo

Creating a regular expression in R statistics

I am trying to create a regular expression in "R" to capture two groups of characters for me and I seem not to be able to figure out why it does not work.
Here is what I am trying to achieve ...
From this string:
"air.BattleofZombies 0.0008 0.0006 -0.0027"
I would like to return:
"air.BattleofZombies=0.0008 0.0006 -0.0027"
Instead, here is what I get:
"air.BattleofZombie= 0.0008 0.0006 -0.0027="
My regular expression query is:
gsub("([^\\s]*)[\\s]*([-?\\d*\\.?\\d*\\s*]*)","\\1=\\2", "air.BattleofZombies 0.0008 0.0006 -0.0027")
Any help is welcome.
I find character classes easier to use. (I think #Simon is wrong about what "\s" will match.)
> tst <- "air.BattleofZombies 0.0008 0.0006 -0.0027"
> sub("[ ]{2,}", "=", tst)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"'
See the ?regex page and notice this sentence: "Symbols \d, \s, \D and \S denote the digit and space classes and their negations." Nontheless, I have found that a literal space, " ", often works even without the character-class mechanism. (I'm unable to comment on a deleted post but I see now that this is the same answer posted earlier by #KaraWoo and the only reason it didn't deliver the desired result was that gsub was used.)
Another short solution:
vec <- "air.BattleofZombies 0.0008 0.0006 -0.0027"
sub("\\s+", "=", vec)
# [1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
Just turn the starting ([^\\s]*) to ([^\\s]+) because the regex you used must catch empty strings also and remove all the *'s inside the character class, because * inside character class will looses his special meaning and matches only the literal *. So turn [\\d*\\s*\\.] to [\\d\\s.]
> gsub("([^\\s]+)\\s*([-\\d.\\d\\s]*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
OR
> gsub("(\\S+)\\s*((-?\\d+(?:\\.\\d+)?)(?:\\s+(?3))*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
(?3) recurses the pattern inside the third capturing group. Easy understandable form of this regex was given below.
OR
> gsub("(\\S+)\\s+(-?\\d+(?:\\.\\d+)?(?:\\s+-?\\d+(?:\\.\\d+)?)*)", "\\1=\\2", x, perl=T)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"
DEMO
There are a couple of problems to solve, I think. First, \\s in a character class (i.e. inside []) matches an s rather than a space unless one uses perl=T (so I've replaced it with just a space). Second, gsub() replaces multiple times so I've replaced it with sub(). Also, the character class in the second set of parentheses would be better as parentheses instead. The following regexp solves the problem:
sub("([^ ]*) +((-?\\d*\\.?\\d* *)*)","\\1=\\2", "air.BattleofZombies 0.0008 0.0006 -0.0027",1)
[1] "air.BattleofZombies=0.0008 0.0006 -0.0027"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)