What's a prefix regular expression? - regex

I'm reading something that mentions prefix regular expressions, and sites as an example /^joey/
What's a prefix regular expression? Does that mean it starts with a caret?

in REGEX ^ at the start of a regex means, "Starts with"
/^joey/
Would therefore match any string that starts with "joey" such as "joeyjoey" or "joey and jane"

A prefixed regular expression (PRE) is defined recursively
Empty set ø end empty string ""- are PREs
For each symbol a in alphabet, "a" is a PRE
If p and q are PREs denoting the regular sets P and Q, respectively, r is a regular expression denoting the regular set R such that e belongs to R, and x belongs to S, then the following expressions are also PREs:
p + q (union )
xp (concatenation with symbol x on the left) .
pr (concatenation with an e-regular on the right)
p* (star) .
This definition was taken from "Fast Text Searching for Regular Expressions or
Automaton Searching on Tries" work by RICARDO A. BAEZA-YATES and GASTON H. GONNET
In other words PRE means Regular Expression that language L has only strings with some fixed prefix.
abc.* - is PRE
(A|B)cd - is not PRE

The caret means that you match the start of a string for example /^joey/ will match "joey is there" since the string starts with "joey" but not "Is joey around?" since joey is in the middle of the sentence.

It's not a standard term. Whoever wrote that obviously means a regex that matches only at the beginning of the target text, as the other responders have said. The caret is usually used for that purpose, but it can also mean the beginning of a logical line, if the match is being performed in multiline mode. Many regex flavors support an additional construct that matches the very beginning of the text regardless of the matching mode, \A being its usual form.
For more details, read this.

Related

Regex | Containing "bbb"

I'm trying to create a Regex with chars 'a' and 'b'.
The only rule is that the regex must contain the word 'bbb' somewhere.
These are possible: aabbbaaaaaababa, abbba, bbb, aabbbaa, abbabbba, ...
These are not possible: abba, a, abb, bba, abbaaaabbaaaabba, ...
I have no idea how can I can express that.
Any ideas? Thanks in Advance!
Based on the tag "automata", I am guessing you are after the formal regular expression for this formal language. In that case, a regular expression is (a+b)bbb(a+b). The anatomy of this regular expression is the following:
(a+b) gives either "a" or "b"
(a+b)* gives any string of "a"s and "b"s whatever
bbb gives the string bbb only
the whole regular expression describes any string that begins with anything, then has bbb, then ends with anything
To prove this regular expression is correct, note that:
This regular expression only generates strings that contain the substring bbb. This is due to the middle part.
This regular expression generates all strings that contain the substring bbb. Suppose there were some string containing the substring bbb that this regular expression didn't generate. The string either starts with bbb or it doesn't. If it does, then the string is generated by our regular expression by repeating the first (a+b) zero times and the second (a+b) n - 3 times, where n is the length of the string. Otherwise, if it doesn't start with bbb, consider the suffix of length n - 1 as a recursive case. Continue thusly until the subcase does begin with bbb (it eventually must). Because this suffix is describable by our regular expression, the original case must be too since we can just repeat the first (a+b) an additional number of times equal to the depth of recursion.
The patter is kind simple
/b{3}/g
if you need it to match 3 and only 3 'b's, you can use
/b{3}[^b]?/g
Good evening! you can use this expression it might work
(a+b)* (bbb)(a+b)*
using this results in generating triple (bbb) minimum string
and by taking closure of (a+b) you can generate any type of strings containing triple b in them

Nongreedy regex with alternation and repetition [duplicate]

This question already has answers here:
Non-greedy regular expression match for multicharacter delimiters in awk
(3 answers)
Closed 8 years ago.
I am trying to match the contents between AB and BA using extended regex, for instance using awk.
Consider the two example strings AB12BABA and AB123BABA, I tried the following regex
AB([^B]|([^B][^A]|B[^A]|[^B]A))*BA
But it matches the whole string (greedy) for both examples.
Can anyone explain how the regex engine works for this case, and how I should change my regex so that it would work.
The BRE and ERE engines will match with the Leftmost Longest Rule, which is different from how Perl and other NFA-based regex engines matches the regex.
The documentation from Boost library is more detailed in regards to the technical aspect, so I quote it here:
The Leftmost Longest Rule
Often there is more than one way of matching a regular expression at a particular location, for POSIX basic and extended regular expressions, the "best" match is determined as follows:
Find the leftmost match, if there is only one match possible at this location then return it.
Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it.
If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found.
Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it.
Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it.
Repeat steps 4 and 5 for each additional marked sub-expression.
If there is still more than one possible match remaining, then they are indistinguishable; return the first one found.
Marked sub-expression as mentioned in the text refers to () capturing groups. Note that they only does capturing and back-reference is not supported.
Therefore, in order to do a lazy matching, you need to construct a regular expression, such that it matches the repeated part, while avoid matching the tail part until the very end. Since ERE and BRE are equivalent to theoretical regular expression, as long as you can construct a DFA, there exists an equivalent regex that does the trick (just that constructing it is not trivial task in some cases).
For your requirement, this regex shall work:
AB([^B]|B+[^AB])*B*BA
The part ([^B]|B+[^AB])*B* matches any string that does not contain the string "BA".
Derivation
This is the DFA for matching a string that does not contain the string "BA".
The notation here is not standard, so I will explain a bit:
State q1/B means that the state is named q1 (just like how you name a variable), B is the current progress towards matching BA.
* means any character in the alphabet. [^B] means any character in the alphabet except for B.
In the DFA, q0 and q1 are final states, q0 is the initial state. Note that q2 is a trap state, since it is a non-final state, and there is no transition out of this state.
Use the steps here, or just use JFLAP to derive the regular expression. (In JFLAP, you should use some character, such as C to represent [^AB]).
Since q2 is a trap state, we can exclude it from the formula:
R0 = [^B]R0 + BR1 + λ
R1 = [^AB]R0 + BR1 + λ
Apply Arden's theorem to R1:
R1 = B*([^AB]R0 + λ)
Substitute R1 to R0:
R0 = [^B]R0 + BB*([^AB]R0 + λ) + λ
Distribute BB* over ([^AB]R0 + λ):
R0 = [^B]R0 + BB*[^AB]R0 + BB*λ + λ
Group together:
R0 = ([^B] + BB*[^AB])R0 + (BB* + λ)
Apply Arden's theorem to R0:
R0 = ([^B] + BB*[^AB])*(BB* + λ)
(BB* OR λ (empty string)) is equivalent to B*:
R0 = ([^B] + BB*[^AB])*B*
Let use rewrite it into awk's syntax: ([^B]|B+[^AB])*B*, which is what shown above.
Use look arounds and a non greedy quantifier:
(?<=AB).*?(?=BA)
If you want to match the delimiters too, simply:
AB.*?BA

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".

How to do REGEX (Groovy) to select words "She","Shell" with REGEX = "She"?

I am newbie on REGEX, I am trying to get the words "She" and "Shell" only, not ashes with this program (Groovy). I have been working on the for some time.
saying = 'She wishes for Shells not ashes'
println saying
def pattern = ~/\bShe*\b/
def matcher = pattern.matcher(saying)
def count = matcher.getCount()
println "Matches = ${count}"
for (i in 0..<count) {
print matcher[i] + " "
}
Output:
She wishes for Shells not ashes
Matches = 1
She
REGEX does not work like Windows CMD e.g dir W* to list folder or files begins with W. What did I do wrong?
Many Thanks when you answer this question
In regular expressions the * is not the same as a wildcard (match any characters).
It is a quantifier that modifies whatever is immediately before it and means "zero or more". Your regular expression matches Sh followed by zero or more e. So it will match these strings:
Sh
She
Shee
Sheee
etc...
What you probably mean is \w* to match any word characters.
/\bShe\w*\b/
Also note that in regular expressions "word characters" are considered to be letters, numbers or the underscore. So a sequence of word characters is different from what is regarded as a "word" in human languages. It is in fact not easy to correctly identify words using regular expressions alone, so if you need to match words in a specific language you should use a natural language processing library and/or a dictionary instead of a regular expression.

R: Find the last dot in a string

In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]