Regular expression properties (theory) [closed]

Regular expression properties (theory) [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm reading Aho and Ullman's book The Theory of Parsing, Translation, and Compiling. In the section that introduces regular expressions in chapter 2, there is a list of properties of regular expressions. I do not understand properties 2 and 8. Here is the list of properties:
(1) 𝛼 + 𝛽 = 𝛽 + 𝛼
(2) ∅* = 𝑒
(3) 𝛼 + (𝛽 + 𝛾) = (𝛼 + 𝛽) + 𝛾
(4) 𝛼(𝛽𝛾) = (𝛼𝛽)𝛾
(5) 𝛼(𝛽 + 𝛾) = 𝛼𝛽 + 𝛼𝛾
(6) (𝛼 + 𝛽)𝛾 = 𝛼𝛾 + 𝛽𝛾
(7) 𝛼𝑒 = 𝑒𝛼 = 𝛼
(8) ∅𝛼 = 𝛼∅ = ∅
(9) 𝛼* = 𝛼 + 𝛼*
(10) (𝛼* )* = 𝛼*
(11) 𝛼 + 𝛼 = 𝛼
(12) 𝛼 + ∅ = 𝛼
where ∅ is the regular expression denoting the regular set ∅, 𝛼, 𝛽, 𝛾 are arbitrary regular expressions, and 𝑒 is the empty string.
How are properties (2) and (8) justified?
Edit: To explain the notation of +, *, etc, here are some definitions given in the book (quoted):
DEFINITION Let 𝚺 be a finite alphabet. We define a regular set over
𝚺 recursively in the following manner:
(1) ∅ (the empty set) is a regular set over 𝚺.
(2) {𝑒} is a regular set over 𝚺.
(3) {𝑎} is a regular set over 𝚺 for all 𝛼 in 𝚺.
(4) If 𝑃 and 𝑄 are regular sets over 𝚺, then so are
(a) 𝑃 ∪ 𝑄.
(b) 𝑃𝑄.
(c) 𝑃*.
(5) Nothing else is a regular set.
Thus a subset of 𝚺* is regular if and only if it is ∅, {𝑒}, or {𝑎},
for some 𝑎 in 𝚺, or can be obtained from these by a finite number of
applications of the operations union, concatenation, and closure.
.
DEFINITION Regular expressions over 𝚺 and the regular expressions
they denote are defined recursively, as follows:
(1) ∅ is a regular expression denoting the regular set ∅.
(2) 𝑒 is a regular expression denoting the regular set {𝑒}.
(3) 𝑎 in 𝚺 is a regular expression denoting the regular set {𝑎}.
(4) If 𝑝 and 𝑞 are regular expressions denoting the regular sets 𝑃
and 𝑄, respectively, then
(a) (𝑝+𝑞) is a regular expression denoting 𝑃 ∪ 𝑄.
(b) (𝑝𝑞) is a regular expression denoting 𝑃𝑄.
(c) (𝑝)* is a regular expression denoting 𝑃*.
(5) Nothing else is a regular expression.

My guess is that 2 & 8 properties might be just a simple math:
Property 2
∅ is an empty set, then ∅* = 𝑒 is true, ∅+ = 𝑒 is also true, ∅{Infinity} = 𝑒 is also true, since e is an empty string.
A regular expression is a string, thus an empty regular expression repeating any number of times or with any operation, is still an empty regular expression, which again equals to an empty string in the right side.
Reference:
Why is the Kleene star of a null set is an empty string?
Property 8
∅𝛼 = 𝛼∅ = ∅ is true, and so is ∅𝛼∅𝛼∅𝛼 = 𝛼∅𝛼∅𝛼∅ = ∅, because an empty set combined with anything would result an empty set.
Reference:
Regular expressions with empty set/empty string
What is the difference between language of empty string and empty set language?
How can concatenating empty sets (languages) result in a set containing empty string?

Related

Removing everything between nested parentheses

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb

In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))

If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

Scala multiline string extraction using regular expression [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Given a collection of strings which start all with prefix _abc_ABC( and end with suffix ), for instance,
val a = """_abc_ABC(
{
x = 1
y = 2
})"""
how to define a regular expression that strips out the prefix and suffix above ?

This would work:
val a = """_abc_ABC(
{
x = 1
y = 2
})"""
val Re = """(?s)_abc_ABC\((.*)\)""".r
a match { case Re(content) => println(content) }
The (?s) makes it match over multiple lines.

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!

There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE

If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.

Regular Expression - All words that begin and end in different letters

I'm having trouble with this regular expression:
Construct a regular expression defining the following language over alphabet
Σ = { a,b }
L6 = {All words that begin and end in different letters}
Here are some examples of regular expressions I was able to solve:
1. L1 = {all words of even length ending in ab}
(aa + ab + ba + bb)*(ab)
2. L2 = {all words that DO NOT have the substring ab}
b*a*

Would this work:
(a.*b)|(b.*a)
Or said in Kleene way:
a(a+b)*b+b(a+b)*a

This should do it:
"^((a.*b)|(b.*a))$"

1- Write a Regular expression for each of the following languages: (a)language of all those strings which end with substrings 'ab' and have odd length. (b)language of all those strings which do not contain the substring 'abb'.
2- Construct a deterministic FSA for each of the following languages: (a)languages of all those strings in which second last symbol is 'b'. (b)language of all those strings whose length is odd,but contain even number if b's.

(aa+ab+ba+bb)∗(a+b)ab
It can choose any number of even length and have any character from a and b, and then end at string ab.

Given an RE, derive the largest substring match

I'm looking for a bit of code that will:
Given regular expression E, derive the longest string X
such that for every S, X is a substring of S iff S will match E
examples:
E = "a", X = "a"
E = "^a$", X = "a"
E = "a(b|c)", X = "a"
E = "[ab]", X = ""
context: I want to match some regular expressions against a data
store that only supports substring searching. It would be nice
to optimize the regular expression searching by applying a substring
search to the data store to reduce the amount of data transferred
as much as possible.
example 2:
If I want to catch "error foo", "error bar", "error baz", I might specify
error: (foo|bar|baz)
and send
search "error: "
to the data store, and then regexping the returned items.
Thanks!

In general terms, you could try to split the regex at all non-unique ((a|b), [ab]) matches, and then look for the longest string in the resulting array. Something like
$foo = longest(regex_split($regex, '(\(.*?\|.*?\))|(\[.*?\])'));

Maybe convert RE to a finite state automata and look for the longest part that needs to be present in a path between start and finish states... Geometric thinking with a graph can be easier to you, at least it is in my case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression properties (theory) [closed] - regex

Related

Removing everything between nested parentheses

Scala multiline string extraction using regular expression [duplicate]

R code to check if word matches pattern

Regular Expression - All words that begin and end in different letters

Given an RE, derive the largest substring match

Categories

Resources