will a regular expression applied in reverse produce the same match? - regex

Suppose we have some text and a regular expression that matches it. Question: if I apply the same expression to text backwards (starting from the last letter to the first one), will it still match?
regex -----> text
xereg --?--> txet
In practice that seems to work, the question is rather about what the theory says about the general case.

Not if you use the Kleene star - if you reverse the regex, you will end up with an invalid regex or one that matches a different pattern:
ab* -> *ba (invalid syntax)
a*b -> b*a (the first one matches aaab but not abbb, while the second one matches bbba but not baaa)
On the other hand, I'm quite sure that it would be possible to design an algorithm that, given a regex, produces a regex that matches the reverse strings. The following recursive algorithm should work (if r is a regex, rev(r) means the regex that matches the reversed strings):
If r is a single symbol x, then rev(r) = x.
If r is a union A|B, then rev(r) = rev(A)|rev(B).
If r is a concatenation AB, then rev(r) = rev(B)rev(A).
If r is a Kleene star A*, then rev(r) = rev(A)*.

The general cause is that it will not
for example, the regex
ab
will match
ab
but not
ba
How come you think that the general case is that it should?
There are regexes that matches the reverse string as well like
[a|b]*
Will match
ab
and
ba

The cases where regex and xeger would both produce the same match on a text are:
regex is a simple (atomic) pattern that is a palindrome. e.g., abcba
regex is composed of several atomic patterns using commutative functions (e.g., or) and you do not reverse those individual atomic patterns. If you do, then they should be a palindrome too. e.g., adef|bd881|cdavr if you do not reverse the atomic components or [aba|defed] if you do reverse the atomic components.

In general I would definitely say "no", but it really just depends on the complexity of the expressions.
Because not only would one need to reverse any simple (sub-)expressions, but if applicable one would also need to take into account more complex stuff which is not so easily "reversed" in just any regex: what about repetition operators, laziness vs. greediness, or back-references and look-arounds, quantifiers and modifiers… – items explained in e.g. this tutorial?
Perhaps if you have more specific examples or issues regarding such a "reversal", a more appropriate answer can be thought of.

Related

Conditional regular expression with one section dependent on the result of another section of the regex

Is it possible to design a regular expression in a way that a part of it is dependent on another section of the same regular expression?
Consider the following example:
(ABCHEHG)[HGE]{5,1230}(EEJOPK)[DM]{5}
I want to continue this regex, and at some point I will have a section where the result of that section should depend on the result of [DM]{5}.
For example, D will be complemented by C, and M will be complemented by N.
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5}[D'M']{5}
By D' I mean C, and by M' I mean N.
So a resulting string that matches the above regex, if it has DDDMM matching to the section [DM]{5}, it should necessarily have CCCNN matching to [D'M']{5}. Therefore, the result of [D'M']{5} always depends on [DM]{5}, or in other words, what matches to [DM]{5} always dictates what will match to [D'M']{5}.
Is it possible to do such a thing with regex?
Please note that, in this example I have extremely over-simplified the problem. The regex pattern I currently have is really much more complex and longer and my actual pattern includes about 5-6 of such dependent sections.
I cannot think of a way you can do this in pure regex. I would run 2 regex expressions. The first regex to extract the [DM]{5} string, such as
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}
And take the last 5 characters. Now replace the characters, for example in C# it would be result = result.Substring(result.Length - 5, 5).Replace('D', 'C').Replace('M', 'N'), and then concatenate like
(ABCHEHG)[HGHE]{5,1230}(EEJOPK)[DM]{5}[ACF]{1,1000}(BBBA)[CU]{2,5} + result
This is pretty easy to do in Perl:
m{
ABCHEHG
[HGHE]{5,1230}
EEJOPK
( [DM]{5} )
[ACF]{1,1000}
BBBA
[CU]{2,5}
(??{ $1 =~ tr/DM/CN/r })
}x
I've added the x modifier and whitespace for better readability. I've also removed the capturing groups around the fixed strings (they're fixed strings; you already know what they're going to capture).
The crucial part is that we capture the string that was actually matched by [DM]{5} (in $1), which we then use at the end to dynamically generate a subpattern by replacing all D by C and M by N in $1.
This sounds like bioinformatics in python. Do 2-stage filtering, at regex level and at app level.
Wildcard the DM portions, so the regex is permissive in what it accepts. Bury the regex in a token generator that yields several matching sections. Have your app iterate through the generator's results, discarding any result rejected by your business logic, such as finding that one token is not the complement of another token.
Alternatively, you might push some of that work down into a complex generated regex, which likely will perform worse and will be harder to debug. Your DDDMM example might be summarized as D+M+, or [DM]+, not sure if sequence matters. The complement might be C+N+, or [CN]+. Apparently there's two cases here. So start assembling a regex: stuff1 [DM]+ stuff2 [CN]+ stuff3. Then tack on '|' for alternation, and tack on the other case: stuff1 [CN]+ stuff2 [DM]+ stuff3 (or factor out suffix and prefix so alternation starts after stuff1). I can't imagine you'll be happy with such an approach, as the combinatorics get ugly, and the regex engine is forced to do lots of scanning and backtracking. And recompiling additional regexes on the fly doesn't come for free. Instead you should use the regex engine for the simple things that it's good at, and delegate complex business logic decisions to your app.

Regular expressions: search (instead of match) with DFA

I have been wondering about theory behind search mode of regex matcher. So, say I have a regex that matches aab, what if instead of match at the beginning of the string I wanna be able to perform this match starting from any position of the string. What I mean - in match mode I can only verify that string aab is consistent with a regex, on the other hand with search this should work with say aaab producing corresponding span result.
So super specifically - is there any way to build DFA searcher, or this is fundamentally impossible since it would require additional memory which you cant have in DFSM. It is obvious though that you can in fact build searcher from matcher by reapplying matcher to input string in a for loop, but complexity of such approach is something like O(len_of_pattern * len_of_input).
A regex searcher is basically the same thing as a regex matcher with .* tacked onto the front of the expression.
I think you basically answered your own question. Search, like many other features in modern regex implementations, leverages the wonder of memory to do things unfeasible with a Finite Automata. DFAs traditionally cannot loop over strings or backtrack along input as this would require memory. Search requires the ability to find a match and then understand how that match fits in the string.

Can regex match intersection between two regular expressions?

Given several regular expressions, can we write a regular expressions which is equal to their intersection?
For example, given two regular expressions c[a-z][a-z] and [a-z][aeiou]t, their intersection contains cat and cut and possibly more. How can we write a regular expression for their intersection?
Thanks.
A logical AND in regex is represented by
(?=...)(?=...)
So,
(?=[a-z][aeiou]t)(?=c[a-z][a-z])
The lookahead examples are easy to use, but technically are no longer regular languages. However it is possible to take the intersection of two regular languages, and that complement is regular.
First note that Regular Expressions can be converted to and from NFAs; they both are ways of expressing regular languages.
Second, by DeMorgan's law,
Thus these are the steps to compute the intersection of two RegExs:
Convert both RegExs to NFAs.
Compute the complement of both NFAs.
Compute the union of the two complements.
Compute the complement of that union.
Convert the resulting NFA to a RegEx.
Some sources:
Union and RegEx to NFA: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_06.pdf
NFA to RegEx: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_08.pdf
Complement of NFA: https://cs.stackexchange.com/questions/13282/complement-of-non-deterministic-finite-automata
Mathematically speaking, an intersection of two regular languages is regular, so there has to be a regular expression that accepts it.
Building it via corresponding NFAs is probably the easiest. Consider the two NFAs that correspond to the two regexes. The new states Q are pairs (Q1,Q2) from the two NFAs. If there is a transition (P1,x,Q1) in the first NFA and (P2,x,Q2) in the second NFA, then and only then there is a transition ((P1,P2),x,(Q1,Q2)) in the new NFA. A new state (Q1,Q2) is initial/final iff both Q1 and Q2 are initial/final.
If you use NFAs with ε-moves, then also for each transition (P1,ε,Q1) there will be a transition ((P1,P2),ε,(Q1,P2)) for all states P2. Likewise for ε-moves in the second NFA.
Now convert the new NFA to a regular expression with any known algorithm, and that's it.
As for PCRE, they are not, strictly speaking, regular expressions. There is no way to do it in the general case. Sometimes you can use lookaheads, like ^(?=regex1$)(?=regex2$) but this is only good for matching the entire string and is no good for either searching or embedding in other regexps. Without anchoring, the two lookaheads may end up matching strings of different lengths. This is not intersection.
First, let's agree on terms. My syntactical assumption will be that
The intersection of several regexes is one regex that matches strings
that each of the component regexes also match.
The General Option
To check for the intersection of two patterns, the general method is (pseudo-code):
if match(regex1) && match(regex2) { champagne for everyone! }
The Regex Option
In some cases, you can do the same with lookaheads, but for a complex regex there is little benefit of doing so, apart from making your regex more obscure to your enemies. Why little benefit? Because the engine will have to parse the whole string multiple times anyway.
Boolean AND
The general pattern for an AND checking that a string exactly meets regex1 and regex2 would be:
^(?=regex1$)(?=regex2$)
The $ in each lookahead ensures that each string matches the pattern and nothing more.
Matching when AND
Of course, if you don't want to just check the boolean value of the AND but also do some actual matching, after the lookaheads, you can add a dot-star to consume the string:
^(?=regex1$)(?=regex2$).*
Or... After checking the first condition, just match the second:
^(?=regex1$)regex2$
This is a technique used for instance in password validation. For more details on this, see Mastering Lookahead and Lookbehind.
Bonus section: Union of Regexes
Instead of working on an intersection, let's say you are interested in the union of the following regexes, i.e., a regex that matches either of those regexes:
catch
cat1
cat2
cat3
cat5
This is accomplished with the alternation | operator:
catch|cat1|cat2|cat3|cat5
Furthermore, such a regex can often be compressed, as in:
cat(?:ch|[1-35])
For And operation, we have something like this in RegEx
(REGEX)(REGEX)
Taking your example
'Cat'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["Cat", "C", "a", "t"]
'Ca'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Cat123'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
where
([A-Za-z]+) //Match All characters
and
([aeiouAEIOU]+) //Match all vowels
Combine them both will match
([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)
eg:
'Hmmmmmm'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Stckvrflw'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
null
'StackOverflow'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["StackOverflow", "StackOverfl", "o", "w"]

Can regexes containing ordered alternation be rewritten to use only unordered alternation?

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?, *, and +, and the nongreedy quantifiers ??, *?, and +?. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?
(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa would match the strings a and aa if the alternation is unordered, but only a if the alternation is ordered.)
Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?
If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).
in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.
I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:
Let's say we have the regex x||y where x and y are regexen and || means the unordered alternation. If so we can construct DFA's accepting x and y. We will mark those DFA_x and DFA_y
We will construct the DFA for x||y in phases by connecting DFA_x and DFA_y
For every path in DFA_x corresponding to some string a (by path I mean a path in the graph sense without traversing and edge twice so a is a path in DFA_"a*" but aa is not)...
For every symbol in the alphabet s
If DFA_y consumes as (that is if run on as DFA_y will not stop early but it may not necessarily accept) and DFA_x does not and DFA_x doesn't accept any prefix of as create a transition from the state DFA_x ends in after consuming a to the state DFA_y ends in after consuming as
The accepting states of the final DFA are all the accepting states of both the input DFA's. The starting state is the starting state of DFA_x.
Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part