Automata - Regular Expression - regex

I've been trying to make a regular expression from the below:
L = {01, 0011, 000111, 00001111, 0000011111, 000000111111, ...}
but I just could not figure it out. The first thing that came to my mind was
0(0)^* 1(1)^*
Is there an app where I could test it out?
If this can't be done through Regular Expression, can an NFA or DFA be done?
but I'm not sure if that is the answer to the language. Could some good Samaritan kindly help me with this? Appreciate it.

A subroutine may suit your needs:
(?<!0)(0(?1)?1)(?!1)
Debuggex Demo
(?1) means recall the pattern captured in the first group, i.e. between the parens. This isn't available in all regex engines though - neither is the (negative) lookbehind (?<!...) by the way.
The difference between (?1) and \1 is that (?1) recalls the captured pattern while \1 recalls the captured data.

I don't know about what you meant when you said that it should be regex, because it is mentioned automaton/regular expression too.
As per the automata theory :-
If you are talking about the regular expression for this formal language (having equal number of 0's and 1's and all 0's must be followed by 1's), it is not a regular language. It can be proved using the pumping lemma that this language is not regular.
But, this language can be expressed as {0i1i | i>0}; i belongs to set of positive integers.

Related

Unclear Complex Regular expression

I'm new to Regular Expressions and has stumbled upon an expression I do not really understand.
The expression is:
.*[^0-9](?P<ref>[0-9]{3})[^0-9].*
and I think I understand the first part and the last part, but the part within parenthesis eludes me. I would be most grateful for an explanation or some links where I could find help with this.
Thanks.
The part within parentheses is a named capturing group that matches exactly three digits (and lets that group be referenced by the name ref). This feature was added because in very long, complex expressions, it's much clearer to used named groups than the usual numbered groups (which requires counting parentheses to see which group is which).
Exactly how the named capture referencing is done depends on the regular expression library and/or language being used. For example, in Python:
>>> import re
>>> match = re.search(r'.*[^0-9](?P<ref>[0-9]{3})[^0-9].*', 'a234b')
>>> match.group('ref')
'234'
here's a short graphical explanation for your expression:
.*[^0-9](?P<ref>[0-9]{3})[^0-9].*
Debuggex Demo
In words:
.* matches any number of any chars
[^0-9] matches one char, that is not a number
(?P<ref>[0-9]{3}) matches a group of three numbers and gives the group in the result the name ref
[^0-9] ... obvious
.* ... obvious

Can regex match intersection between two regular expressions?

Given several regular expressions, can we write a regular expressions which is equal to their intersection?
For example, given two regular expressions c[a-z][a-z] and [a-z][aeiou]t, their intersection contains cat and cut and possibly more. How can we write a regular expression for their intersection?
Thanks.
A logical AND in regex is represented by
(?=...)(?=...)
So,
(?=[a-z][aeiou]t)(?=c[a-z][a-z])
The lookahead examples are easy to use, but technically are no longer regular languages. However it is possible to take the intersection of two regular languages, and that complement is regular.
First note that Regular Expressions can be converted to and from NFAs; they both are ways of expressing regular languages.
Second, by DeMorgan's law,
Thus these are the steps to compute the intersection of two RegExs:
Convert both RegExs to NFAs.
Compute the complement of both NFAs.
Compute the union of the two complements.
Compute the complement of that union.
Convert the resulting NFA to a RegEx.
Some sources:
Union and RegEx to NFA: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_06.pdf
NFA to RegEx: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_08.pdf
Complement of NFA: https://cs.stackexchange.com/questions/13282/complement-of-non-deterministic-finite-automata
Mathematically speaking, an intersection of two regular languages is regular, so there has to be a regular expression that accepts it.
Building it via corresponding NFAs is probably the easiest. Consider the two NFAs that correspond to the two regexes. The new states Q are pairs (Q1,Q2) from the two NFAs. If there is a transition (P1,x,Q1) in the first NFA and (P2,x,Q2) in the second NFA, then and only then there is a transition ((P1,P2),x,(Q1,Q2)) in the new NFA. A new state (Q1,Q2) is initial/final iff both Q1 and Q2 are initial/final.
If you use NFAs with ε-moves, then also for each transition (P1,ε,Q1) there will be a transition ((P1,P2),ε,(Q1,P2)) for all states P2. Likewise for ε-moves in the second NFA.
Now convert the new NFA to a regular expression with any known algorithm, and that's it.
As for PCRE, they are not, strictly speaking, regular expressions. There is no way to do it in the general case. Sometimes you can use lookaheads, like ^(?=regex1$)(?=regex2$) but this is only good for matching the entire string and is no good for either searching or embedding in other regexps. Without anchoring, the two lookaheads may end up matching strings of different lengths. This is not intersection.
First, let's agree on terms. My syntactical assumption will be that
The intersection of several regexes is one regex that matches strings
that each of the component regexes also match.
The General Option
To check for the intersection of two patterns, the general method is (pseudo-code):
if match(regex1) && match(regex2) { champagne for everyone! }
The Regex Option
In some cases, you can do the same with lookaheads, but for a complex regex there is little benefit of doing so, apart from making your regex more obscure to your enemies. Why little benefit? Because the engine will have to parse the whole string multiple times anyway.
Boolean AND
The general pattern for an AND checking that a string exactly meets regex1 and regex2 would be:
^(?=regex1$)(?=regex2$)
The $ in each lookahead ensures that each string matches the pattern and nothing more.
Matching when AND
Of course, if you don't want to just check the boolean value of the AND but also do some actual matching, after the lookaheads, you can add a dot-star to consume the string:
^(?=regex1$)(?=regex2$).*
Or... After checking the first condition, just match the second:
^(?=regex1$)regex2$
This is a technique used for instance in password validation. For more details on this, see Mastering Lookahead and Lookbehind.
Bonus section: Union of Regexes
Instead of working on an intersection, let's say you are interested in the union of the following regexes, i.e., a regex that matches either of those regexes:
catch
cat1
cat2
cat3
cat5
This is accomplished with the alternation | operator:
catch|cat1|cat2|cat3|cat5
Furthermore, such a regex can often be compressed, as in:
cat(?:ch|[1-35])
For And operation, we have something like this in RegEx
(REGEX)(REGEX)
Taking your example
'Cat'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["Cat", "C", "a", "t"]
'Ca'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Cat123'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
where
([A-Za-z]+) //Match All characters
and
([aeiouAEIOU]+) //Match all vowels
Combine them both will match
([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)
eg:
'Hmmmmmm'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Stckvrflw'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
null
'StackOverflow'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["StackOverflow", "StackOverfl", "o", "w"]

Which regular expression requires backtracking?

There are three different solutions to implement regular expression matching: DFA, NFA and Backtracking. I am looking for examples:
a regexp, which can be solved with a DFA and the reason, why a DFA is sufficient.
a regexp, which requires a NFA and the reason why a NFA is necessary.
a regexp, which requires backtracking and the reason why backtracking is necessary.
A recommendation for some good literature about this topic would be nice, too.
i guess there is more than 1 meaning to the word backtracking - even '.*a' has to backtrack to match the string "lalaiiiiiii" (because .* will first match the whole string - so then a won't match anything - and only then it will give up one character at a time, so the final match would be "lala")
i highly recommend http://www.regular-expressions.info/
What I found out so far is:
Every regular expression, which can be implemented with a NFA, can also be implemented with a DFA. Every NFA can be transformed into a DFA.
Regular expressions, which require backtracking, are regular expressions, which contain back references like /(a)\1/.

What are the zero width elements in a regular expression?

Recently, I have been seeing "zero width elements" in regular expressions. What are they? Can they be treated as ghost data, so that for replacement, they won't be replaced, and for ( ) matching, they won't go into the matches[1], matches[2], etc?
Is there a good tutorial for all its various uses? Have they been here for a long time? Which version of O'Reilly's Regular Expression book was the first to discuss them?
The point of zero-width lookaround assertions is that they check if a certain regex can or cannot be matched looking forward or backwards from the current position, without actually adding them to the match. So, yes, they won't count towards the capturing groups, and yes, their matches won't be replaced (because they aren't matched in the first place).
However, you can have a capturing group inside a lookaround assertion that will go into matches[1] etc.
For example, in C#:
Regex.Replace("ab", "(a)(?=(b))", "$1$2");
will return abb.
A very good online tutorial about regular expressions in general can be found at http://www.regular-expressions.info (even though it's a little out of date in some areas).
It contains a specific section about zero-width lookaround assertions (and Part II).
And of course they are covered in-depth in both Mastering Regular Expressions and the Regular Expressions Cookbook.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part