What are the zero width elements in a regular expression? - regex

Recently, I have been seeing "zero width elements" in regular expressions. What are they? Can they be treated as ghost data, so that for replacement, they won't be replaced, and for ( ) matching, they won't go into the matches[1], matches[2], etc?
Is there a good tutorial for all its various uses? Have they been here for a long time? Which version of O'Reilly's Regular Expression book was the first to discuss them?

The point of zero-width lookaround assertions is that they check if a certain regex can or cannot be matched looking forward or backwards from the current position, without actually adding them to the match. So, yes, they won't count towards the capturing groups, and yes, their matches won't be replaced (because they aren't matched in the first place).
However, you can have a capturing group inside a lookaround assertion that will go into matches[1] etc.
For example, in C#:
Regex.Replace("ab", "(a)(?=(b))", "$1$2");
will return abb.
A very good online tutorial about regular expressions in general can be found at http://www.regular-expressions.info (even though it's a little out of date in some areas).
It contains a specific section about zero-width lookaround assertions (and Part II).
And of course they are covered in-depth in both Mastering Regular Expressions and the Regular Expressions Cookbook.

Related

How can I simulate a negative lookup in a regular expression

I have the following regular expression that includes a negative look ahead. Unfortunately the tool that I'm using does not support regular expressions. So I'm wondering if its possible to achieve negative look ahead behaviour without actually using one.
Here is my regular expression:
(?<![ABCDEQ]|\[|\]|\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
Here it is working with sample data on Regex101.com:
see expression on regex101.com
I'm using a tool called Alteryx. The documentation indicates that it uses Perl, however, for whatever reason the look ahead does not work.
Alteryx appears to use the Boost library for its regex support, and the Boost documentation says lookbehind expressions must have a fixed length. It's more restrictive than PHP (PCRE), which allows you to use alternation in a lookbehind, as long as each branch is fixed-length. But that's easy enough to get around: just use multiple lookbehinds:
(?<![ABCDEQ])(?<!\[)(?<!\])(?<!\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
That regex works for me in a Boost-powered regex tester, where yours doesn't. I would compress it a little more by putting square brackets inside the character set:
(?<![][ABCDEQ])(?<!\w\w\d)(\d+["+-]?)(?!BE|AQ|N)(?:.*)
The right bracket is treated as a literal when it's the first character listed, and the left bracket is never special (though some other flavors have different rules).
Here's the updated demo.

Automata - Regular Expression

I've been trying to make a regular expression from the below:
L = {01, 0011, 000111, 00001111, 0000011111, 000000111111, ...}
but I just could not figure it out. The first thing that came to my mind was
0(0)^* 1(1)^*
Is there an app where I could test it out?
If this can't be done through Regular Expression, can an NFA or DFA be done?
but I'm not sure if that is the answer to the language. Could some good Samaritan kindly help me with this? Appreciate it.
A subroutine may suit your needs:
(?<!0)(0(?1)?1)(?!1)
Debuggex Demo
(?1) means recall the pattern captured in the first group, i.e. between the parens. This isn't available in all regex engines though - neither is the (negative) lookbehind (?<!...) by the way.
The difference between (?1) and \1 is that (?1) recalls the captured pattern while \1 recalls the captured data.
I don't know about what you meant when you said that it should be regex, because it is mentioned automaton/regular expression too.
As per the automata theory :-
If you are talking about the regular expression for this formal language (having equal number of 0's and 1's and all 0's must be followed by 1's), it is not a regular language. It can be proved using the pumping lemma that this language is not regular.
But, this language can be expressed as {0i1i | i>0}; i belongs to set of positive integers.

Regular expression for finding swear words unless they are football teams

In AS3, I have created a nice swear filter routine that imports a list of regular expressions for swear words and combines them into a single regular expression. However, one bit I'm having problem over are football teams, namely ARSENAL and SCUNTHORPE.
Is there a way in a regular expression to block the swearwords unless they complete the words to be the above? I tried the following with ARSENAL but it didn't work properly:
/arse[^(nal)]/gi
The problem is that I cannot parenthesise the letters "nal" because it sees the parentheses as characters rather than a block. It appears to expect at least one extra character after "arse" in order to work. Can I make it so that it will allow one but not the other? How can I group letters together and say "not"?
EDIT: I found elsewhere on Stack some talk of "negative lookahead"s but didn't quite get how I could do that for these two use cases... Any ideas?
Just use the word anchor \b: \bswearwordhere\b.
Of course, you'd have to do with whatever s---ty workaround those ba**ar*s will invent to circumvent your f-*"-ng rules, heh.
I don't know about Actionscript specifically but in most Regex engines you can use
negative lookahead: ?!
negative lookbehind: ?<!
So for Arsenal:
/arse(?!nal)/gi
And Scunthorp or sHAPPYhorp:
/HAPPY(?<!sHAPPY)(?!horp)/gi
And Scunthorp will be similar to sHAPPYhorp, left as an assignment for the reader.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular expression listing all possibilities

Given a regular expression, how can I list all possible matches?
For example: AB[CD]1234, I want it to return a list like:
ABC1234
ABD1234
I searched the web, but couldn't find anything.
Exrex can do this:
$ python exrex.py 'AB[CD]1234'
ABC1234
ABD1234
The reason you haven't found anything is probably because this is a problem of serious complexity given the amount of combinations certain expressions would allow. Some regular expressions could even allow infite matches:
Consider following expressions:
AB[A-Z0-9]{1,10}1234
AB.*1234
I think your best bet would be to create an algorithm yourself based on a small subset of allowed patterns. In your specific case, I would suggest to use a more naive approach than a regular expression.
For some simple regular expressions like the one you provided (AB[CD]1234), there is a limited set of matches. But for other expressions (AB[CD]*1234) the number of possible matches are not limited.
One method for locating all the posibilities, is to detect where in the regular expression there are choices. For each possible choice generate a new regular expression based on the original regular expression and the current choice. This new regular expression is now a bit simpler than the original one.
For an expression like "A[BC][DE]F", the method will proceed as follows
getAllMatches("A[BC][DE]F")
= getAllMatches("AB[DE]F") + getAllMatches("AC[DE]F")
= getAllMatches("ABDF") + getAllMatches("ABEF")
+ getAllMatches("ACDF")+ getAllMatches("ACEF")
= "ABDF" + "ABEF" + "ACDF" + "ACEF"
It's possible to write an algorithm to do this but it will only work for regular expressions that have a finite set of possible matches. Your regexes would be limited to using:
Optional: ?
Characters: . \d \D
Sets: like [1a-c]
Negated sets: [^2-9d-z]
Alternations: |
Positive lookarounds
So your regexes could NOT use:
Repeaters: * +
Word patterns: \w \W
Negative lookarounds
Some zero-width assertions: ^ $
And there are some others (word boundaries, lazy & greedy quantifiers) I'm not sure about yet.
As for the algorithm itself, another user posted a link to this answer which describes how to create it.
Well you could convert the regular expression into an equivalent finite state machine (is relatively simple and can be done algorithmly) and then recursively folow every possible path through that fsm, outputting the followed paths through the machine. It's neither very hard nor computer intensive per output (you will normally get a HUGE amount of output however). You should however take care to disallow potentielly infinite passes (like .*). This can be done by having a maximum allowed path length, after which the tracing is aborted
A regular expression is intended to do nothing more than match to a pattern, that being said, the regular expression will never 'list' anything, only match. If you want to get a list of all matches I believe you will need to do it on your own.
Impossible.
Really.
Consider look ahead assertions. And what about .*, how will you generate all possible strings that match that regex?
It may be possible to find some code to list all possible matches for something as simple as you are doing. But most regular expressions you would not even want to attempt listing all possible matches.
For example AB.*1234 would be AB followed by absolutely anything and then 1234.
I'm not entirely sure this is even possible, but if it were, it would be so cpu/time intensive for many situations that it would not be useful.
For instance, try to make a list of all matches for A.*Z
There are sites that help with building a good regular expression though:
http://www.fileformat.info/tool/regex.htm
http://www.regular-expressions.info/javascriptexample.html
http://www.regextester.com/