Regular expression - Strange behavior - regex

I'm writing a compiler. I'm just starting, so I'm creating the Scanner (or Lexer). Currently, I'm writing some regular definitions which will be processed by my scanner. Trying to create one of them, I run in the next problem:
I was testing, in RegExr, the following (incredibly simple) regular expression:
r = /(a|ab)/
Where "r" is a regular definition; I mean, the regular expression just is (a|ab).
I thought the language L(r) would be (according to the book Compilers: Principles, Techniques and Tools):
L(r) = {a, ab}
Surprisingly, the tool matches {a}!
So my question is, why this behavior?

The regex a|ab matches "a" or "ab" (obviously), but some tools/languages (eg Java) consider the input to match when the entire input matches the regex, while others (eg JavaScript) consider input to match when some of the input matches.
Your tool must be a "some" variety to match "{a}".

A regex parses the text from left-to-right and in case of an alternator (|) it will first aim to match with the first candidate.
If you use:
(ab|a)
It will match both ab and a's.
The point is that once a match is found, a global matcher will start the next match attempt after the end of the first match.
You can easily verify that the matched language is {a,ab}: use the regex ^c(a|ab)d and use cabd. In that case, the regex has no choice but selecting the second option.
So say the regex reads: (a|ab) and the text is ab. It will match with a, next it will start after a, so it will attempt to match with b, but fail.
Most lexer tools however use a different way to determine the match. For lexer tools, the "longest match" counts. So the match with the longest number of characters.
Now if you enter (a|ba) as regex, it will match earlier ba earlier. Why? Because it also aims to find the first attempt. And in the text cbad, starting at index 1 (b) is seen as better than starting at index 2 (a).

As said by #bohemian some regex evaluate just a part of the string if you want to match the whole string you can use a regexp like this:
/^(a|ab)$/
Which only will accept a or ab

Related

Why "ab(cd|c)*d" matches "abcdcdd" completely but "ab(c|cd)*d" does not match that? Whereas they're like each other

I tried this regex:
ab(cd|c)*d
in the regex101 and RegExr websites. It matched this text completely:
abcdcdd
Now let's swap "cd" and "c" in the regex:
ab(c|cd)*d
When I try this regex in the websites, I see this regex does not completely match the same text.
Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?
REGEX: ab(cd|c)*d
Complete text matched in 13 steps: abcdcdd
REGEX: ab(c|cd)*d
Partial text matched in 9 steps: abcdcdd
#MurrayW's answer is excellent, but I would like to add some background information.
Regex as Finite State Automata
When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:
Both
echo abcdcdd | grep --color -E 'ab(c|cd)*d'
and
echo abcdcdd | grep --color -E 'ab(cd|c)*d'
color the whole string in red.
Patterns we call "regular expressions"
True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.
At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.
So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.
Speed
While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.
Because the | character performs an or operation by testing the left-most condition first. If that matches, nothing further is tested in the or. If that fails, then the next or element is tested, and so on.
Using regex pattern ab(cd|c)*d, you can see that the cd part of (cd|c)* matches in your string, and is also repeated: abcdcdd.
However, in pattern ab(c|cd)*d, the c matches from the or operation in abcdcdd and so cd isn't tested at all. Then, the d at the end of the pattern matches the d after the first c and then the pattern stops, having only matched abcdcdd
As previously answered in the comments, they are not the same patterns. The alternation in the first one tries to match cd first, the second one c first.
First pattern
abcdcdd
^^^^
||
||
ab(cd|c)*d
Second pattern
abcdcdd
^^____
| |
| |
ab(c|cd)*d
If the d is optional, you can omit the pipe for the alternation and make the d optional.
ab(cd?)*d.
Regex demo
Note that this way you repeat the capturing group which will hold the value of the last iteration.
If you are not interrested in the value of the group and non capturing groups are supported you could use ab(?:cd?)*d.
Regex is always a left to right proposition.
The only way a regex engine will ignore a previous alternation construct
is if it has to satisfy a term on the right side of the alternation group
that cannot be satisfied otherwise.
The regex rule is that the pattern is traversed from left to right,
but is controlled by the target string being traversed from left to right.
The symbiosis ..
Given the target string was matched like so "abcdcdd"
its easy to assume that the regex subset of the full regex
ab
( c | cd )* # (1)
d
is clearly
ab
c*
d
where the cd term of the alternation to the right was never needed
for a successful match.
This proves regex engines are a Left to Right bias machine.

Regex not separating n't (not)

I am trying to write a complex regex for a large corpus. However, due to many ORs, I am not able to capture the "not" in weren't don't wasn't didn't shouln't doesn't
I would like it to match base verb and n't separately: E.g. were and n't
I have added it in the first line on: https://www.regexpal.com/?fam=106183 with the regex.
Any clue why it is not picking despite it being present in the expression on first order: [a-z]{1}'\w
Edit:
The regex is long because it is part of a large corpus. My problem is that the n't is not getting separated out, even though I placed in first order of preference for OR.
Thanks in advance
Trying to parse natural language perfectly with a regular expression is never going to be "perfect". Language contains too many quirks and exceptions.
However, with that said, trying to cover all scenarios explicitly like you have done ("a 2 letter lower case word", "a 4 letter capitalised word", "a word with a multiple of 3 letters" (??!), ... is a doomed approach.
Keep the pattern as simple as you possibly can, and only add exceptions if you really need to.
Here's a basic approach:
/n't|\b\w+(?!'t)/
This is matching "n't", or 'any word, excluding the last letter if it's proceeded by "'t"'.
You may wish to build upon that slightly, but it solved the use case you've provided:
Demo
In order to understand why your original pattern doesn't work, let's consider a Minimal, Complete, Verifiable Example:
Cutting your pattern down to:
/[a-z]?'[a-z]{1,}|[\w-]+/
Consider how it matches the string:
"weren't"
First, the characters weren are matched by the [\w-]+ portion of the pattern.
Then, the 't characters are matched by the [a-z]?'[a-z]{1,} portion of the pattern.
Fundamentally, having the greedy [\w-]+ section in this pattern will mean it cannot work. This will always match up-to-and-including the "n" in "n't", which means the overall match fails for non-3-letter words.

Regex last word starting at end of string

I have the following regex \b(\w+)$ that works to find the last word in a string. However the longer the string, the more steps it takes.
How can I make it start the search from the end of the line?
Answer
Brief
Using the regex you specified \b(\w+)$ you will get an increasing number of steps depending on the string's length (it will match each \b, then each \b\w, then each \b\w\w until it finds a proper match of \b\w$), but it still has to do that check on each item in the string until it's satisfied.
What you can do to get the last item of a string using regex explicitly is to flip the string and then use the ^ anchor. This will cause regex to immediately be satisfied upon the first match and stop processing.
You can search how to flip a string in multiple languages. Some examples for languages include the following:
Java
C#
PHP
Code
You can see the regex in use here
Your programming language
// flip string code goes here
Regex
^(\w+)
Your programming language
// flip regex capture code goes here
Input
This is my string
Output
Converted to the following by flipping the string in your language
gnirts ym si sihT
Regex returns the following result
gnirts
Flip the string back in your language
string
Explanation
Since the anchor ^ is used, it will check from the beginning of the string (as per usual regex behaviour). If this is satisfied it will return the match, otherwise, it will return no matches. Testing in regex101 (provided through the link in the Code section) shows that it takes exactly 6 steps to ensure that a match is made. It also takes exactly 3 steps to ensure no match is made. These values do not change with string length.
It only works in .NET:
Regex rx = new Regex(Pattern, RegexOptions.RightToLeft);
Match match = rx.Match(Source);
In most regex engines, you can't.
Regex engines work by consuming input from the start of the input.
You can programmatically do it with a simple decrementing loop over the characters starting from the last character. If you need more performance, using code over regex is the only way.
This can be faster.
^.*\b(\w+)
• add ^.* before and capture \w+
• drop the $ if possible
Good luck!

A regular expression that replaces a group with hard coded text

First of all, I'm not sure if this is something you can even do in regular expressions. If you can, I have no idea on how to search for how to do it.
Let's say I have text:
Click this link for more information.
And a regular expression:
<a[^>]*>([^<]*)</a>
The application of the regular expression would yield this for group 1:
this link
Let's say I wanted to write the regular expression to instead return hard coded text for group 1
<a[^>]*>(${{replacement text}}[^<]*)</a>
(this is made up syntax by the way)
So that the application of the regular expression to the text would yield this for group 1:
replacement text
Is this possible?
Here's another example just to solidify my objective:
Examples of text:
serverNode1/appPortal
serverNode1/appPortal2
serverNode1/appPortal3
My regular expression
appPortal((?:?{{"1"}}\b)|(?:\d))
(using the same made up syntax)
The expected output for the first character group should be
1
2
3
(The point of the expression is to match the word break and replace it with "1" or otherwise use the digit character class to match a digit. The sub-groups are made optional with the ?: so the outside group is still group 1).
What is the point of this you may ask? I am using Splunk to do field extractions, and I'd like for the field to be extracted as 1, 2, or 3, like in my above example, and I can only rely on the regular expression groups to give me the fields (as in, I don't have anywhere to put code to say if group 1 == "" then change to "1").
Basically, as the regular expressions defined, it is not possible. By definition, regular expressions match the patterns in the text. To be clear, regexp engine returns matches that are always part of the original string, nothing more. There are some regex extensions that allows to specify name of the capturing group, but it does not transform the match.
The behaviour you described can be easy achieved processing the regex match in any programming language, but it also can be achieved by combining regex substitution and parsing.
For example, s/appPortal(?!\d)/appPortal1/ will replace "appPortal" without the digit after it with "appPortal1" and then you can apply another regex to build the match you want.

REGEX: Select everything NOT equal to a certain string

This should be straightforward. I need a regular expression that selects everything that does not specifically contain a certain word.
So if I have this sentence: "There is a word in the middle of this sentence."
And the regular expression gets everything but "middle", I should select everything in that sentence but "middle".
Is there any easy way to do this?
Thanks.
It is not possible for a single regex match operation to be discontinuous.
You could use two capturing groups:
(.*)middle(.*)
Then concatenate the contents of capturing groups 1 and 2 after the match.
You may wish to enable the "dot also matches newline" option in your parser.
See for example Java's DOTALL, .NET's Singleline, Perl's s, etc.
Positive lookaround is the way to go:
/^(.+)(?=middle)/ -- gets everything before middle, not including middle
and
/(?!middle)(.+)$/ -- gets everything after middle, not including middle
Then you just merge the results of both