Alternation usage creates strange behavior - regex

I am using this regex to catch the "e"s at the end of a string.
e\b|e[!?.:;]
It works but the thing I don't understand, when this encounters an input like
"space."
It only takes the "e", not including the "." but the regex has [!?.:;], which suggests it should capture the dot also.
If I remove the e\b| in the beginning, it captures the dot too. This is no problem for me because I was already trying to capture the letter only, however, I need this behavior to be explained.

The regex engine stops searching as soon as it finds a valid match.
The order of the alternatives matters, and since e is first matched, the engine will stop looking for the right side of the alternation.
In your case, the regex engine starts at the first token in "space.", it doesn't match. Then it moves to the second one, the "p". It still doesn't match.. It keeps trying to match tokens until it finally reaches the "e", and matches the left side of the alternation - when this happens, it doesn't proceed since a match was found.
I highly advise you to go through this tutorial, it gives a very good explanation on that.

If you need to make sure the . is returned in the match, just swap the alternatives:
e[!?.:;]|e\b
In NFA regex, the first alternative matched wins. There are also some different aspects here to consider, too, but this is out of scope here.
More details can be found here:
Why regex engine choose to match pattern ..X from .X|..X|X.?
Lazy quantifier {,}? not working as I would expect
In this case, here is what is going on: \b after e requires a non-word character after it. Since . is a non-word character, it satisfies the condition, that is why e\b (being the first alternative branch) wins with e[!?.:;] as both are able to match the same substring at that location.

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

How does \w+ select whole word?

In a simple regular expression, I understand that
\w
gives a single word character; however, I do not get how adding a plus (+) like so:
\w+
selects the whole word. In my mind, the plus just means one or more of the word character, so I do not understand how it would expand out to whole words.
Similar to how [0-9]+ means one or more digits, where each digit may be different, similarly \w+ means one or more word characters, again where each character may be different. In this normal "greedy" mode it keeps on going until it can't find any more. (You can also make it non-greedy, finding as few as possible while still allowing the regex to match) via \w+? in some regex flavors.)
If you wanted what you expect, to require the same character repeatedly, you'd need to use back-references:
(\w)\1* - Find any word character, capture it, and then find zero or more of that same character.
One character at a time example
With the regex \w+ and the input string Hello World the regex will start at the beginning and say to itself, "Is the next character matched by \w? Yes, so we add it to the result and then move forward one character." Because of the + modifier, after doing this once it keeps on doing it, one step at a time, until it cannot find any more. At this point it moves on to the next part of the regular expression (if there is one) or it stops. With just \w+ this captures all of Hello but not the space or World.
A note on Backtracking
The default + modifier enables "backtracking". This is a (sometimes-expensive) feature of regex that allows you to express your desire simply while giving the best chance of succeeding. For example, if your regex was \w+l and your input string was Hello World, the regex engine would capture all of Hello, and then say "Oh dear, now I need to find an l. There isn't one after the o...maybe I went too far?" It will back up until it has captured Hell and see if there is an l next (there isn't), and then back up again to just Hel and see if there is an l next (there is). The end result will be capturing just the string Hell.
Even more interesting is the case of the regex \w+r and the input string "Hello World". In this case the engine will capture all of Hello and see if there is an r following it. Since it does not find one, it backtracks one character at a time, until it finds out that H isn't followed by an r at this point it says "Maybe starting with the H wasn't a good idea" and goes forward in the string. Eventually it finds World, then backtracks to capture just Wo and finds that there is, finally, the r it needs. At this point it returns Wor as the match.
When adding the + it matches 1 or more of the preceeding tokens.
Its called a greedy match, and will match as many characters as possible before satisfying the next token.
http://regexr.com is a great tool for using regex and it also explains what the operators do.
The + is a greedy quantifier. It means that it will match as many characters as possible, even if there are "lesser" matches that would be valid.
In the string Hello world, \w+ matches Hello and world.
Appending a ? to it makes it non-greedy and it will be satisfied with the minimal valid match.
In the string Hello world, \w+? matches every letter separately.

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

How can I capture all nonempty sequences of letters other than cat, dog, fish using a regular expression?

Please explain why the expression makes sense if it is complicated.
If you are actually using grep, you could use the -v option to select only the lines that don't match:
grep -v \(cat\|dog\|fish\|^$\)
The pattern will select empty lines and lines containing "cat", "dog" and "fish".
Okay, you're not using grep. According to http://www.regular-expressions.info/refadv.html , if your regex engine supports it, you want ?!:
`(?!regex)`
Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.
`t(?!s)` matches the first `t` in `streets`.
Let's explore how we can build up a pattern which excludes specific phrases.
We'll start with a simple .*, which matches any character (using the dot), zero or more times (star). This pattern will match any string, including an empty string1.
However, since there are specific phrases we don't want to match, we can try to use a negative lookaround to stop it from matching what we don't want. A lookaround is a zero-width assertion, which means that the regex engine needs to satisfy the assertion for there to be a match, but the assertion does not consume any characters (or in other words, it doesn't advance the position in the string). In this specific case, we will use a lookahead, which tells the regex engine to look ahead of the current position to match the assertion (there are also lookbehinds, which, naturally, look behind the current position). So we'll try (?!cat|dog|fish).*.
When we try this pattern against catdogfish, though, it matches atdogfish! What's going on here? Let's take a look at what happens when the engine tries to use our pattern on catdogfish.
The engine works from left to right, starting from before the first character in our string. On it's first attempt, the lookahead asserts that the next characters from that point are not cat, dog, or fish, but since they actually are cat, the engine cannot match from this point, and advances to before the second character. Here the assertion succeeds, because the next characters following do not satisfy the assertion (atf does not match cat or dog and atfi does not match fish). Now that the assertion succeeds, the engine can match .*, and since by default regular expressions are greedy (which means that they will capture as much of your string as possible), the dot-star will consume the rest of the string.
You might be wondering why the lookaround isn't checked again after the first assertion succeeds. That is because the dot-star is taken as one single token, with the lookaround working on it as a whole. Let's change that so that the lookaround asserts once per repetition: (?:(?!cat|dog|fish).)*.
The (?:…) is called a non-capturing group. In general, things in regular expressions are grouped by parentheses, but these parentheses are capturing, which means that the contents are saved into a backreference (or submatch). Since we don't need a submatch here, we can use a non-capturing group, which works the same as a normal group, but without the overhead of keeping track of a backreference.
When we run our new pattern against catdogfish, we now get three matches2: at, og and ish! Let's take a look at what's going on this time inside the regex engine.
Again the engine starts before the first character. It enters the group that will be repeated ((?!cat|dog|fish).) and sees that the assertion fails, so moves onto the next position (a). The assertion succeeds, and the engine moves forwards to t. Again the assertion succeeds, and the engine moves forwards again. At this point, the assertion fails (because the next three characters are dog), and the engine returns at as a match, because that is the biggest string (so far, and the engine works from left to right), that matches the pattern.
Next, even though we've already got a match, the engine will continue. It will move forwards to the next character (o), and again pick up two characters that match the pattern (og). Finally, the same thing will happen for the ish at the end of the string. Once the engine hits the end of the string, there is nothing more for it to do, and it returns the three matches it picked up.
So this pattern still isn't perfect, because it will match parts of a string that contain our disallowed phrases. In order to prevent this, we need to introduce anchors into our pattern: ^(?:(?!cat|dog|fish).)*$
Anchors are also zero-width assertions, that assert that the position the engine is in must be a specific location in the string. In our case, ^ matches the beginning of the string, and $ matches the end of the string. Now when we match our pattern against catdogfish, none of those small matches can be picked up anymore, because none of them match the anchor positions.
So the final expression would be ^(?:(?!cat|dog|fish).)*$.
1 However, the dot doesn't match newline characters by default, unless the /s (or "single line") modifier is enabled on the regex.
2 I'm making the assumption here that the pattern is working in "global" mode, which makes the pattern match as many times as possible. Without global mode, the pattern would only return the first match, at.
Normally it's better to leave the negation to the code "around" regexp - such as -v switch in grep or !~ in perl. Is there a particular problem you're trying to solve, or its it just an exercise?