REGEX for search and exclude combined - regex

Overview:
I am trying to combine two REGEX queries into one:
\d+\.\d+\.\d+\.\d+
^(?!(10\.|169\.)).*$
I wrote this as a two part query. The first part would isolate IPs in a block of text and after I copy and paste this I select everything and that does not being with a 10 or 169.
Questions:
It seems like I am over complicating this:
Can anybody see a better way to do this?
Is there a way to combine these two queries?

Sure. Just put the anchored negative look ahead at the start:
^(?!10\.|169\.)\d+\.\d+\.\d+\.\d+$
Note: Unnecessary brackets have been removed.
To match within a line, ie remove the anchors and use a "word boundary" \b as the anchor:
\b(?!10\.|169\.)\d+\.\d+\.\d+\.\d+

A quick-and-gimme-regex style answer
Basic one (whole string looks like an IP): ^\d+\.\d+\.\d+\.\d+$
Lite (period-separated 4-digit chunks, a whole word): \b\d+\.\d+\.\d+\.\d+\b
Medium (excluding junk like 1.2.4.6.7.9.0): (?<!\d\.)\b\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 1 (not starting with 10 or 169): (?<!\d\.)\b(?!(?:1(?:0|69))\.)\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 2 (not ending with 8 or 10): (?<!\d\.)\b\d+\.\d+\.\d+\.(?!(?:8|10)\b)\d+\b(?!\.\d+)
Details for the curious
The \b is a word boundary that makes it possible to match exact "words" (entities consisting of [a-zA-Z0-9_] characteters) inside a longer text. So, if we do not want to match 12.12.23.56 inside g12.12.23.56g, we use the Lite version.
The lookarounds together with the word boundary, make it possible to further restrict the matches. (?<!\d\.) - a negative lookbehind - and a (?!\.\d+) - a negative lookahead - will fail a match if the IP-resembling substring is preceded with a digit+. or followed with a .+digit. So, we do not match 12.12.34.56.78.90899-like entities with this regex. Choose Medium regex for that case.
Now, you need to restrict the matches to those that do not start with some numeric value. You need to make use of either a lookbehind, or a lookahead. When choosing between a lookbehind or a lookahead solution, prefer the lookahead, because 1) it is less resource consuming, and 2) more flavors support it. Thus, to fail all matches where IP first number is equal to 10 or 169, we can use a negative lookahead anchored after the leading word boundary: (?!(?:1(?:0|69))\.). The syntax is (?!...) and inside, we match either 1 followed with 0 and then a ., or 1 followed with 69 and then .. Note that we could write (?!10\.|169\.) but there is some redundant backtracking overhead then, as 1 part is repeating. Best practice is to "contract" alternations so that the beginning of each branch did not repeat, make the alternation group more linear. So, use Advanced 1 regex version to get those IPs.
A similar case is the Advanced 2 regex for getting some IPs that do not end with some value.

Related

Why are sequential regular expressions more efficient than a combined experession?

In answering a Splunk question on SO, the following sample text was given:
msg: abc.asia - [2021-08-23T00:27:08.152+0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%257B%2522stringMatchFilters%2522:%255B%257B%2522key%2522:%2522BFEESCE((json_data-%253E%253E'isNotSearchable')::boolean,%2520false)%2522,%2522value%2522:%2522false%2522,%2522operator%2522:%2522EQ%2522%257D%255D,%2522multiStringMatchFilters%2522:%255B%257B%2522key%2522:%2522json_data-%253E%253E'id'%2522,%2522values%2522:%255B%25224970111%2522%255D%257D%255D,%2522containmentFilters%2522:%255B%255D,%2522nestedMultiStringMatchFilter%2522:%255B%255D,%2522nestedStringMatchFilters%2522:%255B%255D%257D&sorts=%257B%2522sortOrders%2522:%255B%257B%2522key%2522:%2522id%2522,%2522order%2522:%2522DESC%2522%257D%255D%257D&pagination=null
The person wanted to extract everything in the "filters" portion of the URL if "factType" was "COMMERCIAL"
The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):
factType=(?<facttype>\w+).+filters=(?<filters>[^\&]+)
According to regex101, it finds its expected matches with 670 steps
But if I break it up to
factType=(?<facttype>\w+)
followed by
filters=(?<filters>[^\&]+)
regex101 reports the matches being found with 26 and 16 steps, respectively
What about breaking up the regex into two makes it so much more (~15x) efficient to match?
The main problem with the regexp is the presence of the .+ where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with .+?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).
Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).
Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).
Here is a regex that would be inherently more efficient than .+ or .+? irrespective of the positions of those matches in input text.
factType=(?<facttype>\w+)(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&]+)
RegEx Demo 1
RegEx Demo 2
This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.
Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use .+ it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use .+? it just does lazy match and moves forward one position at a time to get the full match.
This suggested approach is far more efficient than .* or .+ or .+? approaches because it avoids expensive backtracking while trying to find match of the second pattern.
RegEx Details:
factType=: Match factType=
(?<facttype>\w+): Match 1+ word characters and capture in named group facttype
(?:: Start non-capture group
&: Match a &
(?!filters=): Stop matching when we have filters= at next position
[^&\s]*: Match 0 or more of non-space non-& chars
)*: End non-capture group. Repeat this group 0 or more times
&: Match a &
filters=: Match filters=
(?<filters>[^\&]+): Match 1 or more of non-space non-& chars and capture in named group filters
Related article on catastrophic backtracking

Regex Match 6 Letter String With Chars and Number and No positive look around

I know there are several similar answers, but I am struggling to find one that fits my use case.
I need a regex to extract IDs that are 6 characters long and have a mix of numbers and characters.
The IDs will start with one of the following chars [eEdDwWaA]
I have had some solutions that have nearly worked, but the tool I want to plug this regex into does NOT support positive look around and every answer seems to use this.
The string I need to find can be anywhere in text and will either be preceded by a whitespace or a backslash.
Example of what I would want to match is eh3geh (case insensitive)
Here is what I have so far [eEdDwWaA](?:[0-9]+[a-z]|[a-z]+[0-9],{5})[a-z0-9]*
This works for the most part but it is not consistently matching and I'm not sure why.
If you can't use a lookahead an idea is to capture using The Trick.
The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side....
[\\ ](?:.[a-z]{5}|([eEdDwWaA][a-z0-9]{5}))\b
.[a-z]{5} we don't want only letters (left side)
|(...) but capture what we need to group one (righte side)
Here is the demo at regex101
Get the captures of group 1 on program-side (where group not null/empty).

Why put a lookahead at the beginning, or a lookbehind at the end? [duplicate]

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).

Refer to match big looping data

I have one regex which works fine for upto 1000 records. But when it comes more than that it shows stack overflow error. I'm using this regex in java code. (eclipse)
Here is my regex:
X00(X01((P00){1}(T00){1,99}){1,9999}H00)V99
The data is going to come in single line record with specific number of occurrences.
Here is an example of data
X00X01P00T00T00T00P00T00T00T00T00T00H00V99
Max limit of specific word is mentioned in regex and of whole group occurrence as well (1, 9999}. And within in one group P00 is going to be occur one time and T00 is upto 99 times and this group itself can be repeated itself can be repeated upto 9999 times. I hope its clear now..
What can be done in this regex to match long data coming into this pattern mentioned in above regex... upto 10000 records?
As you want a record to match the pattern as a whole, you should add the start-of-string (^) and end-of-string ($) anchors in your regular expression. This way you avoid that matching attempts are made at other positions in a line, which you would not want to match that way anyway.
Depending on whether you parse all records in one go, you may need to specify the multiline modifier: m.
Secondly, the large number in {1,9999} may cause difficulty for some regex parsers. There is not much you can do about that, if indeed that poses a problem, as it has to do with the size of the compiled regular expression. In that case you could try with a negative look-ahead to see you don't have more than 9999 occurrences, and when that passes, then just do a + (i.e. "one or more"), but it suffers from the same problem on reg101.com (too large expression):
^X00((?!(?:.*?P){10000})X01((P00)(?!(?:T00){100})(T00)+)+H00)V99$
If the numbers 99 and 9999 were not really strict limits, then you could take out those negative look-aheads ( (?! .... ) ): that would really be a time-saver.
Explanation of the (?! ... )
The meaning of (?!(?:.*?P){10000})
(?!: start a negative look-ahead: this does not "eat" any characters, but looks ahead.
(?:: start of a non-capturing group. Similar to normal parentheses, but you cannot back-reference them.
.*?P: any characters up to a "P".
{10000}: count that many occurrences, which practically means: see if you can find 10000 "P" in the string: if so, the look-ahead succeeds, but since it is a negative look-ahead, the match will fail, which is the purpose.

Regex lookahead ordering

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).