About this regular expression (?<=\d)\d{4}

About this regular expression (?<=\d)\d{4} - regex

I use (?<=\d)\d{4} to match 1234567890, the result is 2345 6789.
Why it's not 2345 7890?
In the second match, it starts from 6 and 6 is matched by (?<=\d), so I think the result is 7890 rather than 6789.
Besides, how about using ((?<=\d)\d{3})+ match 1234567890?

Look behinds are non consuming, so the 5 is being "reused" in the second match (even though the first match consumed it).
If you want to start at 6, consume but don't capture:
\d(\d{4})
And use group 1, or if your regex engine supports it, use a negative look behind for \G, which is the end of the previous match:
(?!\G)(?<=\d)\d{4}
See a live demo.

(?<=\d) is Zero-Length Assertion, assertions do not consume characters in the string, but only assert whether a match is possible or not.

It matches this way as the first match finishes at 5 so the next group can be matched from 6. (?<=\d) matches 5 in this case and the match is on 6789, starting with 6.
(?<=\d) doesn't belong to the match, it doesn't consume a character, it's just asserting what is in front of the match.

(?<=\d)\d{4}
?<= Lookbehind. Makes sure a digit precedes the text to be matched.
What text are we matching ? d{4} So, Meaning is match those 4 digits which are preceded by one digit.
In 1234567890 such a match is 2345 as it is preceded by 1 Now we have got one match and the string to be matched still is 1234567890 Now checking the regex condition will again tell to find group of four digits which has a prefix as a digit. Since 2345 has already been matched, the next successful match is 6789 which is preceded by 5 satisfying the regex conditions.
Coming to (?<=\d)\d{3} it does the same thing as before only it makes a group of 3. Editing this regex to get the one mentioned by you, we add the whole thing in a capture group. ((?<=\d)\d{3}) and say one or more of this ((?<=\d)\d{3})+. A repeated capturing group will only capture the last iteration.
So 890 is returned as a match.

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101

You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

How to select with regex this character?

For the example i have these four ip address:
10.100.0.11; wrong
10.100.1.12; good
10.100.11.4; good
10.100.44.1; wrong
The task has simple rules. In the 3rd place cant be 0, and the 4rd place cant be a solo 1.
I need to select they from an ip table in different routers and i know only this rules.
My solution:
^(10.100.[1-9]{1,3}.[023456789]{1,3})$
but in this case every number with 1 like 10, 100 etc is missing, so in this way this solution is wrong.
^(10.100.[1-9]{1,3}.[1-9]{2,3})$
This solve the problem of the single 1, but make another one.

From the rules you have given, this regex should work:
10\.100\.([123456789]\d*|\d{2,})\.([^1]$|\d{2,})
it also matches 3rd position number containing a 0 but not in the first place.
so 10.100.10.4 will match as well as 10.100.02.4
I don't know if it's the intended behavior since I'm not familiar with ip adress.
The last part \.([^1]$|\d{2,}) reads like this:
"after the 3rd dot is either
a character which is not 1 followed by the end of the line
or two or more digits"
If you want to avoid malformed string containing non-digit character like 10.100.12.a to be match you should replace [^1] by [023456789] or lazier (and therefore better ;) by [02-9]
I use https://regex101.com to debug regex. It's just awesome.
Here is your regex if you want to play with it

You might use
^10\.100\.[1-9]{1,3}\.(?:[02-9]|\d{2,3})$
The pattern matches
^ Start of string
10\.100\. Match 10.100. (note to escape the dot to match it literally)
[1-9]{1,3} Match 3 times a digit 1-9
\. Match a dot
(?: Non capture group
[02-9] Match a digit 0 or 2-9
| Or
\d{2,3} Match 2 or 3 digits 0-9
) Close the group
$ End of string
Regex demo

Elastic search regex to get last 7 digits from right

I have data indexed in this format 676767 2343423 2344444 32494444. I need a regular expression to pattern anlayser last 7 digits from right. Ex output: 2494444. Pattern which we have tried [0-9]{7} which is not working.

In ElasticSearch, the pattern is anchored by default. That means, you cannot rely on partial matches, you need to match the entire string and capture the last consecutive 7 digits.
Use
.*([0-9]{7})
where
.* - will match any 0+ chars other than newline (as many as possible) and then will backtrack to match...
([0-9]{7}) - 7 digits placed into Capture group 1.
The Sense plug-in returns the captured value if a capturing group is defined in the regular expression pattern, so, no additional extraction work (or group accessing work) needs to be done.

Regular Expression find space delimited numbers

I have a string that comes from user input through a messaging system, this can contain a series of 4 digit numbers, but as users are likely to type things in wrong it needs to be a little bit flexible.
Therefore I want to allow them to type in the numbers, or pepper their message with any string of characters and then just take the numbers that match the formats
=nnnn or nnnn
For this I have the Regular Expression:
(^|=|\s)\d{4}(\s|$)
Which almost works, however as it says that each group of 4 digits must start with an =, a space, or the start of the string it misses every other set of numbers
I tried this:
(^|=|\s*)\d{4}(\s|$)
But that means that any four digits followed by a space get matched - which is incorrect.
How can I match groups of numbers, but include a single space at the end of one group, and the beginning of the next, to clarify this string:
Ack 9876 3456 3467 4578 4567
Should produce the matches:
9876
3456
3467
4578
4567

Here you need to use lookarounds which won't consume any characters.
(?:^|[=\s])\K\d{4}(?=\s|$)
OR
(?:^|[=\s])(\d{4})(?=\s|$)
DEMO
Your regex (^|=|\s)\d{4}(\s|$) fails because at first this would match <space>9876<space> then it would look for another space or equals or start of the line. So now it finds the next match at <space>3467<space>. It won't match 3456 because the space before 3456 was already consumed in the first match. In-order to do overlapping matches, you need to put the pattern inside positive lookarounds. So when you put the last pattern (\s|$) inside lookahead, it won't consume the space, it just asserts that the match must be followed by a space or end of the line boundary.

\b\d+\b
\b asserts position at a word boundary (^\w|\w$|\W\w|\w\W). It is a 0-width anchor, much like ^ and $. It doesn't consume any characters.
Demo
or
(?:^|(?<=[=\s]))\d{4}\b
Demo

Why does this regex does not match

I'm wondering why the following regex works for some strings and does not work for some others:
/^([0-3]+)(?!4|.*5)[0-9]+$/
1151 -> this does not match
1141 -> this does match, but why? since I can consider .* as empty and the regex becomes /^([0-3]+)(?!4|5)[0-9]+$/
I think that I'm misunderstanding the way the look-ahead works...

Let's look at how the regular expression would parse your string, step by step.
^([0-3]+)(?!4|.*5)[0-9]+$
First, some clarification. (?!4|.*5) is a negative look-ahead that checks if either 4 or .*5 follow the last consumed character. If it does, the current match fails and steps back. It could also be written as (?!(4|.*5)) if you wanted it to be slightly more clear about how exactly | affects it.
Let's start by looking at 1141
First, [0-3]+ consumes as many characters as possible, so it will consume up to and including the 11 in 1141. What's leftover is 41. The regular expression now checks to see if 4 is after the current characters, and since ?! is a negative look-ahead, the match will fail if it is found. Since 4 follows 11, the match fails and the regular expression steps backwards and tries again.
Instead of matching two 1s, it now attempts a single match and matches 1, with 141 left over. ?!4 checks to make sure 4 is the next character, and what do you know, it's not there. The regex leaves the negative look-ahead since it didn't match, and continues on to the rest of the regular expression. 141 is matched by the final [0-9]+, and thus the entire 1141 string is matched. Remember that look-arounds do not consume characters.
Now let's look at 1151
The same thing happens as last time, 11 is consumed and we have 51 left over. Now we look at the negative look-ahead, and evaluate the rest of the string off that. Obviously, 4 is no where in this string so we can ignore that, so let's look at .*5.
So the look-ahead .*5 tries to match 51. If it does end up matching, just as before the match will fail and the regular expression will step back. Now if you know any regex at all, it is obvious that .*5 will match the beginning of 51 since .* can evaluate to empty.
So we step back, and now we've matched a single 1 instead of both, and we're at the negative look-ahead again.
We have currently consumed 1, still have 151 left to match, and are on the (?!4|.*5) portion of the regex. Here, 4 is obviously non-existant in our string so it is not going to match, so let's look at .*5 again.
.*5 will match a portion of 151 since .* will consume the first 1, and the 5 will finish off by matching 5. This should also be obvious if you know regex.
So we've made a match in a negative look-ahead again, which is bad... so we step back again. We have no more integers to attempt to match with [0-3], and since you can't match 0 integers with a +, the entire string fails to match the regular expression.

1141 matches because the the regular expression engine can backtrack from matching 11 with the [0-3]+ to just matching the first 1, leaving the remaining numbers to be matched by the [0-9]+.
As the next character after the first 1 is 1 and not 4, the negative look-ahead, which only looks at the next character, does not prevent the match.
The 1151 does not match because the negative look-ahead with the added .* prevents it.
With the added .* put before the 5 the look-ahead now means 'don't match if the next character is 4, or after any number of any characters the next character is 5' (ignoring newlines).
So even if the engine backtracks to make [0-3]+ match just the first 1 of 1151, there is still a 5 ahead in the string, so a match is prevented.
Remember that look-aheads and look-behinds are zero-width.

If you want it to match 4 or 5 the best option would be
/^[0-3]+[45][0-9]+$/
but without a better explaination of what it's supposed to do it's hard to suggest anything more than that...

What regex flavour is that?
/^([0-3]+)(?!4|.*5)[0-9]+$/
Honestly the only way I would see it match 1141 and not 1151 is if the highlighted part of the regex would be evaluated as NOT 4 or .* followed by 5. If it was that case then the regex engine would fail to find a match for 1141 as it would match the 4 but would miss the 5 to make the inner match complete.
However, usually the alternation would be understood as 4 or .*5 - which is still not equivalent to 4 or 5, because the expression .* can prove quite powerful in case when the engine wants to make a match work.
What are you testing the expression in?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js