Regular expression to match all 4XX numbers except 400 and 411 - regex

I have a bunch of 4XX status codes and I want to match all except 400 and 411.
What I currently have so far:
/4\d{2}(?!400|411)/g
Which according to regexer:
Matches a 4XX
Should specify a group that cannot match after the main expression (but it is here where my expression is failing).

The expression 4\d{2}(?!400|411) First matches a 4 and then 2 digits. After the matching, it asserts not 400 and 411 directly to the right.
Instead, you can match 4 and first assert not 00 or 11 directly after it.
4(?!00|11)\d{2}
Or without partial word matches using word boundaries \b:
\b4(?!00|11)\d{2}\b
See a regex demo

If this is for a RegEx embedded in a programming language, I'd recommend using string comparisons or converting the status code to a number if it isn't already to then do number comparisons.
If you're forced to use RegEx, another, arguably less readable option (that does however make use of fewer RegEx features and is thus more portable): 4(0[1-9]|[1-9][02-9])
If the second character is a zero, the third character must be a nonzero digit;
If the second character is a one, the third character must be any digit other than one
If you don't want/need a capturing group, change the RegEx to 4(?:...).

Related

Numbers between 99 and 9999999 regular expression

I am trying to generate a regular expression that will match any numbers within the range of 99 and 9999999. I have trouble understanding how generating number ranges generally works. I managed to find a range generator online that does the job for me, but I want to understand how it actually works.
My attempt to do this range is as follows:
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
This is supposed to match 99, any 3 digit number or any 4 digit number, but it does not work as expected. When tested it matches only numbers 99 and 3 digit numbers. Four digit numbers are not matched at all. If I only write the part for 4 digit numbers on its own as
[1-9][0-9][0-9][0-9]
It matches 4 digit numbers, but when I construct it as in the first example it does not work. Can someone give me some clarification how this actually works and how successfully to generate a regular expression for the range of 99 to 9999999.
Link to demo - Here
So you want to know how this works...
Regexs have no real understanding of the values of numbers in your string, it only cares how they are represented, which is why looking for numbers in a range seems more awkward than it should be. The only reason your regex engine can understand a range in a character class like [0-9] at all is because of the characters' positions in a list (a character range like [&-~] is just as valid, and equally understandable to it.)
So, to match a range like 99-9999999, ya gotta spell out what that looks like: literal "99", or three digits without a leading zero, or four digits without a leading zero, and so on.
But this is what your demo did, right? And it didn't work. Of your test string "9293" your regex only matched "929". What happened here is the regex engine is eager to return a complete match - as soon as it found one it returned it, even though a better/longer match might have occurred later.
Here's how that match happened. (I'll skip some details like grouping, as they're not super relevant here.)
Step 1.
The engine compares the first token in the regex with the first character in the string
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 2.
The engine then advances both to the next token in the regex and the next character in the string and compares them.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ❌
Failure, no match. The engine would stop and return the failure here, but you're using alternation via |, so it knows there's an alternate expression to try.
Step 3.
The engine advances to the first token of the next alternate expression in the regex, and rewinds the position in the string.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success, they match.
Step 4.
Continuing on.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Match.
Step 5.
And again.
(99|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
9293 ✅
Success. The complete expression matches. There's no need to try the remaining alternate. The match here returned is:
929
As you've probably figured out, if your input string was instead "9923" then step 2 would've matched and the engine there would've stopped and returned "99".
As you've also probably figured out, if you rearrange your alternate expressions from longest to shortest
([1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|99)
the longest would be attempted first, which would match and return your expected "9293".
Simplifying
It's still pretty wordy though, especially as you crank up the number of digits in your range. There are a couple things you can do to simplify it.
The character class [0-9] can be represented by the shorthand character class \d.
([1-9]\d\d\d|[1-9]\d\d|99)
And instead of repeating them use a quantifier in curly brackets like so:
([1-9]\d{3}|[1-9]\d{2}|99)
As it happens, quantifiers can also take the form of {min, max}, so you can combine the two similar alternates:
([1-9]\d{2,3}|99)
You might expect this to land you back returning "929" again, the engine being eager and all, but quantifiers are by default greedy so they'll try to pick up as much as they can. This lends itself well to your larger desired range:
([1-9]\d{2,6}|99)
Finishing up
What you do with it from here depends on what you need the regex to do. As it stands the parentheses are superfluous, there's no point in creating a capturing group of the entire regex itself. However a decision comes when you've got an input string like:
You will likely be eaten by 1000 grue.
If you're trying to pluck out how many grue are about to eat you, you might use
[1-9]\d{2,6}|99
which will return 1000.
However that sorta runs back into the original problem with your demo. If it's "12345678 grue", which is out of range, this'll match "1234567" which might not be what you want. You can make sure the number you've matched isn't immediately followed by (or preceded by) another digit by using negative lookarounds.
(?<!\d)([1-9]\d{2,6}|99)(?!\d)
(?<!\d) means "from this position, the prior character is not a digit" while (?!\d) means "from this position, the next character is not a digit."
The parentheses around the alternates are back as they're necessary for grouping here, otherwise the lookbehind would only be part of and apply in the first alternate expression and the lookahead would only be part of and apply in the second alternate.
On the other hand if you're trying to make sure the entire string only consists of a number in your range you'll want to instead use the anchors ^ and $ (start of string and end of string, respectively):
^([1-9]\d{2,6}|99)$
And finally you can trade the capturing group out for a non-capturing group (?:...), so:
^(?:[1-9]\d{2,6}|99)$
or
(?<!\d)(?:[1-9]\d{2,6}|99)(?!\d)
You'll still grab the number as the match, it just won't be repeated in a group capture. (Lookarounds are already non-capturing, no need to worry about those.)
First of all you need some string boundaries for you regex (anything except digit, in my example I use ^ and $ -- begging and end of line or string)
Try this one:
^([1-9][0-9]{2,6}|99)$

Putting a group within a group [123[a-u]]

I'm having a lot more difficulty than I anticipated in creating a simple regex to match any specific characters, including a range of characters from the alphabet.
I've been playing with regex101 for a while now, but every combination seems to result in no matches.
Example expression:
[\n\r\t\s\(\)-]
Preferred expression:
[[a-z][a-Z]\n\r\t\s\(\)-]
Example input:
(123) 241()-127()()() abc ((((((((
Ideally the expression will capture every character except the digits
I know I could always manually input "abcdefgh".... but there has to be an easier way. I also know there are easier ways to capture numbers only, but there are some special characters and letters which I may eventually need to include as well.
With regex you can set the regex expression to trigger on a range of characters like in your above example [a-z] that will capture any letter in the alphabet that is between a and z. To trigger on more than one character you can add a "+" to it or, if you want to limit the number of characters captured you can use {n} where n is the number of characters you want to capture. So, [a-z]+ is one or more and [a-z]{4} would match on the first four characters between a and z.
You can use partial intervals. For example, [a-j] will match all characters from a to j. So, [a-j]{2} for string a6b7cd will match only cd. Also you can use these intervals several times within same group like this: [a-j4-6]{4}. This regex will match ab44 but not ab47
Overlooked a pretty small character. The term I was looking for was "Alternative" apparently.
[\r\t\n]|[a-z] with the missing element being the | character. This will allow it to match anything from the first group, and then continue on to match the second group.
At least that's my conclusion when testing this specific example.

Do regular expression engine parse characters one by one?

It is kind of absurd to ask. Anyway, let me proceed.
While exploring the regular expressions, I come across a scenario, where the expression is
[A-Z0-9]+(\d\d\.\d+)
The input string is 123.456 and the pattern being matched is as follows,
The pattern [A-Z0-9]+ could have matched upto 135, but it is not followed by the 2 more digits (\d\d) and a literal dot character. So, engine went with having characters 23.456 in the first subgroup.
Whether the regular expression engine check for the match by parsing one character at a time ? I was in that assumption.
By looking at this, it seems not. The engine should be parsing characters as well moving the window of matching back and forth, so that it can help us matching the result.
Correct me if I am wrong.
A regex engine parse the string according to the pattern it is given.
Your pattern is [A-Z0-9]+(\d\d\.\d+). Given the 123.456 string, the [A-Z0-9]+ is first tried from the beginning of the string. 123 is grabbed first (since + is a greedy quantifier). Then the regex engine tries to match the rest of the string with (\d\d\.\d+) - and fails. Backtracking occurs because the regex engine knows that [A-Z0-9]+ can match a different (smaller) portion of the string, and thus, the 3 is dropped from the currently consumed chars, and (\d\d\.\d+) is retried to match 3.456, but there must be 2 digits before a dot. Backtracking happens again.
Thus, only 1 remains outside the capturing group 1 value.
Also, have a look at the steps generated at regex101.com (backtracking is marked with ):

Email-similar regex catastrophic backtracing

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?
It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.
The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.