Improve Regex with Groups - regex

Just getting into regex and I am trying to write a regex for a uk national insurance number example ab123456c.
I've currently got this which works
^[jJ]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[a-zA-Z]$
but I was wondering if there is a shorter version for exmaple
^[jJ]{2} [ [\-\s]{0,1}[0-9]{2} ]{3} [\-\s]{0,1}[a-zA-Z]$
So repeat the [-\s]{0,1}[0-9]{2} 3 by wrapping it in some sort of group [ * ]{3}

If i got you right, your insurance numbers are always two letters, 6 numbers, and a final letter, A,B,C or D? Wouldn't it be the easiest way to try sth. like that
/\w{2}\d{6}[A-D]/
you catch 2 letters at first with \w{2} , then you get 6 numbers with \d{6} and you end with a letter from A to D by [A-D]
Or, if blanks are impontant, try this
/\w{2}\d\d \d\d \d\d [A-D]/
I dont think that shorten it much more would be possible, since when you are trying to use (\d\d ){3} it would only repeat the same pattern three times, e.g. 23 23 23
If you really want to learn RegEx, i suggest you this tutorial, it helped me a lot in the beginning of Regular Expressions

A simple research for a regex tutorial in your favorite search engine (duckduckgo for sure) would give you the answer faster than asking in a forum!
So what you are looking for is a non-capturing group (?:...). You can rewrite your pattern like this:
^[jJ]{2}(?:[-\s]?[0-9]{2}){3}[-\s]?[a-zA-Z]$
or like this if you use a case insensitive flag/option:
^J{2}(?:[-\s]?[0-9]{2}){3}[-\s]?[A-Z]$
An other possible way consists to remove all that is not a letter or a digit before (and eventually to use an uppercase function). Then you only need:
^J{2}[0-9]{6}[A-Z]$
As an aside, I don't understand why you start your pattern with J for the first two letters, since many others letters are allowed according to this article: https://en.wikipedia.org/wiki/National_Insurance_number
Other thing, short and efficient are two different things in computing.
for example this pattern will be efficient too and more restrictive:
^(?!N[KT]|BG|GB|[KT]N|ZZ)[ABCEGHJ-PRSTW-Z][ABCEGHJ-NPRSTW-Z][0-9][0-9][-\s]?[0-9][0-9][-\s]?[0-9][0-9][-\s]?[A-D]$

A shorter version:
/^j{2}(?:[-\s]?\d{2}){3}[-\s]?[a-zA-Z]$/i
See the regex online demo
Note that
you do not need to escape - inside the character class if it is at the beginning or end of the class (see Metacharacters Inside Character Classes)
you can use a \d as a shorthand character class for a digit (see Shorthand Character Classes)
{0,1} limiting quantifier can usually be represented as a ? quantifier (1 or zero occurrences) (see Limiting Repetition)
The /i (or inline modifier version (?i) - depending on the engine) can be used to turn [jJ] to just j or J (see Specifying Modes Inside The Regular Expression)
A limiting quantifier can be applied to a whole (better non-capturing) group: (?:[-\s]?\d{2}){3} (see Limiting Repetition)

Related

Multiple {n} quantifiers regex

Is it possible to have multiple quantifiers in a regex?
Say I have the following regex:
[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30}
I want to match any string which is either 30, 36 or 44 chars long. Is it possible to write this shorter in any way? Something like the following:
[A-Z0-9<]{30|36|44}?
Edit: Seeing the answers I assume there is not really a way in which you can write the above shorter. The best solution would be to solve it programmatically I guess. Thanks for the input.
Brief
Note that your regex performs much better than any other answers you'll get on your question, but since your question is actually about simplifying/shortening your regex, you can use this.
Your original regex (38 characters):
[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30}
Your original regex with modifications so that we can use it to test against multiline input (44 characters):
^(?:[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30})$
Code
My original regex (32 characters):
([A-Z0-9]){44}|(?1){36}|(?1){30}
My original regex with modifications so that we can use it to test against multiline input (38 characters):
^(?:([A-Z0-9]){44}|(?1){36}|(?1){30})$
See regex in use here
Explanation
([A-Z0-9]){44}|(?1){36}|(?1){30} Match either of the following
([A-Z0-9]){44} Match any character in the set (A-Z or 0-9) exactly 44 times. This also captures a single character in the set into capture group 1. We will later use this capture group through recursion.
(?1){36} Recurse the first subpattern exactly 36 times
(?1){30} Recurse the first subpattern exactly 30 times
Looks like you want
[A-Z0-9]{30}([A-Z0-9]{6}([A-Z0-9]{8})?)?
This isn't actually simpler, mind you.
You don't need to check your input contains only uppercase letters [A-Z] and digits [0-9] to test whether it is a string. Eliminate [A-Z0-9] part for this reason. Now, you can specify multiple quantifiers as follows:
^(?:.{30}|.{36}|.{44})$
If you need to do that check strictly. You can use this regex without typing [A-Z0-9] three times:
^(?=[A-Z0-9]*$)(?:.{30}|.{36}|.{44})$
You have the [A-Z0-9] part only once and a generic . to check the length of string.

REGEX for search and exclude combined

Overview:
I am trying to combine two REGEX queries into one:
\d+\.\d+\.\d+\.\d+
^(?!(10\.|169\.)).*$
I wrote this as a two part query. The first part would isolate IPs in a block of text and after I copy and paste this I select everything and that does not being with a 10 or 169.
Questions:
It seems like I am over complicating this:
Can anybody see a better way to do this?
Is there a way to combine these two queries?
Sure. Just put the anchored negative look ahead at the start:
^(?!10\.|169\.)\d+\.\d+\.\d+\.\d+$
Note: Unnecessary brackets have been removed.
To match within a line, ie remove the anchors and use a "word boundary" \b as the anchor:
\b(?!10\.|169\.)\d+\.\d+\.\d+\.\d+
A quick-and-gimme-regex style answer
Basic one (whole string looks like an IP): ^\d+\.\d+\.\d+\.\d+$
Lite (period-separated 4-digit chunks, a whole word): \b\d+\.\d+\.\d+\.\d+\b
Medium (excluding junk like 1.2.4.6.7.9.0): (?<!\d\.)\b\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 1 (not starting with 10 or 169): (?<!\d\.)\b(?!(?:1(?:0|69))\.)\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 2 (not ending with 8 or 10): (?<!\d\.)\b\d+\.\d+\.\d+\.(?!(?:8|10)\b)\d+\b(?!\.\d+)
Details for the curious
The \b is a word boundary that makes it possible to match exact "words" (entities consisting of [a-zA-Z0-9_] characteters) inside a longer text. So, if we do not want to match 12.12.23.56 inside g12.12.23.56g, we use the Lite version.
The lookarounds together with the word boundary, make it possible to further restrict the matches. (?<!\d\.) - a negative lookbehind - and a (?!\.\d+) - a negative lookahead - will fail a match if the IP-resembling substring is preceded with a digit+. or followed with a .+digit. So, we do not match 12.12.34.56.78.90899-like entities with this regex. Choose Medium regex for that case.
Now, you need to restrict the matches to those that do not start with some numeric value. You need to make use of either a lookbehind, or a lookahead. When choosing between a lookbehind or a lookahead solution, prefer the lookahead, because 1) it is less resource consuming, and 2) more flavors support it. Thus, to fail all matches where IP first number is equal to 10 or 169, we can use a negative lookahead anchored after the leading word boundary: (?!(?:1(?:0|69))\.). The syntax is (?!...) and inside, we match either 1 followed with 0 and then a ., or 1 followed with 69 and then .. Note that we could write (?!10\.|169\.) but there is some redundant backtracking overhead then, as 1 part is repeating. Best practice is to "contract" alternations so that the beginning of each branch did not repeat, make the alternation group more linear. So, use Advanced 1 regex version to get those IPs.
A similar case is the Advanced 2 regex for getting some IPs that do not end with some value.

Regex for two digits in any order

I need a regex that will match a string as long as it includes 2 or more digits.
What I have:
/(?=.*\d)(?=.*\d)/
and
/\d{2,}/
The first one will match even if there is one digit, and the second requires that there are 2 consecutive digits. I have tried to combine them in different ways to no avail.
You can do much simpler :
/\d\D*\d/
You can use the following expression:
.*\d.*\d.*
This will match anything that has two digits in it, anywhere. Regardless of where the numbers are. Example here.
You can also do it like this, using ranges:
.*[0-9].*[0-9].*
Link.
You may also consider using this:
\D*\d\D*\d
The \D will match anything that is not a digit character
It depends on your applications language, but this regex is the most general:
^(?=.*\d.*\d)
Not all application languages consider partial matches as "matching"; this regex will match no matter where in the input the two digits lie.
grep -E ".*[0-9].*[0-9].*" filename
You can use the following depending on the use case:
^(?=(?:\D*\d){2}).* - The restriction is implemented with a positive lookahead (anchored at the start of string) that requires any two (or more) digits anywhere inside the string (and the regex flavor supports lookaheads) - Regex demo #1
^([^0-9]*[0-9]){2}.* - The regex matches a string that starts with two sequences of any non-digit chars followed with a digit char and then contains any text (this pattern is POSIX ERE compliant, to make it POSIX BRE compliant, use ^\([^0-9]*[0-9]\)\{2\}.*) - Regex demo #2
\d\D*\d - in case you simply want to make sure there is a digit + zero or more chars other than digits followed with a digit and the method you are using allows partial matches - Regex demo #3.
The first approach is best when you already have a complex pattern and you need to add an extra constraint.
The second one is good for POSIX regex engines.
The third one is best when you implement complex if-else logic for password and other validations with separate error messages per issue.
try this.
[0-9].{2}
this will help to u

regex negative look-ahead for exactly 3 capital letters arround a char

im trying to write a regex finds all the characters that have
exactly 3 capital letters on both their sides
The following regex finds all the characters that have exactly 3 capital letters on the left side of the char, and 3 (or more) on the right:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})'
When trying to limit the right side to no more then 3 capitals using the regex:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})(?![A-Z])'
i get no results, there seems to be a fail when adding the (?![A-Z]) to the first regex.
can someone explain me the problem and suggest a way to solve it?
Thanks.
You need to put the negative lookahead inside the positive one:
(?<![A-Z])[A-Z]{3}.(?=[A-Z]{3}(?![A-Z]))
You can do that with the lookbehind, too:
(?<=(?<![A-Z])[A-Z]{3}).(?=[A-Z]{3}(?![A-Z]))
It doesn't violate the "fixed-length lookbehind" rule because lookarounds themselves don't consume any characters.
EDIT (about fixed-length lookbehind): Of all the flavors that support lookbehind, Python is the most inflexible. In most flavors (e.g. Perl, PHP, Ruby 1.9+) you could use:
(?<=^[A-Z]{3}|[^A-Z][A-Z]{3}).
...to match a character preceded by exactly three uppercase ASCII letters. The first alternative - ^[A-Z]{3} - starts looking three positions back, while the second - [^A-Z][A-Z]{3} - goes back exactly four positions. In Java, you can reduce that to:
(?<=(^|[^A-Z])[A-Z]{3}).
...because it does a little extra work at compile time to figure out that the maximum lookbehind length will be four positions. And in .NET and JGSoft, anything goes; if it's legal anywhere, it's legal in a lookbehind.
But in Python, a lookbehind subexpression has to match a single, fixed number of characters. If you've butted your head against that limitation a few times, you might not expect something like this to work:
(?<=(?<![A-Z])[A-Z]{3}).
At least I didn't. It's even more concise than the Java version; how can it work in Python? But it does work, in Python and in every other flavor that supports lookbehind.
And no, there are no similar restrictions on lookaheads, in any flavor.
Taking out the positive lookahead worked for me.
(?<![A-Z])[A-Z]{3}(.)([A-Z]{3})(?![A-Z])
'ABCdDEF' 'ABCfDEF' 'HHHhhhHHHH' 'jjJJjjJJJ' JJJjJJJ
matches
ABCdDEF
ABCfDEF
JJJjJJJ
I'm not sure how the regexp engines should work with multiple lookahead assertions, but the one you're using may have its own opinion on that.
You could as well use a single assertion as follows:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3}[^A-Z])'
The same with lookbehind:
'(?<=[^A-Z][A-Z]{3})(.)(?=[A-Z]{3}[^A-Z])'
This will have a problem matching the pattern in the beginning and in the end of the line.
I can't think of a proper solution, but there can be a dirty trick: for instance, add a space (or something else) in the beginning and the end of the whole line, then perform the matching.
$ echo 'ABCdDEF ABCfDEF HHHhhhHHHH AAAaAAAbAAA jjJJJJjJJJ JJJjJJJ' | sed 's/.*/ & /' | grep -oP '(?<=[^A-Z][A-Z]{3})(\S)(?=[A-Z]{3}[^A-Z])'
d
f
a
b
j
Note that I changed (.) to (\S) in the middle, change it back if you want the space to match.
P.S. Are you solving The Python Challenge? :)
Since the look ahead pattern is the same as the look behind pattern, you could also use the continue anchor \G:
/(?:[A-Z]{3}|\G[A-Z]*)(.)[A-Z]{3}/
A match is returned if three capitals precede a single character or where the last match left off (optionally followed by other capitals).

What's wrong with this number extracting Regex?

I have a string like the following:
<br><b>224h / 15.45 verbuchte Stunden</b>
I want to extract the numbers and have created the following Regex:
([0-9]\.?[0-9]{0,2})h\s\/\s([0-9]\.?[0-9]{0,2})
But for the preceding string this gives me the numbers 224 and 15 instead of 15.45.
What's wrong with this Regex?
Because you allow only one digit before the dot.
Try this, I used {1,2} as quantifier before the dot, change it to your needs. Probably + would be a better choice, it allows one or more.
([0-9]\.?[0-9]{0,2})h\s\/\s([0-9]{1,2}\.?[0-9]{0,2})
A better regex could be this
([0-9]+(?:\.[0-9]{1,2})?)h\s*\/\s*([0-9]+(?:\.[0-9]{1,2})?)
I made here the complete fraction part optional and require at least one and at most 2 digits after the dot and minimum one before.
The answer is given by stema.
If your regex engine supports character classes it could be a little bit more compact like this:
(\d{1,2}\.?\d{0,2})h\s/\s(\d{1,2}\.?\d{0,2})
\d is a shorthand character class for [0-9]