I have quite a straightforward question. Where I work I see a lot of regular expressions come by. They are used in Perl to get replace and/or get rid of some strings in text, e.g.:
$string=~s/^.+\///;
$string=~s/\.shtml//;
$string=~s/^ph//;
I understand that you cannot concatenate the first and last replacement, because you may only want to replace ph at the beginning of the string after you did the first replacement. However, I would put the first and second regex together with alternation: $string=~s/(^.+\/|\.shtml)//; Because we're processing thousands of files (+500,000) I was wondering which method is the most efficient.
Your expressions are not equivalent
This:
$string=~s/^.+\///;
$string=~s/\.shtml//;
replaces the text .shtml and everything up to and including the last slash.
This:
$string=~s/(^.+\/|\.shtml)//;
replaces either the text .shtml or everything up to and including the last slash.
This is one problem with combining regexes: a single complex regex is harder to write, harder to understand, and harder to debug than several simple ones.
It probably doesn't matter which is faster
Even if your expressions were equivalent, using one or the other probably wouldn't have a significant impact on your program's speed. In-memory operations like s/// are significantly faster than file I/O, and you've indicated that you're doing a lot of file I/O.
You should profile your application with something like Devel::NYTProf to see if these particular substitutions are actually a bottleneck (I doubt they are). Don't waste your time optimizing things that are already fast.
Alternations hinder the optimizer
Keep in mind that you're comparing apples and oranges, but if you're still curious about performance, you can see how perl evaluates a particular regex using the re pragma:
$ perl -Mre=debug -e'$_ = "foobar"; s/^.+\///; s/\.shtml//;'
...
Guessing start of match in sv for REx "^.+/" against "foobar"
Did not find floating substr "/"...
Match rejected by optimizer
Guessing start of match in sv for REx "\.shtml" against "foobar"
Did not find anchored substr ".shtml"...
Match rejected by optimizer
Freeing REx: "^.+/"
Freeing REx: "\.shtml"
The regex engine has an optimizer. The optimizer searches for substrings that must appear in the target string; if these substrings can't be found, the match fails immediately, without checking the other parts of the regex.
With /^.+\//, the optimizer knows that $string must contain at least one slash in order to match; when it finds no slashes, it rejects the match immediately without invoking the full regex engine. A similar optimization occurs with /\.shtml/.
Here's what perl does with the combined regex:
$ perl -Mre=debug -e'$_ = "foobar"; s/(?:^.+\/|\.shtml)//;'
...
Matching REx "(?:^.+/|\.shtml)" against "foobar"
0 <> <foobar> | 1:BRANCH(7)
0 <> <foobar> | 2: BOL(3)
0 <> <foobar> | 3: PLUS(5)
REG_ANY can match 6 times out of 2147483647...
failed...
0 <> <foobar> | 7:BRANCH(11)
0 <> <foobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
1 <f> <oobar> | 1:BRANCH(7)
1 <f> <oobar> | 2: BOL(3)
failed...
1 <f> <oobar> | 7:BRANCH(11)
1 <f> <oobar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
2 <fo> <obar> | 1:BRANCH(7)
2 <fo> <obar> | 2: BOL(3)
failed...
2 <fo> <obar> | 7:BRANCH(11)
2 <fo> <obar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
3 <foo> <bar> | 1:BRANCH(7)
3 <foo> <bar> | 2: BOL(3)
failed...
3 <foo> <bar> | 7:BRANCH(11)
3 <foo> <bar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
4 <foob> <ar> | 1:BRANCH(7)
4 <foob> <ar> | 2: BOL(3)
failed...
4 <foob> <ar> | 7:BRANCH(11)
4 <foob> <ar> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
5 <fooba> <r> | 1:BRANCH(7)
5 <fooba> <r> | 2: BOL(3)
failed...
5 <fooba> <r> | 7:BRANCH(11)
5 <fooba> <r> | 8: EXACT <.shtml>(12)
failed...
BRANCH failed...
Match failed
Freeing REx: "(?:^.+/|\.shtml)"
Notice how much longer the output is. Because of the alternation, the optimizer doesn't kick in and the full regex engine is executed. In the worst case (no matches), each part of the alternation is tested against each character in the string. This is not very efficient.
So, alternations are slower, right? No, because...
It depends on your data
Again, we're comparing apples and oranges, but with:
$string = 'a/really_long_string';
the combined regex may actually be faster because with s/\.shtml//, the optimizer has to scan most of the string before rejecting the match, while the combined regex matches quickly.
You can benchmark this for fun, but it's essentially meaningless since you're comparing different things.
How regex alternation is implemented in Perl is fairly well explained in perldoc perlre
Matching this or that
We can match different character strings with the alternation
metacharacter '|' . To match dog or cat , we form the regex dog|cat .
As before, Perl will try to match the regex at the earliest possible
point in the string. At each character position, Perl will first try
to match the first alternative, dog . If dog doesn't match, Perl will
then try the next alternative, cat . If cat doesn't match either, then
the match fails and Perl moves to the next position in the string.
Some examples:
"cats and dogs" =~ /cat|dog|bird/; # matches "cat"
"cats and dogs" =~ /dog|cat|bird/; # matches "cat"
Even though dog is the first alternative in the second regex, cat is able to match
earlier in the string.
"cats" =~ /c|ca|cat|cats/; # matches "c"
"cats" =~ /cats|cat|ca|c/; # matches "cats"
Here, all the alternatives match at the first string position, so the
first alternative is the one that matches. If some of the alternatives
are truncations of the others, put the longest ones first to give them
a chance to match.
"cab" =~ /a|b|c/ # matches "c"
# /a|b|c/ == /[abc]/
The last example points out
that character classes are like alternations of characters. At a given
character position, the first alternative that allows the regexp match
to succeed will be the one that matches.
So this should explain the price you pay when using alternations in regex.
When putting simple regex together, you don't pay such a price. It's well explained in another related question in SO. When directly searching for a constant string, or a set of characters as in the question, optimizations can be done and no backtracking is needed which means potentially faster code.
When defining the regex alternations, just choosing a good order (putting the most common findings first) can influence the performance. It is not the same either to choose between two options, or twenty. As always, premature optimization is the root of all evil and you should instrumentiate you code (Devel::NYTProf) if there are problems or you want improvements. But as a general rule alternations should be kept to a minimum and avoided if possible since:
They easily make the regex too big an complex. We like simple, easy to understand / debug / maintain regex.
Variability and input dependant. They could be an unexpected source of problems since they backtrack and can lead to unexpected lack of performance depending on your input. As I understand, there's no case when they will be faster.
Conceptually you are trying to match two different things, so we could argue that two different statements are more correct and clear than just one.
Hope this answer gets closer to what you were expecting.
First, measure the various options on your real data, because no amount of theory will beat an experiment (if one can be done). There are many timing modules on CPAN that will help you.
Second, if you decide to optimize the regexes, do not squish them into one giant monster by hand, try to assemble the “master” regex with code. Otherwise no-one will be able to decipher the code.
Combination is not your best option
If you have three regexes that work perfectly well, there's no benefit to combine them. Not only does rewriting them open the door for bugs, it makes it harder for both programmer and engine to read the regex.
This page suggests that instead alteration like:
while (<FILE>) {
next if (/^(?:select|update|drop|insert|alter)\b/);
...
}
You should use:
while (<FILE>) {
next if (/^select/);
next if (/^update/);
...
}
You can further optimize your regexes
You can use regex objects, which will ensure that your regex is not being recompiled in a loop:
my $query = qr/foo$bar/; #compiles here
#matches = ( );
...
while (<FILE>) {
push #matches, $_ if /$query/i;
}
You may also be able to optimize the .+. It will eat up the entire file, and then has to backtrack character by character until it finds a / so it can match. If there is only one / per file, try a negated character class like: [^/] (escaped: [^\/]). Where do you expect to find the / in your file? Knowing that will allow your regex to become faster.
The speed of replacement depends on other factors
If you have performance issues (currently, with the 3 regexes), it could be a different part of your program. While the processing speed of computers has grown exponentially, read and write speed has experienced little growth.
There may be faster engines for search and replacing text in a file
Perl uses NFA, which is slower yet more powerful than the DFA engine sed has. NFA backtracks (especially with alterations) and has a worst-case exponential run time. DFA has linear execution time. Your patterns do not need an NFA engine, so you can use your regexes in a DFA engine, like sed, very easily.
According to here sed can do search and replace at a speed of 82.1 million characters processed per second (note that this test was writing to /dev/null, so hard disk write speed wasn't really a factor).
A bit off topic maybe, but if the actual replacements are rare, relative to number of comapares (10%-20%?), you may gain some speed with using an index match first
$string=~s/\.shtml//
if index($string, ".shtml");
Second method is best in which you put first and second regex together with alternation.
Because in that method the perl do traverse once, and check both the expressions.
If you use first method in which perl have to traverse separate for both expressions.
Hence Number of loops decreased in Second method.
Related
I have a huge file aab.txt whose contents are aaa...aab.
To my great surprise
perl -ne '/a*bb/' < aab.txt
runs (match failure) faster than
perl -ne '/a*b/' < aab.txt
(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.
Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.
There is an optimization that first looks for a constant part of the string, in this case, a “floating” b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.
Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):
Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
You can get such output with the debug option for the re pragma.
Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.
/a*bb/
is basically
/^(?s:.*?)a*bb/
Note the two *. Optimizations aside, it's quadratic. In the worst case scenario, (a string of all a), for a string of length N, it will check if the current character is an a N*(N-1)/2 times. We call this O(N2).
It's worth doing a scan of the string (O(N)) to see if it can possibly match before starting the match. It will take a little longer to match, but it will fail to match much faster. This is what Perl does.
When you run the following
perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
You get information about the compilation of the pattern:
Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <bb> (6)
6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
The last line indicates it will search for bb in the input before starting to match.
You get information about the evaluation of the pattern:
Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer
Here you see that check in action.
I have a text with some lines (200+) in this format:
10684 - The jackpot ? discuss Lev 3 --- ? ---
10755 - Garbage Heap ? discuss Lev 5 --- ? ---
I hant to retrieve the first number (10684 or 10755) only if number after "Lev" is greater than 3.
I'm able to get the first number with this regex: ([0-9]+) - but without the 'level' restrictions.
How this could be made?
Thanks in advance.
(\d+) - .*?Lev (?:[4-9]|[1-9]\d+)
The first \d+ matches line number as you have done.
The next .*? is a lazy quantifier, which will not consume too many characters. And the following expression will guide it to the right place. (lazy quantifier is usually more efficient)
The second parenthesis, (?:[4-9]|[1-9]\d+), matches either single digital numbers greater than 3 or two digital numbers without leading zero.
Alright stackoverflow doesn't properly show my image. Take this link : http://regexr.com?36n5l
Example Output:
Regular expressions doesn't recognize numbers as numbers (only strings). You can do this though:
([0-9]+) - .*Lev (?:[4-9][^0-9]|[1-9][0-9]+)
Basically, we use the alternation operator (|) to accept only a single digit greater than 3 (enforced by checking that the following character is not a digit) or a multi-digit number not beginning with a zero.
In case that level number might be the end of the line, though, you might have to do this:
([0-9]+) - .*Lev (?:[4-9](?:[^0-9]|$)|[1-9][0-9]+)
(I'm assuming whatever regex engine you're using can't handle lookaround assertions. In the future, try to always include what language you're using when you're asking a regex question.)
Ah, I just read your edit that the number is always less than 10. Well, that's much easier then:
([0-9]+) - .*Lev [4-9]
A lookahead is really the best thing because it will leave just the number:
/\d+(?=.*Lev (0*[4-9]|[1-9]\d))/
A bit of Awk trickery:
awk -F '\? +discuss +Lev' '$2>3 { split($1,a,/ */); print a[1] }' file
In bash use this:
var=">3"
perl -lne '/(\d+) - .*Lev (\d+)/; print $1 if $2'"$var"
This is a good solution to be able to pass the condition by parameter.
I keep getting into situations where I end up making two regular expressions to find subtle changes (such as one script for 0-9 and another for 10-99 because of the extra number)
I usually use [0-9] to find strings with just one digit and then [0-9][0-9] to find strings with multiple digits, is there a better wildcard for this?
ex. what expression would I use to simultaneously find the strings
6:45 AM and 10:52 PM
You can specify repetition with curly braces. [0-9]{2,5} matches two to five digits. So you could use [0-9]{1,2} to match one or two.
[0-9]{1,2}:[0-9]{2} (AM|PM)
I personally prefer to use \d for digits, thus
\d{1,2}:\d{2} (AM|PM)
[0-9] 1 or 2 times followed by : followed by 2 [0-9]:
[0-9]{1,2}:[0-9]{2}\s(AM|PM)
or to be valid time:
(?:[1-9]|1[0-2]):[0-9]{2}\s(?:AM|PM)
If you are looking for a time patten, you'd do something like:
\d{1,2}:\d{1,2} (AM|PM)
Or for more specific time regex
[0-1]{0,1}[0-9]{1,2}:[0-5][0-9] (AM|PM)
Much like the other answers, except the AM/PM is not captured, which should be more efficient
\d{1,2}:\d{1,2}\s(?:AM|PM)
if I have a file containing:
1 ABC
2 123XYZ
3 6:45 AM
4 123DHD
5 ABC
6 10:52 PM
7 CDE
and run the following
$>grep -P '6:45\sAM|10:52\sPM' temp
6:45 AM
10:52 PM
$>.
should do the trick (-P is a perl regx)
EDIT:
Perhaps I misunderstood, the other answers are very good if I were looking to just find a time, but you seem to be after specific times. the others would match ANY time in HH:MM format.
overall, I believe the items you are after would be the | pipe character which is used in this case to allow alternative phrases and the {n,m} match n-m times {1,2} would match 1-2 times, etc.
It can be able to check all type of time formats :
e.g. 12:05PM, 3:19AM, 04:25PM, 23:52PM
my $time = "12:52AM";
if ($time =~ /^[01]?[0-9]\:[0-5][0-9](AM|PM)/) {
print "Right Time Dude...";
}
else { print "Wrong time Dude"; }
This is the regex you want.
/^[01]?[0-9]\:[0-5][0-9](AM|PM)/
Having this string as input:
Sat, 6 May 2017 02:08:08 +0000
I did this regEx to get combinations of one or two digits:
[0-9]*:[0-9]*:[0-9]*
I want to create a regex that will match any of these values
7-5
6-6 ((0-99) - (0-99))
6-4
6-3
6-2
6-1
6-0
0-6
1-6
2-6
3-6
4-6
the 6-6 example is a special case, here are some examples of values:
6-6 (23-8)
6-6 (4-25)
6-6 (56-34)
Is it possible to make one regex that can do this?
If so, is it possible to further extend that regex for the 6-6 special case such that the the difference between the two numbers within the parentheses is equal to 2 or -2?
I could easily write this with procedural code, but i'm really curious if someone can devise a regex for this.
Lastly, if it could be further extended such that the individual digits were in their own match groups I'd be amazed. An example would be for 7-5, i could have a match group that just had the value 7, and another that had the value 5. However for 6-6 (24-26) I'd like a match group that had the first six, a match group for the second 6, a match group for the 24 and a match group for the 26.
This may be impossible, but some of you can probably get this part of the way there.
Good luck, and thanks for the help.
NO. The answer is "We can't," and the reason is because you're trying to use a hammer to dig a hole.
The problem with writing one long "clever" (this word causes a knee-jerk reaction in many people who are far more anti-regex than I) regex is that, six months from now, you'll have forgotten those clever regex features that you used so heavily, and you'll have written six months worth of code related to something else, and you'll get back to your impressive regex and have to tweak one detail, and you'll say, "WTF?"
This is what (I understand) you want, in Perl:
# data is in $_
if(/7-5|6-[0-4]|[0-4]-6|6-6 \((\d{1,2})-(\d{1,2})\)/) {
if($1 and $2 and abs($1 - $2) == 2) {
# we have the right difference
}
}
Some might say that the given regex is a bit much, but I don't think it's too bad. If the \d{1,2} bit is a little too obscure you could use \d\d? (which is what I used at first, but didn't like the repetition).
You can do it like this:
7-5|6-[0-4]|[0-5]-6|6-6 \(\d\d?-\d\d?\)
Just add parens to get your match groups.
Off the top of my head (there may be some errors but the principle should be good):
\d-\d|6-6 (\d+-\d+)
And like with any regexp, you can surround what you want to extract with parentheses for match groups:
(\d)-(\d)|(6)-(6) ((\d)+-(\d+))
In the 6-6 case, the first two parentheses should get the sixes, and the second two should get the multi-digit values that come afterwards.
Here is one that will match only the numbers you want and let you get each digit by name:
p = r'(?P<a>[0-4]|6|7)-(?P<b>[0-4]|6|5) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?'
To get each digit you could use:
values = re.search(p, string).group('a', 'b', 'c', 'd')
Which will return a four element tuple with the values you are looking for (or None if no match was found).
One problem with this pattern is that it will patch the stuff in the parenthesis whether or not there was a match to '6-6'. This one will only match the final parenthesis if 6-6 is matched:
p = r'(?P<a>[0-4]|(?P<tmp_a>6)|7)-(?P<b>(?(tmp_a)(?P<tmp_b>6)|([0-4]|5)))(?(tmp_b) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?)'
I don't know of any way to look for a difference between the numbers in the parenthesis; regex only knows about strings, not numerical values . . .
(I am assuming python syntax here; the perl syntax is slightly different, though perl supports the python way of doing things.)
Ok, so I have this regex:
( |^|>)(((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6}))|((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8})))( |$|<)
It formats Dutch and Belgian phone numbers (I only want those hence the 31 and 32 as country code).
Its not much fun to decipher but as you can see it also has a lot duplicated. but now it does handles it very accurately
All the following European formatted phone numbers are accepted
0031201234567
0031223234567
0031612345678
+31(0)20-1234567
+31(0)223-234567
+31(0)6-12345678
020-1234567
0223-234567
06-12345678
0201234567
0223234567
0612345678
and the following false formatted ones are not
06-1234567 (mobile phone number in the Netherlands should have 8 numbers after 06 )
0223-1234567 (area code with home phone)
as opposed to this which is good.
020-1234567 (area code with 3 numbers has 7 numbers for the phone as opposed to a 4 number area code which can only have 6 numbers for phone number)
As you can see it's the '-' character that makes it a little difficult but I need it in there because it's a part of the formatting usually used by people, and I want to be able to parse them all.
Now is my question... do you see a way to simplify this regex (or even improve it if you see a fault in it), while keeping the same rules?
You can test it at regextester.com
(The '( |^|>)' is to check if it is at the start of a word with the possibility it being preceded by either a new line or a '>'. I search for the phone numbers in HTML pages.)
First observation: reading the regex is a nightmare. It cries out for Perl's /x mode.
Second observation: there are lots, and lots, and lots of capturing parentheses in the expression (42 if I count correctly; and 42 is, of course, "The Answer to Life, the Universe, and Everything" -- see Douglas Adams "Hitchiker's Guide to the Galaxy" if you need that explained).
Bill the Lizard notes that you use '(-)?( )?' several times. There's no obvious advantage to that compared with '-? ?' or possibly '[- ]?', unless you are really intent on capturing the actual punctuation separately (but there are so many capturing parentheses working out which '$n' items to use would be hard).
So, let's try editing a copy of your one-liner:
( |^|>)
(
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7})) |
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6})) |
((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)
OK - now we can see the regular structure of your regular expression.
There's much more analysis possible from here. Yes, there can be vast improvements to the regular expression. The first, obvious, one is to extract the international prefix part, and apply that once (optionally, or require the leading zero) and then apply the national rules.
( |^|>)
(
(((\+|00)(31|32)( )?(\(0\))?)|0)
(((([0-9]{2})(-)?( )?)?)([0-9]{7})) |
(((([0-9]{3})(-)?( )?)?)([0-9]{6})) |
(((([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)
Then we can simplify the punctuation as noted before, and remove some plausibly redundant parentheses, and improve the country code recognizer:
( |^|>)
(
(((\+|00)3[12] ?(\(0\))?)|0)
(((([0-9]{2})-? ?)?)[0-9]{7}) |
(((([0-9]{3})-? ?)?)[0-9]{6}) |
(((([0-9]{1})-? ?)?)[0-9]{8})
)
( |$|<)
We can observe that the regex does not enforce the rules on mobile phone codes (so it does not insist that '06' is followed by 8 digits, for example). It also seems to allow the 1, 2 or 3 digit 'exchange' code to be optional, even with an international prefix - probably not what you had in mind, and fixing that removes some more parentheses. We can remove still more parentheses after that, leading to:
( |^|>)
(
(((\+|00)3[12] ?(\(0\))?)|0) # International prefix or leading zero
([0-9]{2}-? ?[0-9]{7}) | # xx-xxxxxxx
([0-9]{3}-? ?[0-9]{6}) | # xxx-xxxxxx
([0-9]{1}-? ?[0-9]{8}) # x-xxxxxxxx
)
( |$|<)
And you can work out further optimizations from here, I'd hope.
Good Lord Almighty, what a mess! :) If you have high-level semantic or business rules (such as the ones you describe talking about European numbers, numbers in the Netherlands, etc.) you'd probably be better served breaking that single regexp test into several individual regexp tests, one for each of your high level rules.
if number =~ /...../ # Dutch mobiles
# ...
elsif number =~ /..../ # Belgian landlines
# ...
# etc.
end
It'll be quite a bit easier to read and maintain and change that way.
Split it into multiple expressions. For example (pseudo-code)...
phone_no_patterns = [
/[0-9]{13}/, # 0031201234567
/+(31|32)\(0\)\d{2}-\d{7}/ # +31(0)20-1234567
# ..etc..
]
def check_number(num):
for pattern in phone_no_patterns:
if num matches pattern:
return match.groups
Then you just loop over each pattern, checking if each one matches..
Splitting the patterns up makes its easy to fix specific numbers that are causing problems (which would be horrible with the single monolithic regex)
(31|32) looks bad. When matching 32, the regex engine will first try to match 31 (2 chars), fail, and backtrack two characters to match 31. It's more efficient to first match 3 (one character), try 1 (fail), backtrack one character and match 2.
Of course, your regex fails on 0800- numbers; they're not 10 digits.
It's not an optimization, but you use
(-)?( )?
three times in your regex. This will cause you to match on phone numbers like these
+31(0)6-12345678
+31(0)6 12345678
but will also match numbers containing a dash followed by a space, like
+31(0)6- 12345678
You can replace
(-)?( )?
with
(-| )?
to match either a dash or a space.