Trying to Extract Numeric from a text field - regex

I have field with different text entered with a 13 or 17 Digit ID.Need to extract that ID from this field
regexp_substr(TXT,'CTRL ACDV\\s+(\\d+)',1,1,'ie')..
Txt can can be like this
SUPPRESSED AND FORWARDING CTRL{ACDV 36608732875895776 } {DRID 12345
SUPPRESSED AND FORWARDING CTRL 9809770899005 TO FRAUD DUE TO ID TH
SUPPRESSED AND FORWARDING CTRL ACDV 987878829039161097 .DRID 87569
regexp_substr(TXT,'CTRL ACDV\\s+(\\d+)',1,1,'ie')..
need to get
36608732875895776
9809770899005
987878829039161097

If you can assume the digits are a minimum length, this works for your 3 examples:
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL{ACDV 36608732875895776 } {DRID 12345',
'(\\d{13,})', 1,1, 'e');
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL 9809770899005 TO FRAUD DUE TO ID TH',
'(\\d{13,})', 1,1, 'e');
SELECT regexp_substr('SUPPRESSED AND FORWARDING CTRL ACDV 987878829039161097 .DRID 87569',
'(\\d{13,})', 1,1, 'e');

You might use a capturing group and use the (from the docs) e parameter to return only the part of the string that matches the first sub-expression in the pattern.
Note that the last number are 18 digits instead of 17.
\bCTRL\D+(\d{13,18})
Explanation
\bCTRL Match word boundary and CTRL
\D+ Match 1+ times not a digit
(\d{13,18}) Capture 1 group 1 matching 13 - 18 digits
Regex demo
Another option is to match 13 or more digits using \d{13,}
The docs state that the patterns are implicitly anchored at both ends, in that case you could use:
.*\bCTRL\D+(\d{13,18})\b.*
Regex demo

If the only big numbers are the ID's, then this is the shortest and fastest:
\d{13,17}
Test it here.
Be aware that the third ID (987878829039161097) is actually 18 digits long.
Therefore, if the minimum length is 13, you may want to use:
\d{13,}
Alternatively, if you want to delete everything except the long ID's, you can search for the regex:
([^\d]+|\d{,12})
and replace it with \n (= new line) or whatever you want (e.g. a space).
Test it here.
You may get better result if you do the replace in two steps. First for:
[^\d]+
(for non-digits)
and then for:
\s\d{1,12}(\s|$)
(for numbers with less than 13 digits)

Related

Regex to enter a decimal number digit by digit

I have a requirement where user can input only between 0.01 to 100.00 in a textbox. I am using regex to limit the data entered. However, I cannot enter a decimal point, like 95.83 in the regex. Can someone help me fix the below regex?
(^100([.]0{1,2})?)$|(^\d{1,2}([.]\d{1,2})?)$
if I copy paste the value, it passes. But unable to type a decimal point.
Please advice.
Link to regex tester: https://regex101.com/r/b2BF6A/1
Link to demo: https://stackblitz.com/edit/react-9h2xsy
The regex
You can use the following regex:
See regex in use here
^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$
How it works
^(?:...|...|...)$ this anchors the pattern to ensure it matches the entire string
^ assert position at the start of the line
(?:...|...|...) non-capture group - used to group multiple alternations
$ assert position at the end of the line
(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})? first option
(?:\d?[1-9]|[1-9]0) match either of the following
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
(?:\.\d{0,2})? optionally match the following
\. this character . literally
\d{0,2} match any digit between 0 and 2 times
0{0,2}\.(?:\d?[1-9]|[1-9]0) second option
0{0,2} match 0 between 0 and 2 times
\. match this character . literally
(?:\d?[1-9]|[1-9]0) match either of the following options
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
10{2}(?:\.0{0,2})? third option
10{2} match 100
(?:\.0{0,2})? optionally match ., followed by 0 between 0 and 2 times
How it works (in simpler terms)
With the above descriptions for each alternation, this is what they will match:
Any two-digit number other than 0 or 00, optionally followed by any two-digit decimal.
In terms of a range, it's 1.00-99.99 with:
Optional leading zero: 01.00-99.99
Optional decimal: 01-99, or 01.-99, or 01.0-01.99
Any two-digit decimal other than 0 or 00
In terms of a range, it's .01-.99 with:
Optional leading zeroes: 00.01-00.99 or 0.01-0.99
Literally 100, followed by optional decimals: 100, or 100., or 100.0, or 100.00
The code
RegExp vs /pattern/
In your code, you can use either of the following options (replacing pattern with the pattern above):
new RegExp('pattern')
/pattern/
The first option above uses a string literal. This means that you must escape the backslash characters in the string in order for the pattern to be properly read:
^(?:(?:\\d?[1-9]|[1-9]0)(?:\\.\\d{0,2})?|0{0,2}\\.(?:\\d?[1-9]|[1-9]0)|10{2}(?:\\.0{0,2})?)$
The second option above allows you to avoid this and use the regex as is.
Here's a fork of your code using the second option.
Usability Issues
Please note that you'll run into a couple of usability issues with your current method of tackling this:
The user cannot erase all the digits they've entered. So if the user enters 100, they can only erase 00 and the 1 will remain. One option to resolving this is to make the entire non-capture group (with the alternations) optional by adding a ? after it. Whilst this does solve that issue, you now need to keep two regular expression patterns - one for user input and the other for validation. Alternatively, you could just test if the input is an empty string to allow it (but not validate the form until the field is filled.
The user cannot enter a number beginning with .. This is because we don't allow the input of . to go through your validation steps. The same rule applies here as the previous point made. You can allow it though if the value is . explicitly or add a new alternation of |\.
Similarly to my last point, you'll run into the issue for .0 when a user is trying to write something like .01. Again here, you can run the same test.
Similarly again, 0 is not valid input - same applies here.
An change to the regex that covers these states (0, ., .0, 0., 0.0, 00.0 - but not .00 alternatives) is:
^(?:(?:\d?[1-9]?|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]?|[1-9]0)|10{2}(?:\.0{0,2})?)$
Better would be to create logic for these cases to match them with a separate regex:
^0{0,2}\.?0?$
Usability Fixes
With the changes above in mind, your function would become:
See code fork here
handleChange(e) {
console.log(e.target.value)
const r1 = /^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$/;
const r2 = /^0{0,2}\.?0?$/
if (r1.test(e.target.value)) {
this.setState({
[e.target.name]: e.target.value
});
} else if (r2.test(e.target.value)) {
// Value is invalid, but permitted for usability purposes
this.setState({
[e.target.name]: e.target.value
});
}
}
This now allows the user to input those values, but also allows us to invalidate them if the user tries to submit it.
Using the range 0.01 to 100.00 without padding is this (non-factored):
0\.(?:0[1-9]|[1-9]\d)|[1-9]\d?\.\d{2}|100\.00
Expanded
# 0.01 to 0.99
0 \.
(?:
0 [1-9]
| [1-9] \d
)
|
# 1.00 to 99.99
[1-9] \d? \.
\d{2}
|
# 100.00
100 \.
00
It can be made to have an optional cascade if incremental partial form
should be allowed.
That partial is shown here for the top regex range :
^(?:0(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|[1-9]\d?(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
The code line with stringed regex :
const newRegExp = new RegExp("^(?:0(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|[1-9]\\d?(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$");
_________________________
The regex 'partial' above requires the input to be blank or to start
with a digit. It also doesn't allow 1-9 with a preceding 0.
If that is all to be allowed, a simple mod is this :
^(?:0{0,2}(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|(?:[1-9]\d?|0[1-9])(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
which allows input like the following:
(It should be noted that doing this requires allowing the dot . as
a valid input but could be converted to 0. on the fly to be put
inside the input box.)
.1
00.01
09.90
01.
01.11
00.1
00
.
Stringed version :
"^(?:0{0,2}(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|(?:[1-9]\\d?|0[1-9])(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$"

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Trying to create a regex that allowes following format yyyy[: -][VW]Week number

My regex currently looks like this
\b(19|20)\d{2}\b[- :][VW][0-5]{1}(?(?=[5])[0-2]{1}|[0-9]{1})
It doesn't quite do what I want as I'm trying to get this part
(?(?=[5])[0-2]{1}|[0-9]{1})
to say "If the previous number was 5 then you may only choose between 0-2, and if it's another number 0-4 then choosing between 0-9 is allowed
Currently it allowes 00-59 with an exclusion of 05,15,25,35 etc.
Essentially I want it to look like this for example 2016-W25.
You need to replace [5] with a positive lookbehind (?<=5) in order to check a char to the left of the current location:
\b(19|20)\d{2}[- :][VW][0-5](?(?=(?<=5))[0-2]|[0-9])
^^^^^
See the regex demo
Also, you may get rid of the conditional pattern at all using a mere alternation group:
\b(19|20)\d{2}[- :][VW](?:[0-4][0-9]|5[0-2])
^^^^^^^^^^^^^^^^^^^^^
See this regex demo
The (?:[0-4][0-9]|5[0-2]) matches either a digit from 0 to 4 and then any digit (see [0-4][0-9]), or (see |) a 5 followed with 0, 1 or 2 (see 5[0-2]).
NOTE: Since the number of weeks can amount to 53, the [0-2] at the end might be replaced with [0-3] to also match 53 values.

Regex to add leading zero in date record

Question - what is the shortest form of regex to add a leading zero into single digit in date record?
So I want to convert 8/8/2014 8:04:34 to 08/08/2014 8:04:34 - add leading zero when only one digit is presented.
The record can have two single digit entry, one single digit entry or no single digit entry. Some records can be in forms like 25/06/2014 19:50:18 or 9/06/2014 8:27:35 - in other words, some of them could be already normalized and regex needs to fix only single digit entry.
Not a regex user by any means. Your help is appreciated.
How about:
Ctrl+H
Find what: \b(\d)(?=/)
Replace with: 0$1
Replace all
This will change 8/8/2014 8:04:34 into 08/08/2014 8:04:34
Use the following regex to find:
(\d)(\d)?/(\d)(\d)?/(.*)
Then use the following to replace:
(?{2}\1\2:0\1)/(?{4}\3\4:0\3)/\5
What we are using is called conditionals in terms of regex. Refer this answer for explanation.
Make sure you have unselected the checkbox which says ". matches newline".
First of all, let's do some test-driven development and write the test cases. We can ignore the time and concentrate on the date alone. Also, the year is not important. We have to find all the possible cases for the day and the month. For each of them, we can have:
A single digit
Two digits, the first of which is already a 0
Two digits, the first of which is not a 0
Two digits, the second of which is a 0 (probably not needed, but just in case).
The case where we have to do something is only the first one, and the last 3 could be joined into a single one, but I prefer to keep them separated. We need to test 16 combinations:
8/8/2014
8/08/2014
8/12/2014
8/10/2014
08/8/2014
08/08/2014
08/12/2014
08/10/2014
12/8/2014
12/08/2014
12/12/2014
12/10/2014
10/8/2014
10/08/2014
10/12/2014
10/10/2014
Of all of these, only 1, 2, 3, 4, 5, 9, 13 must be changed. I don't know how to do it with a single regex, but with 2 regexes it's easy:
First regex, for the day:
(?<!\d)(\d/\d{1,2}/\d+)
replace with:
0\1
It matches a date where the day has only one digit, followed by a month with either 1 or 2 days, followed by a year with any number of digits, and it simply adds a 0 at the beginning.
Second regex, for the month:
(\d{2}/)(\d/\d+)
replace with:
\10\2
This one assumes that the first one has already been run, and thus the day has 2 digits. It finds dates where the month has a single digit, and adds a 0 before it. Please note that \10\2 means: the first group that matched, followed by a 0, followed by the second group. It doesn't mean: the tenth group, followed by the second. So the digits 1 and 0 are logically separated.
Run the first one, then the second one, and it gives the correct result:
08/08/2014
08/08/2014
08/12/2014
08/10/2014
08/08/2014
08/08/2014
08/12/2014
08/10/2014
12/08/2014
12/08/2014
12/12/2014
12/10/2014
10/08/2014
10/08/2014
10/12/2014
10/10/2014
Thanks to this recent answer I finally can give you an (hopefully) correct answer ;)
Replace
\b(?:(\d\d)|(\d))/(?:(\d\d)|(\d))/(\d\d)
with
(?{1}\1:0$2)/(?{3}\3:0\4)/\5
It uses Notepad++ conditionals (which I didn't know of until I stumbled over the mention question) to handle when only one or the other is single digit.
The regex matches a word boundary \b followed by two digits, captured in group 1, or one digit, captured in group 2, followed by a /. Then the same logic is repeated for day, which is captured in group 3 (2 digit) or 4 (1 digit). Then finally it checks that a year follows (at least two digits).
The conditional replace is explained in the linked answer. But simply put the (?{1} test if a match to group 1 was made it replaces with the expression before the :, otherwise the one after.
Hope this helps.
Regards
If you had a date like (ISO format)
2017-9-5
This
replace(/(\D)(\d)(?!\d)/g, '$10$2')
will turn it into
2017-09-05
and will preserve two digits in dates like
2017-11-11 or 2017-9-05
a general approach is to search for (in this case 5 digit numbers):
(\d)??(\d)??(\d)??(\d)??(\d)
Replace with
(?1\1:0)(?2\2:0)(?3\3:0)(?4\4:0)\5
You can use /^\d\/|(?<=\/)\d\/\d/g to select text, then add 0 before selected text, it should work for all your conditions.

RegEx: Uk Landlines, Mobile phone numbers

I've been struggling with finding a suitable solution :-
I need an regex expression that will match all UK phone numbers and mobile phones.
So far this one appears to cover most of the UK numbers:
^0\d{2,4}[ -]{1}[\d]{3}[\d -]{1}[\d -]{1}[\d]{1,4}$
However mobile numbers do not work with this regex expression or phone-numbers written in a single solid block such as 01234567890.
Could anyone help me create the required regex expression?
[\d -]{1}
is blatently incorrect: a digit OR a space OR a hyphen.
01000 123456
01000 is not a valid UK area code. 123456 is not a valid local number.
It is important that test data be real area codes and real number ranges.
^\s*(?(020[7,8]{1})?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3})?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*|[0-9]+[ ]?[0-9]+$
The above pattern is garbage for many different reasons.
[7,8] matches 7 or comma or 8. You don't need to match a comma.
London numbers also begin with 3 not just 7 or 8.
London 020 numbers aren't the only 2+8 format numbers; see also 023, 024, 028 and 029.
[1-9]{1} simplifies to [1-9]
[ ]? simplifies to \s?
Having found the intial 0 once, why keep searching for it again and again?
^(0....|0....|0....|0....)$ simplifies to ^0(....|....|....|....)$
Seriously. ([1]|[2]|[3]|[7]){1} simplifies to [1237] here.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers
Given that people sometimes write their numbers with spaces in random places, you might be better off ignoring the spaces all together - you could use a regex as simple as this then:
^0(\d ?){10}$
This matches:
01234567890
01234 234567
0121 3423 456
01213 423456
01000 123456
But it would also match:
01 2 3 4 5 6 7 8 9 0
So you may not like it, but it's certainly simpler.
Would this regex do?
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Wed, Sep 8, 2010, 06:38:28
/// Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [1]: A numbered capture group. [\+44], zero or one repetitions
/// \+44
/// Literal +
/// 44
/// [2]: A numbered capture group. [\s+], zero or one repetitions
/// Whitespace, one or more repetitions
/// [3]: A numbered capture group. [\(?]
/// Literal (, zero or one repetitions
/// [area_code]: A named capture group. [(\d{1,5}|\d{4}\s+?\d{1,2})]
/// [4]: A numbered capture group. [\d{1,5}|\d{4}\s+?\d{1,2}]
/// Select from 2 alternatives
/// Any digit, between 1 and 5 repetitions
/// \d{4}\s+?\d{1,2}
/// Any digit, exactly 4 repetitions
/// Whitespace, one or more repetitions, as few as possible
/// Any digit, between 1 and 2 repetitions
/// [5]: A numbered capture group. [\)?]
/// Literal ), zero or one repetitions
/// [6]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// [tel_no]: A named capture group. [(\d{1,4}(\s+|-)?\d{1,4}|(\d{6}))]
/// [7]: A numbered capture group. [\d{1,4}(\s+|-)?\d{1,4}|(\d{6})]
/// Select from 2 alternatives
/// \d{1,4}(\s+|-)?\d{1,4}
/// Any digit, between 1 and 4 repetitions
/// [8]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// Any digit, between 1 and 4 repetitions
/// [9]: A numbered capture group. [\d{6}]
/// Any digit, exactly 6 repetitions
///
///
/// </summary>
public Regex MyRegex = new Regex(
"(\\+44)?\r\n(\\s+)?\r\n(\\(?)\r\n(?<area_code>(\\d{1,5}|\\d{4}\\s+"+
"?\\d{1,2}))(\\)?)\r\n(\\s+|-)?\r\n(?<tel_no>\r\n(\\d{1,4}\r\n(\\s+|-"+
")?\\d{1,4}\r\n|(\\d{6})\r\n))",
RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
//// Replace the matched text in the InputText using the replacement pattern
// string result = MyRegex.Replace(InputText,MyRegexReplace);
//// Split the InputText wherever the regex matches
// string[] results = MyRegex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = MyRegex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = MyRegex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = MyRegex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = MyRegex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = MyRegex.GetGroupNumbers();
Notice how the spaces and dashes are optional and can be part of it.. also it is now divided into two capture groups called area_code and tel_no to break it down and easier to extract.
Strip all whitespace and non-numeric characters and then do the test. It'll be musch , much easier than trying to account for all the possible options around brackets, spaces, etc.
Try the following:
#"^(([0]{1})|([\+][4]{2}))([1]|[2]|[3]|[7]){1}\d{8,9}$"
Starts with 0 or +44 (for international) - I;m sure you could add 0044 if you wanted.
It then has a 1, 2, 3 or 7.
It then has either 8 or 9 digits.
If you want to be even smarter, the following may be a useful reference: http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom
It's not a single regex, but there's sample code from Braemoor Software that is simple to follow and fairly thorough.
The JS version is probably easiest to read. It strips out spaces and hyphens (which I realise you said you can't do) then applies a number of positive and negative regexp checks.
Start by stripping the non-numerics, excepting a + as the first character.
(Javascript)
var tel=document.getElementById("tel").value;
tel.substr(0,1).replace(/[^+0-9]/g,'')+tel.substr(1).replace(/[^0-9]/g,'')
The regex below allows, after the international indicator +, any combination of between 7 and 15 digits (the ITU maximum) UNLESS the code is +44 (UK). Otherwise if the string either begins with +44, +440 or 0, it is followed by 2 or 7 and then by nine of any digit, or it is followed by 1, then any digit except 0, then either seven or eight of any digit. (So 0203 is valid, 0703 is valid but 0103 is not valid). There is currently no such code as 025 (or in London 0205), but those could one day be allocated.
/(^\+(?!44)[0-9]{7,15}$)|(^(\+440?|0)(([27][0-9]{9}$)|(1[1-9][0-9]{7,8}$)))/
Its primary purpose is to identify a correct starting digit for a non-corporate number, followed by the correct number of digits to follow. It doesn't deduce if the subscriber's local number is 5, 6, 7 or 8 digits. It does not enforce the prohibition on initial '1' or '0' in the subscriber number, about which I can't find any information as to whether those old rules are still enforced. UK phone rules are not enforced on properly formatted international phone numbers from outside the UK.
After a long search for valid regexen to cover UK cases, I found that the best way (if you're using client side javascript) to validate UK phone numbers is to use libphonenumber-js along with custom config to reduce bundle size:
If you're using NodeJS, generate UK metadata by running:
npx libphonenumber-metadata-generator metadata.custom.json --countries GB --extended
then import and use the metadata with libphonenumber-js/core:
import { isValidPhoneNumber } from "libphonenumber-js/core";
import data from "./metadata.custom.json";
isValidPhoneNumber("01234567890", "GB", data);
CodeSandbox Example