Extract phone number from complicated string - regex

Assume we got address book with some unformatted data, like:
+1 (4542) 114214 111#111.org d#ghhg.com,,,,
+1 (2342) 114234 ert#nhy.sdfr.domain.org; 1#kjk.eiu.1
+7 (101) 111-222-11 abc#ert.com, def#sdf.org
+1 (102) 123532-2 some#mail.ru
+44 (301) 123 23 45 7zip#site.edu; ret#ghjj.org
I made attempts to write regex for this:
/+\d+\s(\d+)\s\d+[\d+\s | \d+ -]+/g
But i have no idea how to exclude numbers before alphabetical characters. Probably this is not even a partial solution.
Edit #1: I'm overwhelmed by all working solutions provided, many thanks to everyone. If possible, i would be grateful if you added at least some reference/explanation how to write such complicated regex.

Without knowing where you are using this regex I'd recommend using a negative lookahead.
^[+\d() -]+(?![\w#])
Demo: https://regex101.com/r/rQ6fK4/1
If you want to capture the phone number use:
^([+\d() -]+)(?![\w#])
It will be in $1 or \1, (depending on where you are using this).

That may be one of few cases, where you need a possessive quantifier.
My attempt:
\s*(\+?(\d+)\s*\(\d+\)\s+([- \d+]++(?!\#)|\d+))
The part [- \d+]++(?!\#) will stop the matching if a "#" is following. Therefore it excludes the e-mail addresses.
The phone numbers are now stored in group 1.
Edit:
Yes, the last input line doesn't match correctley. Maybe it's easier to extract the e-mail addresses with the following regex, so the phone numbers are left (and some commas, but they should be a problem to extract as well):
\s[^\# ]+\#[-\w.]+\.\w+

Here's my solution (on regex101):
\+\d+\s+\(\d+\)\s+[- \d]+(?= )
It ensures the last group of spaces, digits, and/or dashes ([- \d]+) is followed by a space ((?= )).
It cleanly captures all of your examples, without trailing spaces, and without including any part of the email addresses.

I wouldn't even try parsing the phone number.
You have a phone number, separated with a space character from one or more email addresses, which are separated by commas or semicolons. An email address always contains a #.
Find the first #. If there is none then the phone number is the trimmed string. If there is an # then find the last space before the #. The phone number is the everything up to that space, trimmed. If there is no space before the # then you have no phone number.
With phone numbers removed, you can find the emails by splitting the string around "," or ";", trimming the strings, and throwing away what doesn't contain an #.
Then find a decent number to handle phone numbers, if you need to do that beyond recording what the phone number is.

You can use see demo:
(?<phone>\+\d{1,2}\s\(\d{3,4}\)\s(?:[\d- ]+\d)(?=\s))
\s+(?<email>.*?#.*?)(?=[\s;,]|$).*?
\s+(?<email2>[\w]*?#.*?)?(?=[\s;,]|$)
Which produce:
MATCH 1
phone [4-20] `+1 (4542) 114214`
email [21-32] `111#111.org`
email2 [33-43] `d#ghhg.com`
MATCH 2
phone [52-68] `+1 (2342) 114234`
email [69-92] `ert#nhy.sdfr.domain.org`
email2 [94-105] `1#kjk.eiu.1`
MATCH 3
phone [110-129] `+7 (101) 111-222-11`
email [130-141] `abc#ert.com`
email2 [143-154] `def#sdf.org`
MATCH 4
phone [159-176] `+1 (102) 123532-2`
email [177-189] `some#mail.ru`
MATCH 5
phone [194-213] `+44 (301) 123 23 45`
email [214-227] `7zip#site.edu`
email2 [229-241] `ret#ghjj.org`
Explanation:
(?<phone>\+\d{1,2}\s\(\d{3,4}\)\s(?:[\d- ]+\d)(?=\s)) Named capturing group phone
\+ matches the character + literally
\d{1,2} match a digit [0-9]
Quantifier: {1,2} Between 1 and 2 times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
\( matches the character ( literally
\d{3,4} match a digit [0-9]
Quantifier: {3,4} Between 3 and 4 times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
\s match any white space character [\r\n\t\f ]
(?:[\d- ]+\d) Non-capturing group
[\d- ]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\d match a digit [0-9]
- a single character in the list - literally
\d match a digit [0-9]
(?=\s) Positive Lookahead - Assert that the regex below can be matched
\s match any white space character [\r\n\t\f ]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?<email>.*?#.*?) Named capturing group email
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
# matches the character # literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?=[\s;,]|$) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: [\s;,]
[\s;,] match a single character present in the list below
\s match any white space character [\r\n\t\f ]
;, a single character in the list ;, literally
2nd Alternative: $
$ assert position at end of a line
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?<email2>[\w]*?#.*?)? Named capturing group email2
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[\w]*? match a single character present in the list below
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\w match any word character [a-zA-Z0-9_]
# matches the character # literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?=[\s;,]|$) Positive Lookahead - Assert that the regex below can be matched
1st Alternative: [\s;,]
[\s;,] match a single character present in the list below
\s match any white space character [\r\n\t\f ]
;, a single character in the list ;, literally
2nd Alternative: $
$ assert position at end of a line
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
x modifier: extended. Spaces and text after a # in the pattern are ignored

Related

How to match a pattern which is repeated more than nth times non consecutively?

I am trying to match a pattern "city, state" (e.g., "Austin, TX") on this sample vector
> s <- c("Austin, TX", "Forth Worth, TX", "Ft. Worth, TX",
"Austin TX", "Austin, TX, USA", "Ft. Worth, TX, USA")
> grepl('[[:alnum:]], [[:alnum:]$]', s)
[1] TRUE TRUE TRUE FALSE TRUE TRUE
However, there are two cases I would like to retrieve a FALSE:
-when there are more than 1 comma (i.e., "Austin, TX, USA")
-when there is another punctuation sign before the comma (i.e.,"Ft. Worth, TX")
You can use the following regex pattern:
grepl("^[a-z ]+, [a-z]+$", subject, perl=TRUE, ignore.case=TRUE);
Regex101 Demo
Regex Explanation:
^[a-z ]+, [a-z]+$/gmi
^ assert position at start of a line
[a-z ]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
the literal character
, matches the characters , literally
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
$ assert position at end of a line
ignore.case: insensitive. Case insensitive match (ignores case of [a-zA-Z])
Here is the RegEx: ^([^.,])+,\s([^.,])+$
^ assert position at start of the string
1st Capturing group ([^\.,])+
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\.,] match a single character not present in the list below
\. matches the character . literally
, the literal character ,
, matches the character , literally
\s match any white space character [\r\n\t\f ]
2nd Capturing group ([^\.,])+
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\.,] match a single character not present in the list below
\. matches the character . literally
, the literal character ,
$ assert position at end of the string

Regex for 12 Hour Timestamp along with support for time-end

Been trying to get this regex to work properly but can't seem to do it.
I need the regex to basically pick up these combinations of times:
12am
12:30am
12am - 12pm
12:30am - 1:30am
12:30 - 1:30am
12 - 1:30am
If I add a ? behind my ([a|p]m) section, the regex will match numbers which I don't want it to do.
Here is my regex code:
(?:(1[012]|[1-9]):([0-5][0-9])|(1[012]|[1-9])) ?([a|p]m)(?:\s-\s(?:(1[012]|[1-9]):([0-5][0-9])|(1[012]|[1-9])) ?([a|p]m))?
Any help is appreciated, thank you.
This does the work:
((?:1[0-2]|\d)(?:\:[0-5]\d)?(?:[ap]m)?)[\s-]+((?:1[0-2]|\d)(?:\:[0-5]\d)?(?:[ap]m)?)
Live Demo.
Explanation (second group is the same as the first one):
1st Capturing group ((?:1[0-2]|\d)(?:\:[0-5]\d)?(?:[ap]m))
(?:1[0-2]|\d) Non-capturing group
1st Alternative: 1[0-2]
1 matches the character 1 literally
[0-2] match a single character present in the list below
0-2 a single character in the range between 0 and 2
2nd Alternative: \d
\d match a digit [0-9]
(?:\:[0-5]\d)? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
\: matches the character : literally
[0-5] match a single character present in the list below
0-5 a single character in the range between 0 and 5
\d match a digit [0-9]
(?:[ap]m) Non-capturing group
[ap] match a single character present in the list below
ap a single character in the list ap literally (case insensitive)
m matches the character m literally (case insensitive)
[\s-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
- the literal character -

Use RegEx to match URL [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I'm writing a regex to pull a URL out of an auto-generated email from my monitoring system. For example:
https://mon.contoso.com/mon/call.py?fn=edit&num=1389896156
I need a regex to match:
https://mon.contoso.com/mon/call.py?fn=edit&num=XXXXXXXXX
whereby the "x"'s always change. I run into an issue with the "?". The point of this is to append the URL to a field in JIRA.
Pattern p = new Pattern("https://mon.contoso.com/mon/call.py?fn=edit&num=(\d+)")
Matcher m = p.matcher(inputEmail);
return m.matches() ? m.group(1) : "";
This returns num if it is numeric, otherwise you might want to use \w instead of \d. If you want the whole URL, remove the group() parameter.
You don't indicate what language you're working in.
In Python and JavaScript, this regex will identify a variety of URLs:
/\[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])|(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*/gi
You can refer to this regex101 test for examples of the regex in use.
Explanation:
/\[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])|(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*/gi
1st Alternative: \[[^\]\n]+\](?:\([^\)\n]+\)|\[[^\]\n]+\])
\[ matches the character [ literally
[^\]\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\] matches the character ] literally
\n matches a line-feed (newline) character (ASCII 10)
\] matches the character ] literally
(?:\([^\)\n]+\)|\[[^\]\n]+\]) Non-capturing group
1st Alternative: \([^\)\n]+\)
\( matches the character ( literally
[^\)\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
\n matches a line-feed (newline) character (ASCII 10)
\) matches the character ) literally
2nd Alternative: \[[^\]\n]+\]
\[ matches the character [ literally
[^\]\n]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\] matches the character ] literally
\n matches a line-feed (newline) character (ASCII 10)
\] matches the character ] literally
2nd Alternative: (?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,})[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]*
(?:\/\w+\/|.:\\|\w*:\/\/|\.+\/[./\w\d]+|(?:\w+\.\w+){2,}) Non-capturing group
1st Alternative: \/\w+\/
\/ matches the character / literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\/ matches the character / literally
2nd Alternative: .:\\
. matches any character (except newline)
: matches the character : literally
\\ matches the character \ literally
3rd Alternative: \w*:\/\/
\w* match any word character [a-zA-Z0-9_]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
: matches the character : literally
\/ matches the character / literally
\/ matches the character / literally
4th Alternative: \.+\/[./\w\d]+
\.+ matches the character . literally
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\/ matches the character / literally
[./\w\d]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
./ a single character in the list ./ literally
\w match any word character [a-zA-Z0-9_]
\d match a digit [0-9]
5th Alternative: (?:\w+\.\w+){2,}
(?:\w+\.\w+){2,} Non-capturing group
Quantifier: {2,} Between 2 and unlimited times, as many times as possible, giving back as needed [greedy]
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\. matches the character . literally
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
[./\w\d:/?#\[\]#!$&'()*+,;=\-~%]* match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
./ a single character in the list ./ literally
\w match any word character [a-zA-Z0-9_]
\d match a digit [0-9]
:/?# a single character in the list :/?# literally
\[ matches the character [ literally
\] matches the character ] literally
#!$&'()*+,;= a single character in the list #!$&'()*+,;= literally (case insensitive)
\- matches the character - literally
~% a single character in the list ~% literally
g modifier: global. All matches (don't return on first match)
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

Alternative regex for complex match. Current one takes too long (space separated content but must match space included strings)

We are building a parser of complex phone bills. The problem is we have to match some strings which include spaces, so i'm keen to understand to optimise the following.
we need to match this string
0499 799 099 First last The plan 20 Nov 28 Nov $138.23
to extract the cell/mobile number and first last name and plan name "The plan".
the regex we have is
/ *([0-9]{4} [0-9]{3} [0-9]{3}) +(([a-zA-Z0-9\.\$\'\(\)]+ ?)+) +(([a-zA-Z0-9\.\$\'\(\)]+ ?)+) +([0-9][0-9] [A-Z][a-z][a-z]) ([0-9][0-9] [A-Z][a-z][a-z]) +\$([0-9]+\.[0-9][0-9]) */
i know the " ?" forward matching etc. is costing us, but how else to do it if we need to match strings which include a single space.
welcome thoughts
thanks
The following regex works as intended:
/^\s+([\d\s]{12})\s+(.*?)\s+(.*?)\s+(.*?)[\s]{2,}/
DEMO
Explanation:
^ assert position at start of the string
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group ([\d\s]{12})
[\d\s]{12} match a single character present in the list below
Quantifier: {12} Exactly 12 times
\d match a digit [0-9]
\s match any white space character [\r\n\t\f ]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
3rd Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\s+ match any white space character [\r\n\t\f ]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
4th Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
[\s]{2,} match a single character present in the list below
Quantifier: {2,} Between 2 and unlimited times, as many times as possible, giving back as needed [greedy]
\s match any white space character [\r\n\t\f ]
Explanation:

Need a regular expression to match a pattern

I need a regular expression to match below pattern
Word1 OR Word2 OR Word3 OR......
basically this is a string which contains words split by OR
You can do:
(\w+)(?=(?:\s+OR)|(?:\s*$))
Demo
The following will match based on what you've given:
^\w+(?: OR \w+)*$
^ assert position at start of the string
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?: OR \w+)* Non-capturing group
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
_OR_ matches the characters _OR_ literally (case sensitive)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ assert position at end of the string
NOTE: I used _OR_ to show the spaces around OR in the quotation as the whitespace was ignored.
Link to Regex101
Do something like this and access the second group using \2:
((\w+)\sOR)*
Try playing with the link below, read the explanation and you'll understand how it works:
https://regex101.com/r/vY6mA7/1