Regex to match phone and fax numbers for WebHarvy - regex

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks

Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Related

I need to extract all words prior to 4th Space in a line

Good Day
I need to extract all words prior to 5th Space in a line.
Sample Data
Article Number Crt.DI No. Date
6ZZ 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
I have written a expression to extract what is in between Date and Article and the result is this
(?<=Date)(.|\n)*(?=Article)
6RU 999 123 S 000000093 19.01.2021
however I need to retrieve all those characters before the 4 space
6ZZ 999 123 S
This is a material number and this can be 13 or 14 characters before the 4th space.
Appreciate your support.
Sample Data
Article Number Crt.DI No. Date
6RU 999 123 S 000000093 19.01.2021
Article description Replace DI No. Date
(Please Note : There is new lines in between, these are three consecutive lines and each line is followed by an enter key)
Regards,
Manjesh
You can use a capture group, and use \s to match a whitespace character or a newline.
The capture group approach can be more flexible in case you want to match more than one whitespace chars or newlines after Date and a quantifier in a lookbehind assertion is not supported.
\bDate\s+(\S+(?:\s+\S+){3})[\s\S]*?\bArticle\b
See a regex demo.
Or using lookarounds to get a match only.
(?<=\bDate\s)\S+(?:\s+\S+){3}(?=[\s\S]*?\bArticle\b)
The pattern matches:
(?<=\bDate\s) Positive lookbehind to assert Date to the left followed by a whitespace char that can also match a newline
\S+ Match 1 or more non whitespace chars
(?:\s+\S+){3}
(?= Positive lookahead to assert that what at the right is
[\s\S]*? Match any character including newlines
\bArticle\b Match the word Article
) Close the lookahead
See another regex demo.

RegEx for extracting US address not working when address is separated with newline

I have the following RegEx to extract US address from a string.
(\d+)[ \n]+((\w+[ ,])+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
This is not working when the address is in the below format.
2933 Glen Crow Court
San Jose
CA 95148
and is working for the below data.
2933 Glen Crow Court,
San Jose, CA 95148
.
2933 Glen Crow Court, San Jose, CA 95148
Any help on this would be much appreciated.
You can simplify your pattern to something like this for matching the address, whether in one line or in multiple line.
\b\d+(?:\s+[\w,]+)+?\s+[a-zA-Z]{2}\s+\d{5}\b
Regex Explanation:
\b\d+ - Starts matching with word boundary with one or more digit
(?:\s+[\w,]+)+? - A non-grouping pattern that matches one or more whitespace then text having one or more word character and comma and whole of it one or more times but in non-greedy way.
\s+[a-zA-Z]{2} - Matches one or more whitespace then two alphabetic characters to expect text like CA, NY
\s+\d{5}\b - Followed by one or more whitespace then finally five digits with word boundary to avoid matching partially in a larger text
Demo
Add ? to the [ ,] check:
(\d+)[ \n]+((\w+[ ,]?)+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
Try this pattern \d+\s+[\w ]+[\s,]+[\w ]+[\s,]+\w+ \d+
Explanation:
\d+\s+ - match one ore more digits then match one ore more white spaces
[\w ]+[\s,]+ - match one or more word characters or space, then one or more white spaces or comma
\w+ \d+ -match one ore more word charaters, space and onre or more digits
Demo
Not drake but you can thank me later...
r"(?:(\d+ [A-Za-z][A-Za-z ]+)[\s,]*([A-Za-z#0-9][A-Za-z#0-9 ]+)?[\s,]*)?(?:([A-Za-z][A-Za-z ]+)[\s,]+)?((?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA‌​|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD‌​|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2})(?:[,\s]+(\d{5}(?:-\d{4})?))?"
you can test it out here... demo
note: this only works for us addresses

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?
In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Capturing repeated word sequence

In Perl, to match text pattern like a11a, g22g, x33x below regex works fine
([a-z])(\d)\g2\g1
Now i want to match repeating groups like similar to above but having space in between words like
abcd 101 abcd 101 ( catch this entire string in single regex pattern in one single line text or a paragraph )
How to do this...i tried below pattern but it wont work
([a-zA-Z]*\s)([0-9]*\s)\g1\g2
#logic is : words followed by space in 1 group and
#numbers followed by space in 2nd group
Regex101 Demo
Also, please explain why the above regex fails to capture the desired text pattern!!!
EDIT
One more complication :
assume that pattern is something like
[words][space][numbers][space][words][space][numbers]
#assume all [numbers] and [word] are same
....so in last [numbers] case, [space] doesn't follow, how to filter then...because regex group capture like:
([0-9]*\s) certainly fails to capture last part if it is repeated, and
([0-9]*) would fail to capture mid-part if it is repeated!! ??
Regex 101
Your problem is that your regex expects a space at the end, because you have included the space in the captures.
Try instead:
([a-zA-Z]+)\s([0-9]+)\s\g1\s\g2
([0-9]*\s) = 101 with space
so \g2 doesn't match with 101 as it doesn't have any space at the end.
Update: Working regex ([a-zA-Z]*\s)([0-9]*)\s\g1\g2 for input abcd 101 abcd 101
Online Demo
More example:
([a-zA-Z]*\s) ([0-9]*) \s \g1 \g2
abcd+space 101 Space abcd+space 101