Extracting phone number issue in R - regex

Having numbers like this:
ll <- readLines(textConnection("(412) 573-7777 opt 1
563.785.1655 x1797
(567) 523-1534 x7753
(567) 483-2119 x 477
(451) 897-MALL
(342) 668-6255 ext 7
(317) 737-3377 Opt 4
(239) 572-8878 x 3
233.785.1655 x1776
(138) 761-6877 x 4
(411) 446-6626 x 14
(412) 337-3332x19
412.393.3177 x24
327.961.1757 ext.4"))
What is the regex I should write to get:
xxx-xxx-xxxx
I tried this one:
gsub('[(]([0-9]{3})[)] ([0-9]{3})[-]([0-9]{4}).*','\\1-\\2-\\3',ll)
It doesn't cover all the possibilities. I think I can do it using several regex patterns, but I think it can be done using a single regex.

If you also want to extract numbers that are represented with letters, you can use the following regex in gsub:
gsub('[(]?([0-9]{3})[)]?[. -]([A-Z0-9]{3})[. -]([A-Z0-9]{4}).*','\\1-\\2-\\3',ll)
See IDEONE demo
You can remove all A-Z from character classes to just match numbers with no letters.
REGEX:
[(]? - An optional (
([0-9]{3}) - 3 digits
[)]? - An optional )
[. -] - Either a dot, or a space, or a hyphen
([A-Z0-9]{3}) - 3 digit or letter sequence
[. -] - Either a dot, or a space, or a hyphen
([A-Z0-9]{4}) - 4 digit or letter sequence
.* - Any number of characters to the end

Related

regex - test for exactly 1 number and exactly 2 letters

Please assist me with creating this regex.
A total/exactly 3 digit alphanumeric
Exactly 1 numeric excluding 0 and 1 2-9
Exactly 2 Alpha excluding letter O and L - o and l
Number can be in any position
Valid codes:
A2M
HH9
3AM
Invalid Codes
10M (too many digits and invalid digits,
22A (two many digits),
MAB (missing digit)
MA2M (too long, not length of 3)
thank you for all the help. Here is the regex I will use, this one removes the letter L and lowercase l:
/([2-9]{1}[A-KMNP-Za-kmnp-z]{2}|[A-KMNP-Za-kmnp-z]{1}[2-9]{1}[A-KMNP-Za-kmnp-z]{1}|[A-KMNP-Za-kmnp-z]{2}[2-9]{1})/g
You may be able to shorten your regex with use of a lookahead:
^(?=\D*\d\D*$)[A-KMNP-Za-kmnp-z2-9]{3}$
RegEx Demo
RegEx Details:
^: Start
(?=\D*\d\D*$): Lookahead to ensure that we have exactly one digit
[A-KMNP-Za-kmnp-z2-9]{3}: Match any letter or digit except [01LlOo] exactly 3 times
$: End

combinig regex partly using | not working c++

so this is going to be a regex for email address.
username: capital and small letters, digits, and underscore and dot
^[A-Za-z0-9._]+
then there's # and domain: capital and small letters and digits
#[A-Za-z0-9]+
then there's dot and tld: at least 2 characters (letters and digits) and can have maximum one dot.
I used | to have both at least 2 characters and maximum one dot:
([A-Za-z0-9]{2,}|[.]{0,1})
so the complete regex is this:
regex reg ("^[A-Za-z0-9._]+#[A-Za-z0-9]+\\.([A-Za-z0-9]{2,}|[.]{0,1})$");
but the maximum one dot rule isn't working. when I input zohal#gmail.df.g (not real of course) it gives false. it does work in other cases like zohal#gmail.com though.
You may use
regex reg(R"(^[A-Za-z0-9._]+#[A-Za-z0-9]+(?:\.[A-Za-z0-9]+)+$)")
See the regex demo
Details
^ - start of string
[A-Za-z0-9._]+ - 1 or more letters, digits, . or _
# - a # char
[A-Za-z0-9]+ - 1 or more letters or digits
(?:\.[A-Za-z0-9]+)+ - 1 or more occurrences of a dot and then 1 or more letters or digits
$ - an end of string position.
Since there are two + quantified patterns after #, you do not need an explicit (?=(?:[^A-Za-z0-9]*[A-Za-z0-9]){2}) lookahead to require at least two letters or digits.

How to make an If Then Else Regex conditional statement

I am trying to make an If-Then-Else conditional statement in regular expressions.
The regex takes as input a string representing a filename.
Here are my test strings...
The Edge Of Seventeen 2016 720p.mp4
20180511 2314 - Film4 - Northern Soul.ts
20150526 2059 - BBC Four - We Need to Talk About Kevin.ts
In the first string, 2016 represents a year but in the other two strings 2314 and 2059 represent times in 24 hour clock format.
The filename should be retained unchanged if it matches this regex:
\d{8} \d{4} -.*?- .*?\.ts
Which I have tested and it works. It can match these test strings:
20180511 2314 - Film4 - Northern Soul.ts
20150526 2059 - BBC Four - We Need to Talk About Kevin.ts
If the filename does not match that first regex then this regex should be applied to it:
(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?
This is a cleandatetime regexp that is used by Kodi to remove everything from a string AFTER a four digit number, if it exists, representing a date between 1900 and 2099. I have also tested this and it works.
Here is what I have tried to make the If-Then-Else Regex but it doesn't work:
I use this format --> (?(A)X|Y)
(?(\d{8} \d{4} -.*?- .*?\.ts)^.*$|(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?)
This is A
\d{8} \d{4} -.*?- .*?\.ts
This is X
^.*$
This is Y
(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?
This is the expected output...
Test string:
The Edge Of Seventeen 2016 720p.mp4
Expected output:
"The Edge Of Seventeen 2016 " (quotes only included to show that a trailing space can be left at the end)
Test String:
20180511 2314 - Film4 - Northern Soul.ts
Expected output:
20180511 2314 - Film4 - Northern Soul.ts
Test String:
20150526 2059 - BBC Four - We Need to Talk About Kevin.ts
Expected output:
20150526 2059 - BBC Four - We Need to Talk About Kevin.ts
I am looking for a solution entirely in regular expression syntax. Can someone help me to make it work please?
Cheers,
Flex
You may use a PCRE pattern like
^(?!\d{8} \d{4} -.*?- .*?\.ts$)(.*[^ _,.()\[\]-][ _.()\[\]-]+(?:19|20)[0-9]{2})(?:[ _,.()\[\]-]|[^0-9]$)?.*
Replace with $1, see the regex demo.
It matches
^ - start of string
(?!\d{8} \d{4} -.*?- .*?\.ts$) - the negative lookahead fails the match if the whole string matches
\d{8} \d{4} - 8 digits, space, 4 digits, space
-.*?- .*? - -, then any 0 or more chars other than line break chars, as few as possible, - and a space and then again 0 or more chars other than line break chars, as few as possible
\.ts$ - .ts at the end of string
(.*[^ _,.()\[\]-][ _.()\[\]-]+(?:19|20)[0-9]{2})(?:[ _,.()\[\]-]|[^0-9]$)?.*: an optional Group 1 and then the rest of the string:
.* - any 0+ chars other than line break chars as many as possible
[^ _,.()\[\]-] - a char other than
[ _.()\[\]-]+ - 1+ spaces, _, ., (, ), [, ] or -
(?:19|20) - 19 or 20
[0-9]{2} - two digits
(?:[ _,.()\[\]-]|[^0-9]$)? - an optional non-capturing group matching a space, _, ., (, ), [, ] or - or any char other than digit at the end of the string.
.*[^ _,.()\[\]-][ _.()\[\]-]+(?:19|20)[0-9]{2})(?:[ _,.()\[\]-]|[^0-9]$
.* - any 0+ chars other than line break chars as many as possible.
Since you have mentioned that A, X and Y are tested and found working, and since there are only 2 patterns, I think this pattern will work (Python style):
pattern = "(.?(?=" + A + ")" + X + ")|(" + Y + ")"
which means:
(.?(?=A)X)|(Y)
Explanation:
There are two groups - one for X and one for Y.
The group for capturing X starts with .? just to make the engine start moving and check if there is a part matching X ahead (a lookahead). If yes, it continues with matching X since it will encounter it after the lookahead block.
If in (2), the lookahead doesn't match, then the | (or) part, which is Y will take over. If that matches, you get a result. Else, no output.
(Sadly, the patterns for A and Y you posted were not working for me on Python, so I replaced them with my own for testing. Please do confirm if the pattern is working with the original ones.)

Match all type of numbers

I need regular expression which extracts all numbers with different delimiters (single whitespace, comma, dot). Each number can use none or all of them.
Example:
text: 'numbers: 3.14 2 544 345,345.55 506 test 120 100 100'
output: '3.14', '2 544', '345,345.55', '506', '120 100 100'
I created re: \d+[(.|,|\s)\d+]+, but it not works properly.
I assume the numbers you need to extract are separated with 2 or more whitespaces, else it would be impossible to differentiate between the end of the previous number and the start of a new one.
If you need to extract the numbers in the formats as shown above, XXX XXX.XXX or XXX,XXX,XXX.XX or XXX or XXX XXX XXX, you may use
\b\d{1,3}(?:[, ]\d{3})*(?:\.\d+)?\b
See the regex demo
Details:
\b - leading word boundary
\d{1,3} - 1 to 3 digits
(?:[, ]\d{3})* - 0+ sequences of a comma or space ([, ]) and 3 digits (\d{3})
(?:\.\d+)? - an optional sequence of a dot followed with 1+ digits
\b - trailing word boundary
A less restrictive pattern would be the same as above, but with limiting quantifiers replaced with a +:
\b\d+(?:[, ]\d+)*(?:\.\d+)?\b
See this regex demo
It will also match numbers like 1234566 and 124354354.343344.

RegEx with counting digits and allow special chars

I've done some searching but cant find the right regex.
i would like a regex for a text that only contains digits, whitespaces and plus signs.
like: [0-9 +]
But with a min/max limit for only the digits in that text.
My suggestions ended up with something like this:
^[0-9 \+](?=(.*[0-9]){5,8})$
Should be OK:
"123 456 7"
"12345"
"+ 123 456 78"
Should not be ok:
"123456789"
"+ 124 578a"
"+123456789"
Anyone got a solution that might do the trick?
Edit:
I can see that i was to short on my explanation what i'm aiming for.
My regex conditions should be:
Must include between 5-8 digits
Allow whitespaces and plus signs
I'm guessing from your own regex that between 5 and 8 digits in a row without a whitespace in between are allowed. If that's true, than the following regex might do the trick (example written in Python). It allows single digit groups being between 5 and 8 digits long. If there is more than one group, it allows each group to have exactly 3 digits except for the last group which can be between 1 and 3 digits long. One single plus sign on the left is optional.
Are you parsing phone numbers? :)
In [176]: regex = re.compile(r"""
^ # start of string
(?: \+\s )? # optional plus sign followed by whitespace
(?:
(?: \d{3}\s )+ # one or more groups of three digits followed by whitespace
\d{1,3} # one group of between one and three digits
| # ALTERNATIVE
\d{5,8} # one group of between five and eight digits
)
$ # end of string
""", flags=re.X)
# --- MATCHES ---
In [177]: regex.findall('123 456 7')
Out[177]: ['123 456 7']
In [178]: regex.findall('12345')
Out[178]: ['12345']
In [179]: regex.findall('+ 123 456 78')
Out[179]: ['+ 123 456 78']
In [200]: regex.findall('12345678')
Out[200]: ['12345678']
# --- NON-MATCHES ---
In [180]: regex.findall('123456789')
Out[180]: []
In [181]: regex.findall('+ 124 578a')
Out[181]: []
In [182]: regex.findall('+123456789')
Out[182]: []
In [198]: regex.findall('123')
Out[198]: []
In [24]: regex.findall('1234 556')
Out[24]: []
You can do something like this:
^(?:[ +]*[0-9]){5}(?:(?:[ +]*[0-9])?){3}$
See it here on Regexr
The first group (?:[ +]*[0-9]){5} are the 5 minimum digits, with any amount of spaces and plus before, the second part (?:(?:[ +]*[0-9])?){3} matches the optional digits, with any amount of spaces and plus before.
You were very close - you need to anchor the lookahead to the start of input, and add a second negative lookahead for the upper bound of the quantity of digits:
^(?=(.*\d){5,8})(?!(.*\d){9,})[\d +]+$
Also, fyi you don't need to escape the plus sign within the character class, and [0-9] is \d