I would like to clean up the phone number column in my pandas dataframe. I'm using below code but it leaves a bracket at the end. How do I get the right regex to exclude any extra characters in the end like (, or anything which is not part of phone number. I've looked through old posts, but can't seem to find exact solution.
sample code below :
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.extract('([\(\)\s\d\-]+)',expand= True)
expected output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
Current output :
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567(
You may use
((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})
See the regex demo
Details
(?:\(\d{3}\)|\d{3})? - an optional sequence of
\(\d{3}\) - (, three digits, )
| - or
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{4} - four digits.
Pandas test:
>>> df1['x'].str.extract(r'((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})',expand= True)
0
0 1234567890
1 202-456-3456
2 (202)-456-3456
3 (202)-456- 4567
4 1234564567
How about a different approach? Instead of trying to match the phone numbers, remove the bits you don't want:
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.replace(r'\([^0-9]+\)|\D*$', '')
Output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
It means using str.replace instead of str.extract but I think the code is simpler as a result.
Explanation:
\([^0-9]+\) matches any characters except 0-9 inside parentheses.
| means logical OR.
\D*$ matches zero or more non-numeric characters at the end of the string.
Used with replace, this matches the above pattern and replaces it with an empty string.
I would use replace.
df1['x1'] = df1['x'].str.replace(r'(?<=\(\d{3}\)[-]\d{3}[-]\d{4})[a-z]*', '')
df1
Simply put replace Y if it is immediately to the right of X that is (?<+X)Y
Y= group of lower case alphanumerics - [a-z]*
X=
three digits between () followed by a dash \(\d{3}\)[-] followed by;
another three digits and a dash \(\d{3}\)[-] and finally followed by;
four digits and a dash `(\d{4})
Output
Related
How I can write a regex which accepts 10 or 14 digits separated by a single space in groups of 1,2 or 3 digits?
examples:
123 45 6 789 1 is valid
1234 567 8 9 1 is not valid (group of 4 digits)
123 45 6 789 109 123 8374 is not valid (not 10 or 14 digits)
EDIT
This is what I have tried so far
[0-9 ]{10,14}+
But it validates also 11,12,13 numbers, and doesn't check for group of numbers
You may use this regex with lookahead assertion:
^(?=(?:\d ?){10}(?:(?:\d ?){4})?$)\d{1,3}(?: \d{1,3})+$
RegEx Demo
Here (?=...) is lookahead assertion that enforces presence of 10 or 14 digits in input.
\d{1,3}(?: \d{1,3})+ matches input with 1 to 3 digits separated by space with no space allowed at start or end.
aggtr,
You can match your use case with the following:
^(?:\d\s?){10}$|^(?:\d\s?){14}$
^ means the beginning of the string and $ means the end of the string.
(?:...) means a non-capturing group. Thus, the part before the | means a string that starts and has a non-capturing group of a decimal followed by an optional space that has exactly 10 items followed by the end of the string. By putting the | you allow for either 10 or 14 of your pattern.
Edit I missed the part of your requirement to have the digits grouped by 1, 2, or 3 digits.
I am trying to do regex validation for 11 digit mobile number of type 03025398448.Where first 3 digits are constant 030 and remaining 8 digits are from 0 to 9 (any number) and 1st digit could be written in +92 format .So, help me for this number regex code
If the number should start with 030 and +92 is optional and when using +92 you should omit the leading zero, you could use:
^(?:\+9230|030)?\d{8}$
Explanation
^ # From the beginning of the string
(?: # Non capturing group
\+9230|030 # Match +9230 or 030
)? # close capturing group and make it optional
\d{8} # Match 8 digits
$ # The end of the string
In C# you could use this as string pattern = #"^(?:\+9230|030)?\d{8}$";
C# code
You can use this regular expression:
^((\+?92)30[0-9]{8}|030[0-9]{8})$
Explanation
BeginOfLine
CapturingGroup
GroupNumber:1
OR: match either of the followings
Sequence: match all of the followings in order
CapturingGroup
GroupNumber:2
Sequence: match all of the followings in order
Repeat
+
optional
9 2
3 0
Repeat
AnyCharIn[ 0 to 9]
8 times
Sequence: match all of the followings in order
0 3 0
Repeat
AnyCharIn[ 0 to 9]
8 times
EndOfLine
I'm trying to grab the date (without time) from the following OCR'd strings:
04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
With the following expression I can catch the dates, but I can't skip the "12" hours:
([\d\s]{2,}(?:\.|,)[\d\s]{2,}(?:\.|,)[\d\s]{4,})
How can I make it work? In plain English, how can I make the last part stop once it has found 4 digits in a mix of digits and spaces/tabs?
By catching the first 8 digits on a line, you will get your date.
\D is any non-digit charater
\d is a digit character
(?:...) is a group that will be ignored
^\D* is used to ignore the beginning of the line until we get a digit
We match 8 times a digits followed by any non-numerics characters, starting with first digit found.
import re
p = re.compile(ur'^\D*((?:\d\D*?){8})', re.MULTILINE)
test_str = u"""04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
"""
print re.findall(p, test_str)
Have a test over here: https://regex101.com/r/eQ8zJ9/4
You can then filter out any non digits to get the date:
from datetime import datetime
for s in re.findall(p, test_str):
digits = re.sub(ur'\D', '', s)
print datetime.strptime(digits, '%d%m%Y')
You can also try with:
((?:\d\s*){2})[,.-]((?:\s*\d\s*){2})[,.-]((?:\s*\d){4})
DEMO
which is not restricted by beginning of a line. Also it match is there is one of choosen delimiters beetwen numbers, like ,, . or -. As there could be more 8-digits chaotic number sequences in such formatted text.
The other answer is nice and short, but if the delimiters are of importance:
((?:(?:\d\s*){2}[.,]\s*){2}(?:\d\s*?){4})
The key being:
(?:\d\s*?){𝑛}
To capture 𝑛 digits with optional, but non-greedy, whitespace in-between.
I also took the liberty to shorten (?:\.|,) to [.,].
Having numbers like this:
ll <- readLines(textConnection("(412) 573-7777 opt 1
563.785.1655 x1797
(567) 523-1534 x7753
(567) 483-2119 x 477
(451) 897-MALL
(342) 668-6255 ext 7
(317) 737-3377 Opt 4
(239) 572-8878 x 3
233.785.1655 x1776
(138) 761-6877 x 4
(411) 446-6626 x 14
(412) 337-3332x19
412.393.3177 x24
327.961.1757 ext.4"))
What is the regex I should write to get:
xxx-xxx-xxxx
I tried this one:
gsub('[(]([0-9]{3})[)] ([0-9]{3})[-]([0-9]{4}).*','\\1-\\2-\\3',ll)
It doesn't cover all the possibilities. I think I can do it using several regex patterns, but I think it can be done using a single regex.
If you also want to extract numbers that are represented with letters, you can use the following regex in gsub:
gsub('[(]?([0-9]{3})[)]?[. -]([A-Z0-9]{3})[. -]([A-Z0-9]{4}).*','\\1-\\2-\\3',ll)
See IDEONE demo
You can remove all A-Z from character classes to just match numbers with no letters.
REGEX:
[(]? - An optional (
([0-9]{3}) - 3 digits
[)]? - An optional )
[. -] - Either a dot, or a space, or a hyphen
([A-Z0-9]{3}) - 3 digit or letter sequence
[. -] - Either a dot, or a space, or a hyphen
([A-Z0-9]{4}) - 4 digit or letter sequence
.* - Any number of characters to the end
I've done some searching but cant find the right regex.
i would like a regex for a text that only contains digits, whitespaces and plus signs.
like: [0-9 +]
But with a min/max limit for only the digits in that text.
My suggestions ended up with something like this:
^[0-9 \+](?=(.*[0-9]){5,8})$
Should be OK:
"123 456 7"
"12345"
"+ 123 456 78"
Should not be ok:
"123456789"
"+ 124 578a"
"+123456789"
Anyone got a solution that might do the trick?
Edit:
I can see that i was to short on my explanation what i'm aiming for.
My regex conditions should be:
Must include between 5-8 digits
Allow whitespaces and plus signs
I'm guessing from your own regex that between 5 and 8 digits in a row without a whitespace in between are allowed. If that's true, than the following regex might do the trick (example written in Python). It allows single digit groups being between 5 and 8 digits long. If there is more than one group, it allows each group to have exactly 3 digits except for the last group which can be between 1 and 3 digits long. One single plus sign on the left is optional.
Are you parsing phone numbers? :)
In [176]: regex = re.compile(r"""
^ # start of string
(?: \+\s )? # optional plus sign followed by whitespace
(?:
(?: \d{3}\s )+ # one or more groups of three digits followed by whitespace
\d{1,3} # one group of between one and three digits
| # ALTERNATIVE
\d{5,8} # one group of between five and eight digits
)
$ # end of string
""", flags=re.X)
# --- MATCHES ---
In [177]: regex.findall('123 456 7')
Out[177]: ['123 456 7']
In [178]: regex.findall('12345')
Out[178]: ['12345']
In [179]: regex.findall('+ 123 456 78')
Out[179]: ['+ 123 456 78']
In [200]: regex.findall('12345678')
Out[200]: ['12345678']
# --- NON-MATCHES ---
In [180]: regex.findall('123456789')
Out[180]: []
In [181]: regex.findall('+ 124 578a')
Out[181]: []
In [182]: regex.findall('+123456789')
Out[182]: []
In [198]: regex.findall('123')
Out[198]: []
In [24]: regex.findall('1234 556')
Out[24]: []
You can do something like this:
^(?:[ +]*[0-9]){5}(?:(?:[ +]*[0-9])?){3}$
See it here on Regexr
The first group (?:[ +]*[0-9]){5} are the 5 minimum digits, with any amount of spaces and plus before, the second part (?:(?:[ +]*[0-9])?){3} matches the optional digits, with any amount of spaces and plus before.
You were very close - you need to anchor the lookahead to the start of input, and add a second negative lookahead for the upper bound of the quantity of digits:
^(?=(.*\d){5,8})(?!(.*\d){9,})[\d +]+$
Also, fyi you don't need to escape the plus sign within the character class, and [0-9] is \d