I'm trying to grab the date (without time) from the following OCR'd strings:
04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
With the following expression I can catch the dates, but I can't skip the "12" hours:
([\d\s]{2,}(?:\.|,)[\d\s]{2,}(?:\.|,)[\d\s]{4,})
How can I make it work? In plain English, how can I make the last part stop once it has found 4 digits in a mix of digits and spaces/tabs?
By catching the first 8 digits on a line, you will get your date.
\D is any non-digit charater
\d is a digit character
(?:...) is a group that will be ignored
^\D* is used to ignore the beginning of the line until we get a digit
We match 8 times a digits followed by any non-numerics characters, starting with first digit found.
import re
p = re.compile(ur'^\D*((?:\d\D*?){8})', re.MULTILINE)
test_str = u"""04.10.2015, in USD
04.10.20 15, in EUR
04,1 0.2015, in XYZ
1 1. 10.2 01 5, in XYZ
0 1.11.201 5 12:30
1 1,0 3, 2 0 1 5 1 2:3 0
"""
print re.findall(p, test_str)
Have a test over here: https://regex101.com/r/eQ8zJ9/4
You can then filter out any non digits to get the date:
from datetime import datetime
for s in re.findall(p, test_str):
digits = re.sub(ur'\D', '', s)
print datetime.strptime(digits, '%d%m%Y')
You can also try with:
((?:\d\s*){2})[,.-]((?:\s*\d\s*){2})[,.-]((?:\s*\d){4})
DEMO
which is not restricted by beginning of a line. Also it match is there is one of choosen delimiters beetwen numbers, like ,, . or -. As there could be more 8-digits chaotic number sequences in such formatted text.
The other answer is nice and short, but if the delimiters are of importance:
((?:(?:\d\s*){2}[.,]\s*){2}(?:\d\s*?){4})
The key being:
(?:\d\s*?){𝑛}
To capture 𝑛 digits with optional, but non-greedy, whitespace in-between.
I also took the liberty to shorten (?:\.|,) to [.,].
Related
I would like to clean up the phone number column in my pandas dataframe. I'm using below code but it leaves a bracket at the end. How do I get the right regex to exclude any extra characters in the end like (, or anything which is not part of phone number. I've looked through old posts, but can't seem to find exact solution.
sample code below :
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.extract('([\(\)\s\d\-]+)',expand= True)
expected output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
Current output :
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567(
You may use
((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})
See the regex demo
Details
(?:\(\d{3}\)|\d{3})? - an optional sequence of
\(\d{3}\) - (, three digits, )
| - or
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{3} - three digits
(?:\s|\s?-\s?)? - an optional sequence of a whitespace char or an - enclosed with single optional whitespaces
\d{4} - four digits.
Pandas test:
>>> df1['x'].str.extract(r'((?:\(\d{3}\)|\d{3})?(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})',expand= True)
0
0 1234567890
1 202-456-3456
2 (202)-456-3456
3 (202)-456- 4567
4 1234564567
How about a different approach? Instead of trying to match the phone numbers, remove the bits you don't want:
import pandas as pd
df1 = pd.DataFrame({'x': ['1234567890', '202-456-3456', '(202)-456-3456adsd', '(202)-456- 4567', '1234564567(dads)']})
df1['x1'] = df1['x'].str.replace(r'\([^0-9]+\)|\D*$', '')
Output:
x x1
0 1234567890 1234567890
1 202-456-3456 202-456-3456
2 (202)-456-3456adsd (202)-456-3456
3 (202)-456- 4567 (202)-456- 4567
4 1234564567(dads) 1234564567
It means using str.replace instead of str.extract but I think the code is simpler as a result.
Explanation:
\([^0-9]+\) matches any characters except 0-9 inside parentheses.
| means logical OR.
\D*$ matches zero or more non-numeric characters at the end of the string.
Used with replace, this matches the above pattern and replaces it with an empty string.
I would use replace.
df1['x1'] = df1['x'].str.replace(r'(?<=\(\d{3}\)[-]\d{3}[-]\d{4})[a-z]*', '')
df1
Simply put replace Y if it is immediately to the right of X that is (?<+X)Y
Y= group of lower case alphanumerics - [a-z]*
X=
three digits between () followed by a dash \(\d{3}\)[-] followed by;
another three digits and a dash \(\d{3}\)[-] and finally followed by;
four digits and a dash `(\d{4})
Output
How I can write a regex which accepts 10 or 14 digits separated by a single space in groups of 1,2 or 3 digits?
examples:
123 45 6 789 1 is valid
1234 567 8 9 1 is not valid (group of 4 digits)
123 45 6 789 109 123 8374 is not valid (not 10 or 14 digits)
EDIT
This is what I have tried so far
[0-9 ]{10,14}+
But it validates also 11,12,13 numbers, and doesn't check for group of numbers
You may use this regex with lookahead assertion:
^(?=(?:\d ?){10}(?:(?:\d ?){4})?$)\d{1,3}(?: \d{1,3})+$
RegEx Demo
Here (?=...) is lookahead assertion that enforces presence of 10 or 14 digits in input.
\d{1,3}(?: \d{1,3})+ matches input with 1 to 3 digits separated by space with no space allowed at start or end.
aggtr,
You can match your use case with the following:
^(?:\d\s?){10}$|^(?:\d\s?){14}$
^ means the beginning of the string and $ means the end of the string.
(?:...) means a non-capturing group. Thus, the part before the | means a string that starts and has a non-capturing group of a decimal followed by an optional space that has exactly 10 items followed by the end of the string. By putting the | you allow for either 10 or 14 of your pattern.
Edit I missed the part of your requirement to have the digits grouped by 1, 2, or 3 digits.
I am trying to do regex validation for 11 digit mobile number of type 03025398448.Where first 3 digits are constant 030 and remaining 8 digits are from 0 to 9 (any number) and 1st digit could be written in +92 format .So, help me for this number regex code
If the number should start with 030 and +92 is optional and when using +92 you should omit the leading zero, you could use:
^(?:\+9230|030)?\d{8}$
Explanation
^ # From the beginning of the string
(?: # Non capturing group
\+9230|030 # Match +9230 or 030
)? # close capturing group and make it optional
\d{8} # Match 8 digits
$ # The end of the string
In C# you could use this as string pattern = #"^(?:\+9230|030)?\d{8}$";
C# code
You can use this regular expression:
^((\+?92)30[0-9]{8}|030[0-9]{8})$
Explanation
BeginOfLine
CapturingGroup
GroupNumber:1
OR: match either of the followings
Sequence: match all of the followings in order
CapturingGroup
GroupNumber:2
Sequence: match all of the followings in order
Repeat
+
optional
9 2
3 0
Repeat
AnyCharIn[ 0 to 9]
8 times
Sequence: match all of the followings in order
0 3 0
Repeat
AnyCharIn[ 0 to 9]
8 times
EndOfLine
I need to match fail counts greater than 5.
string="""fail_count 7
fail_count 8
fail_count 9
fail count 7
fail_count 71
fail_count 23
"""
match = re.search(r'fail(\s|\_)count\s[5-9]', string)
if match:
print match.group()
I am able to match up to 9, but if I increase the range to 999 it doesn't work.
5-9 or at least 2 digits
'([5-9]|\d{2,})'
or to match the whole numbre when it starts by 5-9.
5-9 followed by any number of digits or at least 2 digits
'([5-9]\d*|\d{2,})'
Maybe this regex solution can help
fail(\s|\_)count\s([0-9]{2,}|[5-9]{1})
see on regex101
I am using preg_match (PHP version 5.5.*) and want to ignore all alphabetic letters [a-zA-Z] and special symbols such as $ and -, only to match numbers, commas, dots. Whitespaces between numbers such as 6 000 should be matched. Commas after a number that is not followed by another number should be ignored, such as 6, would only match 6
Note that this is used in a single string and never in a list, like the sample below. I use the list to show what input and desired output is, "per line".
Sample input:
1
1,99
1.99
10
100
5999 dollars
2 USD
$2,99
Our price 2.99
Price: $ 20
200 $
20,-
6 999 USD
Desired output:
1
1,99
1.99
10
100
5999
2
2,99
2.99
20
200
20
6 999
I have tried /([0-9.,\s]+)/ but the output of 6 999 USD becomes 6.
Edit
The code we are using looks like this:
preg_match($regex, $value, $extractions);
array_shift($extractions);
$this->persist($extractions);
Demo
Update:
If you have instead of spaces, you can do two things..my recommended is to just do a str_replace() first:
str_replace(' ', ' ', $number);
The other option is to also check for with the [\s,] group:
[\d.](?:[\d.]|(?:[\s,]| )(?=\d))*
Example:
preg_match('/[\d.](?:[\d.]|[\s,](?=\d))*/', $number, $matches);
$number = reset($matches);
Explanation:
So I classified the valid characters (digits, spaces, commas, and periods) into two groups: [\d.] and [\s,]. A number must start with a digit or a period ($.99 == .99 != 99). Then we use a repeated non-capturing group (?:...)* to take care of our alternation and lookahead assertions. Anytime there is a [\d.] we match it with now questions asked. Otherwise (|), it it is a [\s,] we assert that it is followed with a digit using a lookahead ((?=...)).
Demo
Example:
preg_replace('/\s*[^\d\s,.]+\s*|,(?!\d)/', '', $number);
Explanation:
[^\d\s,.]+ will match 1+ characters that are not either a digit, whitespace, a comma, or a period. We put \s* on either side to grab any extra whitespace around these unwanted characters (like in "Our price "). The only unwanted character this doesn't match is a trailing comma. We use an alternation (|), then look for a comma, and then make sure that it is not followed by a digit using a negative lookahead ((?!...)).
Demo