Reg Ex is getting more digits than expected - regex

Dont suggest me any links , I saw all million times.
I looked at many suggestions - such as Regex credit card number tests. However, I'm not primarily concerned with verifying potential credit numbers.
I want to locate (potentential) credit card numbers in a document by identifying sequences of 12 to 19 numbers (plus a few common separator characters between them). This is being discussed in, e.g., Finding or Verifying Credit Card Numbers, at which #TimBiegeleisen points. But the suggested solution results in a few false negatives. (See section "Problems..." below.)
Sample input:
[ '232625427',
'please stop check 220 2000000 that was sent 6/10 reg mail and reissu fedex. Please
charge to credit card 4610 0000 0000 0000 exp 05/99...thanks, Sxxx' ]
[ '232653042',
'MARKET PLACE: Exxxx or Bxxxx-Please set husband and wife up on monthly credit card
payments. Name on the credit card is Hxxxx-Jxxxx Lxxxx (Maiden name, name on policy is different) Master card number 5424 0000 0000 0000 Exp 11-30-00. Thanks so much.' ]
Much more sample input at my RegEx101.com attempt.
My regex is
[1-9](\d[ ]?[ ]*?[-]?[-]*?[:]*?[:]?){11,18}\b
Problems with my RegEx
The 12-19 digit numbers are not matched when immediately followed by a string. It fails, e.g., on 4554-4545-4545-4545Visa.
Longer running sequences of numbers are matched at the end rather than the beginning: For 999999999999994190000000000000 I do get 9994190000000000000 instead of 9999999999999941900
I am testing it at RegEx101.com.

To address the problem in your title "Reg Ex is getting more digits than expected" (reading "digits" as "characters", though), try:
[1-9]([- :]*\d){11,18}\b
This way, you no longer match trailing blanks in your sample input. See it in action at RegEx101.com.
Closer to what you pointed out under "Problems..." should be:
[1-9]([- :]*\d){11,18}
With the word boundary removed from the end, strings immediately following the sequence of numbers are no longer causing false negatives. And the match is no longer biased towards the end of a potential match, either. This, however, handles 001 111111111111 differently from your approach:
RegEx101.com.
This could be accounted for with
[1-9][0-9]([- :]*\d){10,17}
at the cost of allowing a few more zeros from "5452 0000 0000 0000000": RegEx101.com.
All suggestions were checked against your sample input, only. Different input might require further tweaking.
Please comment, if and as this requires adjustment / further detail.

Related

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']

Combinations of word `ticket` according to different rules

I am trying to find a single regex pattern that would cover all cases of a string 'ticket' combined with a numeric string coming after it. Rules are:
ticket#digit
ticket #digit
ticket digit
ticketdigit
Sample input:
ticket500
ticket 500
ticket#500
ticket #500
Ticket500
Ticket 500
Ticket#500
Ticket #500
so far I have /ticket([\d]+)/i that correctly reacts for 'ticket500'.
Edit: I am working on large database which has lot's of different variations. I've discovered some other cases not covered by suggested solution. I really need a single regex for PHP to cover all above cases plus the following ones:
Ticket # 786
Ticket: # 786
Ticket: #786
Ticket:# 786
Ticket #: 786
Ticket#: 786
Ticket #:786
This here doesn't accept all combinations with "# " in the middle, and matches the number as group(1):
[Tt]icket ?#?([0-9]+)
If you don't expect your users to input ticket-numbers in octal notation, you might also want to exclude the zero at the beginning of the number:
[Tt]icket ?#?([1-9][0-9]*)
UPDATE: Another version that matches the updated requirements 2018-02-07, i.e. is more tolerant to spaces and flipped order of colon and sharp:
[Tt]icket(?:(?: *:? *#? *)|(?: *#? *:? *))([\d]+)
Here it can be seen in action: https://regex101.com/r/itmr0G/2
For case-(in)sensitivity, the same remarks apply: if your environment compiles the pattern as case-insensitive if you append an /i to it, then feel free to do it, if this is the desired behavior.

Regex for getting just name of street and number from messy address

I have this list of messy addresses, some are clean some aren't:
Av. Chorrillos # 1759 Local 1082 Exterior Jumbo
Av. Balmaceda N° 2355 Local BS - 121 / Subterráneo sector servicios
Tarapaca N° 729
The structure is usually name of street + N°|#|nothing + number + extra stuff
I'd like to erase this extra stuff so that the expected output from the above list is:
Av. Chorrillos # 1759
Av. Balmaceda N° 2355
Tarapaca N° 729
I tried using a combination of letters and lookback:
([a-zA-Z\s]+\d+)
But the # and N° gave me trouble, so I tried also including them
([(\w|°|#)\s]+\d+)
but still no luck.
I know regex on addresses is a nightmare, but any regex that fits those three cases above would fit 95% of my list, which is good enough for me!
I'm using this with python regex in case that matters.
You can find the list of addresses and my regex attempt on regex101
(Some addresses have extra info BEFORE the relevant information of street + number, but I'm fine with screwing up those)
Based on your specifications. I came up with this regex.
Regex: ^.*?(?:[N°#Nº]\s*)?\d+
Explanation:
^.*? consumes everything from beginning of string. Since match is lazy it will match until next part which is (?:[N°#Nº]\s*)?
(?:[N°#Nº]\s*)? matches optional N°#Nº followed by zero or more whitespaces.
\d+ matches numbers.
Regex101 Demo

Regular expression prices

I'm trying to find a valid price validation for my needs..
Valid input format (xxx means no maximum length - 0000 means 4 decimal places at maximum):
15,0000
15.0000
150.0000
150,0000
xxxxxxxxxxxx.0000
xxxxxxxxxxxx,0000
15,00
15,1
15.00
15.1
Invalid input format (basically everything that starts by 0):
01.0000
01.00
01
My regular expression so far: ^\$?[1-9][1-9,]*[0-9]\.?[0-9]{0,2}$
Edit 1: Changed my regex for this one: ^\$?[1-9]*[1-9]((\,)|(\.))?[0-9]{0,4}$ but now I need to be able to add 150000000 and it only allows me 150000
EDIT: just saw that you updated the question and added 0 as a valid input. I'll see if I can add that.
How about:
^([1-9].*[,\.][0-9]*)$
This will work on the examples above.
But be careful with input like 15x,001
See it in action
Okay this one seems okay to me
^[^0]\d+(\.|\,)?[0-9]{0,4}$
checked here http://rubular.com/r/97Ra9VS9h4
and yes one more thing if you want to check for one digit numbers also like 1,2 etc
then you can just replace the + with * like this ^[^0]\d*(\.|\,)?[0-9]{0,4}$
What about this one:
^\$?[1-9][0-9]*(,|\.)[0-9]{1,4}$
The first regex makes sure the price doesnt starts with a zero.
Then all numbers are allowed, zero or more numbers.
Then there must be a comma or a point.
Finaly all numbers are allowed, max count is four and minimum one
^[1-9][0-9]*([.,][0-9]{1,4})?$

Custom RegEx expression for validating different possibilities of phone number entries?

I'm looking for a custom RegEx expression (that works!) to will validate common phone number with area code entries (no country code) such as:
111-111-1111
(111) 111-1111
(111)111-1111
111 111 1111
111.111.1111
1111111111
And combinations of these / anything else I may have forgotton.
Also, is it possible to have the RegEx expression itself reformat the entry? So take the 1111111111 and put it in 111-111-1111 format. The regex will most likely be entered in a Joomla / some type of CMS module, so I can't really add code to it aside from the expression itself.
\(?(\d{3})\)?[ .-]?(\d{3})[ .-]?(\d{4})
will match all your examples; after a match, backreference 1 will contain the area code, backreference 2 and 3 will contain the phone number.
I hope you don't need to handle international phone numbers, too.
If the phone number is in a string by itself, you could also use
^\s*\(?(\d{3})\)?[ .-]?(\d{3})[ .-]?(\d{4})\s*$
allowing for leading/trailing whitespace and nothing else.
Why not just remove spaces, parenthesis, dashes, and periods, then check that it is a number of 10 digits?
Depending on the language in question, you might be better off using a replace-like statement to replace non-numeric characters: ()-/. with nothing, and then just check if what is left is a 10-digit number.