Regex improvements for international, common and RF3966 phone number validation? - regex

Context
Hi, earlier I was browsing the web in order to find a quick answer about telephone number validation in one regex formula : for emergency, short, international, french, spanish and north american numbers (normal, fancy and extended versions).
Strangely, I couldn't find better than "A comprehensive regex for phone number formula", since it seems to be the best topic about this, or I missed it, which is totally possible.
So I'm new to the site and actually writing this very first question (yeah!), since that other thread is currently on hold of some sort : seems the author didn't get what he and I were seeking.
That makes at least three of us who would like to have a good solution, as I know at least my pal, the one who asked me first about finding one to be used in simple integrations like his Google Forms.
Hence my current question(s) and own answer to begin with, since I took some night time to build my own based on advices and tests patterns from the best replies on the other thread. If you're interested by the topic, there are some interesting elements.
Questions
What is the best way to optimize and improve this regex (without resorting to coding) which is dedicated to validation of international and most national phone numbers (along the recommendations of RFC 3966 at least)?
Not sure if I can add a related question as well (since it is still on purpose to improve the usefulness of the regex pattern), no harm asking I guess.
Are there other commonly-used formats that this regex should match (and not)?
If you can add them (or a link) here for me to update my test bundles, I would be thankful. Equally useful would be phone numbers that should definitely not be validated (the unwanted).
My initial solution
My current regex solution (version 4) on Regular Expressions 101
An earlier version was matching results despite leading and trailing whitespaces, not that useful to the point (a bit too fancy for the exceution time).
The latest version at the time of writing took into consideration the other posts on the subject RFC 3966 (from the IETF standards) and the wikipedia article on "Natural conventions for writing telephone numbers".
Another potentially side dish is to isolate matching groups for country code, area code and extended code... and things work relatively dandy to a certain point : it only works well when there are some separators (or the parenthesis) to distinguish those groups of digits.
Matching goals
Emergency and short numbers : 112 or 911
Spanish international : +34 987 654 321
French extended +33 (0)1 23 45 67 89
French national : 01 23 45 67 89
American extended : 001-(123)-456-7890 ext-4321
German (Microsoft style) : +49 (1234) 567890
Mexican national : (01 55) 1234 5678
Hypothetical international number (max length?) : 00321-(4321)-567.89 ext-4321
Another matching goal is to have a regex that do not under-perform too much, not really picky since it is not to be used in critical parts of code.
Still, how could we optimize those best regex(es) people will find/propose without changing their results?
Goals from the main thread
+1(234)/567.8901 x1234 and the like (with different permutations of separators : ., /, - and horizontal whitespaces.
2345678901 : same US number dialed in the states I guess.
Not sure how it should work since I though that + (or its equivalent the double zero 00) was required in front of any international number... always done it that way. The other thread had a list of positive matches without.
Could someone confirm that + or 00 is not mandatory to US numbers? Thank you again.
Best of unwanted formats
12(34567890 and 123)456789012345 : unmatched parenthesis.
)123(34567890 : parenthesis are wrongly matched.
++34123456789 : double + is a typo.
+9-123/456.7890 x12345 : ext has 4 numbers top.
1-234-567-8901 : missing 00 or + at the beginning of an international number.
1234 to 12345678 : not a short number, yet not a normal one (between 9 and 12? as far as i know).
1234567890123 : over max length (since without international features).
0012312345678901 : over max length (as international number).
Regex101.com was a big plus to rewrite and test the regex to this point, I couldn't have progressed so far without its help. Yet, I'm no expert so I can only scratch the surface here and I need your help to improve this.
Thank you for reading, it was very educating to write the question (but not something I would do every day, very time-consuming at my pace), hope it will find its answers as well. Have a nice day (or night... ;) ).

Before I forgot, here's the post of the latest version of the regex I put together and its code :
^(?=(?:\+|0{2})?(?:(?:[\(\-\)\.\/ \t\f]*\d){7,10})?(?:[\-\.\/ \t\f]?\d{2,3})(?:[\-\s]?[ext]{1,3}[\-\.\/ \t\f]?\d{1,4})?$)((?:\+|0{2})\d{0,3})?(?:[\-\.\/ \t\f]?)(\(0\d[ ]?\d{0,4}\)|\(\d{0,4}\)|\d{0,4})(?:[\-\.\/ \t\f]{0,2}\d){3,8}(?:[\-\s]?(?:x|ext)[\-\t\f ]?(\d{1,4}))?$
As far as I know, it pass the tests I put in the question and some more that I added on that Regex101.com page. You can even fork it, very useful feature indeed, I'm a new fan. :)
The code seems to work, as is, with PHP (pcre), Python and Javascript (but not Golang) with different performance that are not awesome but good enough for our purpose.
For instance, I wanted to use \h for horizontal whitespaces (instead of \t, \f and space, but it is less compatible with the different platforms.
It still need a lot of improvements, and I'm eager to see what you will be cooking to answer this little problem of ours, but I'm spent... already a sunny morning here. Good night folks.

Related

Can you limit the words between two capturing groups in Regex

I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

Excluding % from a Regex number search

I'm attempting to create a Regex that finds only 2-digit integers or numbers with a precision of 2 decimal points.
In the example string at the bottom, I want to find only the following:
21 and 10.50
Using this expression, 100% is getting captured, in addition to the strings I desire to capture:
(\d){1,2}(\.?)([0-9]?[0-9]?){1,2}
I know I need to use ^% somewhere, but I can't figure out where it goes. Any suggestions are greatly appreciated.
Here's my sample string:
Earn Up to $21 Per Hour - Deliver Food with !!
Delivery Drivers work when they want and make great money when they do.
All orders are prepaid, just pick them up and deliver them to hungry diners. No waiting in line or fumbling with receipts and prepaid cards.
It's fast and easy to start working. Get started today.
Apply Now
Why choose ?
More orders than any other takeout platform
100% of our restaurants are official partners
Competitive pay: Per order fee + mileage + tips
We guarantee an hourly minimum of $10.50/hour*
Create your own schedule & work the hours you want
Word boundaries in your regular expression will grant you a bit more control.
Since word boundaries are a bit strict, we need to introduce an OR condition to address both cases which will satisfy your regex.
(\b[\d]{2}\.[\d]{2}\b)|(\b[\d]{2}\b)
Edit: Try this one,
\b[\d]{2}\b(\.[\d]{2})?
The first example has a chance to fail as it is order dependent due to the way it short-circuits. This I believe should address multiple cases properly.
I think this should work:
(?<!\d)((\d+\.\d\d)|(\d\d))(?!%|\d)
Demo (and explanation)
EDIT:
Improved version:
(?<!\d)(\d{1,2}(?:\.\d{1,2})?)(?!%|\d)
Demo (and explanation)
You can try this variant: (\d{1,}|[\d.])\b(?!%)
It uses negative lookahead (?!%) to exclude digits following by % sign.
Details at regex101

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".
A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)
There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.
To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4
Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)
regexplib.com has a good library you can search through for examples.
H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input
I would start with the common ones:
validate email
validate phone number
separate the parts of a URL
Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags
Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm
I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.
Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).
Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.
Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)