How to verify form input using HTML5 input verification - regex

I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url, email, tel and such, but I couldn't find any. Currently, the built-in versions of these input verifications are far from perfect (and tel doesn't even check if the thing you're entering is a phone number). So I was wondering, which patterns could I use for verifying the user is entering the right format in the inputs?
Here are a few examples of cases where the default verification allows input that is not supposed to be allowed:
type="email"
This field allows emails that have incorrect domains after the #, and it allows addresses to start or end with a dash or period, which isn't allowed either. So, .example-#x is allowed.
type="url"
This input basically allows any input that starts with http:// (Chrome) and is followed by anything other than a few special characters such as those that have a function in URLs (\, #, #, ~, etc). In FF, all that's checked is if it starts with http:, followed by anything other than : (even just http: is allowed in FF). IE does the same as FF, except that it doesn't disallow http::.
For example: http://. is allowed in all three. And so is http://,.
type="tel"
There currently is no built-in verification for phone numbers in any of the major browsers (it functions 100% the same as a type="text", other than telling mobile browsers which kind of keyboard to display.
So, since the browsers don't show a consistent behaviour in each of these cases, and since the behaviour they do show is very basic with many false positives, what can I do to verify my HTML forms (still using HTML5 input verification)?
PS: I'm posting this because I would find it useful to have a complete list of form verification patterns myself, so I figured it might be useful for others too (and of course others can post their solutions too).

These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
##TL;DR:
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Emails
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
##URLs, without IDN support
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
Explanation:
DNSes
URLs should always start with http:// or https://, since we don't want links to other protocols.
Domain names should not start or end with -
Domain names can be a maximum of 63 characters each (so a maximum of 63 characters between each dot), and the total length (including dots) cannot exceed 253 (or 255? be safe and bet on 253.) characters [1].
Non-IDNs can only support the letters of the Latin alphabet, the numbers 0 through 9, and a dash.
Top-level domains of non-IDNs only contain at least the letters of the Latin alphabet [2].
I've set an arbitrary limit of 15 letters, since there are currently no domains that exceed 13 characters (".international"), which most likely won't change any time soon.
IPs
Special cases such as 0.0.0.0, 127.0.0.1, etc. are not checked for
IPs that have padded zeroes in them are not allowed (for example 01.1.1.1) [4].
IP numbers can only go from 0 through 255. 256 is not allowed.
Note that the default http:.* pattern built into modern browsers will always be enforced, so even if you remove the https?:// at the start in this pattern, it will still be enforced. Use type="text" to avoid it.
##URLs, with IDN support
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
Disallowed characters in domain names are: !"#$%&'()*+, ./ :;<=>?# [\]^_`` {|}~ with the exception of a period as domain seperator.
These are matched in the ranges [!-,] [\.\/] [:-#] [\[-``] [{-~].
All other characters are allowed in this input field
TLDs are allowed to have the same letters in them, up to an arbitrary limit of 15 characters (like with the non-IDN URLs).
Alternatively, TLDs can be of the format xn--* with * being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.
##Email addresses
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
Email addresses can only contain letters of the Latin alphabet, the numbers 0-9, and the characters in !#$%&'*+\/=?^_``{|}~.- [6].
Accents are not universally supported [7], but if needed, post a comment, and I could perhaps write a version that meets the RFC 6530 standard.
The local part (before the # can only be 63 characters long, and the total address can only be 254 characters long [8].
Addresses may not start or end with a - or ., and no two dots may appear consecutively [8].
The domain may not be an IP address [9].
Other than that, I only included the non-IDN part of the pattern. IDNs are allowed too though, so those will result in false negatives.
##Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
Phone numbers must start with one of the following, where [CTRY] stands for the country code, and X stands for the first non-zero digit (such as 6 in mobile numbers),
00[CTRY]X
+[CTRY]X
0X
[CTRY]X (This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)
Spaces are allowed between the digits (see the second pattern for the space-less version), except before the non-zero X as defined above.
Phone numbers must be exactly 9 digits long, other than the part before the first non-zero X as defined above.
This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
##Extra: Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
All names must start with an uppercase letter
Uppercase letters may occur in the middle of names (such as John McDoe)
Names must be at least 2 letters long
I've set an arbitrary maximum of 10 names (these people probably won't mind), each of which can be at most 20 letters long (the length of "Werbenjagermanjensen", who happens to be #1).
Latin and Greek letters are allowed, including all accented Latin and Greek letters (list) and Icelandic letters (ÐÞ ðþ):
A-Z matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ά-Ϋ matches all uppercase Greek letters, including the accented ones: Ά·ΈΉΊ΋Ό΍ΎΏΐ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩ ΪΫ.
À-ÖØ-Þ matches all uppercase accented Latin letters, and the Ð and Þ: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ. In between there's also the character × (between Ö and Ø), which is left out this way.
a-z matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
ά-ώ matches all lowercase Greek letters, including the accented ones: άέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
ß-öø-ÿ matches all lowercase accented Latin letters, and the ß, ð and þ: ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ. In between there's also the character ÷ (between ö and ø), which is left out this way.
##References
https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax → https://www.rfc-editor.org/rfc/rfc1034#section-3.1
https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains / https://www.icann.org/resources/pages/tlds-2012-02-25-en
https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process / What are the allowed characters in a subdomain?
Based on the fact neither browsers nor the Windows cmd line allow the padded format.
What are the allowed characters in a subdomain? → http://www.domainnameshop.com/faq.cgi?id=8&session=106ee5e67d523298
https://en.wikipedia.org/wiki/Email_address#Local_part / What characters are allowed in an email address?
https://en.wikipedia.org/wiki/Email_address#Internationalization
https://en.wikipedia.org/wiki/Email_address#Syntax → https://www.rfc-editor.org/rfc/rfc5321#section-4.5.3.1
Sending Email using IP Address instead of Domain Name

Related

Expanding List of Special Characters Allowed In Regex

I found the regex below that I'm using to validate password complexity. How can I modify it to include these characters -_+=#^~ ?
current regex
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[$#$!%*?&])[A-Za-z\d$#$!%*?&]{8,}
conditions
Minimum eight characters, at least one uppercase letter, one lowercase letter, one number and one special character
You can include those special characters in the character classes:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[-_+=#^~$#$!%*?&])[\w+=^$#$!%*?&~-]{8,}$
RegEx Demo
Just remember to keep unescaped hyphen either at start or at the end of the character class and keep ^ in the middle to avoid interpreting it as negation.
Brief
I see these types of questions get posted here all the time, especially with the javascript tag.
The way you're validating passwords is actually very wrong. Don't limit the passwords to a specific set of characters. You're making hackers' jobs extremely easy. How many iterations of the characters abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_+=#^~$#!%*?& are there? Yes, a lot, but limiting the characters to that set reduces the number of iterations possible. Your character set includes 76 characters.
Now let's do some math. 76 characters, password of length 8 (let's be honest, even though we don't like to admit it, most users use a password that's as short as possible, so 8 characters in your case). That means there are 760,269,225,744,000 possible permutations of those characters.
Great! Now what? Adding one more character to the set (77 characters instead of 76) we now get 848,416,382,352,000 permutations (+88,147,156,608,000 permutations). One more (78) yields 945,378,254,620,800 (+96,961,872,268,800 permutations) etc. As you can see, adding one more character to the set increases the number of permutations exponentially.
Whilst adding additional characters to your set may not actually increase password strength (users may still use e in the password instead of è), it at least gives users the option to try to make their passwords stronger.
According to OWASP (the Open Web Application Security Project) - a worldwide not-for-profit organization focused on improving the security of software (from their article on Password Storage Cheat Sheet):
Do not limit the character set and set long max lengths for credentials
Some organizations restrict the
types of special characters
length of credentials accepted by systems because of their inability to prevent SQL Injection, Cross-site scripting,
command-injection and other forms of injection attacks. These
restrictions, while well-intentioned, facilitate certain simple
attacks such as brute force.
Do not allow short or no-length passwords and do not apply character
set, or encoding restrictions on the entry or storage of credentials.
Continue applying encoding, escaping, masking, outright omission, and
other best practices to eliminate injection risks.
A reasonable long password length is 160. Very long password policies
can lead to DOS in certain circumstances.
An interesting read: Think you have a strong password? Hackers crack 16-character passwords in less than an HOUR.
Code
All that being said, I understand trying to help users in the creation of a strong password. For that you can use the following regex (note that not all regex flavours support this, but most languages will support some form of Unicode support, this will need to be adapted for those languages). Also note that this should be run server-side only as doing so client-side exposes information about your password requirements in plain-sight to any hackers (yes, it's still possible for them to figure it out by creating an account and trying to use easy passwords, but it still means they have to put a little bit of effort into figuring out what is and is not allowed):
^(?=.*\p{Ll})(?=.*\p{Lu})(?=.*\p{N})(?=.*[^\p{L}\p{N}\p{C}]).‌​{8,}$
Explanation
^ Assert position at the start of the line
(?=.*\p{Ll}) Positive lookahead ensuring at least one lowercase letter (in any language/script) exists
(?=.*\p{Lu}) Positive lookahead ensuring at least one uppercase letter (in any language/script) exists
(?=.*\p{N}) Positive lookahead ensuring at least one number (any any language/script) exists
(?=.*[^\p{L}\p{N}\p{C}]) Positive lookahead ensuring at least one character that isn't a letter, number or control character (in any language/script) exists
.‌​{8,} Match any character 8 or more times
$ Assert position at the end of the line

Allow only letters and digits in strings but without confusables

Say I want usernames to only consist of letters and digits regardless of language.
I think I might accomplish this with the following regex parts
(?>\p{L}[\p{Mn}\p{Mc}]*) //match any letter, including those consisting of two code points
\p{Nd} //match any digit
Now I have the problem that users may pretend to be other users by using a username that has the same look like the one from another user (homograph attack). admin vs admin would be an example.
I guess it's not possible to easily exclude characters that are both letters and confusables using a regex but how about outside the context of the regexes. Do the unicode ids of confusables lie in certain ranges that we could filter or something like that?
Confusables... Then it comes to mind that you are talking about Cyrillic characters. If that's right, you can easily exclude them from your RegEx. Consider following ranges:
Cyrillic: U+0400–U+04FF, 256 characters
Cyrillic Supplement: U+0500–U+052F, 48 characters
Cyrillic Extended-A: U+2DE0–U+2DFF, 32 characters
Cyrillic Extended-B: U+A640–U+A69F, 96 characters
Phonetic Extensions: U+1D2B, U+1D78, 2 Cyrillic characters
Then:
/[^\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{2DE0}-\x{2DFF}\x{A640}-\x{A69F}\x{1D2B}\x{1D78}]/u
Or simply by using [^\p{Cyrillic}]
The Unicode standard includes a list of confusable characters at http://www.unicode.org/Public/security/revision-02/confusables.txt
This list is incomplete according to some, and too aggressive according to others, but take a look at it in order to understand how difficult the problem is to solve generically.

Looking to build some regex to validate domain names (RFC 952/ RFC 1123)

One of our clients validates email addresses in their own software prior to firing it via an API call to our system. The issue is however that their validation rules do not match those our system, therefore they are parsing and accepting addresses which break our rules. This is causing lots of failed calls.
They are parsing stuff like "dave#-whatever.com", this goes against RFC 952/RFC 1123 rules as it begins with a hyphen. They have asked that we provide them with our regex list so they can update validation on their platform to match ours.
So, I need to find/build an RFC 952/RFC 1123 accepted. I found this in another SO thread (i'm a lurker :)), would it be suitable and prevent these illegal domains from being sent?
"^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$";
A domain part has a max length of 255 characters and can only consist of digits, ASCII characters and hyphens; a hyphen cannot come first.
Checking the validity of one domain component can be done using this regex, case insensitive, length notwithstanding:
[a-z0-9]+(-[a-z0-9]+)*
This is the normal* (special normal*)* pattern again, with normal being [a-z0-9] and special being -.
Then you take all this in another normal* (special normal*)* pattern as the normal part, and the special being ., and anchor it at the beginning and end:
^[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+$
If you cannot afford case insensitive matching, add A-Z to the character class.
But please note that it won't check for the max length of 255. It may be done using a positive lookahead, but the regex will become very complicated, and it is shorter to be using a string length function ;)

A typical regex for phone number validation in Java

I am struggling with getting the right regex for a phone number validation in my application. I have got a regex that will accept only numbers and some special symbols like ()- etc, however, the problem is that it accepts only symbols as well. So for example, it would accept something like ()()()(). I want to modify the regex or get a whole new regex that accepts these symbols but it should have at least one number before and after each symbol.
My requirements are:
Only numbers
Number with combination of special symbols
Each symbol should be followed by a number (before and after) but white spaces are okay
Max length should be 15
In my experience, the parenthesis only appear around the first group of digits and there are never fewer than 3 digits in a group. This regex does that, and prevents multiple consecutive separators with the exception of a space following a paren "(123) 456-7890". I also added support for periods as separators. It allows for 1, 2, or 3 groups of numbers and attempts to enforce an overall range of 7-15 digits but it errs on the permissive side.
^\\s*(\\d{7,15})||(\\d{3,12}[\\-.]?\\s?\\d{3,12}[\\-.\\s]?)||([(]?\\d{3,9}[)\\-.]?\\s?\\d{3,9}[\\-.\\s]?\\d{3,9})\\s*
In my environment I have to escape the backslashes - you may not have to so you may need to replace the \ with . The hyphen must be escaped because in this context it represents a range.

Regex match, quite simple:

I'm looking to match Twitter syntax with a regex.
How can I match anything that is "#______" that is, begins with an # symbol, and is followed by no spaces, just letters and numbers until the end of the word? (To tweeters, I want to match someone's name in a reply)
Go for
/#(\w+)/
to get the matching name extracted as well.
#\w+
That simple?
It should be noted that Twitter no longer allows usernames longer than 15 characters, so you can also match with:
#\w{1,15}
There are still apparently a few people with usernames longer than 15 characters, but testing on 15 would be better if you want to exclude likely false positives.
There are apparently no rules regarding whether underscores can be used the the beginning or end of usernames, multiple underscores, etc., and there are accounts with single-letter names, as well as someone with the username "_".
#[\d\w]+
\d for a digit character
\w for a word character
[] to denote a character class
+ to represent more than one instances of the character class
Note that these specifiers for word and digit characters are language dependent. Check the language specification to be sure.
There is a very extensive API for how to get valid twitter names, mentions, etc. The Java version of the API provided by Twitter can be found on github twitter-text-java. You may want to take a look at it to see if this is something you can use.
I have used it to validate Twitter names and it works very well.