I looked up couple of questions on SO, which seem to suggest that two continuous hyphens (e.g. my--website.com) are not allowed but when I search for same domain name on http://www.register.com/index.rcmx, it gladly accepts the name while rejects non valid domain names like my#website.com.
Validation for URL/Domain using Regex? (Rails)
It's legal in a domain name, and required for internationalised domain names (IDNs) which when converted from Unicode to ASCII end up prefixed with xn--
In general double hyphens are allowed. However, in your specific example, it should not be possible, because they may not occur on the third and fourth position except when writing IDN labels in their xn-- notation. See the following section from RFC 5891:
4.2.3.1. Hyphen Restrictions
The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
the third and fourth character positions and MUST NOT start or end with a "-" (hyphen).
Some TLDs allow as many consecutive dashes as you want, others seem to have specific rules about their positioning.
This is a working website: l-------------------------------------------------------------l.tk (mirror)
The universal rules are:
Not in the 3rd and 4th position, except as part of an IDN
Not at the beginning or end
Nothing else can be counted on unless you thoroughly test each individual TLD.
It must be "allowed", regardless of the answers here, because
https://hp--community.force.com exists. However, perhaps that's only okay because it's a subdomain of a registered domain, and not a registered domain in itself.
Related
I have this regex (not mine, taken from here)
^[^\.]+\.example\.org$
The regex will match *.example.org (e.g. sub.example.org), but will leaves out sub-subdomain (e.g. sub.sub.example.org), that's great and it is what I want.
But I have other requirement, I want to match subdomain that contain specific string, in this case press. So the regex will match following (literally any subdomain that has word press in it).
free-press.example.org
press.example.org
press23.example.org
I have trouble finding the right syntax, have looked on the internet and mostly they works only for standalone text and not domain like this.
Ok, let's break down what the "subdomain" part of your regex does:
[^\.]+ means "any character except for ., at least once".
You can break your "desired subdomain" up into three parts: "the part before press", "press itself", and "the part after press".
For the parts before and after press, the pattern is basically the same as before, except that you want to change the + (one or more) to a * (zero or more), because there might not be anything before or after press.
So your subdomain pattern will look like [^\.]*press[^\.]*.
Putting it all together, we have ^[^\.]*press[^\.]*\.example\.org$. If we put that into Regex101 we see that it works for your examples.
Note that this isn't a very strict check for valid domains. It might be worth thinking about whether regexes are actually the best tool for the "subdomain checking" part of this task. You might instead want to do something like this:
Use a generic, more thorough, domain-validation regex to check that the domain name is valid.
Split the domain name into parts using String.split('.').
Check that the number of parts is correct (i.e. 3), and that the parts meet your requirements (i.e. the first contains the substring press, the second is example, and the third is org).
If you're looking for a regex that matches URLs whose subdomains contain the word press then use
^[^\.]*press[^\.]*\.example\.org$
See the demo
I'am having some problems creating a regex to allow the only valid domain names.
The rules are:
It has 3 characters minimum.
Can have dots but can't have two in a row. Can't have another special characters
Can have lower and upper case letters and numbers
Between points, it needs to have at least one character
For example:
Valid domain name -> bruno.cCm.pt3
Invalid domain name -> bruno..com (or) bruno.
What I have right now is this: ^.{2,253}([A-Za-z\d](-*[A-Za-z\d])*)(\.([A-Za-z\d](-*[A-Za-z\d])*))*$
Try with this one, i made some test and i think it solves your problem:
(?:a-z0-9?.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]
Should also help you in extracting domain names from more complex strings.
This one should also manage upper/lower case:
(?:a-zA-Z0-9?.)+[a-zA-Z0-9][a-zA-Z0-9-]{0,61}[a-zA-Z0-9]
Since
In October 2009, the Internet Corporation for Assigned Names and Numbers (ICANN) approved the creation of country code top-level domains (ccTLDs) in the Internet that use the IDNA standard for native language scripts.
We just validate the a-zA-Z at in the past. But now, I want to validate a unicode email such as a Chinese email 我#在.中国 or other languages. How to validate them by RegExp?
Here is a validation regex that I wrote for maximum Unicode support and reasonably good overall adherence to RFC standards.
JS:
/^(?!\.)((?!.*\.{2})[a-zA-Z0-9\u0080-\u00FF\u0100-\u017F\u0180-\u024F\u0250-\u02AF\u0300-\u036F\u0370-\u03FF\u0400-\u04FF\u0500-\u052F\u0530-\u058F\u0590-\u05FF\u0600-\u06FF\u0700-\u074F\u0750-\u077F\u0780-\u07BF\u07C0-\u07FF\u0900-\u097F\u0980-\u09FF\u0A00-\u0A7F\u0A80-\u0AFF\u0B00-\u0B7F\u0B80-\u0BFF\u0C00-\u0C7F\u0C80-\u0CFF\u0D00-\u0D7F\u0D80-\u0DFF\u0E00-\u0E7F\u0E80-\u0EFF\u0F00-\u0FFF\u1000-\u109F\u10A0-\u10FF\u1100-\u11FF\u1200-\u137F\u1380-\u139F\u13A0-\u13FF\u1400-\u167F\u1680-\u169F\u16A0-\u16FF\u1700-\u171F\u1720-\u173F\u1740-\u175F\u1760-\u177F\u1780-\u17FF\u1800-\u18AF\u1900-\u194F\u1950-\u197F\u1980-\u19DF\u19E0-\u19FF\u1A00-\u1A1F\u1B00-\u1B7F\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u1E00-\u1EFF\u1F00-\u1FFF\u20D0-\u20FF\u2100-\u214F\u2C00-\u2C5F\u2C60-\u2C7F\u2C80-\u2CFF\u2D00-\u2D2F\u2D30-\u2D7F\u2D80-\u2DDF\u2F00-\u2FDF\u2FF0-\u2FFF\u3040-\u309F\u30A0-\u30FF\u3100-\u312F\u3130-\u318F\u3190-\u319F\u31C0-\u31EF\u31F0-\u31FF\u3200-\u32FF\u3300-\u33FF\u3400-\u4DBF\u4DC0-\u4DFF\u4E00-\u9FFF\uA000-\uA48F\uA490-\uA4CF\uA700-\uA71F\uA800-\uA82F\uA840-\uA87F\uAC00-\uD7AF\uF900-\uFAFF\.!#$%&'*+-/=?^_`{|}~\-\d]+)#(?!\.)([a-zA-Z0-9\u0080-\u00FF\u0100-\u017F\u0180-\u024F\u0250-\u02AF\u0300-\u036F\u0370-\u03FF\u0400-\u04FF\u0500-\u052F\u0530-\u058F\u0590-\u05FF\u0600-\u06FF\u0700-\u074F\u0750-\u077F\u0780-\u07BF\u07C0-\u07FF\u0900-\u097F\u0980-\u09FF\u0A00-\u0A7F\u0A80-\u0AFF\u0B00-\u0B7F\u0B80-\u0BFF\u0C00-\u0C7F\u0C80-\u0CFF\u0D00-\u0D7F\u0D80-\u0DFF\u0E00-\u0E7F\u0E80-\u0EFF\u0F00-\u0FFF\u1000-\u109F\u10A0-\u10FF\u1100-\u11FF\u1200-\u137F\u1380-\u139F\u13A0-\u13FF\u1400-\u167F\u1680-\u169F\u16A0-\u16FF\u1700-\u171F\u1720-\u173F\u1740-\u175F\u1760-\u177F\u1780-\u17FF\u1800-\u18AF\u1900-\u194F\u1950-\u197F\u1980-\u19DF\u19E0-\u19FF\u1A00-\u1A1F\u1B00-\u1B7F\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u1E00-\u1EFF\u1F00-\u1FFF\u20D0-\u20FF\u2100-\u214F\u2C00-\u2C5F\u2C60-\u2C7F\u2C80-\u2CFF\u2D00-\u2D2F\u2D30-\u2D7F\u2D80-\u2DDF\u2F00-\u2FDF\u2FF0-\u2FFF\u3040-\u309F\u30A0-\u30FF\u3100-\u312F\u3130-\u318F\u3190-\u319F\u31C0-\u31EF\u31F0-\u31FF\u3200-\u32FF\u3300-\u33FF\u3400-\u4DBF\u4DC0-\u4DFF\u4E00-\u9FFF\uA000-\uA48F\uA490-\uA4CF\uA700-\uA71F\uA800-\uA82F\uA840-\uA87F\uAC00-\uD7AF\uF900-\uFAFF\-\.\d]+)((\.([a-zA-Z\u0080-\u00FF\u0100-\u017F\u0180-\u024F\u0250-\u02AF\u0300-\u036F\u0370-\u03FF\u0400-\u04FF\u0500-\u052F\u0530-\u058F\u0590-\u05FF\u0600-\u06FF\u0700-\u074F\u0750-\u077F\u0780-\u07BF\u07C0-\u07FF\u0900-\u097F\u0980-\u09FF\u0A00-\u0A7F\u0A80-\u0AFF\u0B00-\u0B7F\u0B80-\u0BFF\u0C00-\u0C7F\u0C80-\u0CFF\u0D00-\u0D7F\u0D80-\u0DFF\u0E00-\u0E7F\u0E80-\u0EFF\u0F00-\u0FFF\u1000-\u109F\u10A0-\u10FF\u1100-\u11FF\u1200-\u137F\u1380-\u139F\u13A0-\u13FF\u1400-\u167F\u1680-\u169F\u16A0-\u16FF\u1700-\u171F\u1720-\u173F\u1740-\u175F\u1760-\u177F\u1780-\u17FF\u1800-\u18AF\u1900-\u194F\u1950-\u197F\u1980-\u19DF\u19E0-\u19FF\u1A00-\u1A1F\u1B00-\u1B7F\u1D00-\u1D7F\u1D80-\u1DBF\u1DC0-\u1DFF\u1E00-\u1EFF\u1F00-\u1FFF\u20D0-\u20FF\u2100-\u214F\u2C00-\u2C5F\u2C60-\u2C7F\u2C80-\u2CFF\u2D00-\u2D2F\u2D30-\u2D7F\u2D80-\u2DDF\u2F00-\u2FDF\u2FF0-\u2FFF\u3040-\u309F\u30A0-\u30FF\u3100-\u312F\u3130-\u318F\u3190-\u319F\u31C0-\u31EF\u31F0-\u31FF\u3200-\u32FF\u3300-\u33FF\u3400-\u4DBF\u4DC0-\u4DFF\u4E00-\u9FFF\uA000-\uA48F\uA490-\uA4CF\uA700-\uA71F\uA800-\uA82F\uA840-\uA87F\uAC00-\uD7AF\uF900-\uFAFF]){2,63})+)$/i
PHP:
/^(?!\.)((?!.*\.{2})[a-zA-Z0-9\x{0080}-\x{00FF}\x{0100}-\x{017F}\x{0180}-\x{024F}\x{0250}-\x{02AF}\x{0300}-\x{036F}\x{0370}-\x{03FF}\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{0530}-\x{058F}\x{0590}-\x{05FF}\x{0600}-\x{06FF}\x{0700}-\x{074F}\x{0750}-\x{077F}\x{0780}-\x{07BF}\x{07C0}-\x{07FF}\x{0900}-\x{097F}\x{0980}-\x{09FF}\x{0A00}-\x{0A7F}\x{0A80}-\x{0AFF}\x{0B00}-\x{0B7F}\x{0B80}-\x{0BFF}\x{0C00}-\x{0C7F}\x{0C80}-\x{0CFF}\x{0D00}-\x{0D7F}\x{0D80}-\x{0DFF}\x{0E00}-\x{0E7F}\x{0E80}-\x{0EFF}\x{0F00}-\x{0FFF}\x{1000}-\x{109F}\x{10A0}-\x{10FF}\x{1100}-\x{11FF}\x{1200}-\x{137F}\x{1380}-\x{139F}\x{13A0}-\x{13FF}\x{1400}-\x{167F}\x{1680}-\x{169F}\x{16A0}-\x{16FF}\x{1700}-\x{171F}\x{1720}-\x{173F}\x{1740}-\x{175F}\x{1760}-\x{177F}\x{1780}-\x{17FF}\x{1800}-\x{18AF}\x{1900}-\x{194F}\x{1950}-\x{197F}\x{1980}-\x{19DF}\x{19E0}-\x{19FF}\x{1A00}-\x{1A1F}\x{1B00}-\x{1B7F}\x{1D00}-\x{1D7F}\x{1D80}-\x{1DBF}\x{1DC0}-\x{1DFF}\x{1E00}-\x{1EFF}\x{1F00}-\x{1FFF}\x{20D0}-\x{20FF}\x{2100}-\x{214F}\x{2C00}-\x{2C5F}\x{2C60}-\x{2C7F}\x{2C80}-\x{2CFF}\x{2D00}-\x{2D2F}\x{2D30}-\x{2D7F}\x{2D80}-\x{2DDF}\x{2F00}-\x{2FDF}\x{2FF0}-\x{2FFF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}\x{3100}-\x{312F}\x{3130}-\x{318F}\x{3190}-\x{319F}\x{31C0}-\x{31EF}\x{31F0}-\x{31FF}\x{3200}-\x{32FF}\x{3300}-\x{33FF}\x{3400}-\x{4DBF}\x{4DC0}-\x{4DFF}\x{4E00}-\x{9FFF}\x{A000}-\x{A48F}\x{A490}-\x{A4CF}\x{A700}-\x{A71F}\x{A800}-\x{A82F}\x{A840}-\x{A87F}\x{AC00}-\x{D7AF}\x{F900}-\x{FAFF}\.!#$%&'*+-\/=?^_`{|}~\-\d]+)#(?!\.)([a-zA-Z0-9\x{0080}-\x{00FF}\x{0100}-\x{017F}\x{0180}-\x{024F}\x{0250}-\x{02AF}\x{0300}-\x{036F}\x{0370}-\x{03FF}\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{0530}-\x{058F}\x{0590}-\x{05FF}\x{0600}-\x{06FF}\x{0700}-\x{074F}\x{0750}-\x{077F}\x{0780}-\x{07BF}\x{07C0}-\x{07FF}\x{0900}-\x{097F}\x{0980}-\x{09FF}\x{0A00}-\x{0A7F}\x{0A80}-\x{0AFF}\x{0B00}-\x{0B7F}\x{0B80}-\x{0BFF}\x{0C00}-\x{0C7F}\x{0C80}-\x{0CFF}\x{0D00}-\x{0D7F}\x{0D80}-\x{0DFF}\x{0E00}-\x{0E7F}\x{0E80}-\x{0EFF}\x{0F00}-\x{0FFF}\x{1000}-\x{109F}\x{10A0}-\x{10FF}\x{1100}-\x{11FF}\x{1200}-\x{137F}\x{1380}-\x{139F}\x{13A0}-\x{13FF}\x{1400}-\x{167F}\x{1680}-\x{169F}\x{16A0}-\x{16FF}\x{1700}-\x{171F}\x{1720}-\x{173F}\x{1740}-\x{175F}\x{1760}-\x{177F}\x{1780}-\x{17FF}\x{1800}-\x{18AF}\x{1900}-\x{194F}\x{1950}-\x{197F}\x{1980}-\x{19DF}\x{19E0}-\x{19FF}\x{1A00}-\x{1A1F}\x{1B00}-\x{1B7F}\x{1D00}-\x{1D7F}\x{1D80}-\x{1DBF}\x{1DC0}-\x{1DFF}\x{1E00}-\x{1EFF}\x{1F00}-\x{1FFF}\x{20D0}-\x{20FF}\x{2100}-\x{214F}\x{2C00}-\x{2C5F}\x{2C60}-\x{2C7F}\x{2C80}-\x{2CFF}\x{2D00}-\x{2D2F}\x{2D30}-\x{2D7F}\x{2D80}-\x{2DDF}\x{2F00}-\x{2FDF}\x{2FF0}-\x{2FFF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}\x{3100}-\x{312F}\x{3130}-\x{318F}\x{3190}-\x{319F}\x{31C0}-\x{31EF}\x{31F0}-\x{31FF}\x{3200}-\x{32FF}\x{3300}-\x{33FF}\x{3400}-\x{4DBF}\x{4DC0}-\x{4DFF}\x{4E00}-\x{9FFF}\x{A000}-\x{A48F}\x{A490}-\x{A4CF}\x{A700}-\x{A71F}\x{A800}-\x{A82F}\x{A840}-\x{A87F}\x{AC00}-\x{D7AF}\x{F900}-\x{FAFF}\-\.\d]+)((\.([a-zA-Z\x{0080}-\x{00FF}\x{0100}-\x{017F}\x{0180}-\x{024F}\x{0250}-\x{02AF}\x{0300}-\x{036F}\x{0370}-\x{03FF}\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{0530}-\x{058F}\x{0590}-\x{05FF}\x{0600}-\x{06FF}\x{0700}-\x{074F}\x{0750}-\x{077F}\x{0780}-\x{07BF}\x{07C0}-\x{07FF}\x{0900}-\x{097F}\x{0980}-\x{09FF}\x{0A00}-\x{0A7F}\x{0A80}-\x{0AFF}\x{0B00}-\x{0B7F}\x{0B80}-\x{0BFF}\x{0C00}-\x{0C7F}\x{0C80}-\x{0CFF}\x{0D00}-\x{0D7F}\x{0D80}-\x{0DFF}\x{0E00}-\x{0E7F}\x{0E80}-\x{0EFF}\x{0F00}-\x{0FFF}\x{1000}-\x{109F}\x{10A0}-\x{10FF}\x{1100}-\x{11FF}\x{1200}-\x{137F}\x{1380}-\x{139F}\x{13A0}-\x{13FF}\x{1400}-\x{167F}\x{1680}-\x{169F}\x{16A0}-\x{16FF}\x{1700}-\x{171F}\x{1720}-\x{173F}\x{1740}-\x{175F}\x{1760}-\x{177F}\x{1780}-\x{17FF}\x{1800}-\x{18AF}\x{1900}-\x{194F}\x{1950}-\x{197F}\x{1980}-\x{19DF}\x{19E0}-\x{19FF}\x{1A00}-\x{1A1F}\x{1B00}-\x{1B7F}\x{1D00}-\x{1D7F}\x{1D80}-\x{1DBF}\x{1DC0}-\x{1DFF}\x{1E00}-\x{1EFF}\x{1F00}-\x{1FFF}\x{20D0}-\x{20FF}\x{2100}-\x{214F}\x{2C00}-\x{2C5F}\x{2C60}-\x{2C7F}\x{2C80}-\x{2CFF}\x{2D00}-\x{2D2F}\x{2D30}-\x{2D7F}\x{2D80}-\x{2DDF}\x{2F00}-\x{2FDF}\x{2FF0}-\x{2FFF}\x{3040}-\x{309F}\x{30A0}-\x{30FF}\x{3100}-\x{312F}\x{3130}-\x{318F}\x{3190}-\x{319F}\x{31C0}-\x{31EF}\x{31F0}-\x{31FF}\x{3200}-\x{32FF}\x{3300}-\x{33FF}\x{3400}-\x{4DBF}\x{4DC0}-\x{4DFF}\x{4E00}-\x{9FFF}\x{A000}-\x{A48F}\x{A490}-\x{A4CF}\x{A700}-\x{A71F}\x{A800}-\x{A82F}\x{A840}-\x{A87F}\x{AC00}-\x{D7AF}\x{F900}-\x{FAFF}]){2,63})+)$/u
Besides RFC rules, the Unicode part above consists of a series of ranges of character subsets. This is done so that only real letters and numbers enter the validation, while non-Latin punctuation and miscellaneous Unicode characters are rejected.
The main things that don't validate correctly at this time are IP addresses in place of domain names, comments inside the "local" part, apostrophes, and forward slashes. I've never seen anyone use the latter two so didn't bother bloating the regex to support them.
You can find a live demo here: http://jsfiddle.net/aossikine/qCLVH/3/
Here is a breakdown of characters allowed by RFC standards and whether they are supported by this regex:
a-zA-Z0-9
!#$%&'*+-/=?^_`{|}~
(),:;<>#[] (must be between quotation marks)
this one is not yet implemented
. (period cannot be the first or last character, shouldn't appear consecutively)
\(must be preceded by a backslash)
this one is not yet implemented
" (must be preceded by a backslash)
this one is not yet implemented
strip out anything surrounded by parenthesis from the start or end of the local part
this one is not yet implemented
domain name can only contain letters, numbers, and dashes (and dashes may be consecutive)
For details on RFC standards, see http://en.wikipedia.org/wiki/E-mail_address#Syntax. Most of the logic detailed there is supported. The JSFiddle link above includes some additional documentation and additional links to handy sites.
Validating an IDNA domain name properly is impossible with regexes. It basically involves the ToASCII operation defined in RFC3490. For every domain label the following steps must be executed:
NAMEPREP processing as defined in RFC3491 which requires:
Mapping of code points using tables from RFC 3454 (STRINGPREP).
Conversion to Unicode Normalization Form NFKC.
Check for prohibited code points using tables from RFC 3454.
Check bidirectional characters as defined in RFC 3454.
Check ASCII characters.
Encode using Punycode.
Check that the result doesn't exceed 63 characters.
You should use an IDNA library for your language of choice.
EDIT: The answer above refers to the old IDNA standard which has been superseded by RFC 5890 ff. The latter is inclusion-based but it's still too complicated to verify a domain name with a regex.
Try this:
\u0900-\u097F\u0980-\u09FF\u0A00-\u0A7F\u0A80-\u0AFF\u0B00-\u0B7F\u0B80-\u0BFF\u0C00-\u0C7F\u0C80-\u0CFF\u0D00-\u0D7F\u0D80-\u0DFF\u0E00-\u0E7F\u0E80-\u0EFF\u0F00-\u0FFF\u1000-\u109F\u1700-\u171F\u1720-\u173F\u1740-\u175F\u1760-\u177F\u1780-\u17FF\u1900-\u194F\u1950-\u197F\u1980-\u19DF\u19E0-\u19FF\u1A00-\u1A1F\u1A20-\u1AAF\u1B00-\u1B7F\u1B80-\u1BBF\u1BC0-\u1BFF\u1C00-\u1C4F\u1CC0-\u1CCF\uA800-\uA82F\uA840-\uA87F\uA880-\uA8DF\uA8E0-\uA8FF\uA930-\uA95F\uA980-\uA9DF\uA9E0-\uA9FF\uAA00-\uAA5F\uAA60-\uAA7F\uAA80-\uAADF\uAAE0-\uAAFF\uABC0-\uABFF\u0600-\u06FF\u0750–\u077F\u08A0–\u08FF\uFB50–\uFDFF\uFE70–\uFEFF\u4e00-\u9faf\u0D80-\u0DFF\u0E80-\u0EFF]/)!=null}e.exports=a}),null);
I am still figuring my way around regex and have come across a problem that I am trying to solve. How do I validate for multiple specific email addresses?
For example, I want to only allow testdomain.com, realdomain.com, gooddomain.com to be validated. All other email addresses are not allowed.
annie#testdomain.com OK
aaron1#realdomain.com OK
amber#gooddomain.com OK
annie#otherdomain.com NOT OK
But I'm stil unclear on how to add multiple specific email addresses for the regex.
Any and all help would be appreciated.
Thank you,
Do you mean to include various ligitimate domains in one regex?
\b[A-Z0-9._%-]+#(testdomain|gooddomain|realdomain)\.com\b
You didn't specify which language you're using, but most regex implementations have a notion of logical operators, so the domain part of your pattern would have something like:
(domain1|domain2|domain3)
\b[A-Z0-9._%-]+#(testdomain|realdomain|gooddomain)\.com\b
Assuming the above works for testdomain:
\b[A-Z0-9._%-]+#(?:testdomain|realdomain|gooddomain)\.com\b
Also, please note that you will have to add a case insensitive i modifier for this to work with your test cases, or use [A-Za-z0-9._%-] instead of [A-Z0-9._%-]
See here
To make this expandable to many domains, I would probably capture the domain name and then compare that captured domain name with your whitelist in code.
.+#(.+)
First, ".+" will match any number (more than 0) of any characters up until the last "#" symobol in the string.
Second, "#" will match the "#" symbol.
Third, "(.+)" will match and capture (capture because of the parenthesis) any character string after the "#" symbol.
Then, depending on the language you are using, you can get the captured string. Then you can see if that captured string is in your domain whitelist. Note, you'll want to do a case insensitive comparison in this last step.
The official standard is known as RFC 2822.
Use OR operator | for all domain names you want to allow. Do not forget to escape . in the domain.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:testdomain\.com|realdomain\.com|gooddomain\.com)
Also use case-insensitivity modifier/flag to allow capital letters in the address.
I am using a regular expression to convert plain text URL to clickable links.
#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)#
However, sometimes in the body of the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124
Is it permitted to have a semicolon (;) in a URL or can the semicolon be considered a marker of the end of an URL? How would that fit in my regular expression?
A semicolon is reserved and should only for its special purpose (which depends on the scheme).
Section 2.2:
Many URL schemes reserve certain
characters for a special meaning:
their appearance in the
scheme-specific part of the URL has a
designated semantics. If the character
corresponding to an octet is
reserved in a scheme, the octet must
be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are
the characters which may be
reserved for special meaning within a
scheme. No other characters may be
reserved within a scheme.
The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.
The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt
However, the specification states that whether the semi-colon is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.
Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.
And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.
In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.
For example, sending someone an email with this link:
http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/
will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.
And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.
*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.
http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.
Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..
If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:
https?://[\w!#$%&'()*+,./:;=?#\[\]-]+(?<![!,.?;:"'()-])
After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.
Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.
I replaced the Regex we were using with the following pattern (and have tested that it works):
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-zA-Z0-9-]+\.[a-zA-Z0-9\/_:#=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)