Need regex to get domain + subdomain - regex

So im using this function here:
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
$referer = get_domain($_SERVER['HTTP_REFERER']);
And what i need is another regex for it, if someone would be so kind to help.
Exactly what i need is for it to get the whole domain, including subdomains.
Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com
The referer url will be just blogger.com, which is not ideal..
So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!
Thanks!

This regex should match a domain in a string, including any dubdomains:
/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/
Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.
See it in action here: http://regexr.com?2vppk
Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex

Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ...
See https://www.rfc-editor.org/rfc/rfc3490
which has
Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.

I guess this is an optimization for the first suggestion.
The main improvements:
does not react to invalid pattern sub..domain.xyz
captures more that one sub-domain as group
captures port if given
https://((?:[a-z0-9-]+\.)*)([a-z0-9-]+\.[a-z]+)($|\s|\:\d{1,5})
Test it: https://regex101.com/r/njFIil/1
This regex does not handle any unicode symbols, which could be a problem as mentioned above.

Better solution:
/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/
Regex sample:
https://regexr.com/4k71a
And for email address:
/^[a-z0-9|.|-]+[a-z0-9]{1,}#([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/

Related

Regex for multiple specific email addresses

I am still figuring my way around regex and have come across a problem that I am trying to solve. How do I validate for multiple specific email addresses?
For example, I want to only allow testdomain.com, realdomain.com, gooddomain.com to be validated. All other email addresses are not allowed.
annie#testdomain.com OK
aaron1#realdomain.com OK
amber#gooddomain.com OK
annie#otherdomain.com NOT OK
But I'm stil unclear on how to add multiple specific email addresses for the regex.
Any and all help would be appreciated.
Thank you,
Do you mean to include various ligitimate domains in one regex?
\b[A-Z0-9._%-]+#(testdomain|gooddomain|realdomain)\.com\b
You didn't specify which language you're using, but most regex implementations have a notion of logical operators, so the domain part of your pattern would have something like:
(domain1|domain2|domain3)
\b[A-Z0-9._%-]+#(testdomain|realdomain|gooddomain)\.com\b
Assuming the above works for testdomain:
\b[A-Z0-9._%-]+#(?:testdomain|realdomain|gooddomain)\.com\b
Also, please note that you will have to add a case insensitive i modifier for this to work with your test cases, or use [A-Za-z0-9._%-] instead of [A-Z0-9._%-]
See here
To make this expandable to many domains, I would probably capture the domain name and then compare that captured domain name with your whitelist in code.
.+#(.+)
First, ".+" will match any number (more than 0) of any characters up until the last "#" symobol in the string.
Second, "#" will match the "#" symbol.
Third, "(.+)" will match and capture (capture because of the parenthesis) any character string after the "#" symbol.
Then, depending on the language you are using, you can get the captured string. Then you can see if that captured string is in your domain whitelist. Note, you'll want to do a case insensitive comparison in this last step.
The official standard is known as RFC 2822.
Use OR operator | for all domain names you want to allow. Do not forget to escape . in the domain.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:testdomain\.com|realdomain\.com|gooddomain\.com)
Also use case-insensitivity modifier/flag to allow capital letters in the address.

Regular expression to add base domain to directory

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.example.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following:
1. the string 'src="' (and capture it in variable $1)
2. one or more non-double-quote (") character followed by " (and capture it in variable $2)
3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+

The Hostname Regex

I'm looking for the regex to validate hostnames. It must completely conform to the standard. Right now, I have
^[0-9a-z]([0-9a-z\-]{0,61}[0-9a-z])?(\.[0-9a-z](0-9a-z\-]{0,61}[0-9a-z])?)*$
but it allows successive hypens and hostnames longer than 255 characters. If the perfect regex is impossible, say so.
Edit/Clarification: a Google search didn't reveal that this is a solved (or proven unsolvable) problem. I want to to create the definitive regex so that nobody has to write his own ever. If dialects matter, I want a a version for each one in which this can be done.
^(?=.{1,255}$)[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?)*\.?$
The approved answer validates invalid hostnames containing multiple dots (example..com). Here is a regex I came up with that I think exactly matches what is allowable under RFC requirements (minus an ending "." supported by some resolvers to short-circuit relative naming and force FQDN resolution).
Spec:
<hname> ::= <name>*["."<name>]
<name> ::= <letter-or-digit>[*[<letter-or-digit-or-hyphen>]<letter-or-digit>]
Regex:
^([a-zA-Z0-9](?:(?:[a-zA-Z0-9-]*|(?<!-)\.(?![-.]))*[a-zA-Z0-9]+)?)$
I've tested quite a few permutations myself, I think it is accurate.
This regex also does not do length validation. Length constraints on labels betweens dots and on names are required by RFC, but lengths can easily be checked as second and third passes after validating against this regex, by checking full string length, and by splitting on "." and validating all substrings lengths. E.g., in JavaScript, label length validation might look like: "example.com".split(".").reduce(function (prev, curr) { return prev && curr.length <= 63; }, true).
Alternative Regex (without negative lookbehind, courtesy of the HTML Living Standard):
^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
Your answer was relatively close.
But see
RFC 2396 Section 3.2.2
JaredPar's reference to this answer is referring to Regexp/Common/URI/RFC2396.pm source.
For a hostname RE, that perl module produces
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)
I would modify to be more accurate as:
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)
Optionally anchoring the ends with ^$ to ONLY match hostnames.
I don't think a single RE can accomplish an full validation because, according to Wikipedia, there is a 255 character length restriction which i don't think can be included within that same RE, at least not without a ton of changes, but it's easy enough to just check the length <= 255 before running the RE.
Take a look at the following question. A few of the answers have regex expressions for host names
Regular expression to match DNS hostname or IP Address?
Could you specify what language you want to use this regex in? Most languages / systems have slightly different regex implementations that will affect people's answers.
I tried all answers with these examples below and unfortunately no one has passed the test.
ec2-11-111-222-333.cd-blahblah-1.compute.amazonaws.com
domaine.com
subdomain.domain.com
12533d5.dkkkd.com
2dotsextension.co
1dotextension.c
ekkej_dhh.com
12552.2225
112.25.25
12345.com
12345.123.com
domaine.123
whatever
9999-ee.99
email#domain.com
.jjdj.kkd
-subdomain.domain.com
#subdomain.domain.com
112.25.25
Here is a better solution.
^[A-Za-z0-9][A-Za-z0-9-.]*\.\D{2,4}$
Just please post any other not considered case if exists # https://regex101.com/r/89zZkW/1
What about:
^(?=.{1,255})([0-9A-Za-z]|_{1}|\*{1}$)(?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?)*\.?$
for matching only one '_' (for some SRV) at the beginning and only one * (in case of a label for a DNs wildcard)
According to the relevant internet RFCs and assuming you have lookahead and lookbehind positive and negative assertions:
If you want to validate a local/leaf hostname for use in an internet hostname (e.g. - FQDN), then:
^(?!-)[-a-zA-Z0-9]{1,63}(?<!-)$
That ^^^ is also the general check that a label component inside an internet hostname is valid.
If you want to validate an internet hostname (e.g. - FQDN), then:
^(?=.{1,253}\.?$)(?:(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.)*(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.?$

Can a URL contain a semicolon and still be valid?

I am using a regular expression to convert plain text URL to clickable links.
#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)#
However, sometimes in the body of the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124
Is it permitted to have a semicolon (;) in a URL or can the semicolon be considered a marker of the end of an URL? How would that fit in my regular expression?
A semicolon is reserved and should only for its special purpose (which depends on the scheme).
Section 2.2:
Many URL schemes reserve certain
characters for a special meaning:
their appearance in the
scheme-specific part of the URL has a
designated semantics. If the character
corresponding to an octet is
reserved in a scheme, the octet must
be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are
the characters which may be
reserved for special meaning within a
scheme. No other characters may be
reserved within a scheme.
The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.
The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt
However, the specification states that whether the semi-colon is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.
Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.
And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.
In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.
For example, sending someone an email with this link:
http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/
will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.
And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.
*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.
http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.
Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..
If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:
https?://[\w!#$%&'()*+,./:;=?#\[\]-]+(?<![!,.?;:"'()-])
After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.
Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.
I replaced the Regex we were using with the following pattern (and have tested that it works):
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-zA-Z0-9-]+\.[a-zA-Z0-9\/_:#=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)

How do I write a regular expression for a URL without the scheme?

How can I write a RE which validates the URLs without the scheme:
Pass:
www.example.com
example.com
Fail:
http://www.example.com
^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
string must start with an ASCII letter or number
ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
optional: a port is allowed (":8080")
optional: anything after a slash may follow (since you said "URL")
then the end of the string
Thoughts:
no line breaks allowed
no validity or sanity checking
no support for "internationalized domain names" (IDNs)
leave off the "optional:" parts if you like, but be sure to include the final "$"
If your regex flavor supports it, you can shorten the above to:
^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$
Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.
If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.
The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:
[2.1] ... In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed
by a colon and then a string (the <scheme-specific-part>) whose
interpretation depends on the scheme.
And, later in the BNF,
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
So, if a scheme is there, you can match it with:
/^[a-z0-9+.-]+:/i
If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.
Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.
URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:
^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$
My guess is
/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/
In more primitive RE syntax that would be
/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/
Or even more primitive still:
/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/
Thanks guys, I think I have a Python and a PHP solution. Here they are:
Python Solution:
import re
url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m # m returns _sre.SRE_Match if url is valid, otherwise None
PHP Solution:
$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);