How to only match before the first dot? - regex

I have the following regex.
^((?!example).)*$#Subdomain is reserved (example).
I would like to validate <subdomain>.example.org. However, since the domain name contains example, a match is occurring.
The validation should not match when the address is www.example.org
The validation should match when the address is example.example.org

Looks like you're missing the escape character from the period
^(example)\..*$
should work

It seems that a simple
^example\.
is enough. Or use string methods, depending on your language:
url.indexOf('example.') === 0
If input such as example.org is also possible, you can use
^example\..+\.
to force the appearance of two dots. But this would still fail for example.co.uk. It depends on your input.

A simple way might be to break it up into two:
^.+\.example\.org$
^(www)?\.example\.org$
If 1) matches and 2) does not, it's a subdomain of example.org; otherwise, it's not. (Although www technically is a subdomain, but you understand.)

Related

Validating URL using regex

I am trying to validate a URL with just a scheme and domain name (something like http://www.domainname.com). I am using this regex:
/^(http|https):\/\/[\w.\-]+\.[A-Za-z]{2,6}/
When I type http://www.ab, up to 6 characters it returns true, after that length it return false. How can I tackle this situation?
You can use regex like this : https?:\/\/www\..*?\.(com|uk|in) (you have to specify what all you want to match at the end.
demo here
Try this one:
^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
Test it here: https://regex101.com/r/xR0oV9/1
Let me correct a bit your pattern, just for information.
Instead of (http|https) much shorter would be (https?) because http part will be in both cases, and s is optional.
Instead of this: [A-Za-z] you can just use lower case letters: [a-z] and add i modifier to the end of your pattern (after last slash /) which would mean case insensitive match.
This one from diegoperini is maybe a little bit longer but therefore it's nearly perfect (atleast in my eyes).
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS
If you want to use it in C# you have to slightly change it. I've done this already some time ago.
^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$

Regex for multiple specific email addresses

I am still figuring my way around regex and have come across a problem that I am trying to solve. How do I validate for multiple specific email addresses?
For example, I want to only allow testdomain.com, realdomain.com, gooddomain.com to be validated. All other email addresses are not allowed.
annie#testdomain.com OK
aaron1#realdomain.com OK
amber#gooddomain.com OK
annie#otherdomain.com NOT OK
But I'm stil unclear on how to add multiple specific email addresses for the regex.
Any and all help would be appreciated.
Thank you,
Do you mean to include various ligitimate domains in one regex?
\b[A-Z0-9._%-]+#(testdomain|gooddomain|realdomain)\.com\b
You didn't specify which language you're using, but most regex implementations have a notion of logical operators, so the domain part of your pattern would have something like:
(domain1|domain2|domain3)
\b[A-Z0-9._%-]+#(testdomain|realdomain|gooddomain)\.com\b
Assuming the above works for testdomain:
\b[A-Z0-9._%-]+#(?:testdomain|realdomain|gooddomain)\.com\b
Also, please note that you will have to add a case insensitive i modifier for this to work with your test cases, or use [A-Za-z0-9._%-] instead of [A-Z0-9._%-]
See here
To make this expandable to many domains, I would probably capture the domain name and then compare that captured domain name with your whitelist in code.
.+#(.+)
First, ".+" will match any number (more than 0) of any characters up until the last "#" symobol in the string.
Second, "#" will match the "#" symbol.
Third, "(.+)" will match and capture (capture because of the parenthesis) any character string after the "#" symbol.
Then, depending on the language you are using, you can get the captured string. Then you can see if that captured string is in your domain whitelist. Note, you'll want to do a case insensitive comparison in this last step.
The official standard is known as RFC 2822.
Use OR operator | for all domain names you want to allow. Do not forget to escape . in the domain.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:testdomain\.com|realdomain\.com|gooddomain\.com)
Also use case-insensitivity modifier/flag to allow capital letters in the address.

Regex with URLs - syntax

We're using a proprietary tracking system that requires the use of regular expressions to load third party scripts on the URLs we specify.
I wanted to check the syntax of the regex we're using to see if it looks right.
To match the following URL
/products/18/indoor-posters
We are using this rule:
.*\/products\/18\/indoor-posters.*
Does this look right? Also, if there was a query parameter on the URL, would it still work? e.g.
/products/18/indoor-posters?someParam=someValue
There's another URL to match:
/products
The rule for this is:
.*\/products
Would this match correctly?
Well, "right" is a relative term. Usually, .* is not a good idea because it matches anything, even nothing. So while these regexes will all match your example strings, they'll also match much more. The question is: What are you using the regexes for?
If you only want to check whether those substrings are present anywhere in the string, then they are fine (but then you don't need regex anyway, just check for substrings).
If you want to somehow check whether it's a valid URL, then no, the regexes are not fine because they'd also match foo-bar!$%(§$§$/products/18/indoor-postersssssss)(/$%/§($/.
If you can be sure that you'll always get a correct URL as your input and just want to check whether they match you pattern, then I'd suggest
^.*\/products$
to match any URL that ends in /products, and
^.*\/products\/18\/indoor-posters(?:\?[\w-]+=[\w-]+)?$
to match a URL that ends in /products/18/indoor-posters with an optional ?name=value bit at the end, assuming only alphanumeric characters are legal for name and value.

The Hostname Regex

I'm looking for the regex to validate hostnames. It must completely conform to the standard. Right now, I have
^[0-9a-z]([0-9a-z\-]{0,61}[0-9a-z])?(\.[0-9a-z](0-9a-z\-]{0,61}[0-9a-z])?)*$
but it allows successive hypens and hostnames longer than 255 characters. If the perfect regex is impossible, say so.
Edit/Clarification: a Google search didn't reveal that this is a solved (or proven unsolvable) problem. I want to to create the definitive regex so that nobody has to write his own ever. If dialects matter, I want a a version for each one in which this can be done.
^(?=.{1,255}$)[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|-){0,61}[0-9A-Za-z])?)*\.?$
The approved answer validates invalid hostnames containing multiple dots (example..com). Here is a regex I came up with that I think exactly matches what is allowable under RFC requirements (minus an ending "." supported by some resolvers to short-circuit relative naming and force FQDN resolution).
Spec:
<hname> ::= <name>*["."<name>]
<name> ::= <letter-or-digit>[*[<letter-or-digit-or-hyphen>]<letter-or-digit>]
Regex:
^([a-zA-Z0-9](?:(?:[a-zA-Z0-9-]*|(?<!-)\.(?![-.]))*[a-zA-Z0-9]+)?)$
I've tested quite a few permutations myself, I think it is accurate.
This regex also does not do length validation. Length constraints on labels betweens dots and on names are required by RFC, but lengths can easily be checked as second and third passes after validating against this regex, by checking full string length, and by splitting on "." and validating all substrings lengths. E.g., in JavaScript, label length validation might look like: "example.com".split(".").reduce(function (prev, curr) { return prev && curr.length <= 63; }, true).
Alternative Regex (without negative lookbehind, courtesy of the HTML Living Standard):
^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
Your answer was relatively close.
But see
RFC 2396 Section 3.2.2
JaredPar's reference to this answer is referring to Regexp/Common/URI/RFC2396.pm source.
For a hostname RE, that perl module produces
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)
I would modify to be more accurate as:
(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]{0,61})?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]{0,61}[a-zA-Z0-9]|[a-zA-Z])[.]?)
Optionally anchoring the ends with ^$ to ONLY match hostnames.
I don't think a single RE can accomplish an full validation because, according to Wikipedia, there is a 255 character length restriction which i don't think can be included within that same RE, at least not without a ton of changes, but it's easy enough to just check the length <= 255 before running the RE.
Take a look at the following question. A few of the answers have regex expressions for host names
Regular expression to match DNS hostname or IP Address?
Could you specify what language you want to use this regex in? Most languages / systems have slightly different regex implementations that will affect people's answers.
I tried all answers with these examples below and unfortunately no one has passed the test.
ec2-11-111-222-333.cd-blahblah-1.compute.amazonaws.com
domaine.com
subdomain.domain.com
12533d5.dkkkd.com
2dotsextension.co
1dotextension.c
ekkej_dhh.com
12552.2225
112.25.25
12345.com
12345.123.com
domaine.123
whatever
9999-ee.99
email#domain.com
.jjdj.kkd
-subdomain.domain.com
#subdomain.domain.com
112.25.25
Here is a better solution.
^[A-Za-z0-9][A-Za-z0-9-.]*\.\D{2,4}$
Just please post any other not considered case if exists # https://regex101.com/r/89zZkW/1
What about:
^(?=.{1,255})([0-9A-Za-z]|_{1}|\*{1}$)(?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?(?:\.[0-9A-Za-z](?:(?:[0-9A-Za-z]|\b-){0,61}[0-9A-Za-z])?)*\.?$
for matching only one '_' (for some SRV) at the beginning and only one * (in case of a label for a DNs wildcard)
According to the relevant internet RFCs and assuming you have lookahead and lookbehind positive and negative assertions:
If you want to validate a local/leaf hostname for use in an internet hostname (e.g. - FQDN), then:
^(?!-)[-a-zA-Z0-9]{1,63}(?<!-)$
That ^^^ is also the general check that a label component inside an internet hostname is valid.
If you want to validate an internet hostname (e.g. - FQDN), then:
^(?=.{1,253}\.?$)(?:(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.)*(?!-)[-a-zA-Z0-9]{1,63}(?<!-)\.?$

How do I write a regular expression for a URL without the scheme?

How can I write a RE which validates the URLs without the scheme:
Pass:
www.example.com
example.com
Fail:
http://www.example.com
^[A-Za-z0-9][A-Za-z0-9.-]+(:\d+)?(/.*)?$
string must start with an ASCII letter or number
ASCII letters, numbers, dots and dashes follow (no slashes or colons allowed)
optional: a port is allowed (":8080")
optional: anything after a slash may follow (since you said "URL")
then the end of the string
Thoughts:
no line breaks allowed
no validity or sanity checking
no support for "internationalized domain names" (IDNs)
leave off the "optional:" parts if you like, but be sure to include the final "$"
If your regex flavor supports it, you can shorten the above to:
^[A-Za-z\d][\w.-]+(:\d+)?(/.*)?$
Be aware that \w may include Unicode characters in some regex flavors. Also, \w includes the underscore, which is invalid in host names. An explicit approach like the first one would be safer.
If you're trying to do this for some real code, find the URL parsing library for your language and use that. If you don't want to use it, look inside to see what it does.
The thing that you are calling "resource" is known as a "scheme". It's documented in RFC 1738 which says:
[2.1] ... In general, URLs are written as follows:
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used (<scheme>) followed
by a colon and then a string (the <scheme-specific-part>) whose
interpretation depends on the scheme.
And, later in the BNF,
scheme = 1*[ lowalpha | digit | "+" | "-" | "." ]
So, if a scheme is there, you can match it with:
/^[a-z0-9+.-]+:/i
If that matches, you have what the URL syntax considers a scheme and your validation fails. If you have strings with port numbers, like www.example.com:80, then things get messy. In practice, I haven't dealt with schemes with - or ., so you might add a real world fudge to get around that until you decide to use a proper library.
Anything beyond that, like checking for existing and reachable domains and so on, is better left to a library that's already figured it all out.
URL syntax is quite complex, you need to narrow it down a bit. You can match anything.ext, if that is enough:
^[a-zA-Z0-9.]+\.[a-zA-Z]{2,4}$
My guess is
/^[\p{Alnum}-]+(\.[\p{Alnum}-]+)+$/
In more primitive RE syntax that would be
/^[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)+$/
Or even more primitive still:
/^[0-9A-Za-z-][0-9A-Za-z-]*\.[0-9A-Za-z-][0-9A-Za-z-]*(\.[0-9A-Za-z-][0-9A-Za-z-]*)*$/
Thanks guys, I think I have a Python and a PHP solution. Here they are:
Python Solution:
import re
url = 'http://www.foo.com'
p = re.compile(r'^(?!http(s)?://$)[A-Za-z][A-Za-z0-9.-]+(:\d+)?(/.*)?$')
m = p.search(url)
print m # m returns _sre.SRE_Match if url is valid, otherwise None
PHP Solution:
$url = 'http://www.foo.com';
preg_match('/^(?!http(s)?:\/\/$)[A-Za-z][A-Za-z0-9\.\-]+(:\d+)?(\/\.*)?$/', $url);