I'am having some problems creating a regex to allow the only valid domain names.
The rules are:
It has 3 characters minimum.
Can have dots but can't have two in a row. Can't have another special characters
Can have lower and upper case letters and numbers
Between points, it needs to have at least one character
For example:
Valid domain name -> bruno.cCm.pt3
Invalid domain name -> bruno..com (or) bruno.
What I have right now is this: ^.{2,253}([A-Za-z\d](-*[A-Za-z\d])*)(\.([A-Za-z\d](-*[A-Za-z\d])*))*$
Try with this one, i made some test and i think it solves your problem:
(?:a-z0-9?.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]
Should also help you in extracting domain names from more complex strings.
This one should also manage upper/lower case:
(?:a-zA-Z0-9?.)+[a-zA-Z0-9][a-zA-Z0-9-]{0,61}[a-zA-Z0-9]
Related
I have this regex (not mine, taken from here)
^[^\.]+\.example\.org$
The regex will match *.example.org (e.g. sub.example.org), but will leaves out sub-subdomain (e.g. sub.sub.example.org), that's great and it is what I want.
But I have other requirement, I want to match subdomain that contain specific string, in this case press. So the regex will match following (literally any subdomain that has word press in it).
free-press.example.org
press.example.org
press23.example.org
I have trouble finding the right syntax, have looked on the internet and mostly they works only for standalone text and not domain like this.
Ok, let's break down what the "subdomain" part of your regex does:
[^\.]+ means "any character except for ., at least once".
You can break your "desired subdomain" up into three parts: "the part before press", "press itself", and "the part after press".
For the parts before and after press, the pattern is basically the same as before, except that you want to change the + (one or more) to a * (zero or more), because there might not be anything before or after press.
So your subdomain pattern will look like [^\.]*press[^\.]*.
Putting it all together, we have ^[^\.]*press[^\.]*\.example\.org$. If we put that into Regex101 we see that it works for your examples.
Note that this isn't a very strict check for valid domains. It might be worth thinking about whether regexes are actually the best tool for the "subdomain checking" part of this task. You might instead want to do something like this:
Use a generic, more thorough, domain-validation regex to check that the domain name is valid.
Split the domain name into parts using String.split('.').
Check that the number of parts is correct (i.e. 3), and that the parts meet your requirements (i.e. the first contains the substring press, the second is example, and the third is org).
If you're looking for a regex that matches URLs whose subdomains contain the word press then use
^[^\.]*press[^\.]*\.example\.org$
See the demo
I am trying to have Google Analytics record a goal when a user lands on a specific page. My client's sales funnel uses dynamic parameters within the URL as you can see below. Unfortunately, I cannot figure out how to fire a goal when a user lands on /2eR54 and not when they land on /confirm.
Goal: /booking/SDFG/2eR54
Not Goal: /booking/SDFG/confirm
Quick notes: The first dynamic field is always four capitalized letters; the second dynamic parameter is a combination of digits and capitalized and non-capitalized letters.
I've tried using the following but /confirm still fires a conversion: .*/booking/[A-Z]{4}/[0-9A-Za-z]{5}
I appreciate your help with this matter.
The regex \/booking\/[A-Z]{4}\/\w{5}$ may fit your needs.
It matches /booking/, four capital letters, /, and then 5 word characters, followed by the end of the string. (If this URL won't be at the end of the string, you may also be able to use \b, the word boundary.)
Demo
I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!
I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.
I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported
You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.
I looked up couple of questions on SO, which seem to suggest that two continuous hyphens (e.g. my--website.com) are not allowed but when I search for same domain name on http://www.register.com/index.rcmx, it gladly accepts the name while rejects non valid domain names like my#website.com.
Validation for URL/Domain using Regex? (Rails)
It's legal in a domain name, and required for internationalised domain names (IDNs) which when converted from Unicode to ASCII end up prefixed with xn--
In general double hyphens are allowed. However, in your specific example, it should not be possible, because they may not occur on the third and fourth position except when writing IDN labels in their xn-- notation. See the following section from RFC 5891:
4.2.3.1. Hyphen Restrictions
The Unicode string MUST NOT contain "--" (two consecutive hyphens) in
the third and fourth character positions and MUST NOT start or end with a "-" (hyphen).
Some TLDs allow as many consecutive dashes as you want, others seem to have specific rules about their positioning.
This is a working website: l-------------------------------------------------------------l.tk (mirror)
The universal rules are:
Not in the 3rd and 4th position, except as part of an IDN
Not at the beginning or end
Nothing else can be counted on unless you thoroughly test each individual TLD.
It must be "allowed", regardless of the answers here, because
https://hp--community.force.com exists. However, perhaps that's only okay because it's a subdomain of a registered domain, and not a registered domain in itself.
I am still figuring my way around regex and have come across a problem that I am trying to solve. How do I validate for multiple specific email addresses?
For example, I want to only allow testdomain.com, realdomain.com, gooddomain.com to be validated. All other email addresses are not allowed.
annie#testdomain.com OK
aaron1#realdomain.com OK
amber#gooddomain.com OK
annie#otherdomain.com NOT OK
But I'm stil unclear on how to add multiple specific email addresses for the regex.
Any and all help would be appreciated.
Thank you,
Do you mean to include various ligitimate domains in one regex?
\b[A-Z0-9._%-]+#(testdomain|gooddomain|realdomain)\.com\b
You didn't specify which language you're using, but most regex implementations have a notion of logical operators, so the domain part of your pattern would have something like:
(domain1|domain2|domain3)
\b[A-Z0-9._%-]+#(testdomain|realdomain|gooddomain)\.com\b
Assuming the above works for testdomain:
\b[A-Z0-9._%-]+#(?:testdomain|realdomain|gooddomain)\.com\b
Also, please note that you will have to add a case insensitive i modifier for this to work with your test cases, or use [A-Za-z0-9._%-] instead of [A-Z0-9._%-]
See here
To make this expandable to many domains, I would probably capture the domain name and then compare that captured domain name with your whitelist in code.
.+#(.+)
First, ".+" will match any number (more than 0) of any characters up until the last "#" symobol in the string.
Second, "#" will match the "#" symbol.
Third, "(.+)" will match and capture (capture because of the parenthesis) any character string after the "#" symbol.
Then, depending on the language you are using, you can get the captured string. Then you can see if that captured string is in your domain whitelist. Note, you'll want to do a case insensitive comparison in this last step.
The official standard is known as RFC 2822.
Use OR operator | for all domain names you want to allow. Do not forget to escape . in the domain.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:testdomain\.com|realdomain\.com|gooddomain\.com)
Also use case-insensitivity modifier/flag to allow capital letters in the address.