Regex for matching different parts of a domain - regex

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!

I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.

I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported

You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Related

Validate domain with regex but not subdomain

I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
I found a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Anyone have tips on this?
I have this regex that is almost what I need:
/(?!www\.)(?=^.{5,254}$)(^(?:(?!\d+\.)[a-z0-9\-]{1,63}\.){1,2}(?:[a-z]{2,})$)/
If I use a subdomain like test.domain.com.br, it validates good (rejecting it), but test.domain.com don't.
I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
Because no regex can do that for you (and anyone pretending the opposite just doesn't understand the DNS).
Which is exactly why you found out that:
a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Because a "subdomain" is just a domain seen differently (or you can say that any domain is also a subdomain of another domain, except for root and TLD). This is all because the DNS is a tree.
You can use the definition given in https://www.rfc-editor.org/rfc/rfc8499:
Subdomain: "A domain is a subdomain of another domain if it is
contained within that domain. This relationship can be tested by
seeing if the subdomain's name ends with the containing domain's
name." (Quoted from [RFC1034], Section 3.1) For example, in the
host name "nnn.mmm.example.com", both "mmm.example.com" and
"nnn.mmm.example.com" are subdomains of "example.com". Note that
the comparisons here are done on whole labels; that is,
"ooo.example.com" is not a subdomain of "oo.example.com".
You can not find administrative boundaries given an hostname by just looking at it. You need either to do DNS live queries to find the delegation points OR you need to use something like the Public Suffix List maintained by Mozilla. Both cases have drawbacks that can be or not a problem depending on your use case.
If you are not convinced, here is some list of valid hostnames (you can use them in an URL and it will work), and try to find out how a regex could have helped you by being right in all cases:
dk
www.sante.gouv.fr
www.com.com
www.nominet.co.uk
www.uk.com
www.walton.k12.fl.us
lagazettedesancetres.blogspot.fr
www.al.ma.leg.br
ab.m.wikibooks.nom.nu
1512f1.станок.спб.рус
You can obviously find shortcuts where a regex will still be wrong but good enough, if you restrict the cases you need to act on. Otherwise, if you need to stay generic and potentially work in any TLD, then, sorry, no regex will solve your problem.
Also your regex is wrong in multiple other cases. For example it won't handle IDN TLDs, that do exist, as they will be like xn--something in ASCII form which won't be accepted by [a-z]{2,}
BTW, useful terminologies I suggest using which may often be clearer than domain/subdomain, as taken from https://url.spec.whatwg.org/#host-miscellaneous
"A host’s public suffix is the portion of a host which is included on the Public Suffix List."
"A host’s registrable domain is a domain formed by the most specific public suffix, along with the domain label immediately preceding it, if any."
I think what you are searching is the "registrable domain" part of any given string (and as you can see from the algorithm given at above URL, you can't do that without finding first the public suffix, which you can't do without using an external resource, the information is NOT self contained in the string).

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Google Analytics IP Filter Exclude

Could someone help me with some REGEX...
I have been blocking internal traffic using the filter pattnrn:
10.*..
This just bit me in the foot as this is blocking all referral traffic between our sites.
What I want to do now is block everything except 10.103..
Do I need to apply two separate ranges, or can I accomplish this with one filter?
If you want to block everything but 10.103.xxx.xxx, use an include filter instead of the usual exclude filter.
NOTE ABOUT REGEXES MATCHING IPs IN ANALYTICS
I am not sure if the filter I suggested above uses regex or not (literal string match), but it doesn't make a difference because there's no way the expression 10.103. could be misinterpreted in an IP address.
Your original pattern, on the other hand, is bogus and is probably hurting you. That's because in a regex the dot . is not a literal dot, but represents any character. Your expression, in fact, excludes every single IP that merely starts with 10 (not just 10. that is ten-dot), including 100.xxx, 101.xxx etc.
The correct version of your original excluding regex would be 10\..*, which contains an escaped dot (\.), then proceeds to any characters after that (.*).
REGEXP are very good explained in the Google Analytics Help (here).
For multiple IPs, there is this little helper, which generates the REGEXP for you.
If you want to block internal traffic, just ADD NEW FILTER and CUSTOM then EXCLUDE and put the IP in REGEXP in the field, that's it.

Perl extract domain name from email address inc tld but excluding subdomains

I'm trying to do what the title says and I've got this:
sub getDomain {
my $scalarRef = shift;
my #from_domain = split(/\#/,$$scalarRef);
if($from_domain[1] =~ m/^.*?(\w+\.\w+)$/){
print "$from_domain[1] $1" if($username eq 'xxx');
return $1;
}
}
Works fine for user#domain.com returning domain.com, but of course domain.co.uk will return .co.uk and I need domain.co.uk. Any suggestions on how to proceed with this one, I'm guessing a module and some suggest some kind of tld lookup table.
Don't use a RegExp.
use Email::Address;
my ($addr) = Email::Address->parse('foo#domain.co.uk');
print "Domain: ".$addr->host."\n";
print "User: ".$addr->user."\n";
Prints:
Domain: domain.co.uk
User: foo
I think you're out of luck here. Net::Domain::TLD will give you a list of TLDs, but that's not actually what you want.
As I understand it, given an email address like user#sub.domain.com, you want to get domain.com. The TLD here is "com" and you want the TLD and the section of the domain that comes before it. That's easy.
And then there's user#sub.domain.co.uk. Here the TLD is "uk". But here you don't want the TLD and the section of the domain that precedes it - you want two sections before the TLD.
So perhaps you need a heuristic. If the TLD is three letters long, take the previous section of the domain, and if the TLD is three letters long, take the previous two sections.
But that doesn't work either. Not all ccTLDs have defined subdomains like .uk does. Take, for example, the popular .tv ccTLD. They allow you to register a domain directly under the ccTLD.
So you don't just need a list of TLDs. You also need to understand the rules that each of the TLDs apply to registrations. And they could change over time. And new TLDs are being introduced - you'd need to keep up with all of those.
Oh, and one last point. Even big ccTLDs like .uk don't always follow their own rules. There are a few .uk domains that don't have a top-level subdomain - .british-library.for example.
You might be able to implement this for a sub-set of domains that you're particularly interested in. But a full solution would be incredibly complex and almost impossible to keep up to date.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.