Perl extract domain name from email address inc tld but excluding subdomains - regex

I'm trying to do what the title says and I've got this:
sub getDomain {
my $scalarRef = shift;
my #from_domain = split(/\#/,$$scalarRef);
if($from_domain[1] =~ m/^.*?(\w+\.\w+)$/){
print "$from_domain[1] $1" if($username eq 'xxx');
return $1;
}
}
Works fine for user#domain.com returning domain.com, but of course domain.co.uk will return .co.uk and I need domain.co.uk. Any suggestions on how to proceed with this one, I'm guessing a module and some suggest some kind of tld lookup table.

Don't use a RegExp.
use Email::Address;
my ($addr) = Email::Address->parse('foo#domain.co.uk');
print "Domain: ".$addr->host."\n";
print "User: ".$addr->user."\n";
Prints:
Domain: domain.co.uk
User: foo

I think you're out of luck here. Net::Domain::TLD will give you a list of TLDs, but that's not actually what you want.
As I understand it, given an email address like user#sub.domain.com, you want to get domain.com. The TLD here is "com" and you want the TLD and the section of the domain that comes before it. That's easy.
And then there's user#sub.domain.co.uk. Here the TLD is "uk". But here you don't want the TLD and the section of the domain that precedes it - you want two sections before the TLD.
So perhaps you need a heuristic. If the TLD is three letters long, take the previous section of the domain, and if the TLD is three letters long, take the previous two sections.
But that doesn't work either. Not all ccTLDs have defined subdomains like .uk does. Take, for example, the popular .tv ccTLD. They allow you to register a domain directly under the ccTLD.
So you don't just need a list of TLDs. You also need to understand the rules that each of the TLDs apply to registrations. And they could change over time. And new TLDs are being introduced - you'd need to keep up with all of those.
Oh, and one last point. Even big ccTLDs like .uk don't always follow their own rules. There are a few .uk domains that don't have a top-level subdomain - .british-library.for example.
You might be able to implement this for a sub-set of domains that you're particularly interested in. But a full solution would be incredibly complex and almost impossible to keep up to date.

Related

Validate domain with regex but not subdomain

I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
I found a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Anyone have tips on this?
I have this regex that is almost what I need:
/(?!www\.)(?=^.{5,254}$)(^(?:(?!\d+\.)[a-z0-9\-]{1,63}\.){1,2}(?:[a-z]{2,})$)/
If I use a subdomain like test.domain.com.br, it validates good (rejecting it), but test.domain.com don't.
I couldn't find anywhere a regex that could validate a domain but not accepting subdomains.
Because no regex can do that for you (and anyone pretending the opposite just doesn't understand the DNS).
Which is exactly why you found out that:
a lot of rules that validates domains but unfortunately all of them also validates subdomains.
Because a "subdomain" is just a domain seen differently (or you can say that any domain is also a subdomain of another domain, except for root and TLD). This is all because the DNS is a tree.
You can use the definition given in https://www.rfc-editor.org/rfc/rfc8499:
Subdomain: "A domain is a subdomain of another domain if it is
contained within that domain. This relationship can be tested by
seeing if the subdomain's name ends with the containing domain's
name." (Quoted from [RFC1034], Section 3.1) For example, in the
host name "nnn.mmm.example.com", both "mmm.example.com" and
"nnn.mmm.example.com" are subdomains of "example.com". Note that
the comparisons here are done on whole labels; that is,
"ooo.example.com" is not a subdomain of "oo.example.com".
You can not find administrative boundaries given an hostname by just looking at it. You need either to do DNS live queries to find the delegation points OR you need to use something like the Public Suffix List maintained by Mozilla. Both cases have drawbacks that can be or not a problem depending on your use case.
If you are not convinced, here is some list of valid hostnames (you can use them in an URL and it will work), and try to find out how a regex could have helped you by being right in all cases:
dk
www.sante.gouv.fr
www.com.com
www.nominet.co.uk
www.uk.com
www.walton.k12.fl.us
lagazettedesancetres.blogspot.fr
www.al.ma.leg.br
ab.m.wikibooks.nom.nu
1512f1.станок.спб.рус
You can obviously find shortcuts where a regex will still be wrong but good enough, if you restrict the cases you need to act on. Otherwise, if you need to stay generic and potentially work in any TLD, then, sorry, no regex will solve your problem.
Also your regex is wrong in multiple other cases. For example it won't handle IDN TLDs, that do exist, as they will be like xn--something in ASCII form which won't be accepted by [a-z]{2,}
BTW, useful terminologies I suggest using which may often be clearer than domain/subdomain, as taken from https://url.spec.whatwg.org/#host-miscellaneous
"A host’s public suffix is the portion of a host which is included on the Public Suffix List."
"A host’s registrable domain is a domain formed by the most specific public suffix, along with the domain label immediately preceding it, if any."
I think what you are searching is the "registrable domain" part of any given string (and as you can see from the algorithm given at above URL, you can't do that without finding first the public suffix, which you can't do without using an external resource, the information is NOT self contained in the string).

detect domains from text with regular expression

I've been finding url from text by preg_match with this pattern /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/
Any further solutions for detecting domains instead of url? May be with a list of top-level domain like this:
.asia .biz .cat .com .net .edu .gov .info .com.eu .com.au
//Edit
For example I have a paragraph like this:
Hello world. https://stackoverflow.com/posts/22112284/edit
And I want to find this domain stackoverflow.com in that text.
If you just want the domain, then just stop at a slash. In fact, you've got it there already, just shorten it. I also added another spot the the end, as there are some wierd top level domains out there (.info, .mobi for instance)
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}

Regex for matching different parts of a domain

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!
I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.
I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported
You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

Need regex to get domain + subdomain

So im using this function here:
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
$referer = get_domain($_SERVER['HTTP_REFERER']);
And what i need is another regex for it, if someone would be so kind to help.
Exactly what i need is for it to get the whole domain, including subdomains.
Lets say as a real problem i have now. When people blogging link from example: myblog.blogger.com
The referer url will be just blogger.com, which is not ideal..
So if someone could help me so i can get the including subdomain as regex code for the function above, id apreciate it alot!
Thanks!
This regex should match a domain in a string, including any dubdomains:
/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/
Translated to rough english, it functions like this: "match the first part of the string that has 'sometextornumbers.sometext', and also include any number of 'sometextornumbers.' that might preceed it.
See it in action here: http://regexr.com?2vppk
Note that the multiline and global flags in that link are only there to be able to match the entire blob of test-text, so you don't need if you're passing only one line to the regex
Good luck with the above as Domain names now contain non-roman characters. These would have to be processed into equivalent but unique ascii before regex could work reliably. See RFC 3490 Internationalizing Domain Names in Applications (IDNA) ...
See https://www.rfc-editor.org/rfc/rfc3490
which has
Until now, there has been no standard method for domain names to use
characters outside the ASCII repertoire. This document defines
internationalized domain names (IDNs) and a mechanism called
Internationalizing Domain Names in Applications (IDNA) for handling
them in a standard fashion. IDNs use characters drawn from a large
repertoire (Unicode), but IDNA allows the non-ASCII characters to be
represented using only the ASCII characters already allowed in so-
called host names today. This backward-compatible representation is
required in existing protocols like DNS, so that IDNs can be
introduced with no changes to the existing infrastructure. IDNA is
only meant for processing domain names, not free text.
I guess this is an optimization for the first suggestion.
The main improvements:
does not react to invalid pattern sub..domain.xyz
captures more that one sub-domain as group
captures port if given
https://((?:[a-z0-9-]+\.)*)([a-z0-9-]+\.[a-z]+)($|\s|\:\d{1,5})
Test it: https://regex101.com/r/njFIil/1
This regex does not handle any unicode symbols, which could be a problem as mentioned above.
Better solution:
/^([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/
Regex sample:
https://regexr.com/4k71a
And for email address:
/^[a-z0-9|.|-]+[a-z0-9]{1,}#([a-z0-9|-]+[a-z0-9]{1,}\.)*[a-z0-9|-]+[a-z0-9]{1,}\.[a-z]{2,}$/