detect domains from text with regular expression - regex

I've been finding url from text by preg_match with this pattern /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/
Any further solutions for detecting domains instead of url? May be with a list of top-level domain like this:
.asia .biz .cat .com .net .edu .gov .info .com.eu .com.au
//Edit
For example I have a paragraph like this:
Hello world. https://stackoverflow.com/posts/22112284/edit
And I want to find this domain stackoverflow.com in that text.

If you just want the domain, then just stop at a slash. In fact, you've got it there already, just shorten it. I also added another spot the the end, as there are some wierd top level domains out there (.info, .mobi for instance)
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}

Related

Regex for both website url versions with wildcard

I'm trying to add in allowed urls in a watchguard firebox webblocker list using regular expression. I'm trying to keep my list short by allowing one entry to apply to both www and non-www versions of a site including subdomains. I'm currently using the following:
(www\.)?ups\.com/*
Which works great for both versions plus subdomains, but has an issue as it allows other sites through that end their domain with ups.com such as jobs-ups.com
How can I make the regular expression know that if there is no subdomain that the url is only going to be ups.com without any other letters before the u, so it will block sites like jobs-ups.com?
You can use the caret ^ to accomplish this
^(?:www\.)?ups\.com\/
DEMO
The caret forces the check at the start of the string. This means it will not match in mid-string, which is what you are wanting.
Not familiar with firebox at all, but generally you should escape your periods and forward slashes. You would also generally use a non-capturing group as well. But if this is simple regex, you can still preserve your original formatting:
^(www.)?ups.com/*

Regex for matching different parts of a domain

I am attempting to split up domains into different categories (Subdomain, Domain, TLD) and am having trouble..
I can't figure out a way to match any number of subdomains and not overtake my domain or TLD mathcing. I am using PCRE regex.
Current regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,3}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Data set:
apple.orange.banana.clevername.co.uk
strawberry.apple.orange.banana.clevername.co.uk
tangerine.com.au
simple.com
Note: There are spaces before and after the domains and they will always be lower case.
An example of how this data would match:
apple.orange.banana.clevername.co.uk
subdomain: apple.orange.banana
domain: google
tld: co.uk
If I add another fruit to the subdomain(strawberry.apple.orange.banana.clevername.co.uk), the match will fail. If I modify the {0,3} for the subdomain regex to a higher number or an unlimited number of matches, it gets too greedy and I no longer end up with a correct match for a domain/tld. Example of this:
Modified regex:
\s(?:(?<subdomain>[a-z0-9\-]*){0,1}\.){0,5}(?<domain>(?>([a-z0-9\-]+)))\.(?<tld>[a-z\.]{2,6})\s
Resulting match with new regex:
strawberry.apple.orange.banana.clevername.co.uk
subdomain: strawberry.apple.orange.banana.clevername
domain:
tld: co.uk
I'm sure the regex isn't the most efficient either so any help or suggestions would be greatly appreciated. Thanks!
I believe this should do it for you:
\s((?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>[a-z0-9\-]{3,}(?=\.[a-z\.]{3,6}))\.(?<tld>[a-z\.]{3,6})\s
Tested this in Splunk and it works with your test data set.
Do note that this won't work for very short domains like bit.ly because there is no way to tell the domain from the subdomain without doing a lookup of the TLD.
For example, compare something.bit.ly and clevername.com.au. Without outside information, there is no way to tell that bit and clevername are the domains.
I recently came across the same problem. So I took Syon's regex and modified it a bit. This is the result:
\s(?:(?<subdomain>[a-z0-9\.\-]*)\.)?(?<domain>(?!com)[a-z0-9\-]{3,}(?=\.[a-z\.]{2,}))\.(?:(?<tld>[a-z\.]{2,})$)\s
It works on the whole test data set (I trimmed the spaces though), as well as short domains like bit.ly. Also works for new top level domains like .cancerresearch. See result:
https://regex101.com/r/nX6yQ7/4
Note: The regex specifically states that the domain can't be com, this needs to be updated if other {3 characters}.xyz tlds need to be supported
You could try to find the longest suffix of the domain which is still listed in the Public Suffix List. After that, splitting the string should be easy.
Note that the list also considers domains of web hosters a public suffix. For example, in example.blogspot.com the public suffix is considered to be blogspot.com, not com. Also the list has to be parsed carefully as it contains comments and exceptions.

Perl extract domain name from email address inc tld but excluding subdomains

I'm trying to do what the title says and I've got this:
sub getDomain {
my $scalarRef = shift;
my #from_domain = split(/\#/,$$scalarRef);
if($from_domain[1] =~ m/^.*?(\w+\.\w+)$/){
print "$from_domain[1] $1" if($username eq 'xxx');
return $1;
}
}
Works fine for user#domain.com returning domain.com, but of course domain.co.uk will return .co.uk and I need domain.co.uk. Any suggestions on how to proceed with this one, I'm guessing a module and some suggest some kind of tld lookup table.
Don't use a RegExp.
use Email::Address;
my ($addr) = Email::Address->parse('foo#domain.co.uk');
print "Domain: ".$addr->host."\n";
print "User: ".$addr->user."\n";
Prints:
Domain: domain.co.uk
User: foo
I think you're out of luck here. Net::Domain::TLD will give you a list of TLDs, but that's not actually what you want.
As I understand it, given an email address like user#sub.domain.com, you want to get domain.com. The TLD here is "com" and you want the TLD and the section of the domain that comes before it. That's easy.
And then there's user#sub.domain.co.uk. Here the TLD is "uk". But here you don't want the TLD and the section of the domain that precedes it - you want two sections before the TLD.
So perhaps you need a heuristic. If the TLD is three letters long, take the previous section of the domain, and if the TLD is three letters long, take the previous two sections.
But that doesn't work either. Not all ccTLDs have defined subdomains like .uk does. Take, for example, the popular .tv ccTLD. They allow you to register a domain directly under the ccTLD.
So you don't just need a list of TLDs. You also need to understand the rules that each of the TLDs apply to registrations. And they could change over time. And new TLDs are being introduced - you'd need to keep up with all of those.
Oh, and one last point. Even big ccTLDs like .uk don't always follow their own rules. There are a few .uk domains that don't have a top-level subdomain - .british-library.for example.
You might be able to implement this for a sub-set of domains that you're particularly interested in. But a full solution would be incredibly complex and almost impossible to keep up to date.

Regex for checking a body of text for a URL?

I have a regex pattern for URL's that I use to check for links in a body of text. The only problem is that the pattern will match this link
stackoverflow.com
And this sentence
I'm a sentence.Next Sentence.
Obviously this would make sense because my pattern doesn't strong check .com, .co.uk, .com.au etc
I want it to match stackoverflow.com and not the latter.
As I'm no Regex expert, does anyone know of any good Regex patterns for checking for all types of URL's in a body text, while not matching the sentences like above?
If I have to strong check the domain extension, I suppose I'll have to settle.
Here's my pattern, but i don't think it help.
(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
I would definitely suggest finding a working regex that someone else has made (which would probably include a strong check on the domain extension), but here is one possible way to just modify your existing regex.
It requires that you make the assumption that usually links will not mix case in the domain extension, for example you might see .COM or .com but probably not .Com, if you only match domain extensions that don't mix case then you would avoid matching most sentences.
In the middle of your regex you have [\w]{2,4}, try changing this to ([A-Z]{2,4}|[a-z]{2,4}) (or (?:[A-Z]{2,4}|[a-z]{2,4}) if you don't want a new captured group).

regex to match only .gov tlds

I am trying to write a regex to grab an entire url of any .gov or .edu web address to make it into a link.
I currently have:
/(\b(https?|ftp):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/
all in () so i can regurgitate it for any url, but I only want .gov or .edu ones.
Thanks in advance.
[-A-Z0-9+&##\/%?=~_|!:,.;]* appears to be slurping up most of the url, so we need to jam the .gov and .edu in here somewhere. The quickest solution would be:
[-A-Z0-9+&##\/%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&##\/%?=~_|!:,.;]*
However, this will match a url like: http://www.example.com/evil.gov/test.html
To fix this, we can take out the / that it is matching before the top level domain:
[-A-Z0-9+&##%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&##\/%?=~_|!:,.;]*
Or, in closing, we have:
/(\b(https?|ftp):\/\/[-A-Z0-9+&##%?=~_|!:,.;]+(\.gov|\.edu)[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|]?)/
Due to the problem that it doesn't match example.gov, I added a ? to the last token.
Damn that is ugly.