Extracting top-level and second-level domain from a URL using regex

Extracting top-level and second-level domain from a URL using regex - regex

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

Here's my idea,
Match anything that isn't a dot, three times, from the end of the line using the $ anchor.
The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.
Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.
Regex:
[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$
Demonstration:
Regex101 Example

Updated 2019
This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.
The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.
There are several open-source libraries out there that you can use, like psl, or you can write your own.
Usage for psl is quite intuitive. From their docs:
var psl = require('psl');
// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null
// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'
// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
Old answer
You could use this:
(\w+\.\w+)$
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.
Example: http://regex101.com/r/wD8eP2

Also, you can likely do that with some expression similar to,
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
and add as much as capturing groups that you want to capture the components of a URL.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
This matches anything with a period followed by two or three characters and then a word boundary.
Here's some example outputs:
'example.aus.com' // .aus.com
'example.austin.com' // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy' // .co.uk
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.
Edit
I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
'example.aus.com'.match(/\.\w*\b/g).join('')

Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:
([^.\s]+\.[^.\s]+)$
Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.
http://regexr.com/3bmb3

With capturing groups you can achieve some magix.
For example, consider the following javascript:
let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');
document.write(domain);
This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.
The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.

if you want all specific Top Level Domain name then you can write regular expression like this:
[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]
You can also put more domain name from this link:
https://www.icann.org/resources/pages/tlds-2012-02-25-en

The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:
(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|
It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.

If you need to be more specific:
/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/
Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

Related

JMeter extract link using regular expression pass into next request with blank values

This is how I have Test Plan set up:
HTTP Request -> Regular Expression Extractor to extract multiple links - This is extracting correctly -- But some of the links are Blank
RegularExpressionExtractor --- <a href="(.*)" class="product-link">
BeanShell Sampler - to filter blank or null values -- This works fine
BeanShell Sampler
log.info("Enter Beanshell Sampler");
matches = vars.get("url_matchNr");
log.info(matches);
for (Integer i=1; i < Integer.parseInt(matches); i++)
{
String url = vars.get("url_"+i);
//log.info(url1);
if(url != null #and url.length() > 0)
{
log.info(i+"->" + url);
//return url;
//vars.put("url2", url);
vars.put("url2", url);
//props.put("url2", url);
log.info("URL2:" + vars.get("url2"));
}
}
ForEach Controller
ForEach Controller
Test Plan
The problem I am facing is ForEach Controller runs through all the values including Blank or NULL -- How can I run the loop only for the non null blank values

You should change your regular expression to exclude empty value
Instead of using any value including empty using * sign
<a href="(.*)" class="product-link">
Find only not empty strings using + sign:
<a href="(.+)" class="product-link">

As mentioned earlier, you should change your regex!
you can replace it directly by
<a href="(.+)" class="product-link">
or by something more constraining like this:
<a href="^((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)$" class="product-link">
which is a regex to match only URLs.
https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149
The first capturing group is all option. It allows the URL to begin
with "http://", "https://", or neither of them. I have a question mark
after the s to allow URL's that have http or https. In order to make
this entire group optional, I just added a question mark to the end of
it.
Next is the domain name: one or more numbers, letters, dots, or hypens
followed by another dot then two to six letters or dots. The following
section is the optional files and directories. Inside the group, we
want to match any number of forward slashes, letters, numbers,
underscores, spaces, dots, or hyphens. Then we say that this group can
be matched as many times as we want. Pretty much this allows multiple
directories to be matched along with a file at the end. I have used
the star instead of the question mark because the star says zero or
more, not zero or one. If a question mark was to be used there, only
one file/directory would be able to be matched.
Then a trailing slash is matched, but it can be optional. Finally we
end with the end of the line.
String that matches:
http://net.tutsplus.com/about
String that doesn't match:
http://google.com/some/file!.html (contains an exclamation point)
Good luck!!!

ForEach controller doesn't work with JMeter Properties, you need to change the "Input Variable Prefix" to url_2 and your test should start working as expected.
Also be aware that since JMeter 3.1 it is recommended to use Groovy language for any form of scripting so consider migrating to JSR223 Sampler and Groovy language on next available opportunity.
Groovy has much better performance while Beanshell might become a bottleneck when it comes to immense loads.

Regex, how to match all urls but one?

My Regex skills a minimum, I have been trying for a while now to get this to work:
I need to match all urls in one domain, but one (the login one).
Example:
Match: domain.com/ANYTHING-GOES-HERE
but
Not Match: domain.com/login
I don't actually need to match the domain.com part because that's always the same, what comes after it.
I have tried:
(?!\/login)\/.*
\/.*[^login]
Neither one seems to work as desired.
Update:
I should have explained that this is done in PHP. I don't have control over the actual code that runs the regex, but I do have control over how many regex I can have. So I could have one regex that matches everything, and then have one regex that matches or not matches "/login"

You're almost there:
// javascript
r = /domain\.com\/(?!login).+/
r.test("domain.com/ANYTHING-GOES-HERE") // true
r.test("domain.com/login") // false
This also rejects "domain.com/login/foobar", if you want it to be accepted, modify the regex to be
r = /domain\.com\/(?!login$).+/

RegEx pattern to handle URL with dates

I moved to a new website and it mangled up my URL's. Now blog posts are accessible from multiple URL's and would like to redirect one pattern to the other.
I am trying to redirect the first case to the second case:
~/blogs/johndoe/john-doe/2014/03/14/test-article1 =>
~/blogs/john-doe/2014/03/14/test-article1
~/blogs/jimjones/jim-jones/2014/03/14/test-articleb =>
~/blogs/jim-jones/2014/03/14/test-articleb
How do I create a pattern smart enough to slice out the first "johndoe" and "jimjones"? I am using this for IIS rewrite but I think any RegEx should work. Thanks for any help.

This works:
^~/blogs/\w+/(\w+)-(\w+)/(\d{4})/(\d\d)/(\d\d)/([\w-]+)$
Debuggex Demo
It just discards the non-dash name. It doesn't know if its equal to the dash name or not. And it also assumes that the date numbers are valid. 9899/45/33 would be matched.
Capture groups:
First name
Last name
Year
Month
Day
Article name

I don't know about IIS rewrites, but this should work:
/^~/blogs\/[a-z]+\/ -> ~/blogs/
The regular expression will match the start of a string, following by ~/blogs/, followed by a string of all lowercase characters.

I don't use IIS, but this should be at least close.
Pattern:
^blogs/\w+/(\w+/)
Action
blogs/{R:1}
Handy usage doc

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.

To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.

use this regular expression (?<=://).+

This regex matches and shouldn't. Why is it?

This regex:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$
Formatted for readability:
^( # Begin regex / begin address clause
(https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
( # container for two address formats, more to come later
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
)|( # delimiter for address formats
((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
(\.) #dot for .com
([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$
is matching:
www.google
and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.
Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.
Querystring needs to support proper formats only, currently accepts any combination of querystring characters
Several protocols not supported, though the scope of my requirements may not include them
uncommon TLDs with 3 characters not included
Probably matches http://www.google..com - will check for consecutive dots
Doesn't support decimal IP address formats
What's wrong with my regex?
edit: See also a previous problem with an earlier version of this regex on a different test case:
How can I make this regex match correctly?
edit2: Fixed - The corrected regex (as asked) is:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$

"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.

Your TLD clause matches "go" in google and the querystring support part matches "ogle" afterwards. Try changing the querystring part to this:
([?/][a-zA-Z0-9\?\=\&\%\/]*)?

google" doesn't fit into the "[a-z]{2}" clause.
But "go" does and then "ogle" matches "([a-zA-Z0-9\?\=\&\%/]*)?"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting top-level and second-level domain from a URL using regex - regex

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

If you need to be more specific: /\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/ Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

Related

JMeter extract link using regular expression pass into next request with blank values

Regex, how to match all urls but one?

RegEx pattern to handle URL with dates

Regex to match anything after /

This regex matches and shouldn't. Why is it?

Categories

Resources