Regex Base URL Grabbing - regex

I'm trying to filter out a bunch of urls to find their base url, which doesn't include the www or any prefix, having trouble writing a expression to capture it, but with subset of TLDs, it becomes a rather more complicated issue.
answers.yahoo.com => yahoo.com
www.google.com => google.com
uk.answers.yahoo.co.uk = > yahoo.co.uk
www.g.se => g.se
Any suggestions?
I was using this expression, but it messes up when the domain name isn't more than 2 characters or when the domain tld is less than 2 characters.
(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$

How do you know that the base of uk.answers.yahoo.co.uk is yahoo.co.uk, but the base of, for example, foo.bar.maps.google.com isn't maps.google.com?

[^\.]*\.(?:co.uk|\w{2,3})$
You'll need to add known domains in the regex.
http://regexr.com?30p4r

Related

Dart extract host from URL string

Supposing that I have the following URL as a String;
String urlSource = 'https://www.wikipedia.org/';
I want to extract the main page name from this url String; 'wikipedia', removing https:// , www , .com , .org part from the url.
What is the best way to extract this? In case of RegExp, what regular expression do I have to use?
You do not need to make use of RegExp in this case.
Dart has a premade class for parsing URLs:
Uri
What you want to achieve is quite simple using that API:
final urlSource = 'https://www.wikipedia.org/';
final uri = Uri.parse(urlSource);
uri.host; // www.wikipedia.org
The Uri.host property will give you www.wikipedia.org. From there, you should easily be able to extract wikipedia.
Uri.host will also remove the whole path, i.e. anything after the / after the host.
Extracting the second-level domain
If you want to get the second-level domain, i.e. wikipedia from the host, you could just do uri.host.split('.')[uri.host.split('.').length - 2].
However, note that this is not fail-safe because you might have subdomains or not (e.g. www) and the top-level domain might also be made up of multiple parts. For example, co.uk uses co as the second-level domain.

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

RegEx : Grab domain after sub domain (if there is one) from URL

This is my problem: http://regexr.com?2temn
I'm sure it's pretty simple for some of you regex masters.
Cheers!
This also works:
(?<=\.|)\w+\.\w+$
Tested only with PHP.
Grab domain after (possible) sub domain
is in fact the same as
grab domain before top level domain
it's just get the domain name from a URL.
possible duplicate
Try with:
(\w+\.\w+)[\r\n]+
It matches string with dot inside before new line character
Regex (generic form) :
/^(?:https?://)?(?:([\w_.-]+?).)*[\w_-]+\.\w+.+$/i
Test :
http://subdomain.domain.tld/foo/bar.html => One match (subdomain)
http://subdomain.subdomain2.domain.tld/bar => Two submatches (subdomain, subdomain2)
http://justdomain.tld => NO match
Tested with C#.
C# version of the regex :
^(?:http://)?(?:([\w+_.-]+?)\.)*[\w+_-]+\.\w+.+$
DEMO
I adjusted your version slightly:
(?:\.|)\w+(\.\w+){1,}
I just made the trailing ".xyz" part be a separate token to loop one or many times ("{1,}").

Regex to match all valid links

In regards to this: http://stackoverflow.uservoice.com/pages/general/suggestions/103227-parser-does-not-match-all-valid-urls is this regex adequate or will it need to be refined, if it needs to be refined how so?
\b(?P<link>(?:.*?://)[\w\-\_\.\#\:\/\?\#\=]*)\b
Even though the question is vague, I'll attempt to respond with possible solutions.
Possible Intention 1: To match any URL's in a given file (for replacement):
/^([^:]+):\/\/([-\w._]+)(\/[-\w._]\?(.+)?)?$/ig
The above should match nearly all URL formats, with the following captured groups:
0 => entire match
1 => protocol (eg. http, ftp, git, ...)
2 => hostname (eg. www.stackoverflow.com)
3 => requested_file_path (eg. /images/prod/1/4/success.gif)
4 => query_string (eg. param=1&param2=2&param3=3)
Possible Intention 2: To get details about the current request url
In order to get details about the url such as the protocol, hostname, requested file path, and query string, you're better off using language/object methods to gather the results. In php you can get all of the above information using function calls:
$protocol = $_SERVER['SERVER_PROTOCOL']; // HTTP/1.0
$host = $_SERVER['HTTP_HOST']; // www.stackoverflow.com
$path_to_file = dirname($_SERVER['SCRIPT_NAME']);
$file = basename($_SERVER['SCRIPT_NAME']);
$query_string = $_SERVER['QUERY_STRING'];
Hope this helps in any way.
I guess SO blocks comments after a while? localshred's answer is great, except for a missing wildcard and unescaped periods:
/^([^:]+):\/\/([-\w\._]+)(\/[-\w\._]*\?(.+)?)?$/ig
^-- wildcard
^
we dont want to match everything ^

This regex matches and shouldn't. Why is it?

This regex:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$
Formatted for readability:
^( # Begin regex / begin address clause
(https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
( # container for two address formats, more to come later
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
)|( # delimiter for address formats
((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
(\.) #dot for .com
([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$
is matching:
www.google
and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.
Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.
Querystring needs to support proper formats only, currently accepts any combination of querystring characters
Several protocols not supported, though the scope of my requirements may not include them
uncommon TLDs with 3 characters not included
Probably matches http://www.google..com - will check for consecutive dots
Doesn't support decimal IP address formats
What's wrong with my regex?
edit: See also a previous problem with an earlier version of this regex on a different test case:
How can I make this regex match correctly?
edit2: Fixed - The corrected regex (as asked) is:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$
"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.
Your TLD clause matches "go" in google and the querystring support part matches "ogle" afterwards. Try changing the querystring part to this:
([?/][a-zA-Z0-9\?\=\&\%\/]*)?
google" doesn't fit into the "[a-z]{2}" clause.
But "go" does and then "ogle" matches "([a-zA-Z0-9\?\=\&\%/]*)?"