WordPress PHP Redirect with Regex Wildcard

WordPress PHP Redirect with Regex Wildcard - regex

This is for a project. I need to redirect certain incoming URL requests to files hosted on the same webserver. I cannot use .htaccess and I cannot rely on any plugins.
I managed to do this with a plugin but I am having a hard time extracting the necessary code in order to write my own plugin which does what I need to do hardcoded.
WordPress Multisite
All software up to date
"Redirection" WordPress Plugin (sucessfully)
writing custom WP functions to do this (semi-successfully)
I found some code in the Pantheon documentation:
$blog_id = get_current_blog_id();
// You can easily put a list of many 301 url redirects in this format
// Trailing slashes matters here so /old-url1 is different from /old-url1/
$redirect_targets = array(
'/test/test.xml' => '/files/' . $blog_id . '/test.xml',
'/regex/wildcard.xml(.*)' => '/files/' . $blog_id . '/regex.xml',
);
if ( (isset($redirect_targets[ $_SERVER['REQUEST_URI'] ] ) ) && (php_sapi_name() != "cli") ) {
echo 'https://'. $_SERVER['HTTP_HOST'] . $redirect_targets[ $_SERVER['REQUEST_URI'] ];
header('HTTP/1.0 301 Moved Permanently');
header('Location: https://'. $_SERVER['HTTP_HOST'] . $redirect_targets[ $_SERVER['REQUEST_URI'] ]);
if (extension_loaded('newrelic')) {
newrelic_name_transaction("redirect");
}
exit();
}
https://example.com/test/test.xml sucessfully
redirects to
/files/5/text.xml
However, some incoming requests contain a query string, e.g.
https://example.com/regex/wildcard.xml?somequerystring
obviously, the redirect from
https://example.com/regex/wildcard.xml
to
/files/5/regex.xml
works fine.
However, as soon as there is a query string involved, the redirect does not work. Given that I need to do this with PHP, how can I achieve a wildcard redirect from either /regex/* or /regex/wildcard.xml* to /files/5/regex.xml?
Any help would be greatly appeciated.
Thanks,
Daniel

If you prefer to retain your code logic and just make the code work, then try this:
$blog_id = get_current_blog_id();
$redirect_targets = array(
'/wordpress\/test\/test\.xml/i' => 'files/'.$blog_id.'/test.xml',
'/([^\/]+)\/([^\/]+)\.xml.*/i' => 'files/'.$blog_id.'/$1.xml',
);
// Get reuest uri without GET attributes.
$request_uri = get_request_uri();
// Loop through redirect rules.
foreach ($redirect_targets as $pattern => $redirect) {
// If matched a rule, then create a new redirect URL
if ( preg_match( $pattern, $request_uri ) ) {
$new_request_uri = preg_replace( $pattern, $redirect, $request_uri );
$new_url = 'https://'.$_SERVER['HTTP_HOST'].$new_request_uri;
header( 'HTTP/1.0 301 Moved Permanently' );
header( 'Location: '.$new_url );
if ( extension_loaded( 'newrelic' ) ) {
newrelic_name_transaction( "redirect" );
}
exit();
}
}
// Returns REQUEST URI without 'get' arguments
// if example.com/test/test.php?some=arg it will return test/test.php
function get_request_uri () {
return strtok( $_SERVER['REQUEST_URI'], '?' );
}
You can modify redirect rule as you want. It works as a normal Regex pattern.

Related

nginx remove / rewrite GET Parameters for specific URL

I want to create a nginx localtion do to the following
Given URL:
example.com/foo/bar/123456?ItemID=123456&aid=0&bid=0
Task:
If both numbers are the same and aid and bid are zero, then rewreite the url to example.com/foo/bar/123456
My Try:
location ~ ^/foo/bar/(?<prid>\d+)\?ItemID=\1&aid=0&bid=0$ {
rewrite ^ /foo/bar/$prid? permanent;
}
But that doesn't work. ;)
Would be great if s.o. could give me a hint.
EDIT:
nginx seems not to match GET-Parameters by regex at all (in location line) so you have to use $args and check with if (which can be evil according to documentation).

This should work:
location ~ /foo/bar/(\d+) {
if ($arg_ItemID = $1) { set $check I; }
if ($arg_aid = 0) { set $check "${check}A"; }
if ($arg_bid = 0) { set $check "${check}B"; }
if ($check = IAB) {
rewrite ^ /foo/bar/$arg_ItemID? permanent;
} }
Explanation:
nginx doesn't include the parameters in the match for rewrite. You can access the parameters by name via $arg_name.
the set of if-statements is a work-around (described here), because nginx doesn't allow multiple conditions
the ? at the end of the replacement cuts off the arguments from the original request

htaccess 404 error conditional code for 301 redirect

I have paginated URLs that look like http://www.domain.com/tag/apple/page/1/
If a URL such as http://www.domain.com/tag/apple/page/*2/ doesn't exist, or page/2 doesn't exist, I need code to redirect this to a page such as http://www.domain.com/tag/apple/ which would be the main tag page.
I've currently have the following code:
RewriteCond %{HTTP_HOST} !^http://www.domain.com/tag/([0-9a-zA-Z]*)/page/([0-9]*)/$
RewriteRule (.*) http://www.domain.com/tag/$1/ [R=301,L]
In this code, if the URL does not exist it redirects to the main tag page, but its not working.
Does anybody have an hints or solutions on how to solve this problem?

If I understand what you're saying, you are saying that you have a list of rewritten URLs (using mod_rewrite); some of them exist, some of them don't. With the ones that don't exist, you want them to be redirected to a new page location?
The short-answer is, you can't do that within htaccess. When you're working with mod_rewrite, your rewritten page names are passed to a controller file that translates the rewritten URL to what page/content it's supposed to display.
I'm only assuming you're using PHP, and if so, most PHP frameworks (CakePHP, Drupal, LithiumPHP, etc.) can take care of this issue for you and handle custom redirects for non-existing files. If you have a custom-written application, you'll need to handle the redirect within the PHP website and not in the .htaccess file.
A very simple example of this would be:
<?php
function getTag($url) {
if (preg_match('|/tag/([0-9a-zA-Z]*)/|', $url, $match)) {
return $match[1];
}
return '';
}
function validateUrl($url) {
if (preg_match('|/tag/([0-9a-zA-Z]*)/page/([0-9]*)/|', $url, $match)) {
$tag = $match[1];
$page = $match[2];
$isValid = // your code that checks if it's a URL/page that exists
return $isValid;
}
return false;
}
if (!validateUrl($_SERVER['REQUEST_URI'])) {
$tag = getTag($_SERVER['REQUEST_URI']);
header ('HTTP/1.1 301 Moved Permanently');
header('Location /tag/' . $tag . '/');
die();
}
?>

joomla forbid direct access to components and modules

I activated SEO friendly URLs. Basically URLs in my app looks like following:
http://x.com/en or http://x.com/en/gallery.
From my app there is no link, let's say, on com_users. But user still can open it with one of the following URLs: http://x.com/component/users or http://x.com/?option=com_banners.
I blocked first one with this:
RewriteCond %{REQUEST_URI} /component/ [NC]
RewriteRule ^.*$ - [F,L]
How can I block the second (?option=com_users)?
I understand that this behavior could be default and expected for Joomla, but I just want to give you one example.
When I allowed access to all my pages for only registered users they still are able to access components. At the same time in Joomla administration there is no permission for read. Finally, users are getting template page or some data if it is public, for ex., articles from com_content. And question: how to raise 403 in this case or, at least, redirect to / ?
Update:
I need to block /users?view=registration, reset remind and profile. And I need to redirect any error to login page. Doesn't matter whether it is whole Joomla component or view, task etc.

I would go another way, and use rel=canonical for this.
This is a much easier/better way of doing things, as the tag will appear on all "Page Versions" and you don't need to set many case-specific rules or carry around a heave redirect file...
This is just one Plugin that will help your canoniczlization.
http://extensions.joomla.org/extensions/site-management/seo-a-metadata/meta-data/11038?qh=YTo0OntpOjA7czo5OiJjYW5vbmljYWwiO2k6MTtzOjExOiInY2Fub25pY2FsJyI7aToyO3M6MTI6IidjYW5vbmljYWwnLiI7aTozO3M6NToiY2Fub24iO30%3D

I wrote my own plugin to handle all cases and redirect to login page (/login) in case of any inconvenience. By inconvenience I mean any direct access to any component in Joomla, or 403, or 404, but not 500. For now, my application works very well accepting only following URLs: /login, /home, /gallery, /gallery/album/any, and few others. Direct access is totally forbidden, though, user cannot use URL params (like ?option=com_users) or /component/ path.
This approach wouldn't work with SEO URLs turned off.
<?php // no direct access
defined( '_JEXEC' ) or die( 'Restricted access' );
jimport( 'joomla.event.plugin' );
class plgSystemComontrol extends JPlugin {
function plgSystemComcontrol(& $subject, $config) {
parent::__construct($subject, $config);
}
function onAfterRoute() {
// get plugin parameters
$com_redirect_url = $this->params->def('com_redirect_url', 'index.php?option=com_user&view=login');
$com_debug = $this->params->def('com_debug', '0');
$com_message = $this->params->def('com_message', '');
// get option, view, task ..
$mainframe = JFactory::getApplication();
$option = JRequest::getCmd('option');
$view = JRequest::getCmd('view');
$task = JRequest::getCmd('task');
// get current URL
$uri = JFactory::getURI();
$url = $uri->toString();
$u_host = $uri->getHost();
$u_path = $uri->getPath();
$path = substr($url, strlen(JURI::root()));
// get user permissions
$groupsUserIsIn = JAccess::getGroupsByUser(JFactory::getUser()->id);
$user_type = implode(" ",$groupsUserIsIn);
$group_sum = array_sum($groupsUserIsIn);
if ($com_debug == '1') {
$mainframe->enqueueMessage('--------------------------------');
$mainframe->enqueueMessage('$option = '.$option);
$mainframe->enqueueMessage('$view = '.$view);
$mainframe->enqueueMessage('$task = '.$task);
$mainframe->enqueueMessage('$url = '.$url);
$mainframe->enqueueMessage('$path = '.$path);
}
if (strpos($path, 'administrator') === 0) {
return;
}
// set default redirect page
$redirectPage = ($group_sum > 1) ? 'index.php' : 'index.php/login';
$directAccess = strpos($path, 'component') !== false || strpos($path, 'option') !== false;
// allow login page only
if ($option == 'com_users') {
if (($view == 'login' || empty($view) || $task == 'user.login' || $task == 'user.logout') && !$directAccess) { // $view == 'default'
return;
} else {
$mainframe->redirect($redirectPage, $directAccess ? 'Direct access to components forbidden' : 'Login/logout is enabled only');
//JError::raiseError(403, JText::_('Forbidden'));
//return;
}
}
// deny direct access to components
if ($directAccess) {
$mainframe->redirect($redirectPage, 'Direct access to components forbidden');
//JError::raiseError(401, JText::_('/component/'));
}
// get usertype to see if logged-in
// $user =& JFactory::getUser();
// $user_type = $user->get('usertype');
$groupsUserIsIn = JAccess::getGroupsByUser(JFactory::getUser()->id);
$user_type = implode(" ",$groupsUserIsIn);
$group_sum = array_sum($groupsUserIsIn);
if ($group_sum > '1') {
return ;
}
//if user logged-in, then return from function
if (empty($option)) {
return;
}
$mainframe->redirect( $com_redirect_url, $com_message );
return;
}
}
?>
I hope this will help to understand how to do some custom redirects and disable direct access to the components.

Turn set of urls in to a regex pattern (optional patterns)

Using an arbitrary set of urls (eg: http://api.longurl.org/v2/services) what is the best way to turn this list into a regex?
Is this appropriate regex?
(((easyuri|eepurl|eweri)\.com)|((migre|mke|myloc)\.me)|etc...)'
Can you do multiple levels of optional patterns like that?

I see different ways to accomplish this.
Use XPath and try to select a node given the current URL.
Parse the xml into a dictionary and test your current URL if it exists as a key.
Store the domains of the XML in a database, index the url field and query your current URL.
If performance is not an issue: Match the current URL against the entire XML file as text.
Perhaps there are more ideas.
Building a regex from the XML does not seem to me a good idea since all the other solutions appear to me far more easy to develop.

OP'S ANSWER:
Well it turns out that this does work:
/((?:easyuri|eepurl|eweri)\.com)|((?:migre|mke|myloc)\.me)/
Run against this:
easyuri.com eepurl.comer eweri.us migre.me mke.memo myloc.em
You get this:
[0] => Array
(
[0] => easyuri.com
[1] => eepurl.com
[2] => migre.me
[3] => mke.me
)
But the easiest way would just be something like this:
/0rz\.tw|1link\.in|1url\.com|2\.gp|2big\.at|etc\.\.\./
Regex helps you complicate things more than is possible with other methods. ;P
Here's the PHP I eventually used to create the regex:
Assumes that you have cURL'd http://api.longurl.org/v2/services and converted the xml to an array called $urlShorteners like: $urlShorteners = array('0rz.tw', '1link.in', 'etc...');
foreach($urlShorteners as $url) {
$urls[] = array_reverse(explode('.', $url));
}
foreach($urls as $url) {
$tldKeys[array_shift($url)][] = $url;
}
foreach($tldKeys as $tld => $doms) {
if($tld != '') {
$subPattern = array();
foreach($doms as $subDomain) {
$subPattern[] = implode("\.", array_reverse($subDomain));
}
if (count($subPattern) > 1) $optionPattern[] = "((?:" . implode("|", $subPattern) . ")\." . $tld . ")";
else $optionPattern[] = "(" . $subPattern[0] . "\." . $tld . ")";
}
}
$regex = '/' . implode('|', $optionPattern) . '/';
echo $regex . "\n";

How to get domain name from URL

How can I fetch a domain name from a URL String?
Examples:
+----------------------+------------+
| input | output |
+----------------------+------------+
| www.google.com | google |
| www.mail.yahoo.com | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk | abc |
+----------------------+------------+
Related:
Matching a web address through regex

I once had to write such a regex for a company I worked for. The solution was this:
Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
Very fast if regex is optimally ordered
The downside of this solution is of course:
Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
Very large regex so not very readable.

A little late to the party, but:
const urls = [
'www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'http://www.google.co.uk',
'www.yandex.com',
'yandex.ru',
'yandex'
]
urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.
I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:
function parse_url_all($url){
$url = substr($url,0,4)=='http'? $url: 'http://'.$url;
$d = parse_url($url);
$tmp = explode('.',$d['host']);
$n = count($tmp);
if ($n>=2){
if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
$d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-3)];
} else {
$d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-2)];
}
}
return $d;
}
This simple function will work in almost every case. There are a few exceptions, but these are very rare.
To demonstrate / test this function you can use the following:
$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
$info = parse_url_all($url);
echo "<tr><td>".$url."</td><td>".$info['host'].
"</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";
The output will be as follows for the URL's listed:
As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.
I hope that this helps.

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

There are two ways
Using split
Then just parse that string
var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
//find & remove port number
domain = domain.split(':')[0];
Using Regex
var r = /:\/\/(.[^/]+)/;
"http://stackoverflow.com/questions/5343288/get-url".match(r)[1]
=> stackoverflow.com
Hope this helps

I don't know of any libraries, but the string manipulation of domain names is easy enough.
The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).
The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

import urlparse
GENERIC_TLDS = [
'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs',
'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
]
def get_domain(url):
hostname = urlparse.urlparse(url.lower()).netloc
if hostname == '':
# Force the recognition as a full URL
hostname = urlparse.urlparse('http://' + uri).netloc
# Remove the 'user:passw', 'www.' and ':port' parts
hostname = hostname.split('#')[-1].split(':')[0].lstrip('www.').split('.')
num_parts = len(hostname)
if (num_parts < 3) or (len(hostname[-1]) > 2):
return '.'.join(hostname[:-1])
if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
return '.'.join(hostname[:-1])
if num_parts >= 3:
return '.'.join(hostname[:-2])
This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.
However it'll do the job in most cases.

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).
But even with that you won't have success if your list does not contain SLDs, too. URLs like http://big.uk.com/ and http://www.uk.com/ would be both interpreted as uk.com (the first domain is big.uk.com).
Because of that all browsers use Mozilla's Public Suffix List:
https://en.wikipedia.org/wiki/Public_Suffix_List
You can use it in your code by importing it through this URL:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

Basically, what you want is:
google.com -> google.com -> google
www.google.com -> google.com -> google
google.co.uk -> google.co.uk -> google
www.google.co.uk -> google.co.uk -> google
www.google.org -> google.org -> google
www.google.org.uk -> google.org.uk -> google
Optional:
www.google.com -> google.com -> www.google
images.google.com -> google.com -> images.google
mail.yahoo.co.uk -> yahoo.co.uk -> mail.yahoo
mail.yahoo.com -> yahoo.com -> mail.yahoo
www.mail.yahoo.com -> yahoo.com -> mail.yahoo
You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:
(co|com|gov|net|org)
If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
$dest=$d[$c-2].'.'.$d[$c-1]; # use the last 2 parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3].'.'.$dest; # if so, add a third part
};
print $dest; # show it
To just get the name, as per your question:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3]; # if so, give the third last
$dest=$d[$c-4].'.'.$dest if ($c>3); # optional bit
} else {
$dest=$d[$c-2]; # else the second last
$dest=$d[$c-3].'.'.$dest if ($c>2); # optional bit
};
print $dest; # show it
I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.
If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.
I'd love to see someone do all of this using just a regex, I'm sure it's possible.

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim
usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

Could you just look for the word before .com (or other) (the order of the other list would be the opposite of the frequency see here
and take the first matching group
i.e.
window.location.host.match(/(\w|-)+(?=(\.(com|net|org|info|coop|int|co|ac|ie|co|ai|eu|ca|icu|top|xyz|tk|cn|ga|cf|nl|us|eu|de|hk|am|tv|bingo|blackfriday|gov|edu|mil|arpa|au|ru)(\.|\/|$)))/g)[0]
You can test it could by copying this line into the developers' console on any tab
This example works in the following cases:

So if you just have a string and not a window.location you could use...
String.prototype.toUrl = function(){
if(!this && 0 < this.length)
{
return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
s = 'http://' + original;
}
s = this.split('/');
var protocol = s[0];
var host = s[2];
var relativePath = '';
if(s.length > 3){
for(var i=3;i< s.length;i++)
{
relativePath += '/' + s[i];
}
}
s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];
return {
original: original,
protocol: protocol,
domain: domain,
host: host,
relativePath: relativePath,
getParameter: function(param)
{
return this.getParameters()[param];
},
getParameters: function(){
var vars = [], hash;
var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
for (var i = 0; i < hashes.length; i++) {
hash = hashes[i].split('=');
vars.push(hash[0]);
vars[hash[0]] = hash[1];
}
return vars;
}
};};
How to use.
var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;
var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.
Output looks like this :
http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com
def getDomain(url):
parts = re.split("\/", url)
match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2])
if match != None:
if re.search("\.uk", parts[2]):
match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
return match.group(2)
else: return ''
Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

how is this
=((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3})
(you may want to add "\/" to end of pattern
if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:
=((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)
and replace with "/"
The goal of this example to get rid of any domain name regardless of the form it appears in.
(i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

All answers here are very nice, but all will fails sometime.
So i know it is not common to link something else, already answered elsewhere, but you'll find that you have to not waste your time into impossible thing.
This because domains like mydomain.co.uk there is no way to know if an extracted domain is correct.
If you speak about to extract by URLs, something that ever have http or https or nothing in front (but if it is possible nothing in front, you have to remove
filter_var($url, filter_var($url, FILTER_VALIDATE_URL))
here below, because FILTER_VALIDATE_URL do not recognize as url a string that do not begin with http, so may remove it, and you can also achieve with something stupid like this, that never will fail:
$url = strtolower('hTTps://www.example.com/w3/forum/index.php');
if( filter_var($url, FILTER_VALIDATE_URL) && substr($url, 0, 4) == 'http' )
{
// array order is !important
$domain = str_replace(array("http://www.","https://www.","http://","https://"), array("","","",""), $url);
$spos = strpos($domain,'/');
if($spos !== false)
{
$domain = substr($domain, 0, $spos);
} } else { $domain = "can't extract a domain"; }
echo $domain;
Check FILTER_VALIDATE_URL default behavior here
But, if you want to check a domain for his validity, and ALWAYS be sure that the extracted value is correct, then you have to check against an array of valid top domains, as explained here:
https://stackoverflow.com/a/70566657/6399448
or you'll NEVER be sure that the extracted string is the correct domain. Unfortunately, all the answers here sometime will fails.
P.s the unique answer that make sense here seem to me this (i did not read it before sorry. It provide the same solution, even if do not provide an example as mine above mentioned or linked):
https://stackoverflow.com/a/569219/6399448

I know you actually asked for Regex and were not specific to a language. But In Javascript you can do this like this. Maybe other languages can parse URL in a similar way.
Easy Javascript solution
const domain = (new URL(str)).hostname.replace("www.", "");
Leave this solution in js for completeness.

In Javascript, the best way to do this is using the tld-extract npm package. Check out an example at the following link.
Below is the code for the same:
var tldExtract = require("tld-extract")
const urls = [
'http://www.mail.yahoo.co.in/',
'https://mail.yahoo.com/',
'https://www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'https://google.co.uk',
'https://www.yandex.com',
'https://yandex.ru',
]
const tldList = [];
urls.forEach(url => tldList.push(tldExtract(url)))
console.log({tldList})
which results in the following output:
0: Object {tld: "co.in", domain: "yahoo.co.in", sub: "www.mail"}
1: Object {tld: "com", domain: "yahoo.com", sub: "mail"}
2: Object {tld: "uk", domain: "au.uk", sub: "www.abc"}
3: Object {tld: "com", domain: "github.com", sub: ""}
4: Object {tld: "ca", domain: "github.ca", sub: ""}
5: Object {tld: "ru", domain: "google.ru", sub: "www"}
6: Object {tld: "co.uk", domain: "google.co.uk", sub: ""}
7: Object {tld: "com", domain: "yandex.com", sub: "www"}
8: Object {tld: "ru", domain: "yandex.ru", sub: ""}

Found a custom function which works in most of the cases:
function getDomainWithoutSubdomain(url) {
const urlParts = new URL(url).hostname.split('.')
return urlParts
.slice(0)
.slice(-(urlParts.length === 4 ? 3 : 2))
.join('.')
}

You need a list of what domain prefixes and suffixes can be removed. For example:
Prefixes:
www.
Suffixes:
.com
.co.in
.au.uk

#!/usr/bin/perl -w
use strict;
my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
print $3;
}

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

Just for knowledge:
'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');
# returns livreto.co

I know the question is seeking a regex solution but in every attempt it won't work to cover everything
I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Let's say we have this: http://google.com
and you only want the domain name
let url = http://google.com;
let domainName = url.split("://")[1];
console.log(domainName);

Use this
(.)(.*?)(.)
then just extract the leading and end points.
Easy, right?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

WordPress PHP Redirect with Regex Wildcard - regex

Related

nginx remove / rewrite GET Parameters for specific URL

htaccess 404 error conditional code for 301 redirect

joomla forbid direct access to components and modules

Turn set of urls in to a regex pattern (optional patterns)

How to get domain name from URL

Categories

Resources