varnish split url and change url

varnish split url and change url - regex

Is there a way to split url on varnish or change url structure with it.
I know regsub or regsuball support that but they are not enough in my case.
I would like to change a url and redirect it to another domain.
For example:
http://aaa.test.com/sport/99244-article-hyun-jun-suku-kapa.html?
to redirect below address
http://m.test.com/article-hyun-jun-suku-kapa-sport-99244/
I added some lines in vcl file to do that
set req.http.xrul=regsuball(req.url,".html",""); "clear .html"
set req.http.xrul=regsub(req.http.xrul,"(\d+)","\1"); find numbers --article ID =99244
I can rid of the article ID with
set req.http.xrul=regsub(req.http.xrul,"(\d+)","");
but cannot get just only article ID
set req.http.xrul=regsub(req.http.xrul,"(\d+)","\1"); or any other method
Does varnish support split the url with "-" pattern thus I could redesign the url? Or can we get only articleID with regsub?

Is this what you want to achieve?
set req.http.X-Redirect-URL = regsuball(req.url,"^/([^/]*)/([0-9]+)-([^/]+)\.html$","http://m.test.com/\3-\1-\2");
This is working code tailored to example you provided, just one level of section placement.
If you want to support more levels of sections, you only have to adjust regexp a bit and replace / to - in second step:
set req.http.X-Redirect-URL = "http://m.test.com/" + regsuball(regsuball(req.url, "^/(.*)/([0-9]+)-([^/]+)\.html$", "\3-\1-\2"), "/", "-");
Maybe you need one more refinement. What if URL doesn't match you pattern? X-Redirect-URL will be the very same value as req.url is. You definitely don't want redirect loop, so I suggest to add mark character to the begin of X-Redirect-URL and then test for it.
Let's say:
set req.http.X-Redirect-URL = regsuball(regsuball(req.url, "^/(.*)/([0-9]+)-([^/]+)\.html$", "#\3-\1-\2"), "/", "-");
if(req.http.X-Redirect-URL ~ "^#") {
set req.http.X-Redirect-URL = regsuball(req.http.X-Redirect-URL, "#", "http://m.test.com/");
return(synth(391));
} else {
unset req.http.X-Redirect-URL;
}
and for all cases, you need in vcl_synth:
if (resp.status == 391) {
set resp.status = 301;
set resp.http.Location = req.http.X-Redirect-URL;
return (deliver);
}
Hope this helps.

Related

Modify a cookie in JMeter to include curly braces around the value

I have the following JSR223 PostProcessor where the captured c_QUID value is to be surrounded by curly braces and used in a cookie. For example if c_QUID is 5575-9878-4848-8897, then I need to set the cookie as {5575-9878-4848-8897}. What can I modify in the below script to do that? As of now it does not add the curly braces.
import org.apache.jmeter.protocol.http.control.*
//Get cookie manager
CookieManager cm = sampler.getCookieManager()
log.info("XXXXXXXXXX QUID " + vars.get("c_QUID"));
Cookie c = new Cookie("QUID", vars.get("c_QUID"), "stage.randomtesting.com", "/", true, 1557578515)
cm.add(c);

Use string concatenation the value:
new Cookie("QUID", "{" + vars.get("c_QUID") + "}",

Shouldn't it be something like:
Cookie c = new Cookie("QUID", "{" + vars.get("c_QUID") + "}", "stage.randomtesting.com", "/", true, 1557578515)
By the way, you might want to use current timestamp as the cookie expiration date because your value of 1557578515 is in the past so the application might not accept it, you might want to replace it with something in the future? See Creating and Testing Dates in JMeter - Learn How article for more details
Also wouldn't that be easier just to define the custom cookie in the HTTP Cookie Manager?

Using Servlet filter on all the pages except the index

I'm trying to use a Filter to force my users to login if they want to access some pages.
So my Filter has to redirect them to an error page in there's no session.
But I don't want this to happen when they visit index.html, because they can login in the index page.
So I need an URL Pattern that matches all the pages excluding / and index.xhtml.
How can I do that? Can I use regex in my web.xml ?
EDIT:
After reading this
I thought that I can make something like :
if (!req.getRequestURI().matches("((!?index)(.*)\\.xhtml)|((.*)\\.(png|gif|jpg|css|js(\\.xhtml)?))"))
in my doFilter() method, but it still processes everything.
I'm sure that the regex works because I've tested it online and it matches the files that doesn't need to be filtered, but the content of the if is executed even for the excluded files!
EDIT 2 :
I'm trying a new way.
I've mapped the Filter to *.xhtml in my web.xml, so I don't need to exclude css, images and javascript with the regex above.
Here's the new code (into the doFilter())
if (req.getRequestURI().contains("index")) {
chain.doFilter(request, response);
} else {
if (!userManager.isLogged()) {
request.getRequestDispatcher("error.xhtml").forward(request, response);
} else {
chain.doFilter(request, response);
}
}
but it still doesn't because it calls the chain.doFilter() (in the outer if) on every page.
How can I exclude my index page from being filtered?

The web.xml URL pattern doesn't support regex. It only supports wildcard prefix (folder) and suffix (extension) matching like /faces/* and *.xhtml.
As to your concrete problem, you've apparently the index file defined as a <welcome-file> and are opening it by /. This way the request.getRequestURI() will equal to /contextpath/, not /contextpath/index.xhtml. Debug the request.getRequestURI() to learn what the filter actually retrieved.
I suggest a rewrite:
String path = request.getRequestURI().substring(request.getContextPath().length());
if (userManager.isLogged() || path.equals("/") || path.equals("/index.xhtml") || path.startsWith(ResourceHandler.RESOURCE_IDENTIFIER)) {
chain.doFilter(request, response);
} else {
request.getRequestDispatcher("/WEB-INF/error.xhtml").forward(request, response);
}
Map this filter on /*. Note that I included the ResourceHandler.RESOURCE_IDENTIFIER check so that JSF resources like <h:outputStylesheet>, <h:outputScript> and <h:graphicImage> will also be skipped, otherwise you end up with an index page without CSS/JS/images when the user is not logged in.
Note that I assume that the FacesServlet is mapped on an URL pattern of *.xhtml. Otherwise you need to alter the /index.xhtml check on path accordingly.

How to validate a complete and valid url using Regex

I have a URL validation method which works pretty well except that this url passes: "http://". I would like to ensure that the user has entered a complete url like: "http://www.stackoverflow.com".
Here is the pattern I'm currently using:
"^(https?://)"
+ "?(([0-9a-z_!~*'().&=+$%-]+: )?[0-9a-z_!~*'().&=+$%-]+#)?" //user#
+ #"(([0-9]{1,3}\.){3}[0-9]{1,3}" // IP- 199.194.52.184
+ "|" // allows either IP or domain
+ #"([0-9a-z_!~*'()-]+\.)*" // tertiary domain(s)- www.
+ #"([0-9a-z][0-9a-z-]{0,61})?[0-9a-z]\." // second level domain
+ "[a-z]{2,6})" // first level domain- .com or .museum
+ "(:[0-9]{1,4})?" // port number- :80
+ "((/?)|" // a slash isn't required if there is no file name
+ "(/[0-9a-z_!~*'().;?:#&=+$,%#-]+)+/?)$"
Any help to change the above to ensure that the user enters a complete and valid url would be greatly appreciated.

Why not use a urlparsing library? Let me list out some preexisting url parsing libraries for languages:
Python: http://docs.python.org/library/urlparse.html
Perl: http://search.cpan.org/dist/URI/URI/Split.pm
Ruby: http://www.ensta.fr/~diam/ruby/online/ruby-doc-stdlib/libdoc/uri/rdoc/classes/URI.html#M001444
PHP: http://php.net/manual/en/function.parse-url.php
Java: http://download.oracle.com/javase/1.4.2/docs/api/java/net/URI.html#URI(java.lang.String)
C#: http://msdn.microsoft.com/en-us/library/system.uri.aspx
Ask if I'm missing a language.
This way, you could first parse the uri, then check to make sure that it passes your own verification rules. Here's an example in Python:
url = urlparse.urlparse(user_url)
if not (url.scheme and url.path):
raise ValueError("User did not enter a correct url!")
Since you said you were using C# on asp.net, here's an example (sorry, my c# knowledge is limited):
user_url = "http://myUrl/foo/bar";
Uri uri = new Uri(user_url);
if (uri.Scheme == Uri.UriSchemeHttp && Uri.IsWellFormedUriString(user_url, UriKind.RelativeOrAbsolute)) {
Console.WriteLine("I have a valid URL!");
}

This is pretty much a FAQ. You could simply try a search with [regex] +validate +url or just look at this answer: What is the best regular expression to check if a string is a valid URL

I use this regex. It works fine for me.
/((([A-Za-z]{3,9}:(?://)?)(?:[-;:&=+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=+\$,\w]+#)[A-Za-z0-9.-]+)((?:/[+~%/.\w-]*)?\??(?:[-+=&;%#.\w])#?(?:[\w]))?)/

How to get domain name from URL

How can I fetch a domain name from a URL String?
Examples:
+----------------------+------------+
| input | output |
+----------------------+------------+
| www.google.com | google |
| www.mail.yahoo.com | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk | abc |
+----------------------+------------+
Related:
Matching a web address through regex

I once had to write such a regex for a company I worked for. The solution was this:
Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
Very fast if regex is optimally ordered
The downside of this solution is of course:
Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
Very large regex so not very readable.

A little late to the party, but:
const urls = [
'www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'http://www.google.co.uk',
'www.yandex.com',
'yandex.ru',
'yandex'
]
urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.
I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:
function parse_url_all($url){
$url = substr($url,0,4)=='http'? $url: 'http://'.$url;
$d = parse_url($url);
$tmp = explode('.',$d['host']);
$n = count($tmp);
if ($n>=2){
if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
$d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-3)];
} else {
$d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-2)];
}
}
return $d;
}
This simple function will work in almost every case. There are a few exceptions, but these are very rare.
To demonstrate / test this function you can use the following:
$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
$info = parse_url_all($url);
echo "<tr><td>".$url."</td><td>".$info['host'].
"</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";
The output will be as follows for the URL's listed:
As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.
I hope that this helps.

/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

There are two ways
Using split
Then just parse that string
var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
//find & remove port number
domain = domain.split(':')[0];
Using Regex
var r = /:\/\/(.[^/]+)/;
"http://stackoverflow.com/questions/5343288/get-url".match(r)[1]
=> stackoverflow.com
Hope this helps

I don't know of any libraries, but the string manipulation of domain names is easy enough.
The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).
The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

import urlparse
GENERIC_TLDS = [
'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs',
'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
]
def get_domain(url):
hostname = urlparse.urlparse(url.lower()).netloc
if hostname == '':
# Force the recognition as a full URL
hostname = urlparse.urlparse('http://' + uri).netloc
# Remove the 'user:passw', 'www.' and ':port' parts
hostname = hostname.split('#')[-1].split(':')[0].lstrip('www.').split('.')
num_parts = len(hostname)
if (num_parts < 3) or (len(hostname[-1]) > 2):
return '.'.join(hostname[:-1])
if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
return '.'.join(hostname[:-1])
if num_parts >= 3:
return '.'.join(hostname[:-2])
This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.
However it'll do the job in most cases.

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).
But even with that you won't have success if your list does not contain SLDs, too. URLs like http://big.uk.com/ and http://www.uk.com/ would be both interpreted as uk.com (the first domain is big.uk.com).
Because of that all browsers use Mozilla's Public Suffix List:
https://en.wikipedia.org/wiki/Public_Suffix_List
You can use it in your code by importing it through this URL:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

Basically, what you want is:
google.com -> google.com -> google
www.google.com -> google.com -> google
google.co.uk -> google.co.uk -> google
www.google.co.uk -> google.co.uk -> google
www.google.org -> google.org -> google
www.google.org.uk -> google.org.uk -> google
Optional:
www.google.com -> google.com -> www.google
images.google.com -> google.com -> images.google
mail.yahoo.co.uk -> yahoo.co.uk -> mail.yahoo
mail.yahoo.com -> yahoo.com -> mail.yahoo
www.mail.yahoo.com -> yahoo.com -> mail.yahoo
You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:
(co|com|gov|net|org)
If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
$dest=$d[$c-2].'.'.$d[$c-1]; # use the last 2 parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3].'.'.$dest; # if so, add a third part
};
print $dest; # show it
To just get the name, as per your question:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3]; # if so, give the third last
$dest=$d[$c-4].'.'.$dest if ($c>3); # optional bit
} else {
$dest=$d[$c-2]; # else the second last
$dest=$d[$c-3].'.'.$dest if ($c>2); # optional bit
};
print $dest; # show it
I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.
If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.
I'd love to see someone do all of this using just a regex, I'm sure it's possible.

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim
usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

Could you just look for the word before .com (or other) (the order of the other list would be the opposite of the frequency see here
and take the first matching group
i.e.
window.location.host.match(/(\w|-)+(?=(\.(com|net|org|info|coop|int|co|ac|ie|co|ai|eu|ca|icu|top|xyz|tk|cn|ga|cf|nl|us|eu|de|hk|am|tv|bingo|blackfriday|gov|edu|mil|arpa|au|ru)(\.|\/|$)))/g)[0]
You can test it could by copying this line into the developers' console on any tab
This example works in the following cases:

So if you just have a string and not a window.location you could use...
String.prototype.toUrl = function(){
if(!this && 0 < this.length)
{
return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
s = 'http://' + original;
}
s = this.split('/');
var protocol = s[0];
var host = s[2];
var relativePath = '';
if(s.length > 3){
for(var i=3;i< s.length;i++)
{
relativePath += '/' + s[i];
}
}
s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];
return {
original: original,
protocol: protocol,
domain: domain,
host: host,
relativePath: relativePath,
getParameter: function(param)
{
return this.getParameters()[param];
},
getParameters: function(){
var vars = [], hash;
var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
for (var i = 0; i < hashes.length; i++) {
hash = hashes[i].split('=');
vars.push(hash[0]);
vars[hash[0]] = hash[1];
}
return vars;
}
};};
How to use.
var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;
var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.
Output looks like this :
http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com
def getDomain(url):
parts = re.split("\/", url)
match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2])
if match != None:
if re.search("\.uk", parts[2]):
match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
return match.group(2)
else: return ''
Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

how is this
=((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3})
(you may want to add "\/" to end of pattern
if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:
=((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)
and replace with "/"
The goal of this example to get rid of any domain name regardless of the form it appears in.
(i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

All answers here are very nice, but all will fails sometime.
So i know it is not common to link something else, already answered elsewhere, but you'll find that you have to not waste your time into impossible thing.
This because domains like mydomain.co.uk there is no way to know if an extracted domain is correct.
If you speak about to extract by URLs, something that ever have http or https or nothing in front (but if it is possible nothing in front, you have to remove
filter_var($url, filter_var($url, FILTER_VALIDATE_URL))
here below, because FILTER_VALIDATE_URL do not recognize as url a string that do not begin with http, so may remove it, and you can also achieve with something stupid like this, that never will fail:
$url = strtolower('hTTps://www.example.com/w3/forum/index.php');
if( filter_var($url, FILTER_VALIDATE_URL) && substr($url, 0, 4) == 'http' )
{
// array order is !important
$domain = str_replace(array("http://www.","https://www.","http://","https://"), array("","","",""), $url);
$spos = strpos($domain,'/');
if($spos !== false)
{
$domain = substr($domain, 0, $spos);
} } else { $domain = "can't extract a domain"; }
echo $domain;
Check FILTER_VALIDATE_URL default behavior here
But, if you want to check a domain for his validity, and ALWAYS be sure that the extracted value is correct, then you have to check against an array of valid top domains, as explained here:
https://stackoverflow.com/a/70566657/6399448
or you'll NEVER be sure that the extracted string is the correct domain. Unfortunately, all the answers here sometime will fails.
P.s the unique answer that make sense here seem to me this (i did not read it before sorry. It provide the same solution, even if do not provide an example as mine above mentioned or linked):
https://stackoverflow.com/a/569219/6399448

I know you actually asked for Regex and were not specific to a language. But In Javascript you can do this like this. Maybe other languages can parse URL in a similar way.
Easy Javascript solution
const domain = (new URL(str)).hostname.replace("www.", "");
Leave this solution in js for completeness.

In Javascript, the best way to do this is using the tld-extract npm package. Check out an example at the following link.
Below is the code for the same:
var tldExtract = require("tld-extract")
const urls = [
'http://www.mail.yahoo.co.in/',
'https://mail.yahoo.com/',
'https://www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'https://google.co.uk',
'https://www.yandex.com',
'https://yandex.ru',
]
const tldList = [];
urls.forEach(url => tldList.push(tldExtract(url)))
console.log({tldList})
which results in the following output:
0: Object {tld: "co.in", domain: "yahoo.co.in", sub: "www.mail"}
1: Object {tld: "com", domain: "yahoo.com", sub: "mail"}
2: Object {tld: "uk", domain: "au.uk", sub: "www.abc"}
3: Object {tld: "com", domain: "github.com", sub: ""}
4: Object {tld: "ca", domain: "github.ca", sub: ""}
5: Object {tld: "ru", domain: "google.ru", sub: "www"}
6: Object {tld: "co.uk", domain: "google.co.uk", sub: ""}
7: Object {tld: "com", domain: "yandex.com", sub: "www"}
8: Object {tld: "ru", domain: "yandex.ru", sub: ""}

Found a custom function which works in most of the cases:
function getDomainWithoutSubdomain(url) {
const urlParts = new URL(url).hostname.split('.')
return urlParts
.slice(0)
.slice(-(urlParts.length === 4 ? 3 : 2))
.join('.')
}

You need a list of what domain prefixes and suffixes can be removed. For example:
Prefixes:
www.
Suffixes:
.com
.co.in
.au.uk

#!/usr/bin/perl -w
use strict;
my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
print $3;
}

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

Just for knowledge:
'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');
# returns livreto.co

I know the question is seeking a regex solution but in every attempt it won't work to cover everything
I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

Let's say we have this: http://google.com
and you only want the domain name
let url = http://google.com;
let domainName = url.split("://")[1];
console.log(domainName);

Use this
(.)(.*?)(.)
then just extract the leading and end points.
Easy, right?

Getting parts of a URL (Regex)

Given the URL (single line):
http://test.example.com/dir/subdir/file.html
How can I extract the following parts using regular expressions:
The Subdomain (test)
The Domain (example.com)
The path without the file (/dir/subdir/)
The file (file.html)
The path with the file (/dir/subdir/file.html)
The URL without the path (http://test.example.com)
(add any other that you think would be useful)
The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html

A single regex to parse and breakup a
full URL including query parameters
and anchors e.g.
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
RexEx positions:
url: RegExp['$&'],
protocol:RegExp.$2,
host:RegExp.$3,
path:RegExp.$4,
file:RegExp.$6,
query:RegExp.$7,
hash:RegExp.$8
you could then further parse the host ('.' delimited) quite easily.
What I would do is use something like this:
/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
as $. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
For what it's worth, I found that I had to escape the forward slashes in JavaScript:
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:
var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});
/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:
It can not handle port number.
The hash part is broken.
The following is a modified version:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$
Position of parts are as follows:
int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12
Edit posted by anon user:
function getFileName(path) {
return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

I needed a regular Expression to match all urls and made this one:
/(?:([^\:]*)\:\/\/)?(?:([^\:\#]*)(?:\:([^\#]*))?\#)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/
It matches all urls, any protocol, even urls like
ftp://user:pass#www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag
The result (in JavaScript) looks like this:
["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]
An url like
mailto://admin#www.cs.server.com
looks like this:
["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]

I was trying to solve this in javascript, which should be handled by:
var url = new URL('http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang');
since (in Chrome, at least) it parses to:
{
"hash": "#foobar/bing/bo#ng?bang",
"search": "?foo=bar&bingobang=&king=kong#kong.com",
"pathname": "/path/wah#t/foo.js",
"port": "890",
"hostname": "example.com",
"host": "example.com:890",
"password": "b",
"username": "a",
"protocol": "http:",
"origin": "http://example.com:890",
"href": "http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang"
}
However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:
^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:#\/#\?]+)(?:\:([^:#\/#\?]*))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?
Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.
The parts are in this order:
var keys = [
"href", // http://user:pass#host.com:81/directory/file.ext?query=1#anchor
"origin", // http://user:pass#host.com:81
"protocol", // http:
"username", // user
"password", // pass
"host", // host.com:81
"hostname", // host.com
"port", // 81
"pathname", // /directory/file.ext
"search", // ?query=1
"hash" // #anchor
];
There is also a small library which wraps it and provides query params:
https://github.com/sadams/lite-url (also available on bower)
If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

Propose a much more readable solution (in Python, but applies to any regex):
def url_path_to_dict(path):
pattern = (r'^'
r'((?P<schema>.+?)://)?'
r'((?P<user>.+?)(:(?P<password>.*?))?#)?'
r'(?P<host>.*?)'
r'(:(?P<port>\d+?))?'
r'(?P<path>/.*?)?'
r'(?P<query>[?].*?)?'
r'$'
)
regex = re.compile(pattern)
m = regex.match(path)
d = m.groupdict() if m is not None else None
return d
def main():
print url_path_to_dict('http://example.example.com/example/example/example.html')
Prints:
{
'host': 'example.example.com',
'user': None,
'path': '/example/example/example.html',
'query': None,
'password': None,
'port': None,
'schema': 'http'
}

subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/
the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)
the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$
the path with the file : http://[^/]+/(.*)
the URL without the path : (http://[^/]+/)
(Markdown isn't very friendly to regexes)

This improved version should work as reliably as a parser.
// Applies to URI, not just URL or URN:
// http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
//
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
//
// http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
//
// $# matches the entire uri
// $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
// $2 matches authority (host, user:pwd#host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
// Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
//
// (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
//
// Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
{
if( !schemes )
schemes = '[^\\s:\/?#]+'
else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
throw TypeError( 'expected URI schemes' )
return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
}
// http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
function uriSchemesRegExp()
{
return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
}

const URI_RE = /^(([^:\/\s]+):\/?\/?([^\/\s#]*#)?([^\/#:]*)?:?(\d+)?)?(\/[^?]*)?(\?([^#]*))?(#[\s\S]*)?$/;
/**
* GROUP 1 ([scheme][authority][host][port])
* GROUP 2 (scheme)
* GROUP 3 (authority)
* GROUP 4 (host)
* GROUP 5 (port)
* GROUP 6 (path)
* GROUP 7 (?query)
* GROUP 8 (query)
* GROUP 9 (fragment)
*/
URI_RE.exec("https://john:doe#www.example.com:123/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("ldap://[2001:db8::7]/c=GB?objectClass?one");
URI_RE.exec("mailto:John.Doe#example.com");
Above you can find javascript implementation with modified regex

Try the following:
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+#)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
It supports HTTP / FTP, subdomains, folders, files etc.
I found it from a quick google search:
Link

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)#)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/
From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

You can get all the http/https, host, port, path as well as query by using Uri object in .NET.
just the difficult task is to break the host into sub domain, domain name and TLD.
There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.
However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.
This answers also helpfull:
Get the subdomain from a URL
CaLLMeLaNN

Here is one that is complete, and doesnt rely on any protocol.
function getServerURL(url) {
var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
console.log(m[1]) // Remove this
return m[1];
}
getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")
Prints
http://dev.test.se
http://dev.test.se
//ajax.googleapis.com
//
www.dev.test.se
www.dev.test.se
www.dev.test.se
www.dev.test.se
//dev.test.se
http://www.dev.test.se
http://localhost:8080
https://localhost:8080

None of the above worked for me. Here's what I ended up using:
/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

I like the regex that was published in "Javascript: The Good Parts".
Its not too short and not too complex.
This page on github also has the JavaScript code that uses it.
But it an be adapted for any language.
https://gist.github.com/voodooGQ/4057330

Java offers a URL class that will do this. Query URL Objects.
On a side note, PHP offers parse_url().

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.
http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)
also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).
so this is my version slightly modified with the source being the highest voted version here:
^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

I build this one. Very permissive it's not to check url juste divide it.
^((http[s]?):\/\/)?([a-zA-Z0-9-.]*)?([\/]?[^?#\n]*)?([?]?[^?#\n]*)?([#]?[^?#\n]*)$
match 1 : full protocole with :// (http or https)
match 2 : protocole without ://
match 3 : host
match 4 : slug
match 5 : param
match 6 : anchor
work
http://
https://
www.demo.com
/slug
?foo=bar
#anchor
https://demo.com
https://demo.com/
https://demo.com/slug
https://demo.com/slug/foo
https://demo.com/?foo=bar
https://demo.com/?foo=bar#anchor
https://demo.com/?foo=bar&bar=foo#anchor
https://www.greate-demo.com/
crash
#anchor#
?toto?

I needed some REGEX to parse the components of a URL in Java.
This is what I'm using:
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
Java Code Snippet:
final Pattern pattern = Pattern.compile(
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
);
final Matcher matcher = pattern.matcher(url);
System.out.println(" URL: " + url);
if (matcher.matches())
{
System.out.println(" Method: " + matcher.group(1));
System.out.println("Hostname: " + matcher.group(2));
System.out.println(" Port: " + matcher.group(3));
System.out.println(" Path: " + matcher.group(4));
System.out.println(" Query: " + matcher.group(5));
System.out.println(" ID: " + matcher.group(6));
return matcher.group(2);
}
System.out.println();
System.out.println();

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.
But here is the deal, I want to use different regex patterns in different situations in my program.
For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.
Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).
That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?
If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:
(?:SOMESTUFF)
You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.
Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:
https?
would match 'http' or 'https' just fine.

regexp to get the URL path without the file.
url = 'http://domain/dir1/dir2/somefile'
url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s
It can be useful for adding a relative path to this url.

The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:
^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$
The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:
$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"
When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:
^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$
In JavaScript, of course, you can't use named backreferences, so the regex becomes
^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$
and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

//USING REGEX
/**
* Parse URL to get information
*
* #param url the URL string to parse
* #return parsed the URL parsed or null
*/
var UrlParser = function (url) {
"use strict";
var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:#\/#\?]+)(?:\:([^:#\/#\?]+))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
matches = regx.exec(url),
parser = null;
if (null !== matches) {
parser = {
href : matches[0],
withoutHash : matches[1],
url : matches[2],
origin : matches[3],
protocol : matches[4],
protocolseparator : matches[5],
credhost : matches[6],
cred : matches[7],
user : matches[8],
pass : matches[9],
host : matches[10],
hostname : matches[11],
port : matches[12],
pathname : matches[13],
segment1 : matches[14],
segment2 : matches[15],
search : matches[16],
hash : matches[17]
};
}
return parser;
};
var parsedURL=UrlParser(url);
console.log(parsedURL);

I tried this regex for parsing url partitions:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$
URL: https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2
Matches:
Group 1. 0-7 https:/
Group 2. 0-5 https
Group 3. 8-22 www.google.com
Group 6. 22-50 /my/path/sample/asd-dsa/this
Group 7. 22-46 /my/path/sample/asd-dsa/
Group 8. 46-50 this
Group 9. 50-74 ?key1=value1&key2=value2
Group 10. 51-74 key1=value1&key2=value2

The best answer suggested here didn't work for me because my URLs also contain a port.
However modifying it to the following regex worked for me:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:\d+)?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

For browser / nodejs environment there is a built in URL class which share the same signature it seems. but check out the respective focus for your case.
https://nodejs.org/api/url.html#urlhost
https://developer.mozilla.org/en-US/docs/Web/API/URL
This is how it may be used though.
let url = new URL('https://test.example.com/cats?name=foofy')
url.protocall; // https:
url.hostname; // test.example.com
url.pathname; // /cats
url.search; // ?name=foofy
let params = url.searchParams
let name = params.get('name');// always string I think so parse accordingly
for more on parameters also see https://developer.mozilla.org/en-US/docs/Web/API/URL/searchParams

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";
String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";
System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));
Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl
If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888";
the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888
enjoy..
Yosi Lev

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

varnish split url and change url - regex

Related

Modify a cookie in JMeter to include curly braces around the value

Using Servlet filter on all the pages except the index

How to validate a complete and valid url using Regex

How to get domain name from URL

Getting parts of a URL (Regex)

Categories

Resources