Why Finch using EndPoint to represent Router, Request Parameter and Request Body - web-services

In finch, we can define router, request parameters, request body like this.
case class Test(name: String, age: Int)
val router: Endpoint[Test] = post("hello") { Ok(Test("name", 30)) }
val requestBody: Endpoint[Test] = body.as[Test]
val requestParameters: Endpoint[Test] = Endpoint.derive[Test].fromParams
The benefit is that we can compose EndPoint together. For example, I can define:
The request path is hello and Parameter should have name and age. (router :: requestParameters)
However, I can still run an invalid endpoint which doesnt include any request path successfully (There is actually no compilation error)
Await.ready(Http.serve(":3000", requestParameters.toService))
The result is returning 404 not found page. Even though I expect that the error should report earlier like compilation error. I wonder that is this a design drawback or it is actually finch trying to fix ?
Many thanks in advance

First of all, thanks a lot for asking this!
Let me give you some insight on how Finch's endpoints work. If you speak category theory, an Endpoint is an Applicative embedding StateT represented as something close to Input => Option[(Input, A)].
Simply speaking, an endpoint takes an Input that wraps an HTTP request and also captures the current path (eg: /foo/bar/baz). When endpoint is applied on to a given request and either matches it (returning Some) or falls over (returning None). When matched, it changes the state of the Input, usually removing the first path segment from it (eg: removing foo from /foo/bar/baz) so the next endpoint is the chain can work with a new Input (and new path).
Once endpoint is matched, Finch checks if there is something else left in the Input that wasn't matched. If something is left, the match considered unsuccessful and your service returns 404.
scala> val e = "foo" :: "bar"
e: io.finch.Endpoint[shapeless.HNil] = foo/bar
scala> e(Input(Request("/foo/bar/baz"))).get._1.path
res1: Seq[String] = List(baz)
When it comes to endpoints matching/extracting query-string params, no path segments are being touched there and the state is passed to the next endpoint unchanged. So when an endpoint param("foo") is applied, the path is not affected. That simply means, the only way to serve a query-string endpoint (note: an endpoint that only extract query-string params) is to send it a request with empty path /.
scala> val s = param("foo").toService
s: com.twitter.finagle.Service[com.twitter.finagle.http.Request,com.twitter.finagle.http.Response] = <function1>
scala> s(Request("/", "foo" -> "bar")).get
res4: com.twitter.finagle.http.Response = Response("HTTP/1.1 Status(200)")
scala> s(Request("/bar", "foo" -> "bar")).get
res5: com.twitter.finagle.http.Response = Response("HTTP/1.1 Status(404)")

Related

Query parameters for GET requests using Akka HTTP (formally known as Spray)

One of the features of Akka HTTP (formally known as Spray) is its ability to automagically marshal and unmarshal data back and forth from json into case classes, etc. I've had success at getting this to work well.
At the moment, I am trying to make an HTTP client that performs a GET request with query parameters. The code currently looks like this:
val httpResponse: Future[HttpResponse] =
Http().singleRequest(HttpRequest(
uri = s"""http://${config.getString("http.serverHost")}:${config.getInt("http.port")}/""" +
s"query?seq=${seq}" +
s"&max-mismatches=${maxMismatches}" +
s"&pam-policy=${pamPolicy}"))
Well, that's not so pretty. It would be nice if I could just pass in a case class containing the query parameters, and have Akka HTTP automagically generate the query parameters, kind of like it does for json. (Also, the server side of Akka HTTP has a somewhat elegant way of parsing GET query parameters, so one would think that it would also have a somewhat elegant way to generate them.)
I'd like to do something like the following:
val httpResponse: Future[HttpResponse] =
Http().singleRequest(HttpRequest(
uri = s"""http://${config.getString("http.serverHost")}:${config.getInt("http.port")}/query""",
entity = QueryParams(seq = seq, maxMismatches = maxMismatches, pamPolicy = pamPolicy)))
Only, the above doesn't actually work.
Is what I want doable somehow with Akka HTTP? Or do I just need to do things the old-fashioned way? I.e, generate the query parameters explicitly, as I do in the first code block above.
(I know that if I were to change this from a GET to a POST, I could probably to get it to work more like I would like it to work, since then I could get the contents of the POST request automagically converted from a case class to json, but I don't really wish to do that here.)
You can leverage the Uri class to do what you want. It offers multiple ways to get a set of params into the query string using the withQuery method. For example, you could do something like this:
val params = Map("foo" -> "bar", "hello" -> "world")
HttpRequest(Uri(hostAndPath).withQuery(params))
Or
HttpRequest(Uri(hostAndPath).withQuery(("foo" -> "bar"), ("hello" -> "world")))
Obviously this could be done by altering the extending the capability of Akka HTTP, but for what you need (just a tidier way to build the query string), you could do it with some scala fun:
type QueryParams = Map[String, String]
object QueryParams {
def apply(tuples: (String, String)*): QueryParams = Map(tuples:_*)
}
implicit class QueryParamExtensions(q: QueryParams) {
def toQueryString = "?"+q.map{
case (key,value) => s"$key=$value" //Need to do URL escaping here?
}.mkString("&")
}
implicit class StringQueryExtensions(url: String) {
def withParams(q: QueryParams) =
url + q.toQueryString
}
val params = QueryParams(
"abc" -> "def",
"xyz" -> "qrs"
)
params.toQueryString // gives ?abc=def&xyz=qrs
"http://www.google.com".withParams(params) // gives http://www.google.com?abc=def&xyz=qrs

Need Perl SOAP::Transport::HTTP::CGI sanity check

OK, I think there's no easy (make that lazy) way to do what I want but given the Perl SOAP::Transport::HTTP::CGI code fragment below what I am looking to do is intercept all SOAP operation passing through the service and log the result of an operation or fault...
SOAP::Transport::HTTP::CGI
-> dispatch_to(
#first arg needs to be the directory holding the PackageName.pm modules with no trailing "/". The args aftre the first are name of SPECIFIC packages to be loaded as needed by SOAP requests
#Failure to call out specific moudules below will allow the external SOAP modules to be loaded, but the overall module #INC path for other Perl modules will be blocked for security reasons
SOAP_MODULE_INCULDE, #name of the directory holding the PackageName.pm modules with no trailing "/"
"TechnicalMetaDataExtraction", #prod - wrapper for EXIFTool
"Ingest", #module (package) name
"ImageManipulation", #module (package) name
"FacebookBroadcast", #unfinished
"CompressDecompress", #unfinished
"ImageOCR", #prod - tesseract
"HandleDotNet", #prod
"Pipeline", #prod (needs work)
"TwitterBroadcast", #prototype
"Messaging", #prototype but text format email works
"Property", #development
"FileManager", #prototype
"PassThrough" #prod - module to do location conversion (URL -> Fedora Obj+DS, Fedora Obj+DS -> file, URL -> InlineBase64, etc.) but format conversion
) #done with the dispacth_to section
-> on_action(sub {
#on_action method lets you specify SOAPAction understanding. It acceptsreference to subroutine that takes three parameters: SOAPAction, method_uri and method_name.
#'SOAPAction' is taken from HTTP header and method_uri and method_name are extracted from request's body. Default behavior is match 'SOAPAction' if present and ignore it otherwise.
#die SOAP::Data->type('string')->name('debug')->value("Intercepted call, SOAP request='".shift(#_)."'");
if($Debug) {
##_ notes:
#[0] - "http://www.example.org/PassThrough/NewOperation"
#[1] - http://www.example.org/PassThrough/
#[2] - NewOperation
#[3] - "undefined"
my %DataHash=(
message => #_[0]
);
#SendMessageToAMQTopic(localtime()." - ".#_[0]);
SendDebugMessage(\%DataHash, "info");
} #there's only one element passed at this level
}) #end on_action
#-> on_debug() #not valid for SOAP::Transport::HTTP::CGI
#-> request() #valid, but does not fire - request method gives you access to HTTP::Request object which you can provide for Server component to handle request.
#-> response() #does not fire - response method gives you access to HTTP::Response object which you can access to get results from Server component after request was handled.
#-> options({compress_threshold => 10000}) #causes problems for the JavaScript soap client - removed for the moment
-> handle() #fires but ignores content in sub - handle method will handle your request. You should provide parameters with request() method, call handle() and get it back with response().
;
Initially I thought I could get the information I needed from the "on_action" method, but that only contains the destination of the SOAP call (before it is sent?) and I'm looking for data in the operation result that will be sent back to the SOAP client. The documentation of "SOAP::Transport::HTTP::CGI" is a bit thin and there are few examples online.
Anyone know if this is possible give the what the code above is set up? If not, then the only other option is to alter each method of my SOAP service code modules to include the "SendDebugMessage" function.
I would suggest subclassing SOAP::Transport::HTTP::CGI and hooking into the handle() method. An untested and probably non-working example would be:
package MySoapCGI;
use Data::Dumper;
use SOAP::Transport::HTTP;
use base 'SOAP::Transport::HTTP::CGI';
sub handle {
my $self = shift;
$self->SUPER::handle(#_);
warn Dumper($self->request);
warn Dumper($self->response);
}
Replace the dumpers with whatever logging you want. You may need to do some XML parsing, because these will be the raw HTTP::Request and HTTP::Response.

Handling Cookies in Ocamlnet

I'm trying to write a bot pulling some data which is only available to authenticated users. I settled for ocaml (v. 3.12.1) and ocamlnet (v. 3.6.5). The first part of the script sends a POST request to the website and by the html I receive back, I can tell that the authentication worked (p1 and p2's values in this code sample are obviously not the ones I'm using).
open Http_client
open Nethttp
let pipeline = new pipeline
let () =
let post_call = new post
"http://www.kraland.org/main.php?p=1&a=100"
[("p1", "username");
("p2", "password");
("Submit", "Ok!")]
in
pipeline#add post_call;
pipeline#run();
Then I extract the cookies where the php session id, the account name, a hash of the password, etc. are stored, put them in the header of the next request and run it. And this is where I run into troubles: I systematically get the boring page every anonymous visitor gets.
let cookies = Header.get_set_cookie post_call#response_header in
let get_call = new get "http://www.kraland.org/main.php?p=1" in
let header = get_call#request_header `Base in
Header.set_set_cookie header cookies;
pipeline#add get_call;
pipeline#run();
When I print the content of the cookies, I do get something weird: I would expect the domain of the cookies to be kraland.org but it does not seem to be the case. This is the printing command I use together with the output:
List.iter (fun c -> Printf.printf "%.0f [%s%s:%b] %s := %s\n"
(match c.cookie_expires with None -> -1. | Some f -> f)
(match c.cookie_domain with None -> "" | Some s -> s)
(match c.cookie_path with None -> "" | Some s -> s)
c.cookie_secure c.cookie_name c.cookie_value)
cookies;
-1 [/:false] PHPSESSID := 410b97b0536b3e949df17edd44965926
1372719625 [:false] login := username
1372719625 [:false] id := myid
1372719625 [:false] password := fbCK/0M+blFRLx3oDp+24bHlwpDUy7x885sF+Q865ms=
1372719625 [:false] pc_id := 872176495311
Edit: I had a go at the problem using Haskell's Http-conduit-browser and it works like a charm using something very much like the doc's example.

Scala Play2 Error on Web Service call

I'm getting an error on compile with the following code.
I'm trying to call a Web Service.
def authenticate(username: String, password: String): String = {
val request: Future[Response] =
WS.url(XXConstants.URL_GetTicket)
.withTimeout(5000)
.post( Map("username" -> Seq(username), "password" -> Seq(password) ) )
request map { response =>
Ok(response.xml.text)
} recover {
case t: TimeoutException =>
RequestTimeout(t.getMessage)
case e =>
ServiceUnavailable(e.getMessage)
}
}
I'm seeing the following compiler error:
type mismatch; found : scala.concurrent.Future[play.api.mvc.SimpleResult[String]] required: String
The value being returned from your authenticate function is val request = ... which is of type Future[Response] but the function expects a String which as the compiler says is a type mismatch error. Changing the return type of the function to Future[Response] or converting request to a String before returning it should fix it.
Like say Brian, you're currently returning a Future[String], when you method said that you want to return a String.
The request return a Future because it's an asynchronous call.
So, you have two alternatives:
Change your method definition to return a Future[String], and manage this future in another method (with .map())
Force the request to get this result immediately, in a synchronous way. It's not a very good deal, but sometimes it's the simplest solution.
import scala.concurrent.Await
import scala.concurrent.duration.Duration
val response: String = Await.result(req, Duration.Inf)

How to get domain name from URL

How can I fetch a domain name from a URL String?
Examples:
+----------------------+------------+
| input | output |
+----------------------+------------+
| www.google.com | google |
| www.mail.yahoo.com | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk | abc |
+----------------------+------------+
Related:
Matching a web address through regex
I once had to write such a regex for a company I worked for. The solution was this:
Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.
Example regex:
.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$
This worked really well and also matched weird, unofficial top-levels like de.com and friends.
The upside:
Very fast if regex is optimally ordered
The downside of this solution is of course:
Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
Very large regex so not very readable.
A little late to the party, but:
const urls = [
'www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'http://www.google.co.uk',
'www.yandex.com',
'yandex.ru',
'yandex'
]
urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))
Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.
I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:
function parse_url_all($url){
$url = substr($url,0,4)=='http'? $url: 'http://'.$url;
$d = parse_url($url);
$tmp = explode('.',$d['host']);
$n = count($tmp);
if ($n>=2){
if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
$d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-3)];
} else {
$d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
$d['domainX'] = $tmp[($n-2)];
}
}
return $d;
}
This simple function will work in almost every case. There are a few exceptions, but these are very rare.
To demonstrate / test this function you can use the following:
$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
$info = parse_url_all($url);
echo "<tr><td>".$url."</td><td>".$info['host'].
"</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";
The output will be as follows for the URL's listed:
As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.
I hope that this helps.
/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/
There are two ways
Using split
Then just parse that string
var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
domain = url.split('/')[2];
} else {
domain = url.split('/')[0];
}
//find & remove port number
domain = domain.split(':')[0];
Using Regex
var r = /:\/\/(.[^/]+)/;
"http://stackoverflow.com/questions/5343288/get-url".match(r)[1]
=> stackoverflow.com
Hope this helps
I don't know of any libraries, but the string manipulation of domain names is easy enough.
The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).
The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.
import urlparse
GENERIC_TLDS = [
'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs',
'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
]
def get_domain(url):
hostname = urlparse.urlparse(url.lower()).netloc
if hostname == '':
# Force the recognition as a full URL
hostname = urlparse.urlparse('http://' + uri).netloc
# Remove the 'user:passw', 'www.' and ':port' parts
hostname = hostname.split('#')[-1].split(':')[0].lstrip('www.').split('.')
num_parts = len(hostname)
if (num_parts < 3) or (len(hostname[-1]) > 2):
return '.'.join(hostname[:-1])
if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
return '.'.join(hostname[:-1])
if num_parts >= 3:
return '.'.join(hostname[:-2])
This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.
However it'll do the job in most cases.
It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).
But even with that you won't have success if your list does not contain SLDs, too. URLs like http://big.uk.com/ and http://www.uk.com/ would be both interpreted as uk.com (the first domain is big.uk.com).
Because of that all browsers use Mozilla's Public Suffix List:
https://en.wikipedia.org/wiki/Public_Suffix_List
You can use it in your code by importing it through this URL:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878
Basically, what you want is:
google.com -> google.com -> google
www.google.com -> google.com -> google
google.co.uk -> google.co.uk -> google
www.google.co.uk -> google.co.uk -> google
www.google.org -> google.org -> google
www.google.org.uk -> google.org.uk -> google
Optional:
www.google.com -> google.com -> www.google
images.google.com -> google.com -> images.google
mail.yahoo.co.uk -> yahoo.co.uk -> mail.yahoo
mail.yahoo.com -> yahoo.com -> mail.yahoo
www.mail.yahoo.com -> yahoo.com -> mail.yahoo
You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:
(co|com|gov|net|org)
If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
$dest=$d[$c-2].'.'.$d[$c-1]; # use the last 2 parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3].'.'.$dest; # if so, add a third part
};
print $dest; # show it
To just get the name, as per your question:
my #d=split /\./,$domain; # split the domain part into an array
$c=#d; # count how many parts
if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
$dest=$d[$c-3]; # if so, give the third last
$dest=$d[$c-4].'.'.$dest if ($c>3); # optional bit
} else {
$dest=$d[$c-2]; # else the second last
$dest=$d[$c-3].'.'.$dest if ($c>2); # optional bit
};
print $dest; # show it
I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.
If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.
I'd love to see someone do all of this using just a regex, I'm sure it's possible.
/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim
usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld
Could you just look for the word before .com (or other) (the order of the other list would be the opposite of the frequency see here
and take the first matching group
i.e.
window.location.host.match(/(\w|-)+(?=(\.(com|net|org|info|coop|int|co|ac|ie|co|ai|eu|ca|icu|top|xyz|tk|cn|ga|cf|nl|us|eu|de|hk|am|tv|bingo|blackfriday|gov|edu|mil|arpa|au|ru)(\.|\/|$)))/g)[0]
You can test it could by copying this line into the developers' console on any tab
This example works in the following cases:
So if you just have a string and not a window.location you could use...
String.prototype.toUrl = function(){
if(!this && 0 < this.length)
{
return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
s = 'http://' + original;
}
s = this.split('/');
var protocol = s[0];
var host = s[2];
var relativePath = '';
if(s.length > 3){
for(var i=3;i< s.length;i++)
{
relativePath += '/' + s[i];
}
}
s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];
return {
original: original,
protocol: protocol,
domain: domain,
host: host,
relativePath: relativePath,
getParameter: function(param)
{
return this.getParameters()[param];
},
getParameters: function(){
var vars = [], hash;
var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
for (var i = 0; i < hashes.length; i++) {
hash = hashes[i].split('=');
vars.push(hash[0]);
vars[hash[0]] = hash[1];
}
return vars;
}
};};
How to use.
var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;
var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');
For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.
Output looks like this :
http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com
def getDomain(url):
parts = re.split("\/", url)
match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2])
if match != None:
if re.search("\.uk", parts[2]):
match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
return match.group(2)
else: return ''
Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.
how is this
=((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3})
(you may want to add "\/" to end of pattern
if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:
=((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)
and replace with "/"
The goal of this example to get rid of any domain name regardless of the form it appears in.
(i.e. to ensure url parameters don't incldue domain names to avoid xss attack)
All answers here are very nice, but all will fails sometime.
So i know it is not common to link something else, already answered elsewhere, but you'll find that you have to not waste your time into impossible thing.
This because domains like mydomain.co.uk there is no way to know if an extracted domain is correct.
If you speak about to extract by URLs, something that ever have http or https or nothing in front (but if it is possible nothing in front, you have to remove
filter_var($url, filter_var($url, FILTER_VALIDATE_URL))
here below, because FILTER_VALIDATE_URL do not recognize as url a string that do not begin with http, so may remove it, and you can also achieve with something stupid like this, that never will fail:
$url = strtolower('hTTps://www.example.com/w3/forum/index.php');
if( filter_var($url, FILTER_VALIDATE_URL) && substr($url, 0, 4) == 'http' )
{
// array order is !important
$domain = str_replace(array("http://www.","https://www.","http://","https://"), array("","","",""), $url);
$spos = strpos($domain,'/');
if($spos !== false)
{
$domain = substr($domain, 0, $spos);
} } else { $domain = "can't extract a domain"; }
echo $domain;
Check FILTER_VALIDATE_URL default behavior here
But, if you want to check a domain for his validity, and ALWAYS be sure that the extracted value is correct, then you have to check against an array of valid top domains, as explained here:
https://stackoverflow.com/a/70566657/6399448
or you'll NEVER be sure that the extracted string is the correct domain. Unfortunately, all the answers here sometime will fails.
P.s the unique answer that make sense here seem to me this (i did not read it before sorry. It provide the same solution, even if do not provide an example as mine above mentioned or linked):
https://stackoverflow.com/a/569219/6399448
I know you actually asked for Regex and were not specific to a language. But In Javascript you can do this like this. Maybe other languages can parse URL in a similar way.
Easy Javascript solution
const domain = (new URL(str)).hostname.replace("www.", "");
Leave this solution in js for completeness.
In Javascript, the best way to do this is using the tld-extract npm package. Check out an example at the following link.
Below is the code for the same:
var tldExtract = require("tld-extract")
const urls = [
'http://www.mail.yahoo.co.in/',
'https://mail.yahoo.com/',
'https://www.abc.au.uk',
'https://github.com',
'http://github.ca',
'https://www.google.ru',
'https://google.co.uk',
'https://www.yandex.com',
'https://yandex.ru',
]
const tldList = [];
urls.forEach(url => tldList.push(tldExtract(url)))
console.log({tldList})
which results in the following output:
0: Object {tld: "co.in", domain: "yahoo.co.in", sub: "www.mail"}
1: Object {tld: "com", domain: "yahoo.com", sub: "mail"}
2: Object {tld: "uk", domain: "au.uk", sub: "www.abc"}
3: Object {tld: "com", domain: "github.com", sub: ""}
4: Object {tld: "ca", domain: "github.ca", sub: ""}
5: Object {tld: "ru", domain: "google.ru", sub: "www"}
6: Object {tld: "co.uk", domain: "google.co.uk", sub: ""}
7: Object {tld: "com", domain: "yandex.com", sub: "www"}
8: Object {tld: "ru", domain: "yandex.ru", sub: ""}
Found a custom function which works in most of the cases:
function getDomainWithoutSubdomain(url) {
const urlParts = new URL(url).hostname.split('.')
return urlParts
.slice(0)
.slice(-(urlParts.length === 4 ? 3 : 2))
.join('.')
}
You need a list of what domain prefixes and suffixes can be removed. For example:
Prefixes:
www.
Suffixes:
.com
.co.in
.au.uk
#!/usr/bin/perl -w
use strict;
my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
print $3;
}
/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i
Just for knowledge:
'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');
# returns livreto.co
I know the question is seeking a regex solution but in every attempt it won't work to cover everything
I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com
def urlextract(url):
url_split=url.split(".")
if len(url_split) <= 2:
raise Exception("Full url required with subdomain:",url)
return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}
Let's say we have this: http://google.com
and you only want the domain name
let url = http://google.com;
let domainName = url.split("://")[1];
console.log(domainName);
Use this
(.)(.*?)(.)
then just extract the leading and end points.
Easy, right?