Spring-Boot #RequestMapping and #PathVariable with regular expression matching - regex

I'm attempting to use WebJars-Locator with a Spring-Boot application to map JAR resources. As per their website, I created a RequestMapping like this:
#ResponseBody
#RequestMapping(method = RequestMethod.GET, value = "/webjars-locator/{webjar}/{partialPath:.+}")
public ResponseEntity<ClassPathResource> locateWebjarAsset(#PathVariable String webjar, #PathVariable String partialPath)
{
The problem with this is that the partialPath variable is supposed to include anything after the third slash. What it ends up doing, however, is limiting the mapping itself. This URI is mapped correctly:
http://localhost/webjars-locator/angular-bootstrap-datetimepicker/datetimepicker.js
But this one is not mapped to the handler at all and simply returns a 404:
http://localhost/webjars-locator/datatables-plugins/integration/bootstrap/3/dataTables.bootstrap.css
The fundamental difference is simply the number of components in the path which should be handled by the regular expression (".+") but does not appear to be working when that portion has slashes.
If it helps, this is provided in the logs:
2015-03-03 23:03:53.588 INFO 15324 --- [ main] s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped "{[/webjars-locator/{webjar}/{partialPath:.+}],methods=[GET],params=[],headers=[],consumes=[],produces=[],custom=[]}" onto public org.springframework.http.ResponseEntity app.controllers.WebJarsLocatorController.locateWebjarAsset(java.lang.String,java.lang.String)
2
Is there some type of hidden setting in Spring-Boot to enable regular expression pattern matching on RequestMappings?

The original code in the docs wasn't prepared for the extra slashes, sorry for that!
Please try this code instead:
#ResponseBody
#RequestMapping(value="/webjarslocator/{webjar}/**", method=RequestMethod.GET)
public ResponseEntity<Resource> locateWebjarAsset(#PathVariable String webjar,
WebRequest request) {
try {
String mvcPrefix = "/webjarslocator/" + webjar + "/";
String mvcPath = (String) request.getAttribute(
HandlerMapping.PATH_WITHIN_HANDLER_MAPPING_ATTRIBUTE, RequestAttributes.SCOPE_REQUEST);
String fullPath = assetLocator.getFullPath(webjar,
mvcPath.substring(mvcPrefix.length()));
ClassPathResource res = new ClassPathResource(fullPath);
long lastModified = res.lastModified();
if ((lastModified > 0) && request.checkNotModified(lastModified)) {
return null;
}
return new ResponseEntity<Resource>(res, HttpStatus.OK);
} catch (Exception e) {
return new ResponseEntity<>(HttpStatus.NOT_FOUND);
}
}
I will also provide an update for webjar docs shortly.
Updated 2015/08/05: Added If-Modified-Since handling

It appears that you cannot have a PathVariable to match "the remaining part of the url". You have to use ant-style path patterns, i.e. "**" as described here:
Spring 3 RequestMapping: Get path value
You can then get the entire URL of the request object and extract the "remaining part".

Related

How to match only paths in /[^/]+ with Spring #RequestMapping?

I need the following:
Any request to
https://localhost:8443
https://localhost:8443/
https://localhost:8443/test
https://localhost:8443/api
and so on, should be forwarded to
https://localhost:8443/a/web/index.html
Now, this is how I managed to do that:
#Controller
public class ForwardController {
#RequestMapping(value = "/*", method = RequestMethod.GET)
public String redirectRoot(HttpServletRequest request) {
return "forward:/a/web/index.html";
}
}
The problem is:
This also matches https://localhost:8443/api/ (note the / at the end).
This is a problem because that's where I want the Spring Data REST base path to be:
spring.data.rest.base-path=/api/
/api != /api/ when it comes to REST endpoints.
What should work but somehow doesn't
I have tried several different regular expressions but I am still not able to accomplish what I want. For example (demo):
#RequestMapping(value = "/[^/]+", method = RequestMethod.GET)
Will now work for Spring Data - I'm getting all the resource information I expect, but accessing https://localhost:8443/ is now broken and the web-client cannot be reached anymore.
The same goes for
#RequestMapping(value = "/{path}", method = RequestMethod.GET)
#RequestMapping(value = "/{path:[^/]+}", method = RequestMethod.GET)
which behave like /* (also matches the next /).
This issue is already haunting me for weeks and still no solution insight.
This whole question can also be seen as:
Why is "/[^/]+" not matching https://localhost:8443/whatever ?
Regex are usually not the fastest thing to try.
You can send a list of paths to
#RequestMApping(value={"", "/", "/test", "/api"}, method = RequestMethod.GET)
See Multiple Spring #RequestMapping annotations

URL validation in typescript

I want to make a custom validator that should check the input Url is valid or not.
I want to use the following regex that I tested in expresso, but comes off invalid when used in typescript (the compiler fails to parse it):
(((ht|f)tp(s?))\://)?((([a-zA-Z0-9_\-]{2,}\.)+[a-zA-Z]{2,})|((?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)(?(\.?\d)\.)){4}))(:[a-zA-Z0-9]+)?(/[a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~]*)?
The above url checks for optional http:\\\ and also will validate an Ip address
The following url's should be valid :
192.1.1.1
http://abcd.xyz.in
https://192.1.1.126
abcd.jhjhj.lo
The following url's should be invalid:
192.1
http://hjdhfjfh
168.18.5
Kindly assist
The forward slashes / are not escaped in the regex.
What is valid or invalid in Javascript is valid or invalid in Typescript and vice-versa.
There may be another option for you, that relies on the URL class. The idea is to try converting the string into a URL object. If that fails, the string does not contain a valid URL.
public isAValidUrl(value: string): boolean {
try {
const url = new URL(value);
return isValid(url.pathname);
} catch (TypeError) {
return false;
}
}
isValid(value: URL): boolean {
// you may do further tests here, e.g. by checking url.pathname
// for certain patterns
}
Alternatively to returning a boolean you may return the created URL or null instead of a boolean or - if that exists in JavaScript or TypeScript: something like an Optional<URL>. You should adapt the method's name then, of course.

How can I set a RegularExpression data annotation's regular expression argument at runtime?

We manage several ASP.NET MVC client web sites, which all use a data annotation like the following to validate customer email addresses (I haven't included the regex here, for readability):
[Required(ErrorMessage="Email is required")]
[RegularExpression(#"MYREGEX", ErrorMessage = "Email address is not valid")]
public string Email { get; set; }
What I would like to do is to centralise this regular expression, so that if we make a change to it, all of the sites immediately pick it up and we don't have to manually change it in each one.
The problem is that the regex argument of the data annotation must be a constant, so I cannot assign a value I've retrieved from a config file or database at runtime (which was my first thought).
Can anyone help me with a clever solution to this—or failing that, an alternative approach which will work to achieve the same goal? Or does this just require us to write a specialist custom validation attribute which will accept non-constant values?
The easiest way is to write a custom ValidationAttribute that inherits from RegularExpressionAttribute, so something like:
public class EmailAttribute : RegularExpressionAttribute
{
public EmailAttribute()
: base(GetRegex())
{ }
private static string GetRegex()
{
// TODO: Go off and get your RegEx here
return #"^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$";
}
}
That way, you still maintain use of the built in Regex validation but you can customise it. You'd just simply use it like:
[Email(ErrorMessage = "Please use a valid email address")]
Lastly, to get to client side validation to work, you would simply add the following in your Application_Start method within Global.asax, to tell MVC to use the normal regular expression validation for this validator:
DataAnnotationsModelValidatorProvider.RegisterAdapter(typeof(EmailAttribute), typeof(RegularExpressionAttributeAdapter));
Checkout ScotGu's [Email] attribute (Step 4: Creating a Custom [Email] Validation Attribute).
Do you really want to put the regex in database/config file, or do you just want to centralise them? If you just want to put the regex together, you can just define and use constants like
public class ValidationRegularExpressions {
public const string Regex1 = "...";
public const string Regex2 = "...";
}
Maybe you want to manage the regexes in external files, you can write a MSBuild task to do the replacement when you build for production.
If you REALLY want to change the validation regex at runtime, define your own ValidationAttribute, like
[RegexByKey("MyKey", ErrorMessage = "Email address is not valid")]
public string Email { get; set; }
It's just a piece of code to write:
public class RegexByKeyAttribute : ValidationAttribute {
public RegexByKey(string key) {
...
}
// override some methods
public override bool IsValid(object value) {
...
}
}
Or even just:
public class RegexByKeyAttribute : RegularExpressionAttribute {
public RegexByKey(string key) : base(LoadRegex(key)) { }
// Be careful to cache the regex is this operation is expensive.
private static string LoadRegex(string key) { ... }
}
Hope it's helpful: http://msdn.microsoft.com/en-us/library/cc668224.aspx
Why not just write you own ValidationAttribute?
http://msdn.microsoft.com/en-us/library/system.componentmodel.dataannotations.validationattribute.aspx
Then you can configure that thing to pull the regex from a registry setting... config file... database... etc... etc..
How to: Customize Data Field Validation in the Data Model Using Custom

Pulling multiple values from JSON response using RegEx Extractor

I'm testing a web service that returns JSON responses and I'd like to pull multiple values from the response. A typical response would contain multiple values in a list. For example:
{
"name":"#favorites",
"description":"Collection of my favorite places",
"list_id":4894636,
}
A response would contain many sections like the above example.
What I'd like to do in Jmeter is go through the JSON response and pull each section outlined above in a manner that I can tie the returned name and description as one entry to iterate over.
What I've been able to do thus far is return the name value with regular expression extractor ("name":"(.+?)") using the template $1$. I'd like to pull both name and description but can't seem to get it to work. I've tried using a regex "name":"(.+?)","description":"(.+?)" with a template of $1$$2$ without any success.
Does anyone know how I might pull multiple values using regex in this example?
You can just add (?s) to the regex to avoid line breaks.
E.g: (?s)"name":"(.+?)","description":"(.+?)"
It works for me on assertions.
It may be worth to use BeanShell scripting to process JSON response.
So if you need to get ALL the "name/description" pairs from response (for each section) you can do the following:
1. extract all the "name/description" pairs from response in loop;
2. save extracted pairs in csv-file in handy format;
3. read saved pairs from csv-file later in code - using CSV Data Set Config in loop, e.g.
JSON response processing can be implemented using BeanShell scripting (~ java) + any json-processing library (e.g. json-rpc-1.0):
- either in BeanShell Sampler or in BeanShell PostProcessor;
- all the required beanshell libs are currently provided in default
jmeter delivery;
- to use json-processing library place jar into JMETER_HOME/lib folder.
Schematically it will look like:
in case of BeanShell PostProcessor:
Thread Group
. . .
YOUR HTTP Request
BeanShell PostProcessor // added as child
. . .
in case of BeanShell Sampler:
Thread Group
. . .
YOUR HTTP Request
BeanShell Sampler // added separate sampler - after your
. . .
In this case there is no difference which one use.
You can either put the code itself into the sampler body ("Script" field) or store in external file, as shown below.
Sampler code:
import java.io.*;
import java.util.*;
import org.json.*;
import org.apache.jmeter.samplers.SampleResult;
ArrayList nodeRefs = new ArrayList();
ArrayList fileNames = new ArrayList();
String extractedList = "extracted.csv";
StringBuilder contents = new StringBuilder();
try
{
if (ctx.getPreviousResult().getResponseDataAsString().equals("")) {
Failure = true;
FailureMessage = "ERROR: Response is EMPTY.";
throw new Exception("ERROR: Response is EMPTY.");
} else {
if ((ResponseCode != null) && (ResponseCode.equals("200") == true)) {
SampleResult result = ctx.getPreviousResult();
JSONObject response = new JSONObject(result.getResponseDataAsString());
FileOutputStream fos = new FileOutputStream(System.getProperty("user.dir") + File.separator + extractedList);
if (response.has("items")) {
JSONArray items = response.getJSONArray("items");
if (items.length() != 0) {
for (int i = 0; i < items.length(); i++) {
String name = items.getJSONObject(i).getString("name");
String description = items.getJSONObject(i).getString("description");
int list_id = items.getJSONObject(i).getInt("list_id");
if (i != 0) {
contents.append("\n");
}
contents.append(name).append(",").append(description).append(",").append(list_id);
System.out.println("\t " + name + "\t\t" + description + "\t\t" + list_id);
}
}
}
byte [] buffer = contents.toString().getBytes();
fos.write(buffer);
fos.close();
} else {
Failure = true;
FailureMessage = "Failed to extract from JSON response.";
}
}
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
As well a set of links on this:
JSON in JMeter
Processing JSON Responses with JMeter and the BSF Post Processor
Upd. on 08.2017:
At the moment JMeter has set of built-in components (merged from 3rd party projects) to handle JSON without scripting:
JSON Path Extractor (contributed from ATLANTBH jmeter-components project);
JSON Extractor (contributed from UBIK Load Pack since JMeter 3.0) - see answer below.
I am assuming that JMeter uses Java-based regular expressions... This could mean no named capturing groups. Apparently, Java7 now supports them, but that doesn't necessarily mean JMeter would. For JSON that looks like this:
{
"name":"#favorites",
"description":"Collection of my favorite places",
"list_id":4894636,
}
{
"name":"#AnotherThing",
"description":"Something to fill space",
"list_id":0048265,
}
{
"name":"#SomethingElse",
"description":"Something else as an example",
"list_id":9283641,
}
...this expression:
\{\s*"name":"((?:\\"|[^"])*)",\s*"description":"((?:\\"|[^"])*)",(?:\\}|[^}])*}
...should match 3 times, capturing the "name" value into the first capturing group, and the "description" into the second capturing group, similar to the following:
1 2
--------------- ---------------------------------------
#favorites Collection of my favorite places
#AnotherThing Something to fill space
#SomethingElse Something else as an example
Importantly, this expression supports quote escaping in the value portion (and really even in the identifier name portion as well, so that the Javascript string I said, "What is your name?"! will be stored in JSON as AND parsed correctly as I said, \"What is your name?\"!
Using Ubik Load Pack plugin for JMeter which has been donated to JMeter core and is since version 3.0 available as JSON Extractor you can do it this way with following Test Plan:
namesExtractor_ULP_JSON_PostProcessor config:
descriptionExtractor_ULP_JSON_PostProcessor config:
Loop Controller to loop over results:
Counter config:
Debug Sampler showing how to use name and description in one iteration:
And here is what you get for the following JSON:
[{ "name":"#favorites", "description":"Collection of my favorite places", "list_id": 4894636 }, { "name":"#AnotherThing", "description":"Something to fill space", "list_id": 48265 }, { "name":"#SomethingElse", "description":"Something else as an example", "list_id":9283641 }]
Compared to Beanshell solution:
It is more "standard approach"
It performs much better than Beanshell code
It is more readable

Getting parts of a URL (Regex)

Given the URL (single line):
http://test.example.com/dir/subdir/file.html
How can I extract the following parts using regular expressions:
The Subdomain (test)
The Domain (example.com)
The path without the file (/dir/subdir/)
The file (file.html)
The path with the file (/dir/subdir/file.html)
The URL without the path (http://test.example.com)
(add any other that you think would be useful)
The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html
A single regex to parse and breakup a
full URL including query parameters
and anchors e.g.
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
RexEx positions:
url: RegExp['$&'],
protocol:RegExp.$2,
host:RegExp.$3,
path:RegExp.$4,
file:RegExp.$6,
query:RegExp.$7,
hash:RegExp.$8
you could then further parse the host ('.' delimited) quite easily.
What I would do is use something like this:
/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.
I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
as $. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
For what it's worth, I found that I had to escape the forward slashes in JavaScript:
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:
var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});
/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/
I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:
It can not handle port number.
The hash part is broken.
The following is a modified version:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$
Position of parts are as follows:
int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12
Edit posted by anon user:
function getFileName(path) {
return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}
I needed a regular Expression to match all urls and made this one:
/(?:([^\:]*)\:\/\/)?(?:([^\:\#]*)(?:\:([^\#]*))?\#)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/
It matches all urls, any protocol, even urls like
ftp://user:pass#www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag
The result (in JavaScript) looks like this:
["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]
An url like
mailto://admin#www.cs.server.com
looks like this:
["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]
I was trying to solve this in javascript, which should be handled by:
var url = new URL('http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang');
since (in Chrome, at least) it parses to:
{
"hash": "#foobar/bing/bo#ng?bang",
"search": "?foo=bar&bingobang=&king=kong#kong.com",
"pathname": "/path/wah#t/foo.js",
"port": "890",
"hostname": "example.com",
"host": "example.com:890",
"password": "b",
"username": "a",
"protocol": "http:",
"origin": "http://example.com:890",
"href": "http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang"
}
However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:
^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:#\/#\?]+)(?:\:([^:#\/#\?]*))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?
Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.
The parts are in this order:
var keys = [
"href", // http://user:pass#host.com:81/directory/file.ext?query=1#anchor
"origin", // http://user:pass#host.com:81
"protocol", // http:
"username", // user
"password", // pass
"host", // host.com:81
"hostname", // host.com
"port", // 81
"pathname", // /directory/file.ext
"search", // ?query=1
"hash" // #anchor
];
There is also a small library which wraps it and provides query params:
https://github.com/sadams/lite-url (also available on bower)
If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.
Propose a much more readable solution (in Python, but applies to any regex):
def url_path_to_dict(path):
pattern = (r'^'
r'((?P<schema>.+?)://)?'
r'((?P<user>.+?)(:(?P<password>.*?))?#)?'
r'(?P<host>.*?)'
r'(:(?P<port>\d+?))?'
r'(?P<path>/.*?)?'
r'(?P<query>[?].*?)?'
r'$'
)
regex = re.compile(pattern)
m = regex.match(path)
d = m.groupdict() if m is not None else None
return d
def main():
print url_path_to_dict('http://example.example.com/example/example/example.html')
Prints:
{
'host': 'example.example.com',
'user': None,
'path': '/example/example/example.html',
'query': None,
'password': None,
'port': None,
'schema': 'http'
}
subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/
the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)
the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$
the path with the file : http://[^/]+/(.*)
the URL without the path : (http://[^/]+/)
(Markdown isn't very friendly to regexes)
This improved version should work as reliably as a parser.
// Applies to URI, not just URL or URN:
// http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
//
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
//
// http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
//
// $# matches the entire uri
// $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
// $2 matches authority (host, user:pwd#host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
// Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
//
// (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
//
// Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
{
if( !schemes )
schemes = '[^\\s:\/?#]+'
else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
throw TypeError( 'expected URI schemes' )
return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
}
// http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
function uriSchemesRegExp()
{
return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
}
const URI_RE = /^(([^:\/\s]+):\/?\/?([^\/\s#]*#)?([^\/#:]*)?:?(\d+)?)?(\/[^?]*)?(\?([^#]*))?(#[\s\S]*)?$/;
/**
* GROUP 1 ([scheme][authority][host][port])
* GROUP 2 (scheme)
* GROUP 3 (authority)
* GROUP 4 (host)
* GROUP 5 (port)
* GROUP 6 (path)
* GROUP 7 (?query)
* GROUP 8 (query)
* GROUP 9 (fragment)
*/
URI_RE.exec("https://john:doe#www.example.com:123/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("ldap://[2001:db8::7]/c=GB?objectClass?one");
URI_RE.exec("mailto:John.Doe#example.com");
Above you can find javascript implementation with modified regex
Try the following:
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+#)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
It supports HTTP / FTP, subdomains, folders, files etc.
I found it from a quick google search:
Link
/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)#)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/
From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).
You can get all the http/https, host, port, path as well as query by using Uri object in .NET.
just the difficult task is to break the host into sub domain, domain name and TLD.
There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.
However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.
This answers also helpfull:
Get the subdomain from a URL
CaLLMeLaNN
Here is one that is complete, and doesnt rely on any protocol.
function getServerURL(url) {
var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
console.log(m[1]) // Remove this
return m[1];
}
getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")
Prints
http://dev.test.se
http://dev.test.se
//ajax.googleapis.com
//
www.dev.test.se
www.dev.test.se
www.dev.test.se
www.dev.test.se
//dev.test.se
http://www.dev.test.se
http://localhost:8080
https://localhost:8080
None of the above worked for me. Here's what I ended up using:
/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/
I like the regex that was published in "Javascript: The Good Parts".
Its not too short and not too complex.
This page on github also has the JavaScript code that uses it.
But it an be adapted for any language.
https://gist.github.com/voodooGQ/4057330
Java offers a URL class that will do this. Query URL Objects.
On a side note, PHP offers parse_url().
I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.
http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx
I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)
also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).
so this is my version slightly modified with the source being the highest voted version here:
^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$
I build this one. Very permissive it's not to check url juste divide it.
^((http[s]?):\/\/)?([a-zA-Z0-9-.]*)?([\/]?[^?#\n]*)?([?]?[^?#\n]*)?([#]?[^?#\n]*)$
match 1 : full protocole with :// (http or https)
match 2 : protocole without ://
match 3 : host
match 4 : slug
match 5 : param
match 6 : anchor
work
http://
https://
www.demo.com
/slug
?foo=bar
#anchor
https://demo.com
https://demo.com/
https://demo.com/slug
https://demo.com/slug/foo
https://demo.com/?foo=bar
https://demo.com/?foo=bar#anchor
https://demo.com/?foo=bar&bar=foo#anchor
https://www.greate-demo.com/
crash
#anchor#
?toto?
I needed some REGEX to parse the components of a URL in Java.
This is what I'm using:
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
Java Code Snippet:
final Pattern pattern = Pattern.compile(
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
);
final Matcher matcher = pattern.matcher(url);
System.out.println(" URL: " + url);
if (matcher.matches())
{
System.out.println(" Method: " + matcher.group(1));
System.out.println("Hostname: " + matcher.group(2));
System.out.println(" Port: " + matcher.group(3));
System.out.println(" Path: " + matcher.group(4));
System.out.println(" Query: " + matcher.group(5));
System.out.println(" ID: " + matcher.group(6));
return matcher.group(2);
}
System.out.println();
System.out.println();
Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.
But here is the deal, I want to use different regex patterns in different situations in my program.
For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.
Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).
That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)
I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?
If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:
(?:SOMESTUFF)
You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.
Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:
https?
would match 'http' or 'https' just fine.
regexp to get the URL path without the file.
url = 'http://domain/dir1/dir2/somefile'
url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s
It can be useful for adding a relative path to this url.
The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:
^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$
The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:
$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"
When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:
^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$
In JavaScript, of course, you can't use named backreferences, so the regex becomes
^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$
and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.
//USING REGEX
/**
* Parse URL to get information
*
* #param url the URL string to parse
* #return parsed the URL parsed or null
*/
var UrlParser = function (url) {
"use strict";
var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:#\/#\?]+)(?:\:([^:#\/#\?]+))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
matches = regx.exec(url),
parser = null;
if (null !== matches) {
parser = {
href : matches[0],
withoutHash : matches[1],
url : matches[2],
origin : matches[3],
protocol : matches[4],
protocolseparator : matches[5],
credhost : matches[6],
cred : matches[7],
user : matches[8],
pass : matches[9],
host : matches[10],
hostname : matches[11],
port : matches[12],
pathname : matches[13],
segment1 : matches[14],
segment2 : matches[15],
search : matches[16],
hash : matches[17]
};
}
return parser;
};
var parsedURL=UrlParser(url);
console.log(parsedURL);
I tried this regex for parsing url partitions:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$
URL: https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2
Matches:
Group 1. 0-7 https:/
Group 2. 0-5 https
Group 3. 8-22 www.google.com
Group 6. 22-50 /my/path/sample/asd-dsa/this
Group 7. 22-46 /my/path/sample/asd-dsa/
Group 8. 46-50 this
Group 9. 50-74 ?key1=value1&key2=value2
Group 10. 51-74 key1=value1&key2=value2
The best answer suggested here didn't work for me because my URLs also contain a port.
However modifying it to the following regex worked for me:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:\d+)?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
For browser / nodejs environment there is a built in URL class which share the same signature it seems. but check out the respective focus for your case.
https://nodejs.org/api/url.html#urlhost
https://developer.mozilla.org/en-US/docs/Web/API/URL
This is how it may be used though.
let url = new URL('https://test.example.com/cats?name=foofy')
url.protocall; // https:
url.hostname; // test.example.com
url.pathname; // /cats
url.search; // ?name=foofy
let params = url.searchParams
let name = params.get('name');// always string I think so parse accordingly
for more on parameters also see https://developer.mozilla.org/en-US/docs/Web/API/URL/searchParams
String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";
String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";
System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));
Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl
If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888";
the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888
enjoy..
Yosi Lev