How to extract request parameter from $request? - regex

I want to extract request parameter( reqId ) value from $request in http block using regex and map
Can you please help me resolving it ?
Sample URL :
test-registration.com/emp/reg?reqId=939393&usrName=Jimmy
I am not sure what would be the regular expression in this case but possible
solution would be like
http {
map $request $requestId {
"regular expression" $reqId;
}
}
If there is any other solution to resolve this issue, please let me know.
I thought I can use $arg_reqId but I am not sure whether I can use it in http block or not.
EDITED:
After extracting the id I want to apply sha-256 hashing on it and put it back to $request.
So new value of $request should be like :
test-registration.com/emp/reg?reqId=$#&$#&yewywjd3&usrName=Jimmy
Thanks

You could try this pattern: reqId=([^&]+)
Explanation:
reqId= - match reqId= literally
(...) - capturing group
[^&]+ - match one or more characters other than &
Required value wll be stored in first capturing group.
Demo

What about this map block:
http {
map $request $requestId {
"~/emp/reg\?reqId=(?<reqId>[0-9]+)" /doSomething/$reqId;
}
}

Related

How to extract part of url - dart/flutter

I'm trying to extract the part of url (To be more specific, I'm trying to extract the value of page_info parameter in the url which is next to rel="next"
String testUrl = "<https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9>; rel='previous', <https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel='next'";
List<String> splitUrl = testUrl.split("=");
print(splitUrl[5]);
// this is what it prints out
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel
// this is what I'm trying to extract
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0
// value for rel="next"
I tried to split the url by using split function on String but that would also bring the angle bracket with it. I'm trying to extract only page_info= parameter value which is for rel="next"
I know this has to do something with regex but I'm not really good at it! Any help would be really appreciated
I grabbed that url from header response (paginated REST API), it returns two page_info parameters (one for next and other one for previous page) I'm trying to extract value for next page. Splitting the url didn't help me
thank you
An alternative approach is to use Uri.parse to parse the URL:
void main() {
String testUrl = "<https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9>; rel='previous', <https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel='next'";
// Extract just the URL.
var match = RegExp(r'<([^>]*)>').firstMatch(testUrl);
if (match != null) {
var uri = Uri.parse(match.group(1)!);
print(uri.queryParameters['page_info']); // Prints: eyJkaXJlY3Rpb24iOiJ...
}
}
Note that the above wouldn't need any of the RegExp code if testUrl were a proper URL without the angle brackets and rel='next' junk.
the regEx pattern page_info=([\w]+)
gives you
eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0
https://regexr.com/6qj1h

How to match specific string in ROBOT FRAMEWORK using regex?

I am using REST-API for testing
I am stuck where I am checking the response with some specific string.
please refer below info
I got the response from a request is
{
"clusters":[
{
"id":10,
"name":"HP2",
"status":2,
"statusDisplay":"HParihar#4info.com",
"lastModifiedBy":"HParihar#4info.com",
"lastModifiedTime":"06/08/2017 23:42",
"sitesAppsCount":0
},
{
"id":799,
"name":"Regression_cluster_111_09",
"status":2,
"statusDisplay":"admin#4info.net",
"lastModifiedBy":"admin#4info.net",
"lastModifiedTime":"07/11/2017 08:19",
"sitesAppsCount":0
}
]}
and I wanted to match just
"name":"Regression_cluster_111_09",
"status":2,
"statusDisplay":"admin#4info.net",
"sitesAppsCount":0
right side values I'll be keeping as hard coded.
any guesses?
Since you are only checking those 4 parameters are in response or not.
Do no use regex for this.
Use jsonObject's find key/value feature.
Check whether the values to the keys are there.
If key/value is null, the parameter is not in response.
I got my answer
I used the following regex
"name":"Regression_cluster_111_09","status":2,"statusDisplay":"admin#4info.net","lastModifiedBy":"[a-z]+#[0-9a-z]+\.[a-z]+","lastModifiedTime":"[0-9]{2}\/[0-9]{2}\/[0-9]{4}\ [0-9]{2}:[0-9]{2}","sitesAppsCount":0
or I can simply use
"name":"Regression_cluster_111_09","status":2,"statusDisplay":"admin#4info.net",.+"sitesAppsCount":0
thank you all

Original string contains "+", regular expression extractor (.+?) replaces it with a space. How can I extract with the "+"

(Edit: The answer is to use check 'Encode?'option in the HTTP Request. Please see Vinoth's Edit 2 and comment below, thanks!)
This is interesting!
I'm trying to parse a HTTP response which has (let's give concrete example,
bigH:"2a3a6CEH+iJakQpQtPm8efv"
Using Regular Expression Extractor when I try
bigH:"(.+?)"
it extracts the string but replaces all the "+" in the string with space. That is, instead of
"2a3a6CEH+iJakQpQtPm8efv"
it gives me:
"2a3a6CEH iJakQpQtPm8efv"
Note the space between H and i.
How can I stop it from replacing the "+" with a space? I'd really appreciate if someone can give an explanation also.
Btw, I tried (.+?) and (.\++?) and even ([.|\+]+?) - didn't work :(
Thanks,
--Ishtiaque
Updating with screenshots below:
Adding screenshots:
POST Response data:
After parsing with regular expression extractor in JMeter:
Side by side in Notepad++:
'Raw' tab shows the '+'s:
'HTTP' tab does not:
As you get the response in JSON format, I would go with JSON Path Extractor.
It seems to be a much easier approach than using Regular expression.
Below JSON Path should take care of getting the encoded string from your JSON & You should be able to access using ${bigH}.
Check this for more details (scroll down for JSON Path extractor details).
EDIT:
I was wrong that You get the response in JSON format. Are you trying to access - bigH:"XXX" - from script tag? For this, We have to use Regular expression extractor only or Beanshell.
<script type='text/javascript' charset='utf-8'>
registerSubmit(document.forms[0].elements['SubmitTopButton']);
registerSubmit(document.forms[0].elements['SubmitBottomButton']);
(function($) {
$(".wb_tsauthall").wb_tsauthall({
auth : "Authorize All",
unauth : "Unauthorize All",
locMsgKeys : []
});
$(".wb_newedit").wb_newedit({
labels:['Job','Code','Work Premium','Flat Rate','Premium','Shift','Sched Times','LTA','Sched Times w Breaks','Delete Details','Employee Holiday','Work Detail','Schedule Detail'],values:[105,103,200,206,204,450,401,500,461,199,900,100,460],bigH:"PVxUbYIODBT31j8IZnPGxF/9O1iuKAkFzTO9WhXu8An8hAUa22tLiWrEHz8v9SIu/NXZH1a5IxO0xYeNwRIYM+3n1kNsrESnhiAYhwhCiqUY9mI4hvEPgAOx7B+MEB8iSIUyNGNZbeGx9nSogFYpNrzmCXirW7Nm9Tn7owPKHmc8dOf5SZ+eDzAOHIB8+5YzQ3bIdFoe60hOMkyd7FiUXtwPcNMUFEjOSMs9JhgIHTE4agpCdbFb6SLuSuLoO9rqxj+9GovUbzTmrxj4faBKZVATNN7iIFyDZHYAZuZRcPJBdUJ1xNHMCWyPZ4p2/Yk0Q0ujdKJbJw9NFysikZgBFNEhNXEA4w8HL1ycYCmZDgSUW1GsumDAKh0Brq3K8Kh2akep8YEjDMWipKgSPaNx3CVY4lf87e0oK70nK/zKGkmpWFvyMnxbkJtWmeuxmPgRZgg2lYbZXFauD1AidnQQhPULJTTV+P+Xkk9PYm3ZkIEcDnYJUmPg/D3iuwg84m2IZatFTdjiNuDAcGNKptTd54yMgohN87c3sRMiZlSY/r88u+Le3BKWJqyl7Xai7Odqz366DFgOzdPi92LnSaggKX++hy+Z04kjyfSZOUYWmiWlc38SUdeTq2v15egig2mMkSLMaUnHagk="
});
$("#codeSummaryBar").wb_expandableframe({
iframe : contextPath + '/dailytimesheet/summaryInline.jsp'
});
$("#codeSummaryBar").click(function(){$("#codeSummaryBar_expand_collapse_icon").toggleClass("collapse expand");});
$("#codeSummaryBar").click();
$("#selectionBar").wb_expandableframe({
iframe : contextPath + '/dailytimesheet/dailySelectInline.jsp',
onExpand : function() {
$(".selectionBarControl").css("visibility", "hidden");
$("#expand_collapse_icon").removeClass("expand").addClass("collapse");
},
onCollapse : function() {
$(".selectionBarControl").css("visibility", "");
$("#expand_collapse_icon").removeClass("collapse").addClass("expand");
}
});
DTS.onload();
})(jQuery);
</script>
EDIT 2:
I doubt that you might have checked the Encode in the HTTP Request.
Uncheck
Try with the regular expression ([a-zA-Z0-9+]+)

Express.JS regular expression for example.com/:username

// ex: http://example.com/john_smith
app.get('/^(a-z)_(0-9)', function(req, res) {
res.send('user');
});
// ex: http://example.com/john_smith/messages/1987234
app.get('/^(a-z)_(0-9)/messages/:id', function(req, res) {
res.send('message');
});
I wrote the above code for an app that I want to pass a username as a url variable to node.js like I would do: $username = $_GET['username']; in PHP. I'm not too good at writing regular expressions so I wanted to see if anyone could set me on the right track. Thanks in advance.
From your requirement it doesn't seem like you need a regular expression. Just use a a variable in your rule, like below:
// Grabs whatever comes after /user/ and maps it to req.params.id
app.get('/user/:id', function (req, res) {
var userId = req.params.id;
res.send(userId);
});
If you want to have better control, you could use a regular expression. To grab things you are interested in from the expression, use a capture group (which are typically expressed as a set of matching parenthesis):
// Grabs the lowercase string coming after /user/ and maps it to req.params[0]
app.get(/^\/user\/([a-z]+)$/, function (req, res) {
var userId = req.params[0];
res.send(userId);
});
A little off topic, but here's a really good intro to express.js that will help you understand it better (including how the routes work):
http://evanhahn.com/understanding-express-js/
You're looking for req.params, which is an array of all of the capture groups in the regex.
The capture groups start at 1; req.params[0] is the entire match.

Getting parts of a URL (Regex)

Given the URL (single line):
http://test.example.com/dir/subdir/file.html
How can I extract the following parts using regular expressions:
The Subdomain (test)
The Domain (example.com)
The path without the file (/dir/subdir/)
The file (file.html)
The path with the file (/dir/subdir/file.html)
The URL without the path (http://test.example.com)
(add any other that you think would be useful)
The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html
A single regex to parse and breakup a
full URL including query parameters
and anchors e.g.
https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
RexEx positions:
url: RegExp['$&'],
protocol:RegExp.$2,
host:RegExp.$3,
path:RegExp.$4,
file:RegExp.$6,
query:RegExp.$7,
hash:RegExp.$8
you could then further parse the host ('.' delimited) quite easily.
What I would do is use something like this:
/*
^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.
I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
as $. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
For what it's worth, I found that I had to escape the forward slashes in JavaScript:
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:
var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
console.log(k+':', a[k]);
});
/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/
I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:
It can not handle port number.
The hash part is broken.
The following is a modified version:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$
Position of parts are as follows:
int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12
Edit posted by anon user:
function getFileName(path) {
return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}
I needed a regular Expression to match all urls and made this one:
/(?:([^\:]*)\:\/\/)?(?:([^\:\#]*)(?:\:([^\#]*))?\#)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/
It matches all urls, any protocol, even urls like
ftp://user:pass#www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag
The result (in JavaScript) looks like this:
["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]
An url like
mailto://admin#www.cs.server.com
looks like this:
["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]
I was trying to solve this in javascript, which should be handled by:
var url = new URL('http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang');
since (in Chrome, at least) it parses to:
{
"hash": "#foobar/bing/bo#ng?bang",
"search": "?foo=bar&bingobang=&king=kong#kong.com",
"pathname": "/path/wah#t/foo.js",
"port": "890",
"hostname": "example.com",
"host": "example.com:890",
"password": "b",
"username": "a",
"protocol": "http:",
"origin": "http://example.com:890",
"href": "http://a:b#example.com:890/path/wah#t/foo.js?foo=bar&bingobang=&king=kong#kong.com#foobar/bing/bo#ng?bang"
}
However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:
^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:#\/#\?]+)(?:\:([^:#\/#\?]*))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?
Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.
The parts are in this order:
var keys = [
"href", // http://user:pass#host.com:81/directory/file.ext?query=1#anchor
"origin", // http://user:pass#host.com:81
"protocol", // http:
"username", // user
"password", // pass
"host", // host.com:81
"hostname", // host.com
"port", // 81
"pathname", // /directory/file.ext
"search", // ?query=1
"hash" // #anchor
];
There is also a small library which wraps it and provides query params:
https://github.com/sadams/lite-url (also available on bower)
If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.
Propose a much more readable solution (in Python, but applies to any regex):
def url_path_to_dict(path):
pattern = (r'^'
r'((?P<schema>.+?)://)?'
r'((?P<user>.+?)(:(?P<password>.*?))?#)?'
r'(?P<host>.*?)'
r'(:(?P<port>\d+?))?'
r'(?P<path>/.*?)?'
r'(?P<query>[?].*?)?'
r'$'
)
regex = re.compile(pattern)
m = regex.match(path)
d = m.groupdict() if m is not None else None
return d
def main():
print url_path_to_dict('http://example.example.com/example/example/example.html')
Prints:
{
'host': 'example.example.com',
'user': None,
'path': '/example/example/example.html',
'query': None,
'password': None,
'port': None,
'schema': 'http'
}
subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/
the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)
the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$
the path with the file : http://[^/]+/(.*)
the URL without the path : (http://[^/]+/)
(Markdown isn't very friendly to regexes)
This improved version should work as reliably as a parser.
// Applies to URI, not just URL or URN:
// http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
//
// http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
//
// (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
//
// http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
//
// $# matches the entire uri
// $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
// $2 matches authority (host, user:pwd#host, etc)
// $3 matches path
// $4 matches query (http GET REST api, etc)
// $5 matches fragment (html anchor, etc)
//
// Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
// Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
//
// (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
//
// Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
{
if( !schemes )
schemes = '[^\\s:\/?#]+'
else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
throw TypeError( 'expected URI schemes' )
return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
}
// http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
function uriSchemesRegExp()
{
return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
}
const URI_RE = /^(([^:\/\s]+):\/?\/?([^\/\s#]*#)?([^\/#:]*)?:?(\d+)?)?(\/[^?]*)?(\?([^#]*))?(#[\s\S]*)?$/;
/**
* GROUP 1 ([scheme][authority][host][port])
* GROUP 2 (scheme)
* GROUP 3 (authority)
* GROUP 4 (host)
* GROUP 5 (port)
* GROUP 6 (path)
* GROUP 7 (?query)
* GROUP 8 (query)
* GROUP 9 (fragment)
*/
URI_RE.exec("https://john:doe#www.example.com:123/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("/forum/questions/?tag=networking&order=newest#top");
URI_RE.exec("ldap://[2001:db8::7]/c=GB?objectClass?one");
URI_RE.exec("mailto:John.Doe#example.com");
Above you can find javascript implementation with modified regex
Try the following:
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+#)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
It supports HTTP / FTP, subdomains, folders, files etc.
I found it from a quick google search:
Link
/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)#)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/
From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).
You can get all the http/https, host, port, path as well as query by using Uri object in .NET.
just the difficult task is to break the host into sub domain, domain name and TLD.
There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.
However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.
This answers also helpfull:
Get the subdomain from a URL
CaLLMeLaNN
Here is one that is complete, and doesnt rely on any protocol.
function getServerURL(url) {
var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
console.log(m[1]) // Remove this
return m[1];
}
getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")
Prints
http://dev.test.se
http://dev.test.se
//ajax.googleapis.com
//
www.dev.test.se
www.dev.test.se
www.dev.test.se
www.dev.test.se
//dev.test.se
http://www.dev.test.se
http://localhost:8080
https://localhost:8080
None of the above worked for me. Here's what I ended up using:
/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/
I like the regex that was published in "Javascript: The Good Parts".
Its not too short and not too complex.
This page on github also has the JavaScript code that uses it.
But it an be adapted for any language.
https://gist.github.com/voodooGQ/4057330
Java offers a URL class that will do this. Query URL Objects.
On a side note, PHP offers parse_url().
I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.
http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx
I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)
also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).
so this is my version slightly modified with the source being the highest voted version here:
^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$
I build this one. Very permissive it's not to check url juste divide it.
^((http[s]?):\/\/)?([a-zA-Z0-9-.]*)?([\/]?[^?#\n]*)?([?]?[^?#\n]*)?([#]?[^?#\n]*)$
match 1 : full protocole with :// (http or https)
match 2 : protocole without ://
match 3 : host
match 4 : slug
match 5 : param
match 6 : anchor
work
http://
https://
www.demo.com
/slug
?foo=bar
#anchor
https://demo.com
https://demo.com/
https://demo.com/slug
https://demo.com/slug/foo
https://demo.com/?foo=bar
https://demo.com/?foo=bar#anchor
https://demo.com/?foo=bar&bar=foo#anchor
https://www.greate-demo.com/
crash
#anchor#
?toto?
I needed some REGEX to parse the components of a URL in Java.
This is what I'm using:
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
Java Code Snippet:
final Pattern pattern = Pattern.compile(
"^(?:(http[s]?|ftp):/)?/?" + // METHOD
"([^:^/^?^#\\s]+)" + // HOSTNAME
"(?::(\\d+))?" + // PORT
"([^?^#.*]+)?" + // PATH
"(\\?[^#.]*)?" + // QUERY
"(#[\\w\\-]+)?$" // ID
);
final Matcher matcher = pattern.matcher(url);
System.out.println(" URL: " + url);
if (matcher.matches())
{
System.out.println(" Method: " + matcher.group(1));
System.out.println("Hostname: " + matcher.group(2));
System.out.println(" Port: " + matcher.group(3));
System.out.println(" Path: " + matcher.group(4));
System.out.println(" Query: " + matcher.group(5));
System.out.println(" ID: " + matcher.group(6));
return matcher.group(2);
}
System.out.println();
System.out.println();
Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.
But here is the deal, I want to use different regex patterns in different situations in my program.
For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.
Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).
That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)
I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?
If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:
(?:SOMESTUFF)
You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.
Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:
https?
would match 'http' or 'https' just fine.
regexp to get the URL path without the file.
url = 'http://domain/dir1/dir2/somefile'
url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s
It can be useful for adding a relative path to this url.
The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:
^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$
The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:
$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"
When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:
^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$
In JavaScript, of course, you can't use named backreferences, so the regex becomes
^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$
and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.
//USING REGEX
/**
* Parse URL to get information
*
* #param url the URL string to parse
* #return parsed the URL parsed or null
*/
var UrlParser = function (url) {
"use strict";
var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:#\/#\?]+)(?:\:([^:#\/#\?]+))?)#)?(([^:\/#\?\]\[]+|\[[^\/\]##?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
matches = regx.exec(url),
parser = null;
if (null !== matches) {
parser = {
href : matches[0],
withoutHash : matches[1],
url : matches[2],
origin : matches[3],
protocol : matches[4],
protocolseparator : matches[5],
credhost : matches[6],
cred : matches[7],
user : matches[8],
pass : matches[9],
host : matches[10],
hostname : matches[11],
port : matches[12],
pathname : matches[13],
segment1 : matches[14],
segment2 : matches[15],
search : matches[16],
hash : matches[17]
};
}
return parser;
};
var parsedURL=UrlParser(url);
console.log(parsedURL);
I tried this regex for parsing url partitions:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$
URL: https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2
Matches:
Group 1. 0-7 https:/
Group 2. 0-5 https
Group 3. 8-22 www.google.com
Group 6. 22-50 /my/path/sample/asd-dsa/this
Group 7. 22-46 /my/path/sample/asd-dsa/
Group 8. 46-50 this
Group 9. 50-74 ?key1=value1&key2=value2
Group 10. 51-74 key1=value1&key2=value2
The best answer suggested here didn't work for me because my URLs also contain a port.
However modifying it to the following regex worked for me:
^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:\d+)?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
For browser / nodejs environment there is a built in URL class which share the same signature it seems. but check out the respective focus for your case.
https://nodejs.org/api/url.html#urlhost
https://developer.mozilla.org/en-US/docs/Web/API/URL
This is how it may be used though.
let url = new URL('https://test.example.com/cats?name=foofy')
url.protocall; // https:
url.hostname; // test.example.com
url.pathname; // /cats
url.search; // ?name=foofy
let params = url.searchParams
let name = params.get('name');// always string I think so parse accordingly
for more on parameters also see https://developer.mozilla.org/en-US/docs/Web/API/URL/searchParams
String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";
String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";
System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));
Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl
If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888";
the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888
enjoy..
Yosi Lev