Regular expression to extract different parts of URL and path - regex

Consider URLs like
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
http://stackoverflow.com/v2/summary/saas?test=123
I need a regular expression to match these URLs and convert them into
stackoverflow.com:v1:summary:1243PQ:details:P1:9981
stackoverflow.com:v2:summary:saas
I need to build a single rule using regex where I can extract paths using $1, $2, etc. without using any javascript logic as I need to use it in a classification rule builder tool.
I tried this URL contains ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? and extracted $4:$5 which returns stackoverflow.com:v1/summary/1243PQ/details/P1/9981
But, this is incorrect. Can anyone help me with the correct regex for this?

You may try this:
Regex
/(?:https?:\/\/([^\/?\s#]+))?\/([^\/?\s#]*)(?:[\?#].*)?/g
Substitution
$1:$2
(?: non-capturing group
https?:\/\/ "http://" or "https://"
([^\/?\s#]+) capture the domain and put it in group 1
)? make this capture optional
\/ "/"
([^\/?\s#]*) one segment of the url path, capture it in group 2
(?:[\?#].*)? an optional non-capturing group for consuming query string or # anchor at the end
Check the test cases
Update
If you can't use g flag for substitution, there's no better way but bruteforce all the combinations:
You need to add a \/([^\/?#\s]+) and :$2 etc for each segment of the url path:
https://stackoverflow.com
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/?(?:[#?].*)?$
$1
https://stackoverflow.com/path1
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2
https://stackoverflow.com/path1/path2
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3
https://stackoverflow.com/path1/path2/path3
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4
https://stackoverflow.com/path1/path2/path3/path4
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5
https://stackoverflow.com/path1/path2/path3/path4/path5
^https?:\/\/(?:www\.)?([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/([^\/?#\s]+)\/?(?:[#?].*)?$
$1:$2:$3:$4:$5:$6

Related

Regex to extract a part of an URL

I'm trying to extract the part of an URL ignoring the http(s)://www. part of it.
These URLs come from a form that the user fills and multiple formats and errors are expected, here's a sample:
http://www.akashicbooks.com
https://deliciouselsalvador.com
http://altaonline.com
http://https://www.amtb-la.org/
http://https://www.amovacations.com/
http://dornsife.usc.edu/jep
I've tried in Google Sheets and Airtable using the REGEXEXTRACT formula:
=REGEXEXTRACT({URL},"[^/]+$")
But unfortunately, I can't make it work for all the cases:
Any ideas on how to make it work?
You can use
^(?:https?://(?:www\.)?)*(.*)
See the regex demo. Details:
^ - start of string
(?:https?://(?:www\.)?)* - zero or more occurrences of
https?:// - http:// or https://
(?:www\.)? - an optional sequence of www.
(.*) - Group 1: the rest of the string.
With REGEXEXTRACT, the output value is the text captured with Group 1.

Simple regex to replace first part of URL

Given
http://localhost:3000/something
http://www.domainname.com/something
https://domainname.com/something
How do I select whatever is before the /something and replace it with staticpages?
The input URL is the result of a request.referer, but since you can't render request.referer (and I don't want a redirect_to), I'm trying to manually construct the appropriate template using controller/action where action is always the route, and I just need to replace the domain with the controller staticpages.
You could use a regex like this:
(https?://)(.*?)(/.*)
Working demo
As you can see in the Substitution section, you can use capturing group and concatenates the strings you want to generate the needed urls.
The idea of the regex is to capture the string before and after the domain and use \1 + staticpages + \3.
If you want to change the protocol to ftp, you could play with capturing group index and use this replacement string:
ftp://\2\3
So, you would have:
ftp://localhost:3000/something
ftp://www.domainname.com/something
ftp://domainname.com/something

How to capture optional group and replace if not matched?

I have this situation where the user might enter a URL with or without http(s)://. I would like to have it if it's there otherwise add http:// myself. I have the below regex pattern:
Regex: \[url\](?:https?\:\/\/)?(.*?)\[\/url\] Replacement: $1
which makes this
[url]http://blog.sanspace.in[/url]
[url]https://blog.sanspace.in[/url]
[url]blog.sanspace.in[/url]
[url]blog.sanspace.in/scraperwiki[/url]
[url]www.sanspace.in[/url]
into this
http://blog.sanspace.in
http://blog.sanspace.in
http://blog.sanspace.in
http://blog.sanspace.in/scraperwiki
http://www.sanspace.in
Now, what I would like is to make it like this. (use http(s) if available. otherwise, http)
http://blog.sanspace.in
https://blog.sanspace.in
http://blog.sanspace.in
http://blog.sanspace.in/scraperwiki
http://www.sanspace.in
I tried adding the http(s) as a group.
Regex: \[url\](https?\:\/\/)?(.*?)\[\/url\] Replacement: $1$2
ut, in that case the replacement order $1 and $2 are different. If user added http it becomes $1 otherwise, URL becomes $1.
http://blog.sanspace.in
https://blog.sanspace.in
blog.sanspace.in
blog.sanspace.in/scraperwiki
www.sanspace.in
Note the last 3 URLs. Here I have to add http but only if I know there wasn't a user added http. I'm not sure how to achieve my goal.
I'm testing this problem here. http://regexr.com?3711a
Try following regex:
Match : \[url\](?:http(s)?\:\/\/)?(.*?)\[\/url\]
Replace : http$1://$2
regexr demo
Since you are willing to insert http:// if it's not present in original string, the idea here is not capture it even if it's present. Instead capture only optional s indicating secure http into $1.

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

Regular Expression break down URL into parts

I've just recently started learning Regex so i'm not sure yet about a couple of aspects of the hole thing.
Right now my web page reads in the URL breaks it up into parts and only uses certain parts for processing:
E.g. 1) http://mycontoso.com/products/luggage/selloBag
E.g. 2) http://mycontoso.com/products/luggage/selloBag.sf404.aspx
For some reason Sitefinity is giving us both possibilities, which is fine, but what I need from this is only the actual product details as in "luggage/selloBag"
My current Regex expression is: "(.*)(map-search)(\/)(.*)(\.sf404\.aspx)", I combine this with a replace statement and extract the contents of group 4 (or $4), which is fine, but it doesn't work for example 2.
So the question is: Is it possible to match 2 possibilities with regular expressions where a part of a string might or might not be there and then still reference a group whose value you actually want to use?
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced (and commented) regex (in Python syntax) which utilizes named capture groups:
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation
Depends on your regex implementation, but most support a syntax like
(\.sf404\.aspx|)
Assuming that's your group 4 (i.e. zero-indexed groups). The | lists two alternatives, one of which is the empty string.
You don't say if you're doing this in javascript, but if you are, the parseUri lib written by Steven Levithan does a pretty damn good job at parsing urls. You can get it from various places, including here (click on the "Source Code" tab) and here.