Regex to match only urls that contains a certain path - regex

I am using the following regex
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
and it's showing me a url but I want to show only URLS that contain
/video/hd/
The following correction of the Regex above did not deal correctly with slashes
((?:https\:\/\/)|(?:http\:\/\/)|(?:www\.))?([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(?:\??)[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]+)

You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups (( )) in the regex by non-capturing groups ((?: )). A few of the groups are redundant, and http|https can be simplified to https?. Together this gives us
(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
_ is not allowed in hostnames:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Technically - cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.
The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w and - are in both character classes). We can fix this by requiring a separator of either / or ? after the hostname:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Now we can start looking at your additional requirement: The URL should contain /video/hd/. Presumably this string should appear somewhere in the path. We can encode this as follows:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,#^=%&:/~+-]*/)?video/hd/(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Instead of matching an optional separator of / or ?, we now always require a / after the hostname. This / must be followed by either video/hd/ directly or 0 or more path characters and another /, which is then followed by video/hd/. (The set of path characters does not include ? (which would start the query string) or # (which would start the anchor).)
As before, after /video/hd/ there can be a final part of more path components, a query string, and an anchor (all optional).

First of all, you need a regex to match URLs (be they http, https...)
(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))
Once you got that, you need to select them but not "consume" them. You can do this with a lookahed, i.e. a regex that assert that what follows the current position is e.g. foo:
(?=foo)
Of course you will replace foo with the first regex I wrote.
At this point, you know you selected a URL; now you just constraint your search to URLs that contain /video/hd:
.*\/video\/hd\/.*
So the complete regex is
(?=(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))).*\/video\/hd\/.*
You can test it here with a live demo.

Related

Regex to pull last 2 segments from FQDN

Working on trying to figure out some regex to pull out the last 2 segments of an FQDN.
^.*\shostname=[\w-]+\.(?P<myfield>[^\t]+)
This RegEx works and takes out the first segment of an FQDN.
www.aaa.bbb.someurl.net --> aaa.bbb.someurl.net
But… I only want to keep the last 2 segments of any FQDN.
I need it to be --> someurl.net
Other restrictions:
The hostname field will always be at least 3 segments - don't know the max.
This is for Splunk so I can't use a script. I need it to be PCRE compatible regex.
Here is an example of data:
2021-07-20 18:19:14 reason=Not allowed to browse this category event_id=12345 protocol=HTTP action=Blocked transactionsize=16051 responsesize=789 requestsize=456 urlcategory=Blocked serverip=1.2.4.5 clienttranstime=0 requestmethod=GET refererURL=None useragent=Microsoft-Delivery location=Internal ClientIP=5.6.7.8 status=403 user=John url=dl.delivery.mp.microsoft.com/filestreamingservice/files/abcd-efgh-ijkl/pieceshash vendor=Zscaler hostname=dl.delivery.mp.microsoft.com
From this I data I need the field “myfield” to be: microsoft.com.
The original answer with a much simpler regex ((?:\s|^)hostname=(?:[^\s.]+\.)*(?P<myfield>[^\s.]+\.[^\s.]+)) that worked for OP is in the question history.
You can use
(?:\s|^)hostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
Or, to match the last hostname=... value on a line:
^.*\shostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))
See the regex #1 demo and regex #2 demo. Details:
(?:\s|^) - either a whitespace or start of string
hostname= - a literal substring
(?:[^\s.]+\.)*? - zero or more (but as few as possible) occurrences of one or more chars other than whitespace and dot and then a dot
(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S)) - Group "myfield": one or more chars other than whitespace and dot, then a dot, then any second-level domain or any one or more chars other than whitespace and dot and then either a whitespace or end of string.
Note: the \.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk pattern part (built from a regex trie) matches this list:
.ac.uk
.co.uk
.gov.uk
.judiciary.uk
.ltd.uk
.me.uk
.mod.uk
.net.uk
.nhs.uk
.nic.uk
.org.uk
.parliament.uk
.plc.uk
.police.uk
.royal.uk
.sch.uk
.co.uk
.ltd.uk
.me.uk
.net.uk
.nic.uk
.org.uk
.plc.uk
.sch.uk
.govt.uk
.orgn.uk
.lea.uk
.mil.uk
If you want to add more second-level domain names, add more to the list and use https://www.myregextester.com or suchlike services to built the word list regex.
You could match all following non whitspace chars after hostname= and then use a capture group to capture the last part with a single dot.
^.*\shostname=(?:\S+\.)?([^\s.]+\.[^\s.]+)
^.*\shostname=
(?:\S+\.)? Optionally match a possible dot before
( Capture group 1
[^\s.]+\.[^\s.]+ Match 2 non dot parts with a . in between
) Close group
Regex demo
If you would like to account for country codes, I've previously answered this at: Get Domain Extension From Hostname
The regular expression would look something like (shortened version): \w+((\.[a-z]{2,3})(\.(uk|au))?)$
The full expression with all country codes: \w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$

Data Studio Regex (Google RE2) to Extract Subdirectory from Path

I'm working with a Google Data Studio field that has a page URL Path contained within it. Examples:
/
/sample-url
/sample-url-2/
/#sample-url-5/
/sample-url-3/sample-url-4
/sample-url-3/sample-url-6
In each one, I want to be capturing the bold portion in a custom formula/field--from the first slash, up to but excluding the second slash if there is one, and also including the first slash if that's the whole path. (In essence, the first subdirectory.) I would be open to recording the second backslash when there is one if that would make the solution simpler, but I'm guessing it's more complicated that way. I tried the following:
REGEXP_EXTRACT(Field, "^/[^/]+/$")
But it didn't work; everything returned null. What is wrong with that string?
The ^/[^/]+/$ pattern matches a string that starts with a / char, then contains one or more chars other than / and then ends with a / char. So, you can only match strings like /abc/, /123abc/, /abc-1 2 3.?!/, etc.
You can use
REGEXP_EXTRACT(Field, "^(/[^/]*)")
See the regex demo.
NOTE: REGEXP_EXTRACT requires a capturing group in the pattern, the content captured is the return value.
Here, ^ matches the start of string and (/[^/]*) is a capturing group with ID 1 that matches a / char and then any zero or more chars other than / (with [^/]*).

URL rewrite using PCRE expression - append prefix to all incoming URIs except one pattern

i am using match expression as https://([^/]*)/(.*) and replace expression as constantprefix/$2 and trying to rewrite incoming URL by adding '/constantprefix' to all URLs
for Below URLs it is working as expected:
https://hostname/incomingURI is converting to
/constantprefix/incomingURI
https://hostname/ is converting to /constantprefix/
https://hostname/login/index.aspx is converting to
/constantprefix/login/index.aspx
i am having problem for the URLs which already starting with /constantprefix, i am seeing two /constantprefix/constantprefix in the output URL which I am not looking for, is there any way we can avoid that ?
if incoming URL is https://hostname/constantprefix/login/index.aspx then output URL is becoming https://hostname/constantprefix/constantprefix/login/index.aspx
may i know how i can avoid /constantprefix/constantprefix from match expression ?
You can do it with:
https://[^/]*/(?!constantprefix(?:/|$))(.*)
using the replacement string:
constantprefix/$1
(?!...) is a negative lookahead and means not followed by. It's only a test and doesn't consume characters (this kind of elements in a pattern are also called "zero-width assertions" as a lookbehind or anchors ^ and $).
The first capture group in your pattern was useless, I removed it.

Regex to match all urls, excluding .css, .js recources

I'm looking for a regular expression to exclude the URLs from an extension I don't like.
For example resources ending with: .css, .js, .font, .png, .jpg etc. should be excluded.
However, I can put all resources to the same folder and try to exclude URLs to this folder, like:
.*\/(?!content\/media)\/.*
But that doesn't work! How can I improve this regex to match my criteria?
e.g.
Match:
http://www.myapp.com/xyzOranotherContextRoot/rest/user/get/123?some=par#/other
No match:
http://www.myapp.com/xyzOranotherContextRoot/content/media/css/main.css?7892843
The correct solution is:
^((?!\/content\/media\/).)*$
see: https://regex101.com/r/bD0iD9/4
Inspirit by Regular expression to match a line that doesn't contain a word?
Two things:
First, the ?! negative lookahead doesn't remove any characters from the input. Add [^\/]+ before the trailing slash. Right now it is trying to match two consecutive slashes. For example:
.*\/(?!content\/media)[^\/]+\/.*
(edit) Second, the .*s at the beginning and end match too much. Try tightening those up, or adding more detail to content\/media. As it stands, content/media can be swallowed by one of the .*s and never be checked against the lookahead.
Suggestions:
Use your original idea - test against the extensions: ^.*\.(?!css|js|font|png|jpeg)[a-z0-9]+$ (with case insensitive).
Instead of using the regular expression to do this, use a regex that will pull any URL (e.g., https?:\/\/\S\+, perhaps?) and then test each one you find with String.indexOf: if(candidateURL.indexOf('content/media')==-1) { /*do something with the OK URL */ }

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap