Regex - Exclude URLs by keywords - regex

I am trying to use StackPath's EdgeRules and their documentation is not very clear or good.
I need to match urls in multiple directories but exclude any URL's that have the extension m3u8 in it or the word segment in it. This is their docs EdgeRules
This works to limit it to 2 directories.
/(https://example.com(/(pics|vids)/).*)/
But then this doesn't work.
/(https://example.com(/(pix|vids)/).+(?!m3u8|segment).*)/
I've been trying to use https://regex101.com/ but nothing I try seems to work. I don't even know what kind of regex they use. Hopefully can get some help with this.

I can't test this so apologies if its something else wrong...
The negative look aheads need to be side by side, not wrapped in parentheses separated by or (|). I also added a end of line character ($) at the end of .m3u8.
(https://example.com(/(pix|vids)/)(?!.*\.m3u8$)(?!.*segment.*).*)
See this example:
https://regex101.com/r/reVHWt/1

The EdgeRules docs do not mention the regex favor they support, and from the examples it is not clear. Also the example /(^http://example.com(/.*/)+.$)/ shows non-escaped backslashes, indicating this is non-standard regex.
I see no other way than using a negative lookahead to exclude arbitrary patterns. Assuming their regex does support it you can try:
/^https://example.com/(pix|vids)/(?!.*\bm3u8\b)(?!.*\bsegment\b).*$/
Or with properly escaped special chars:
/^https:\/\/example\.com/(pix|vids)/(?!.*\bm3u8\b)(?!.*\bsegment\b).*$/
Explanation of regex:
^ -- anchor at start of string
https:\/\/example\.com/ -- literal https://example.com/
(pix|vids) -- literal pix or vids
/ -- slash
(?!.*\bm3u8\b) -- negative lookahead for m3u8, anchored on both sides with \b
(?!.*\bsegment\b) -- ditto for segment
.*$ -- any other chars up to end of string

Related

PCRE Regex Match /x... but not /y/x

When configuring redirections, it's common to run into multiple pages that include some of the same path strings. We've ran into this instance multiple times where we need to redirect:
https://example.com/x...
But not:
https://example.com/y/x...
To match the /x... we use PCRE regex of:
/x.*
We've been struggling to get the exclude to match correctly; we apologize in advance as our regex is a bit weak, here's our pseudo code:
Match all /x... except /y/x...
Here is what we thought that looked like:
^\/(?!y\/).x.*
In our mind that reads:
Any query starting with /x..., except starting with /y/x...
Thank you in advance, and please feel free to suggest better formatting, we are not stack overflow pros.
Your regex matches from the start of the string a forward slash and then uses a negative lookahead to check what follows is not y/. If that is true, then match any character followed by x and 0+ character. That will match for example //x///
Without taking matching the url part into account, one way could be to use a negative lookahead (?! to check if what is on the right side does not contain /y/x and then match any character:
^(?!.*/y/x).+
Regex demo
You may use a negative lookbehind assertion:
~(?<!/y)/x~
RegEx Demo
(?<!/y) is a negative lookbehind assertnion that will fail the match if /y appears before matching /x.

Extract url based on specific keyword

I am crawling data from certain websites and I am looking to extract data from specific urls. One such case let say url with *devicehelp.optus.com.au/web/* as as example. PFB my regex -
/[^]*devicehelp\.optus\.com\.au\/web\/[^.]*/
This regex doesn't give me perfect match what I am looking for. Could someone please let me know what am I missing here?
Test urls -
*devicehelp.optus.com.au/web/*
http://www.top.abc.something.optus.devicehelp.optus.com.au/web/web/web/
This regex works when I test it on http://regexr.com/ but doesn't on https://regex101.com/
In most regex flavors, [^] is an invalid regex construct, while on the site you tested (regexr.com), this will be parsed as any character (since the regexr regex flavor is JavaScript).
To match any character but a newline zero or more times, you may use .*.
.*\bdevicehelp\.optus\.com\.au\/web\/.*
The \b is a word boundary, so as to match devicehelp as a whole word (if you do not intend to match it as a whole word, you may remove it). Dots should be escaped to match literal dots.

How can I use a regular expression to match words of a certain length but not urls?

For text such as
Save Favorites & Share expressions with friends or the Community.
A full Reference & Help is available in the Library, or watch the video Tutorial.
expressions can start some lines though eventuallys
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
http://regexr.com/foo.html?q=bar
https://mediatemple.net
mediatemple.net
I want to select words tha are 11 digits long.
I can use
/\b[a-zA-Z]{11}\b/g
(http://regexr.com/3digk)
but it also matches the urls
https://mediatemple.net
mediatemple.net
How can I avoid that? I use \b rather than a space to match at the start and end of lines
By using negative lookahead, you could exclude the words which have .something after them, this would exclude any URL and not touch the words in the end of the sentence (i.e. if a space is following the dot or the newline).
/\b[a-zA-Z]{11}\b(?!\.[^\s]+)/g
You can use negative look behind expression to ensure that your match is not preceded by "://".
Use (?<!//), which is a negative look behind that asserts the preceding chars are not "//":
/(?<!//)\b[a-zA-Z]{11}\b/g
See live demo.
If you want to be more specific and allow double slashes, eg "foo//elevenchars", you can use 2 negative look behinds - one for each protocol (look behinds must match fixed length):
/(?<!http://)(?<!https://)\b[a-zA-Z]{11}\b/g
See live demo, matching foo//elevenchars, but not the urls.

Test but not select with regex

I am trying to test for an expression but not to select it.
I need that for selecing custom TODOs in the IDE Pycharm.
I want to select comments that have the word to-cleanup in them.
When I do the following: # \b.*to-cleanup\b.* it also selects the #. I'm pretty sure there must be a way to test for the existence of # but not to select it.
I just read the documentation for regex that Pycharm Help has, so I don't know how to do it. Any help would be greatly appreciated!
I checked here, but couldn't understand how to fit it into what I need.
You can use a positive lookbehind:
(?<=#) .*\bto-cleanup\b.*
The regex matches (see demo):
(?<=#) - a space preceded with a # symbol
.* - 0 or more characters other than a newline up to the last
\bto-cleanup\b - whole word to-cleanup
.* - 0 or more characters other than a newline (up to the end of the line).
This lookbehind is fixed-width and only checks if the space is preceded with # while the # itself is not part of the match.
See lookarounds details at regular-expressions.info

Negative lookahead to match server directories not properly working

Given the following 3 example paths representing server paths i am trying to create a skiplist for my FTP client via PCRE regular expressions but can't seem to get the wished result.
/subdir-level-1/subdir-level-2/.../Author1_-_Title1-(1234)-Publisher1
/subdir-level-1/subdir-level-2/.../Author2_-_Title2_(5678)-PUBLiSHER2
/subdir-level-1/subdir-level-2/.../Author3_-_Title3-4951-publisher3
I want to skip all folders (not paths) that do not end with
-Publisher1
I am trying to create a working pattern with the help of this online help and and this regex tester but don't get any further than to this negative lookahead pattern
.*-(?!Publisher1)
But with this pattern all lines match because with all of them the substrings up to the pattern do all not contain the pattern.
/subdir/subdir/.../Author1_-_Title1-(1234) -Publisher1
/subdir/subdir/.../Author2_-_Title2_(5678) -PUBLiSHER2
/subdir/subdir/.../Author3_-_Title3-4951 -publisher3
What is my mistake and how would the correct pattern be just to match only the second and third line as line to be skipped but keep the first line?
EDIT to make it clearer what to highlight and what not.
Everything from the beginning of the path to the last slash must be ignored (allowed).
Everything after the last slash that matches the defined regex must be skipped.
EDIT to present an advanced pattern matching only the red part
[^/]*(?<!-Publisher2)$
Debuggex Demo
The regex which you have used is:
.*-(?!Publisher1)
I will tell you whats the fault in it.
According to this regex it will match those lines which dont have a - followed by Publisher1. Okay, do you notice the - there in between on yur text, yes. between author and title or after title. So all the strings satisfy this condition. Instead if you search with a negative lookahead in such a way that hiphen is with Publisher1 then your match should work.
So you plan on moving the hiphen inside the parenthesis so that it matches and make your regex like this :
^.*(?!-Publisher1)
but this will also not work, because here .* matches everything, so when we do a lookahead, we are not able to find a single character to match . Thus we will use a negative lookbehind. <.
.*(?<!-Publisher1)
what now ? . I have done everything but still I cannot get it to work. why is it so ?
because a negative lookbehind will lookback and tell if it is not followed by -Publisher1.
this is complex, just bear with me :
suppose your string
/subdir/subdir/.../Author1_-_Title1-(1234)-Publisher1
we do a negative lookbehind for -Publisher1. From the postition after 1 . i.e. at the end of the string -Publisher1 is visible when we lookback. BUT our condition is negative lookbehind. So it will move one character left to reach a position where it will no more be able to lookback and say that "Hey I can see -Publisher1 from here" because from here we are able to see "-Publisher" only. Our condtin satisfies but the regex still matches the rest of the string.
So it is essential to bind the lookbehind to the end of the string so that it doesnot move one character to the left to search for its match.
final regex:
.*(?<!-Publisher1)$
demo here : http://regex101.com/r/lE1vW2
This should suit your needs:
^.*(?<!-Publisher1)$
Debuggex Demo
I want to skip all folders that do not end with -Publisher1
You can use this negative lookahead based regex:
^(?!.*?-Publisher1$).+$
Working Demo
You could use the following regex in order to exclude lines containing Publisher1:
^((?!Publisher1).)*$
Online demo: http://regex101.com/r/gD8jK0