Regular expression to search for specific Referer in HTTP Header - regex

I need to create a regular expression to match everything except a specific URL for a given Referer. I currently have it to match but can't reverse it and create the negative for it.
What I currently have:
Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?
In the list below:
Referer:http://www.test.online/
Referer:https://www.test.online/
Referer:https://www.test.tv/
Referer:https://www.blah.com/
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
It will match:
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
However, I would like it to match everything except for those.
This is for our WAF so unfortunately are restricted on the usage which can only be fulfilled searching for the HTTP Header being passed back.

Try this regex:
^(?!.*Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?).*$
A good way to negate your regex is to use negative lookahead.
Explanation:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Working example: https://regex101.com/r/QJfeBB/1

You could use an anchor ^ to assert the start of the string and use a negative lookahead to assert what is on the right is not what you want to match.
Note that you have to escape the dot to match it literally and you could omit the last part (\/.*)?.
If you don't use the capturing groups for later use you might also turn those into non capturing groups (?:) instead.
^(?!Referer:(https?(:\/\/))?(www\.)?test\.com).+$
regex101 demo
About the pattern
^ Start of the string
(?! Negative lookahead to assert what is on the right does not match
Referer:(https?(:\/\/))?(www\.)?test\.com Match your pattern
) Close negative lookahead
.+ Match any char except a newline 1+ times
$ Assert end of the string

Related

Regular expression to exactly match the last path segment of an URL without parameters, except if the path ends with a trailing slash

The goal of my regular expression adventure is to create a matcher for a mechanism that could add a trailing slash to URLs, even in the presence of parameters denoted by # or ? at the end of the URL.
For any of the following URLs, I'm looking for a match for segment as follows:
https://example.com/what-not/segment matches segment
https://example.com/what-not/segment?a=b matches segment
https://example.com/what-not/segment#a matches segment
In case there is a match for segment, I'm going to replace it with segment/.
For any of the following URLs, there should be no match:
https://example.com/what-not/segment/ no match
https://example.com/what-not/segment/?a=b no match
https://example.com/what-not/segment/#a no match
because here, there is already a trailing slash.
I've tried:
This primitive regex and their variants: .*\/([^?#\/]+). However, with this approach, I could not make it not match when there is already a trailing slash.
I experimented with negative lookaheads as follows: ([^\/\#\?]+)(?!(.*[\#\?].*))$. In this case, I could not get rid of any ? or # parts properly.
Thank you for your kind help!
Lookahead and lookbehind conditionals are so powerful!
(?<=\/)[\w]+(?(?=[\?\#])|$)
P.s: I just added [\w]+ that means [a-zA-Z0-9_]+.
Of course URLs can contain many other character like - or ~ but for the examples provided it works nicely.
If you want to match urls, you might use
\b(https?://\S+/)[^\s?#/]+(?![^\s?#])
Explanation
\b A word boundary to prevent a partial word match
( Capture group 1
https?://\S+/ Match the protocol, 1+ non whitespace chars and then the last occurrence of /
) Close group 1
[^\s?#/]+ Match 1+ chars other than a whitespace char ? # /
(?![^\s?#]) Negative lookahead, assert that directly to the right is not a non whitespace char other than ? or #
See a regex demo.
In the replacement use group 1 followed by segment/
For a match only instead of a capture group:
(?<=\bhttps?://\S+/)[^\s?#/]+(?![^\s?#])
See another regex demo.

Regex - find the param in a url in any position in the string

I am trying to match a url param and this param's position is not fixed in the uri. It can show up sometime right after the ? or after the &. I need to match vr=359821 param in the below uri's. How can I do this.
Example urls:
/br/col/aon/11631?vr=359821&cId=9113
/br/col/aon/11631?cId=9113&vr=359821
/br/col/aon/11631?cId=9113&vr=359821&grid=2&page=something
Somethings I tried:
I tried to use backreferencing (not sure if this is right approach) but was not successful.
I was trying to group them and may be backreference to find the string within that group.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)) # this matches second url above but not others.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).+?\3 # this is wrong I know.
(\/br\/col\/aon\/11631)(\?cId=9113&(vr=359821)).*?\2[vr=359821] # this is wrong
Above regex are wrong but my idea was to make it a group and match vr=359821 in that group. I dont know if this is even possible in regex.
why I am doing this:
The final goal is to redirect this url to a different url with all the params from original request in ngnix.
In the last 2 patterns that you tried, you are using a backreference like \2 and \3. But a backreference will match the same data that was already captured in the corresponding group.
In this case, that is not the desired behaviour. Instead, you want to match a key value pair in the uri, which does not have to exist in the content before.
Therefore you can match the start of the pattern followed by a non greedy quantifier (as it can also occur right after the question mark) to match the first occurrence of vr= followed by 1 or more digits.
In the comments I suggested this pattern \/br\/col\/aon\/11631\b.*?[?&](vr=\d+), but (depending on the regex delimiters) you don't have to escape the forward slash.
The pattern could be
/br/col/aon/11631\b.*?[?&](vr=\d+)
The pattern matches
/br/col/aon/11631\b Match the start of the pattern followed by a word boundary
.*? Match any char as least as possible
[?&] Match either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
Regex demo
From what I read is that nginx uses PCRE. To get a more specific pattern, one option could be:
/br/col/aon/11631\?.*?(?<=[?&])(vr=\d+)(?=\&|$)
This pattern matches
/br/col/aon/11631\? Match the start of the pattern followed by the question mark
.*? Match any char as least as possible
(?<=[?&]) Positive lookbehind, assert what is directy to the left is either ? or &
(vr=\d+) Capture group 1, match vr= followed by 1+ digits
(?=\&|$) Positive lookahead, assert what is directly to the right is & or the end of the string to prevent a partial match
Regex demo

I need to exclude word from regular expression

I have this regexp:
^[a-z0-9]+([.\-][a-z0-9]+)*$
I need exclude from match only one word "www".
I tried the negative lookahead but without a success.
Use a negative lookahead like this:
^(?!www$)[a-z0-9]+([.-][a-z0-9]+)*$
^^^^^^^^
This will not match a string equal to www.
See the regex demo
If you want to fail a match with strings that contain -www- or .www., use
^(?!.*\bwww\b)[a-z0-9]+([.-][a-z0-9]+)*$
See another regex demo. This pattern contains a (?!.*\bwww\b) lookahead that fails the whole match if there is a www somewhere inside the string and it has no digits or letters round it due to \b word boundaries.

Non capturing group included in capture?

This text
"dhdhd89(dd)"
Matched against this regex
.+?(?:\()
..returns "dhdhd89(".
Why is the start parenthesis included in the capture?
Two different tools, as well as the .NET Regex class, returns the same result. So I gather there is something I don't understand about this.
The way I read my regex is.
Match any character, at least one occurrence. As few as possible.
The matched string should be followed by a start parenthesis, but not to be included in the capture.
I can find workaround, but I still want to know what is going on.
Just turn the non-capturing group to positive lookahead assertion.
.+?(?=\()
.+? non-greedy match of one or more characters followed by an opening parenthesis. Assertions won't match any characters but asserts whether a match is possible or not. But the non-capturing group will do the matching operation.
DEMO
You can just use this negation based regex to capture only text before a literal (:
^([^(]+)
When you use:
.+?(?:\()
Regex engine does match ( after initial text but it just doesn't return that in a captured group to you.
You havn't defined capture groups then I guess you display the whole match (group 0), you can do:
(.+?)(?:\()
and the string you want is in group 1
or use lookahead as #AvinashRaj said.

RegEx - String To Help Match

I read somewhere that it is possible to have a RegEx in which strings preceding and following are not to be matched, but instead help with ambiguities.
For example, I would like a RegEx that matches only "TESTING" from the second line ("defTESTINGghi") and nothing from line one and line two.
abcTESTINGdef
defTESTINGghi
ghiTESTINGjkl
If supported you can use the \K escape sequence. \K resets the starting point of the reported match and any previously consumed characters are no longer included. The Positive Lookahead asserts that the preceded is followed by ghi.
def\KTESTING(?=ghi)
Live Demo
Or depending on what your definition of the preceded and following not being matched are, why not simply use a capturing group to capture only the desired subpattern?
def(TESTING)ghi
Live Demo
You could try the below regexes to match the string TESTING only on the second line,
Through positive lookahead and lookbehind,
(?<=def)TESTING(?=ghi)
Matches the string TESTING only if it's present just after to the def and must be follwed by ghi.
Through positive lookahead,
TESTING(?=ghi)
Matches the string TESTING only if it's followed by ghi.
Through negative lookahead,
TESTING(?!def|jkl)
Matches the string TESTING if it's not followed by def or jkl.
Reference