How to capture a pattern not pereceded by a fix string - regex

Assume this string
http://foobar.com
Catching the domain only not following foo is of interest, so in this case nothing must be captured.
using lookbehind as
(?<!foo)[a-z]+\.[a-z]+
would result to
foobar.com
since no foo is seen behind when it is at the position of 7 at the string.

Use a positive lookbehind to require the match to be after /, and a negative lookahead to prohibit foo at the beginning.
(?<=\/)(?!foo)[a-z]+\.[a-z]+
DEMO

Related

What does "(?!$)" inside a regexp mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed last year.
In a section of Sevelte tutorial, there's a piece of code like this:
view = pin ? pin.replace(/\d(?!$)/g, '•') : 'enter your pin';
I know \d means a digit, but can't figure out what (?!$) means.
(And because it's composed of all punctuation, I can't manage to google for an explanation.)
Please help, thanks.
(?!$) Is a negative lookahead stipulation, where (?!) declares the negative lookahead and $ is what that the expression is "looking ahead" for (in this case, an end anchor).
A negative lookahead is an inverse of a positive lookahead, so it will be more intuitive to understand if you know what a positive lookahead is first: A digit followed by a positive lookahead \d(?=$) basically looks for anything that would be matched by \d$ but does not return the part inside the lookahead stipulation when returning a match. \d(?=$) will match any digit that is directly behind the end of the string. A negative lookahead will simply match every digit that is NOT directly behind the end of the string instead, ergo using \d(?!$) and replacing matches with a * basically turns every digit in the string into a * except for the last one.
For the sake of being thorough, you should know that (?<=) is a positive lookbehind that looks for matches in the characters immediately before the given token instead of after, and (?<!) is a negative lookbehind.
Regex101.com and RegExr.com are fantastic resources to use when you are learning regex, because you can insert a regular expression you don't understand and get a piece-by-piece explanation of an expression you don't understand and test strings in real time to experiment with what the expression captures and what it doesn't. Even if the built-in explanations don't make sense, you can still use them in situations like this to find out what something is called so you can search for it.
\d matches all digits
(?!something) means 'Negative Lookahead' for something
$ matches the end of a string
So when \d(?!$) is used, it matches all digits before the last character
In this string:
$$//www12.example#news.com<~>998123000dasas00--987
This will be matched (7 will not because it is the last character):
129981230000098
Referred to this answer
and Regex Cheat Sheet

SyntaxError: (irb):4: invalid pattern in look-behind (positive look-behind/ahead)

I am trying to write a regex-replace pattern in order to replace a number in a hash like such:
regexr link
some_dict = {
TEST: 123
}
such that 123 could be captured and replaced.
(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)
You'll see that this works perfectly fine in regexr:
When I run this gsub in irb, however, here is what happens:
irb(main):005:0> " TEST: 123".gsub(/(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)/, "321")
SyntaxError: (irb):5: invalid pattern in look-behind: /(?<= |\t*[a-zA-Z0-9_]+: |\t+)\d+(?=.*)/
I was looking around for similar issues like Invalid pattern in look-behind but I made sure to exclude capture groups in my look-behind so I'm really not sure where the problem lies.
The reason is that Ruby's Onigmo regex engine does not support infinite-width lookbehind patterns.
In a general case, positive lookbehinds that contain quantifiers like *, + or {x,} can often be substituted with a consuming pattern followed with \K:
/(?: |\t*[a-zA-Z0-9_]+: |\t+)\K\d+(?=.*)/
#^^^ ^^
However, you do not even need that complicated pattern. (?=.*) is redundant, as it does not require anything, .* matches even an empty string. The positive lookbehind pattern will get triggered if there is a space or tab immediately to the left of the current location. The regex is equal to
.gsub(/(?<=[ \t])\d+/, "321")
where the pattern matches
(?<=[ \t]) - a location immediately preceded with a space/tab
\d+ - one or more digits.

How do I match what's between the quotes excluding these?

I want to match what's between the quotes but excluding these. I tried positive and negative lookahead, which works for the end quote but I cannot exclude the first one. What am I doing wrong?
Here is the example I'm using:
A: $("div"),
B: $("img.some_class"),
B: $("img.some_class.another_class"),
C: $("#some_id"),
D: $(".some_class"),
E: $("input#some_id"),
F: $("div#some_id.some_class.some_other"),
G: $("div.some_class#some_id")
Here is my regex so far:
/(?!").*(?=")/g
Try this:
/\("\K[^"]+/g
\K means that the return value will start here.
For example, it will find: A: $("div but return as match just: div.
Here Is Demo
There are not two, but four different lookaround modifiers, because you need to specify two different aspects:
Are you asserting that something is there (positive) or is not there (negative)?
Are you asserting that it's before the specified pattern (lookbehind) or after it (lookahead)?
The four combinations are generally written like this:
?= for positive lookahead
?! for negative lookahead
?<= for positive lookbehind
?<! for negative lookbehind
You've used a negative lookahead when you wanted a positive lookbehind, so the fixed version of what you wrote would be:
/(?<=").*(?=")/g
Beware the "greediness" of .*, which will match as much of the string as possible; you might want to use .*? to make it "non-greedy", or explicitly say "anything other than a quote mark" ([^"]*).
Another approach is to match the quotes normally, rather than with a lookaround, but "capture" the part between them: /"(.*?)"/. How you get to the "captured group" will vary depending on your programming language / tool, which you haven't specified.
The pattern (?!").*(?=") first asserts what is directly on the right is not a double quote (?!") which succeeds because for the example data that is a $.
Then .* is greedy and will match 0+ times any character except a newline and will match until the end of the string. Then it will backtrack to fulfill the assertion (?=") where directly on the right is a double quote.
If a positive lookbehind is supported, you might change the (?!") to (?<=") and the pattern could look like (?<=\$\(")[^"]+(?="\)) to not match empty double quotes.
Taking the dollar sign and the opening and closing parenthesis into account, you could use a capturing group and a negated character class [^"]+ to match any char except a double quote:
\$\("([^"]+)"\)
Regex demo
Using lookahead and lookbehinds as you asked :
/(?<=").*(?=")/g
Test Here : https://regex101.com/r/kCEuow/2
You might also consider using substrings :
/"([^"]+)"/g
Test the regex : https://regex101.com/r/kCEuow/1

Regular expression to search for specific Referer in HTTP Header

I need to create a regular expression to match everything except a specific URL for a given Referer. I currently have it to match but can't reverse it and create the negative for it.
What I currently have:
Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?
In the list below:
Referer:http://www.test.online/
Referer:https://www.test.online/
Referer:https://www.test.tv/
Referer:https://www.blah.com/
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
It will match:
Referer:https://www.test.com/
Referer:http://www.test.com/
Referer:http://test.com/
Referer:https://test.com/
However, I would like it to match everything except for those.
This is for our WAF so unfortunately are restricted on the usage which can only be fulfilled searching for the HTTP Header being passed back.
Try this regex:
^(?!.*Referer:(http(s)?(:\/\/))?(www\.)?test.com(\/.*)?).*$
A good way to negate your regex is to use negative lookahead.
Explanation:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Working example: https://regex101.com/r/QJfeBB/1
You could use an anchor ^ to assert the start of the string and use a negative lookahead to assert what is on the right is not what you want to match.
Note that you have to escape the dot to match it literally and you could omit the last part (\/.*)?.
If you don't use the capturing groups for later use you might also turn those into non capturing groups (?:) instead.
^(?!Referer:(https?(:\/\/))?(www\.)?test\.com).+$
regex101 demo
About the pattern
^ Start of the string
(?! Negative lookahead to assert what is on the right does not match
Referer:(https?(:\/\/))?(www\.)?test\.com Match your pattern
) Close negative lookahead
.+ Match any char except a newline 1+ times
$ Assert end of the string

Negative Lookahead Faults Regex

I have a regular expression:
^\/admin\/(?!(e06772ed-7575-4cd4-8cc6-e99bb49498c5)).*$
My input string:
/admin/e06772ed-7575-4cd4-8cc6-e99bb49498c5
As I understand, negative lookahead should check if a group (e06772ed-7575-4cd4-8cc6-e99bb49498c5) has a match, or am I incorrect?
Since input string has a group match, why does negative lookahead not work? By that I mean that I expect my regex to e06772ed-7575-4cd4-8cc6-e99bb49498c5 to match input string e06772ed-7575-4cd4-8cc6-e99bb49498c5.
Removing negative lookahead makes this regex work correctly.
Tested with regex101.com
The takeway message of this question is: a lookaround matches a position, not a string.
(?!e06772ed-7575-4cd4-8cc6-e99bb49498c5)
will match any position, that is not followed by e06772ed-7575-4cd4-8cc6-e99bb49498c5.
Which means, that:
^\/admin\/(?!(e06772ed-7575-4cd4-8cc6-e99bb49498c5)).*$
will match:
/admin/abc
and even:
/admin/e99bb49498c5
but not:
/admin/e06772ed-7575-4cd4-8cc6-e99bb49498c5/daffdakjf;adjk;af
This is exactly the explanation why there is a match whenever you get rid of the ?!. The string matches exactly.
Next, you can lose the parentheses inside your lookahead, they do not have their usual function of grouping here.