Negative Lookbehind fails before an Optional Token - regex

(?<!a)b?c
Against abc, this regex matches c. Am I missing something?

Yes, that is correct. Here is a quick walk-through of the match from the engine's stand point.
Try to match starting at the position before the a. Fail. Advance in the string.
Try to match starting at the position before the a. Fail. Advance in the string.
Current position: right before the c
Can the negative lookbehind (?<!a) assert that what precedes is not a? Check. (It's b)
Can b? match zero or one b? Check. We match zero b
Can c matches a c? Check.
Are there any more tokens to match? Nope. We have a match.
Looking Far Behind
In .NET, which has infinite lookbehind, you could use this:
(?<!a.*)b?c
But PCRE does not have infinite lookbehind. You can use this instead:
^[^a]*\Kb?c
How it works:
The ^ anchor asserts that we are at the beginning of the string
[^a]* matches any non-a chars
The \K tells the engine to drop what was matched so far from the final match it returns
b?c matches the optional b and the c

Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors.
They do not consume characters in the string, but only assert whether a match is possible or not.
For more info See Lookahead and Lookbehind Zero-Length Assertions

Related

How do I match what's between the quotes excluding these?

I want to match what's between the quotes but excluding these. I tried positive and negative lookahead, which works for the end quote but I cannot exclude the first one. What am I doing wrong?
Here is the example I'm using:
A: $("div"),
B: $("img.some_class"),
B: $("img.some_class.another_class"),
C: $("#some_id"),
D: $(".some_class"),
E: $("input#some_id"),
F: $("div#some_id.some_class.some_other"),
G: $("div.some_class#some_id")
Here is my regex so far:
/(?!").*(?=")/g
Try this:
/\("\K[^"]+/g
\K means that the return value will start here.
For example, it will find: A: $("div but return as match just: div.
Here Is Demo
There are not two, but four different lookaround modifiers, because you need to specify two different aspects:
Are you asserting that something is there (positive) or is not there (negative)?
Are you asserting that it's before the specified pattern (lookbehind) or after it (lookahead)?
The four combinations are generally written like this:
?= for positive lookahead
?! for negative lookahead
?<= for positive lookbehind
?<! for negative lookbehind
You've used a negative lookahead when you wanted a positive lookbehind, so the fixed version of what you wrote would be:
/(?<=").*(?=")/g
Beware the "greediness" of .*, which will match as much of the string as possible; you might want to use .*? to make it "non-greedy", or explicitly say "anything other than a quote mark" ([^"]*).
Another approach is to match the quotes normally, rather than with a lookaround, but "capture" the part between them: /"(.*?)"/. How you get to the "captured group" will vary depending on your programming language / tool, which you haven't specified.
The pattern (?!").*(?=") first asserts what is directly on the right is not a double quote (?!") which succeeds because for the example data that is a $.
Then .* is greedy and will match 0+ times any character except a newline and will match until the end of the string. Then it will backtrack to fulfill the assertion (?=") where directly on the right is a double quote.
If a positive lookbehind is supported, you might change the (?!") to (?<=") and the pattern could look like (?<=\$\(")[^"]+(?="\)) to not match empty double quotes.
Taking the dollar sign and the opening and closing parenthesis into account, you could use a capturing group and a negated character class [^"]+ to match any char except a double quote:
\$\("([^"]+)"\)
Regex demo
Using lookahead and lookbehinds as you asked :
/(?<=").*(?=")/g
Test Here : https://regex101.com/r/kCEuow/2
You might also consider using substrings :
/"([^"]+)"/g
Test the regex : https://regex101.com/r/kCEuow/1

Regexp. How to match word isn't followed and preceded by another characters

I want to replace mm units to cm units in my code. In the case of the big amount of such replacements I use regexp.
I made such expression:
(?!a-zA-Z)mm(?!a-zA-Z)
But it still matches words like summa, gamma and dummy.
How to make up regexp correctly?
Use character classes and change the first (?!...) lookahead into a lookbehind:
(?<![a-zA-Z])mm(?![a-zA-Z])
^^^^^^^^^^^^^ ^^^^^^^^^^^
See the regex demo
The pattern matches:
(?<![a-zA-Z]) - a negative lookbehind that fails the match if there is an ASCII letter immediately to the left of the current location
mm - a literal substring
(?![a-zA-Z]) - a negative lookahead that fails the match if there is an ASCII letter immediately to the right of the current location
NOTE: If you need to make your pattern Unicode-aware, replace [a-zA-Z] with [^\W\d_] (and use re.U flag if you are using Python 2.x).
There's no need to use lookaheads and lookbehinds, so if you wish to simplify your pattern you can try something like this;
\d+\s?(mm)\b
This does assume that your millimetre symbol will always follow a number, with an optional space in-between, which I think that in this case is a reasonable assumption.
The \b checks for a word boundary to make sure the mm is not part of a word such as dummy etc.
Demo here

Don't match regex when trailed by character

Current regex: [[\/\!]*?[^\[\]]*?]
The goal it to successfully match [size=16] and [/size] in the following test case but not match [abc].
[size=16]1234[/size]
[abc](htt)
Regex currently matches the 3rd test case; which is specific to always being followed by a parenthesis. So I was thinking about using the logic where if group's next char == "(", do not match
But- I don't really know how to write logic like that in regex...
Look assertions look before or ahead to see if there's a match and then proceed (or not) depending on whether there's a match.
A negative lookahead assertion looks like this:
(?!regex)
Stick it on the end, supplying it the parantheses and you're good to go:
[[\/\!]*?[^\[\]]*?](?!\()
https://regex101.com/r/2jEApI/1
What you want is a "negative lookahead".
A "lookaround" is a group which gets matched, but not included in the result. They start with (? and end with ).
There are two types of lookaround, lookahead and lookbehind:
A "lookbehind" looks backward and is indicated with a < immediately after the ? (i.e. ?<), but that's not what you're here for.
A "lookahead" looks forward and is the default if there is no < after the ?.
Both types can be either positive or negative:
A positive lookaround requires the included group to be present to form a match and is indicated with an =.
A negative lookaround requires that the included group is NOT present to form a match and is indicated with an !.
After you have the basic structure for a positive or negative lookahead or lookbehind the contents in the middle is the normal regular expression syntax, the same as if it were any other group, so in your case you'll need an escaped left parenthesis \(.
Put it all together and you just need to tack this on the end of what you have: (?!\()

how to get sub-string using regex if I specify start and end, without start characters?

I have string like this:
12abcc?p_auth=123ABC&ABC&s
Start of symbol is "p_auth=" and end of string first "&" symbol.
P.S symbol '&' and 'p_auth=' must not be included.
I have wrote that regex:
(p_auth).+?(?=&)
Ok, thats works well, it gets that sub-string:
p_auth=123ABC
bot how to get string without 'p_auth'?
Use look-arounds:
(?<=p_auth=).*?(?=&)
See regex demo
The look-behind (?<=p_auth=) and the look-ahead (?=&) do not consume characters as they are zero-width assertions. They just check for the substring presence either before or after a certain subpattern.
A couple more words about (?<=p_auth=). It is a positive look-behind. Positive because it require a pattern inside it to appear on the left, before the "main" subpattern. If the look-behind subpattern is found, the result is just "true" and the regex goes on checking the rest of subpatterns. If not, the match is failed, the engine goes on looking for another match at the next index.
Here is some description from regular-expressions.info:
It [the look-behind] tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt. (?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.
In most cases, you do not really need look-arounds. In this case, you could just use a
p_auth(.*?)&
And get the first capturing group value.
The .*? pattern will look for any number of characters other than a newline, but as few as possible that are required to find a match. It is called lazy dot matching, because the ? symbol makes the * quantifier stop before the first symbol that is matched by the subsequent subpattern in the regular expression.
The .*& would match all the substring until the last & because * quantifier is greedy - it will consume as many characters it can match as possible.
See more at Repetition with Star and Plus regular-expressions.info page.
p_auth(.+?)(?=&)
Simply use this and grab the group 1 or capture 1.

How to only match a single instance of a character?

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.