Matching line without and with lower-case letters - regex

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?

To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?

Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

"Multiple pass" RegEx for fixing space indentations

Searched and found some apparently similar questions, that weren't quite.
I often find myself needing to replace leading 4-space indentations with tabs. I always do this with RegEx ^(\t*) {4}, replacing with $1\t. And then I just do multiple passes to catch nested indents. It works, it's easy. But I'm wondering, is it possible to write a RegEx that can do this in one pass (to handle nested indents)?
EDIT
Apologies for lack of input/output examples, I was in a hurry. Here's an example, let s mean space and t mean tab:
SMA
ssssRTP
ssssssssATR
ssssssssOLN
ssssOWH
ssssERE
TOGO
Output:
SMA
tRTP
ttATR
ttOLN
tOWH
tERE
TOGO
Essentially, the RegEx would need to allow for arbitrarily deeply nested chunks of 4 spaces. It does not need to allow for tabs following spaces in the initial input.
PCRE
(^\t*|\G) {4} replace with $1\t or (^|\G)( {4}|\t) replace with \t. You should use multiline mode.
How this works:
^\t* — this match start of string followed by any numbers of tabs.
\G — this match end of previous match.
​ {4} — this match four spaces.
So this regular expression match four spaces at the start of string or four spaces following four spaces already matched by this regular expression.
Tested this with .NET's regex engine. JavaScript's (at least Mozilla's) won't work, though; it relies on lookbehind, which isn't available. PCRE wants fixed-length lookbehinds, so this won't work there either, unfortunately.
(?<=^( {4}|\t)*) {4}
Basic idea is to match four spaces preceded by the beginning of a line plus all the spots where a previous match would naturally go. Since replacement is done atomically, there's no chance of missing such a spot; all such matches are gathered at once. Then make sure you're using Multiline flag and replace with a single tab character and you're good to go.
Test data, which is just random pseudocode in a vaguely Pythonesque style:
def a:
return true
# comment with embedded spaces etc.

Look behinds: all the rage in regex?

Many regex questions lately have some kind of look-around element in the query that appears to me is not necessary to the success of the match. Is there some teaching resource that is promoting them? I am trying to figure out what kinds of cases you would be better off using a positive look ahead/behind. The main application I can see is when trying to not match an element. But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
And this one from another question:
$url = "www.example.com/id/1234";
preg_match("/\d+(?<=id\/[\d])/",$url,$matches);
When is it truly better to use a positive look-around? Can you give some examples?
I realize this is bordering on an opinion-based question, but I think the answers would be really instructive. Regex is confusing enough without making things more complicated... I have read this page and am more interested in some simple guidelines for when to use them rather than how they work.
Thanks for all the replies. In addition to those below, I recommend checking out m.buettner's great answer here.
You can capture overlapping matches, and you can find matches which could lie in the lookarounds of other matches.
You can express complex logical assertions about your match (because many engines let you use multiple lookbehind/lookahead assertions which all must match in order for the match to succeed).
Lookaround is a natural way to express the common constraint "matches X, if it is followed by/preceded by Y". It is (arguably) less natural to add extra "matching" parts that have to be thrown out by postprocessing.
Negative lookaround assertions, of course, are even more useful. Combined with #2, they can allow you do some pretty wizard tricks, which may even be hard to express in usual program logic.
Examples, by popular request:
Overlapping matches: suppose you want to find all candidate genes in a given genetic sequence. Genes generally start with ATG, and end with TAG, TAA or TGA. But, candidates could overlap: false starts may exist. So, you can use a regex like this:
ATG(?=((?:...)*(?:TAG|TAA|TGA)))
This simple regex looks for the ATG start-codon, followed by some number of codons, followed by a stop codon. It pulls out everything that looks like a gene (sans start codon), and properly outputs genes even if they overlap.
Zero-width matching: suppose you want to find every tr with a specific class in a computer-generated HTML page. You might do something like this:
<tr class="TableRow">.*?</tr>(?=<tr class="TableRow">|</table>)
This deals with the case in which a bare </tr> appears inside the row. (Of course, in general, an HTML parser is a better choice, but sometimes you just need something quick and dirty).
Multiple constraints: suppose you have a file with data like id:tag1,tag2,tag3,tag4, with tags in any order, and you want to find all rows with tags "green" and "egg". This can be done easily with two lookaheads:
(.*):(?=.*\bgreen\b)(?=.*\begg\b)
There are two great things about lookaround expressions:
They are zero-width assertions. They require to be matched, but they consume nothing of the input string. This allows to describe parts of the string which will not be contained in a match result. By using capturing groups in lookaround expressions, they are the only way to capture parts of the input multiple times.
They simplify a lot of things. While they do not extend regular languages, they easily allow to combine (intersect) multiple expressions to match the same part of a string.
Well one simple case where they are handy is when you are anchoring the pattern to the start or finish of a line, and just want to make sure that something is either right ahead or behind the pattern you are matching.
I try to address your points:
some kind of look-around element in the query that appears to me is not necessary to the success of the match
Of course they are necessary for the match. As soon as a lookaround assertions fails, there is no match. They can be used to ensure conditions around the pattern, that have additionally to be true. The whole regex does only match, if:
The pattern does fit and
The lookaround assertions are true.
==> But the returned match is only the pattern.
When is it truly better to use a positive look-around?
Simple answer: when you want stuff to be there, but you don't want to match it!
As Bergi mentioned in his answer, they are zero width assertions, this means they don't match a character sequence, they just ensure it is there. So the characters inside a lookaround expression are not "consumed", the regex engine continues after the last "consumed" character.
Regarding your first example:
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
I think there is a misunderstanding on your side, when you write "has a simple solution to capturing the .*". The .* is not "captured", it is the only thing that the expression does match. But only those characters are matched that have a "<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">" before and a "<\/a><span" after (those two are not part of the match!).
"Captured" is only something that has been matched by a capturing group.
The second example
\d+(?<=id\/[\d])
Is interesting. It is matching a sequence of digits (\d+) and after the sequence, the lookbehind assertion checks if there is one digit with "id/" before it. Means it will fail if there is more than one digit or if the text "id/" before the digit is missing. Means this regex is matching only one digit, when there is fitting text before.
teaching resources
www.regular-expressions.info
perlretut on Looking ahead and looking behind
I'm assuming you understand the good uses of lookarounds, and ask why they are used with no apparent reason.
I think there are four main categories of how people use regular expressions:
Validation
Validation is usually done on the whole text. Lookarounds like you describe are not possible.
Match
Extracting a part of the text. Lookarounds are used mainly due to developer laziness: avoiding captures.
For example, if we have in a settings file with the line Index=5, we can match /^Index=(\d+)/ and take the first group, or match /(?<=^Index=)\d+/ and take everything.
As other answers said, sometimes you need overlapping between matches, but these are relatively rare.
Replace
This is similar to match with one difference: the whole match is removed and is being replaced with a new string (and some captured groups).
Example: we want to highlight the name in "Hi, my name is Bob!".
We can replace /(name is )(\w+)/ with $1<b>$2</b>,
but it is neater to replace /(?<=name is )\w+/ with <b>$&</b> - and no captures at all.
Split
split takes the text and breaks it to an array of tokens, with your pattern being the delimiter. This is done by:
Find a match. Everything before this match is token.
The content of the match is discarded, but:
In most flavors, each captured group in the match is also a token (notably not in Java).
When there are no more matches, the rest of the text is the last token.
Here, lookarounds are crucial. Matching a character means removing it from the result, or at least separating it from its token.
Example: We have a comma separated list of quoted string: "Hello","Hi, I'm Jim."
Splitting by comma /,/ is wrong: {"Hello", "Hi, I'm Jim."}
We can't add the quote mark, /",/: {"Hello, "Hi, I'm Jim."}
The only good option is lookbehind, /(?<="),/: {"Hello", "Hi, I'm Jim."}
Personally, I prefer to match the tokens rather than split by the delimiter, whenever that is possible.
Conclusion
To answer the main question - these lookarounds are used because:
Sometimes you can't match text that need.
Developers are shiftless.
Lookaround assertions can also be used to reduce backtracking which can be the main cause for a bad performance in regexes.
For example: The regex ^[0-9A-Z]([-.\w]*[0-9A-Z])*#(1) can also be written ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])#(2) using a positive look behind (simple validation of the user name in an e-mail address).
Regex (1) can cause a lot of backtracking essentially because [0-9A-Z] is a subset of [-.\w] and the nested quantifiers. Regex (2) reduces the excessive backtracking, more information here Backtracking, section Controlling Backtracking > Lookbehind Assertions.
For more information about backtracking
Best Practices for Regular Expressions in the .NET Framework
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Runaway Regular Expressions: Catastrophic Backtracking
I typed this a while back but got busy (still am, so I might take a while to reply back) and didn't get around to post it. If you're still open to answers...
Is there some teaching resource that is promoting them?
I don't think so, it's just a coincidence I believe.
But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
This is most probably a C# regex, since variable width lookbehinds are not supported my many regex engines. Well, the lookarounds could be certainly avoided here, because for this, I believe it's really simpler to have capture groups (and make the .* lazy as we're at it):
(<td><a href="\/xxx\.html\?n=[0-9]{0,5}">).*?(<\/a><span)
If it's for a replace, or
<td><a href="\/xxx\.html\?n=[0-9]{0,5}">(.*?)<\/a><span
for a match. Though an html parser would definitely be more advisable here.
Lookarounds in this case I believe are slower. See regex101 demo where the match is 64 steps for capture groups but 94+19 = 1-3 steps for the lookarounds.
When is it truly better to use a positive look-around? Can you give some examples?
Well, lookarounds have the property of being zero-width assertions, which mean they don't really comtribute to matches while they contribute onto deciding what to match and also allows overlapping matches.
Thinking a bit about it, I think, too, that negative lookarounds get used much more often, but that doesn't make positive lookarounds less useful!
Some 'exploits' I can find browsing some old answers of mine (links below will be demos from regex101) follow. When/If you see something you're not familiar about, I probably won't be explaining it here, since the question's focused on positive lookarounds, but you can always look at the demo links I provided where there's a description of the regex, and if you still want some explanation, let me know and I'll try to explain as much as I can.
To get matches between certain characters:
In some matches, positive lookahead make things easier, where a lookahead could do as well, or when it's not so practical to use no lookarounds:
Dog sighed. "I'm no super dog, nor special dog," said Dog, "I'm an ordinary dog, now leave me alone!" Dog pushed him away and made his way to the other dog.
We want to get all the dog (regardless of case) outside quotes. With a positive lookahead, we can do this:
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*$)
to ensure that there are even number of quotes ahead. With a negative lookahead, it would look like this:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
to ensure that there are no odd number of quotes ahead. Or use something like this if you don't want a lookahead, but you'll have to extract the group 1 matches:
(?:"[^"]+"[^"]+?)?(\bdog\b)
Okay, now say we want the opposite; find 'dog' inside the quotes. The regex with the lookarounds just need to have the sign inversed, first and second:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*$)
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
But without the lookaheads, it's not possible. the closest you can get is maybe this:
"[^"]*(\bdog\b)[^"]*"
But this doesn't get all the matches, or you can maybe use this:
"[^"]*?(\bdog\b)[^"]*?(?:(\bdog\b)[^"]*?)?"
But it's just not practical for more occurrences of dog and you get the results in variables with increasing numbers... And this is indeed easier with lookarounds, because they are zero width assertions, you don't have to worry about the expression inside the lookaround to match dog or not, or the regex wouldn't have obtained all the occurrences of dog in the quotes.
Of course now, this logic can be extended to groups of characters, such as getting specific patterns between words such as start and end.
Overlapping matches
If you have a string like:
abcdefghijkl
And want to extract all the consecutive 3 characters possible inside, you can use this:
(?=(...))
If you have something like:
1A Line1 Detail1 Detail2 Detail3 2A Line2 Detail 3A Line3 Detail Detail
And want to extract these, knowing that each line starts with #A Line# (where # is a number):
1A Line1 Detail1 Detail2 Detail3
2A Line2 Detail
3A Line3 Detail Detail
You might try this, which fails because of greediness...
[0-9]+A Line[0-9]+(?: \w+)+
Or this, which when made lazy no more works...
[0-9]+A Line[0-9]+(?: \w+)+?
But with a positive lookahead, you get this:
[0-9]+A Line[0-9]+(?: \w+)+?(?= [0-9]+A Line[0-9]+|$)
And appropriately extracts what's needed.
Another possible situation is one where you have something like this:
#ff00fffirstword#445533secondword##008877thi#rdword#
Which you want to convert to three pairs of variables (first of the pair being a # and some hex values (6) and whatever characters after them):
#ff00ff and firstword
#445533 and secondword#
#008877 and thi#rdword#
If there were no hashes inside the 'words', it would have been enough to use (#[0-9a-f]{6})([^#]+), but unfortunately, that's not the case and you have to resort to .*? instead of [^#]+, which doesn't quite yet solve the issue of stray hashes. Positive lookaheads however make this possible:
(#[0-9a-f]{6})(.+?)(?=#[0-9a-f]{6}|$)
Validation & Formatting
Not recommended, but you can use positive lookaheads for quick validations. The following regex for instance allow the entry of a string containing at least 1 digit and 1 lowercase letter.
^(?=[^0-9]*[0-9])(?=[^a-z]*[a-z])
This can be useful when you're checking for character length but have patterns of varying length in the a string, for example, a 4 character long string with valid formats where # indicates a digit and the hyphen/dash/minus - must be in the middle:
##-#
#-##
A regex like this does the trick:
^(?=.{4}$)\d+-\d+
Where otherwise, you'd do ^(?:[0-9]{2}-[0-9]|[0-9]-[0-9]{2})$ and imagine now that the max length was 15; the number of alterations you'd need.
If you want a quick and dirty way to rearrange some dates in the 'messed up' format mmm-yyyy and yyyy-mm to a more uniform format mmm-yyyy, you can use this:
(?=.*(\b\w{3}\b))(?=.*(\b\d{4}\b)).*
Input:
Oct-2013
2013-Oct
Output:
Oct-2013
Oct-2013
An alternative might be to use a regex (normal match) and process separately all the non-conforming formats separately.
Something else I came across on SO was the indian currency format, which was ##,##,###.### (3 digits to the left of the decimal and all other digits groupped in pair). If you have an input of 122123123456.764244, you expect 1,22,12,31,23,456.764244 and if you want to use a regex, this one does this:
\G\d{1,2}\K\B(?=(?:\d{2})*\d{3}(?!\d))
(The (?:\G|^) in the link is only used because \G matches only at the start of the string and after a match) and I don't think this could work without the positive lookahead, since it looks forward without moving the point of replacement.)
Trimming
Suppose you have:
this is a sentence
And want to trim all the spaces with a single regex. You might be tempted to do a general replace on spaces:
\s+
But this yields thisisasentence. Well, maybe replace with a single space? It now yields " this is a sentence " (double quotes used because backticks eats spaces). Something you can however do is this:
^\s*|\s$|\s+(?=\s)
Which makes sure to leave one space behind so that you can replace with nothing and get "this is a sentence".
Splitting
Well, somewhere else where positive lookarounds might be useful is where, say you have a string ABC12DE3456FGHI789 and want to get the letters+digits apart, that is you want to get ABC12, DE3456 and FGHI789. You can easily do use the regex:
(?<=[0-9])(?=[A-Z])
While if you use ([A-Z]+[0-9]+) (i.e. the captured groups are put back in the resulting list/array/etc, you will be getting empty elements as well.
Note that this could be done with a match as well, with [A-Z]+[0-9]+
If I had to mention negative lookarounds, this post would have been even longer :)
Keep in mind that a positive/negative lookaround is the same for a regex engine. The goal of lookarounds is to perform a check somewhere in your "regular expression".
One of the main interest is to capture something without using capturing parenthesis (capturing the whole pattern), example:
string: aaabbbccc
regex: (?<=aaa)bbb(?=ccc)
(you obtain the result with the whole pattern)
instead of: aaa(bbb)ccc
(you obtain the result with the capturing group.)