How to use regex to look around a complex pattern? - regex

I have the following html element in Sublime Text:
<div class="exg"><div><strong class="syn">investigate</strong><span class="syn">, conduct investigations into, make inquiries into, inquire into, probe, examine, explore, research, study, look into, go into</span></div>
I want to use regex to select the content after and including the 5th comma in this element, stopping before
</span></div>.
So, in this case I'd want to select:
, examine, explore, research, study, look into, go into
So far, I was able to write this regex, which works:
(<div class="exg"><div><strong class="syn">(\w+)((\s)?(\w+)?)+</strong><span class="syn">((\,((\s)?(\w+)?)+)?){5})
This allows me to select the part before what I need to select. I tried to use this with a positive lookbehind, but it isn't working and I can't figure out how to fix it. Here is what I tried:
(?<=(<div class="exg"><div><strong class="syn">(\w+)((\s)?(\w+)?)+</strong><span class="syn">((\,((\s)?(\w+)?)+)?){3}))((\,?((\s)?(\w+)?)+?)+)

You make a heavy use of parenthesis. Also your expression for catching words between commas could be simpler. Replacing your groups with non capturing ones, you'll get the expected match in your first (and only) group with this regex:
(?<=<div class="exg"><div><strong class="syn">)(?:\s?\w)*<\/strong><span class="syn">(?:,(?:\s?\w)*){4}(.*?)(?=<\/span><\/div>)
BTW if you want to capture the 5th comma I think your quantifier should be {4} (but I might have misunderstood)
Check the Demo
Update:
If you're looking to delete the matched group (i.e. replacing it with an empty string). Just do the opposite: build one group before and one after:
(<div class="exg"><div><strong class="syn">(?:\s?\w)*<\/strong><span class="syn">(?:,(?:\s?\w)*){4}).*?(<\/span><\/div>)
Demo
Then replace in your editor with \1\2(groups one after the other, without the previously matched string inbetween)

Related

Regex to capture everything between two words only if falls under conditions [duplicate]

Following on from a previous question in which I asked:
How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?
I got this answer:
/outer-start.*?inner-start(.*?)inner-end.*?outer-end/
I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.
For example, if I have this text:
outer-start some text inner-start text-that-i-want inner-end some more text outer-end
I would like 'some text' and 'some more text' not to contain the word 'unwanted'.
In other words, this is OK:
outer-start some wanted text inner-start text-that-i-want inner-end some more wanted text outer-end
But this is not OK:
outer-start some unwanted text inner-start text-that-i-want inner-end some more unwanted text outer-end
Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.
Is this easy to match using regexes?
Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)
However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.
A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.
For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.
You can replace .*? with
([^u]|u[^n]|un[^w]|unw[^a]|unwa[^n]|unwan[^t]|unwant[^e]|unwante[^d])*?
This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.
You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:
/outer-start(?:u(?!nwanted)|[^u])*?inner-start(.*?)inner-end.*?outer-end/
The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.
People may start hating your code if you do much of this though. ;)
Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.
So what do we not want?
First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end
This will be to the left of the first |. It matches a whole outer block.
Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.
Third, we match and capture what we want. This is:
inner-start\s*(text-that-i-want)\s*inner-end
So the whole regex, in free-spacing mode, is:
(?xs)
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
| # OR capture what we want
inner-start\s*(text-that-i-want)\s*inner-end
On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.
In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:
(?xs)
(?: # non-capture group: the things we don't want
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
)
(*SKIP)(*F) # we don't want this, so fail and skip
| # OR capture what we want
inner-start\s*\Ktext-that-i-want(?=\s*inner-end)
See demo: it directly matches what you want.
The technique is explained in full detail in the question and article below.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Try replacing the last .*? with: (?!(.*unwanted text.*))
Did it work?

Complicated regex to match anything NOT within quotes

I have this regex which scans a text for the word very: (?i)(?:^|\W)(very)[\W$] which works. My goal is to upgrade it and avoid doing a match if very is within quotes, standalone or as part of a longer block.
Now, I have this other regex which is matching anything NOT inside curly quotes: (?<![\S"])([^"]+)(?![\S"]) which also works.
My problem is that I cannot seem to combine them. For example the string:
Fred Smith very loudly said yesterday at a press conference that fresh peas will "very, very defintely not" be served at the upcoming county fair. In this bit we have 3 instances of very but I'm only interested in matching the first one and ignore the whole Smith quotation.
What you describe is kind of tricky to handle with a regular expression. It's difficult to determine whether you are inside a quote. Your second regex is not effective as it only ignores the first very that is directly to the right of the quote and still matches the second one.
Drawing inspiration from this answer, that in turn references another answer that describes how to regex match a pattern unless ... I can capture the matches you want.
The basic idea is to use alternation | and match all the things you don't want and then finally match (and capture) what you do want in the final clause. Something like this:
"[^"]*"|(very)
We match quoted strings in the first clause but we don't capture them in a group and then we match (and capture) the word very in the second clause. You can find this match in the captured group. How you reference a captured group depends on your regex environment.
See this regex101 fiddle for a test case.
This regex
(?i)(?<!(((?<DELIMITER>[ \t\r\n\v\f]+)(")(?<FILLER>((?!").)*))))\bvery\b(?!(((?<FILLER2>((?!").)*)(")(?<DELIMITER2>[ \t\r\n\v\f]+))))
could work under two conditions:
your regex engine allows unlimited lookbehind
quotes are delimited by spaces
Try it on http://regexstorm.net/tester

How to replace all lines based on previous lines in Notepad++?

I have an XML code:
<Line1>Matched_text Other_text</Line1>
<Line2>Text_to_replace</Line2>
How to tell Notepad++ to find Matched_text and replace Text_to_replace to Replaced_text? There are several similar blocks of code, with one exactly Matched _text and different Other_text and Text_to_replace. I want to replace all in once.
My idea is to put
Matched_text*<Line2>*</Line2>
in the Find field, and
Matched_text*<Line2>Replaced_text</Line2>
in the Replace field. I know that \1 in regex might be useful, but I don't know where to start.
The actual code is:
<Name>Matched_text, Other_text</Name>
<IsBillable>false</IsBillable>
<Color>-Text_to_replace</Color>
The regex you're looking for is something like the following.
Find: (Matched_text[\w,\s<>\/]*<Color>-).*(</Color>)
Replace: \1Replaced_text\2
Broken down:
`()` is how you tell regex that you want to keep things (for use in /1, /2, etc.), these are called capture groups in regex land.
`Matched_text[\w,\s<>\/]*` means you want your anchor `Matched_text` and everything after it up till the next part of the expression.
`<Color>-).*(</Color>)` Select everything between <Color>- and </Color> for replacement.
If you have any questions about the expression, I highly recommend looking at a regex cheatsheet.

Regex to capture everything except the text that is coherent

I have this string and other ones like it:
<a href='/webapps/alrn-atomiclearning-bb_bb60/atomic/view.jsp?courseId=#X#course.pk_string#X#&contentId=#X#content.pk_string#X#&tt=Using+the+course+calendar&st=Blackboard+Learn%E2%84%A2+9.1+Instructor+-+Additional+Features+Training&d=00:02:09&tid=84425&sid=2389'><img src='/webapps/alrn-atomiclearning-bb_bb60/images/icon_play_UnlockedTutorial.png' alt='play icon'> Using the course calendar</a><br/>Duration: (00:02:09)
I'm trying to come up with a regex to capture everything EXCEPT the coherent labels that begin after and end just before the </a><br/>
So for example, I would capture everything and then delete it and end up only having:
Using the course calendar
as still there. I've tried multiple variations in Rubular but can only get up to the . Trying to use the [^a-zA-Z|^\s]*<\/a>.* to skip every word char and white space up to the <\a> does not work.
Thanks.
Using a lookahead and a lookbehind - the two sections in brackets. Modify the character class in the middle to capture everything you want to select.
(?<=> )[a-zA-Z\s]+(?=<\/)
Edit:
([\s\w\d\S\W\D]+)((?<=> )[a-zA-Z\s]+(?=<\/))\K([\s\w\d\S\W\D]+)
Ultimately this creates three match groups, the bit before what you want to be left with, the bit you want to be left with, and the bit after what you want to be left with. I'm not sure how, or if indeed you can, specify to select multiple matches as if it's a single match.
I'd still go with the selecting what you're actually after, if possible.

Using regex to match string between two strings while excluding strings

Following on from a previous question in which I asked:
How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?
I got this answer:
/outer-start.*?inner-start(.*?)inner-end.*?outer-end/
I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.
For example, if I have this text:
outer-start some text inner-start text-that-i-want inner-end some more text outer-end
I would like 'some text' and 'some more text' not to contain the word 'unwanted'.
In other words, this is OK:
outer-start some wanted text inner-start text-that-i-want inner-end some more wanted text outer-end
But this is not OK:
outer-start some unwanted text inner-start text-that-i-want inner-end some more unwanted text outer-end
Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.
Is this easy to match using regexes?
Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)
However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.
A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.
For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.
You can replace .*? with
([^u]|u[^n]|un[^w]|unw[^a]|unwa[^n]|unwan[^t]|unwant[^e]|unwante[^d])*?
This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.
You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:
/outer-start(?:u(?!nwanted)|[^u])*?inner-start(.*?)inner-end.*?outer-end/
The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.
People may start hating your code if you do much of this though. ;)
Tola, resurrecting this question because it had a fairly simple regex solution that wasn't mentioned. This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The idea is to build an alternation (a series of |) where the left sides match what we don't want in order to get it out of the way... then the last side of the | matches what we do want, and captures it to Group 1. If Group 1 is set, you retrieve it and you have a match.
So what do we not want?
First, we want to eliminate the whole outer block if there is unwanted between outer-start and inner-start. You can do it with:
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end
This will be to the left of the first |. It matches a whole outer block.
Second, we want to eliminate the whole outer block if there is unwanted between inner-end and outer-end. You can do it with:
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
This will be the middle |. It looks a bit complicated because we want to make sure that the "lazy" *? does not jump over the end of a block into a different block.
Third, we match and capture what we want. This is:
inner-start\s*(text-that-i-want)\s*inner-end
So the whole regex, in free-spacing mode, is:
(?xs)
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
| # OR capture what we want
inner-start\s*(text-that-i-want)\s*inner-end
On this demo, look at the Group 1 captures on the right: It contains what we want, and only for the right block.
In Perl and PCRE (used for instance in PHP), you don't even have to look at Group 1: you can force the regex to skip the two blocks we don't want. The regex becomes:
(?xs)
(?: # non-capture group: the things we don't want
outer-start(?:(?!inner-start).)*?unwanted.*?outer-end # dont want this
| # OR (also don't want that)
outer-start(?:(?!outer-end).)*?inner-end(?:(?!outer-end).)*?unwanted.*?outer-end
)
(*SKIP)(*F) # we don't want this, so fail and skip
| # OR capture what we want
inner-start\s*\Ktext-that-i-want(?=\s*inner-end)
See demo: it directly matches what you want.
The technique is explained in full detail in the question and article below.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Try replacing the last .*? with: (?!(.*unwanted text.*))
Did it work?