I have a Regex question. How do I Capture every instance of a word ‘yellow’ that is not surrounded by quotation marks?
You can do anything with RE, but sometimes trying to get the perfect RE gives you a strange unreadable expression that only a few will understand or catch a corner case that affects certain tools. So I prefer to stick with only the basic parts.
The regular expression for 'yellow' is yellow.
The regular expression for not quotes is [^"]
So I would say the simplest version is:
[^"]yellow[^"]
But then you may start thinking of edge cases. Like the word may appear at the beginning or end of a line. So if we are not trying to be clever and just do this brute force we can check for yellow at the beginning and end of line.
Note if it does not have " because of line boundaries (I am assuming that we don't need to check for quotes the other end because it will not be surrounded by quotes :-)
Beginning of line anchor is ^ and end of line anchor is $
^yellow Check for yellow at the beginning
yellow$ Check for yellow at end of line
Combine all these options with | (or)
[^"]yellow[^"]|^yellow|yellow$
But that is going to catch 'yellow' in the middle of a word. Not sure how many words have yellow in the middle. But if this is a short English text, I would not worry about it. If this was a large blob of text I could depend on the situation.
Sure you can do it more compressed. But do you want to?
Now you have to think about the specifics of the tool you are using and what needs to be escaped.
Related
This is a follow-up question to what was solved yesterday:
Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format
I managed to find a Regex to remove the offending semicolons in the main text area but by only cutting out the text and pasting back the result, which can only be done one by one.
I'm not sure how this can be done, but the expert can tell me.
So I have footnote references in markdown format. Two instances of the same thing:
[^1]:
[^2]:
.
.
.
[^99]:
I might not have 99 in a document but I wanted to show I need to match two digits here again.
As I said, there are two instances of these numbered references in the text. One in the main text pointing to the footnote and the footnote at the end of the document.
What I need is deleting the semi-colons from the main text and leave the
[^3]:
[^15]:
etc.
references at the end intact.
Because the main text references come after a word or at the end of a sentence (ususally before the sentence-ending period), there is never a case a reference would start a sentence (even if they seem to appear there once or twice because of word wrap).
I provided the exact opposite of my needs here:
Click here for Regex101 website link
I put in the exact opposite of what I want because I already knew of the
^
sign to match anything that is at the front of the line.
Now I would like to negate this, if possible, so that I would delete the semi-colons in the main text, not down at the bottom.
Of course, it is likely that my approach is not good and you'll come up with a completely different approach. Especially because there doesn't seem to be a NOT operator in Regex, if I read correctly.
I repeat: the Regex101 example with the match and substitution is exactly the opposite of what I want.
I am not sure if you can play around in the substitution line to get the desired negative effect.
I could have probably asked for removing the first occurence of semi-colons but I thought the important part of tackling the problem is that those items not to be matched are always at the start of the line, not the others.
Thanks for any suggestions
In Notepad++ you might use a negative lookabehind asserting not the start of the string to the left, and use \K to clear the match buffer matching only the colon that should be replaced by an empty string.
(?<!^)\[\^\d{1,2}]\K:
Explanation
(?<!^) Negative lookbehind, assert not the start of the start directly to the left
\[\^ Match [^
\d{1,2} Match 1 or 2 digits
] Match literally
\K Forget what is matched so far
: Match a colon
Regex demo
I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.
I would like to extract sentences with the word "flung" in the whole text.
For example, in the following text, I'd like to extract the sentence "It was exactly as if a hand had clutched them in the centre and flung them aside." using regular expression.
I tried to use this .*? flung (?<sub>.*?)\., but it starts searching from the beginning of the line.
How could I solve the problem?
As she did so, a most extraordinary thing happened. The bed-clothes gathered themselves together, leapt up suddenly into a sort of peak, and then jumped headlong over the bottom rail. It was exactly as if a hand had clutched them in the centre and flung them aside. Immediately after, .........
Here you go,
[^.]* flung [^.]*\.
DEMO
OR
[^.?!]*(?<=[.?\s!])flung(?=[\s.?!])[^.?!]*[.?!]
DEMO
Simply anything between dots:
without a dote
[A-Za-z," ]+word[A-Za-z," ]+
with a dote
[A-Za-z," ]+word[A-Za-z," ]+\.
"[A-Z]\\s?\\w*\\s?(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*(?:your target\\s)(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*(([^(\\.\\s)|(\\?\\s)|(!\\s)])|\\s)*[\\.|\\?|!]"
A sentence starts with any capital letter, in the middle it may contain decimal or abbreviation.
(?<=^|\s)[A-Z][^!?.]*( word\s*)[^!?.]*(?=\.|\!|\?)
Before first capital letter there is a line start or a white space, then it may consist any characters without set of [!?.](*)-or may not , then contains your target word with or without white spaces after it (if it is in the end of the sentence), then may consist again any characters without set of [!?.](*)-or not, and finally ends with dot or ! or ?.
I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.
I have some text like this:
Note: this is example text so the content is unimportant
CAT SAT ON A DOG
REASON: No reason
CONCERN: He was cold
BECAUSE: Cold weather
CAT SAT ON A MOUSE
REASON: He eats mice
CONCERN: He was hungry
BECAUSE: Can opener didn't work
CAT SAT ON A HORSE
REASON: He wants to ride
CONCERN: He might fall off
BECAUSE: Saddle is too big
I am trying to write a regular expression that could capture only the 'CAT SAT ON A MOUSE' part, but am having problems capturing the full text.
I have tried:
(\bCAT\sSAT\sON\sA\sMOUSE)(.*)\n{2}
The idea was to match the beginning part of the string and then to capture everything up till two line breaks.
{2} is to capture the two line breaks.
I have tried many more variations but all I manage to do is to capture the first line only.
Any sort of help would be really appreciated.
You were asking for anything then two line breaks.
You needed to ask for a line break followed by anything twice.
Try this one:
(\bCAT\sSAT\sON\sA\sMOUSE)(\n.*){2}
I think your main problem is that your text uses \r\n to separate lines, and you're only looking for \n. Try this:
/^(CAT +SAT +ON +A +MOUSE)(?:(?:\r\n|[\r\n])[^\r\n]+)*/m
(?:\r\n|[\r\n]) matches any of the three most common line separators (which I'll call newlines): \r\n, \r, or \n. It matches exactly one newline at a time, no matter which kind it is. Then [^\r\n]+ takes over, so there can only be one line separator per line. Since paragraphs are delimited by two newlines, the match ends there.
I took the liberty of anchoring the first line with a start anchor (^) in multiline mode (m). It's not absolutely necessary to do that, but helps the regex find a match more quickly, and much importantly, to fail more quickly when no match is possible.
(You haven't said which regex flavor you're working with, so I made a wild guess and used JavaScript syntax.)
What language are you working with? That'll help a bit. In Perl, you can add the m specifier to treat the multi-lined string as a single piece of text:
#! /usr/bin/perl
my $string =<<STRING;
CAT SAT ON A MOUSE
REASON: He eats mice
CONCERN: He was hungry
BECAUSE: Can opener didn't work
This is a test, and not part of the string to match.
STRING
if ($string =~ /(^(CAT[^\n]+).*\n\n/s) {
say "Match: $1";
}
else {
say "Didn't match";
}
In Perl, adding the s on the end treats the enter string as a single line.
This might work:
(\bCAT[^\S\n]SAT[^\S\n]ON[^\S\n]A[^\S\n]MOUSE\b[\s\S]*?)\n{2}
or
(\bCAT[^\S\n]+SAT[^\S\n]+ON[^\S\n]+A[^\S\n]+MOUSE\b[\s\S]*?)\n{2}
Edit - The regex must be slowed after the first anchor, otherwise the next anchor
could be passed up in favor of speed. This can be done with a non-greedy quantifier
or a look-ahead assertion (which allows aggressive behavior at the cost of a check
that basically nullifies its speed).
Edit2 - Sometimes it may be desireable to match an 'apparent' gap between paragraphs that could include non-newline whitespace.
For example \n\n will not match an apparent gap like this:
'start ... \nend of paragraph\n \n' when it should.
In that case, replacing \n{2} with \n[^\S\n]*\n will allow it to match.
Furthermore, since the non-greedy quantifier is used (in this case) \b[\s\S]*?,
it is possible to account for and match the paragraph end when it is at or near the end of file. Putting this all together yeilds:
/(\bCAT\s+SAT\s+ON\s+A\s+MOUSE\b[\s\S]*?)($|\n[^\S\n]*\n)/
which now looks pretty complicated, but does the complete job.