This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 6 years ago.
I was wondering how to match a line not containing a specific word using Python-style Regex (Just use Regex, not involve Python functions)?
Example:
PART ONE OVERVIEW 1
Chapter 1 Introduction 3
I want to match lines that do not contain the word "PART"?
This should work:
/^((?!PART).)*$/
Edit (by request): How this works
The (?!...) syntax is a negative lookahead, which I've always found tough to explain. Basically, it means "whatever follows this point must not match the regular expression /PART/." The site I've linked explains this far better than I can, but I'll try to break this down:
^ #Start matching from the beginning of the string.
(?!PART) #This position must not be followed by the string "PART".
. #Matches any character except line breaks (it will include those in single-line mode).
$ #Match all the way until the end of the string.
The ((?!xxx).)* idiom is probably hardest to understand. As we saw, (?!PART) looks at the string ahead and says that whatever comes next can't match the subpattern /PART/. So what we're doing with ((?!xxx).)* is going through the string letter by letter and applying the rule to all of them. Each character can be anything, but if you take that character and the next few characters after it, you'd better not get the word PART.
The ^ and $ anchors are there to demand that the rule be applied to the entire string, from beginning to end. Without those anchors, any piece of the string that didn't begin with PART would be a match. Even PART itself would have matches in it, because (for example) the letter A isn't followed by the exact string PART.
Since we do have ^ and $, if PART were anywhere in the string, one of the characters would match (?=PART). and the overall match would fail. Hope that's clear enough to be helpful.
Related
This question already has answers here:
Regex match entire words only
(7 answers)
Closed 3 years ago.
I would like to conduct regex substitution. Here is the pattern I am using:
.*?fee.*?$|.*?charge.*?$
The matches the desired lines
"fees credit card"
"charges for interest"
However, it is also matching on coffee and feeder (I want to be specific that it does not match "coffee" or "feed" lines, how can I specifically prevent these matches but still handle cases like fee, fees)
"coffee shop"
feeder cattle
You could use an alternation with 2 word boundaries \b to prevent the words being part of a larger word.
For you example data, if you want to match the single or single or plural version you can make the s at the end optional by using a question mark.
^.*\b(?:fees?|charges?)\b.*$
^ Start of the string
.*\b Match any char except a newline followed by a word boundary
(?:fees?|charges?) Match any of the listed followed by an optional s
\b.* Word boundary, match any char except a newline 0+ times
$ Assert end of the string
Regex demo
If you are just trying to match those two lines, you can simply use an expression similar to this:
^(fees|charges).+$
If you wish to match certain words, you might add boundaries to group one similar to this expression:
^\b(fees|fee|charge|charges)\b(.+)$
If your pattern might be in the middle of string inputs, you can add another group in the left, similar to this expression:
(?:.+|)\b(fees|fee|charge|charges)\b(?:.+|)$
This graph shows how an expression like that would work:
Regular expression design can be achieved much easier, if/when there is real data.
This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 5 years ago.
Say that I have this text
I need to think of something to write here.
I am an alumni of the college.
The blue cat is blue the red cat is red.
I want to be able to select everything if the word "student" (ignoring case) does NOT exist in the text. Therefore, I need something to go beyond the first line. I originally had something like
/(?:[^student]).*/
but it isn't working correctly. I am not sure what "flavor" of Regex I am using but it is with PHP and in the backend of a Drupal site.
Thank you!
Problem
As suggested by ctwheels comment, the regexp you have written (?:[^student]).* is not doing what you think it does. Lets break down your attempt to illustrate the problem:
(?: ) <--- this part is a non-capturing group
[^student] <--- this part matches a single character not present in the list
.* <--- matches any character (except for line terminators)
I.e, your bracket expression is the same as [^tneduts] or [^neduts]or even [^denstu]. It doesn't look for a word!
Solution
What you want to do instead, is using a negative lookahead. like this ^(?!.*student).*$. Let's break it down.
^ <--- position at start of the string
(?!.*student) <--- look for anything followed by student, don't match if found
.* <--- take everything on the line unless the negative lookahead is in effect.
$ <--– position at the end of the string
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Very quick and simple question.
Consider the vector of character strings ("AvAv", "AvAvAv")
Why does the pattern (Av)\1([^A]|$) match both strings?
The pattern says have an isntance of "Av", have another, then either have a character that is not an "A" or else come to an end. The first string clearly matches, the latter I do not see how it does. It has two copies of "Av" but then it fails to end (missing the second disjunct), and fails to be followed by a charavter other than "A" (missing the first disjunct), so how does the pattern successfully match it?
Thank you so much for your time and assistance. It is greatly appreciated.
Here is an explanation:
AvAv - matches (Av)\1$
In this case, we can match Av, followed by that captured quantity, followed by $ from the alternation. In the case of AvAvAv we also have a match:
AvAvAv - again matches (Av)\1$
^^^^ last four letters match
It is the same logic here, except that in order to match, we have to skip the first Av.
If the pattern were ^(Av)\1([^A]|$) then only AvAv would be a match.
A RegEx only needs to match a part of the string to be considered "a match".
In other words, your RegEx matches this part:
AvAvAv
for the second example.
If you don't want it to match the second one, use a caret ^
^(Av)\1([^A]|$)
In this way the second one won't be matched.
This question already has answers here:
Regular expression for a string containing one word but not another
(5 answers)
Closed 3 years ago.
Have regex in our project that matches any url that contains the string
"/pdf/":
(.+)/pdf/.+
Need to modify it so that it won't match urls that also contain "help"
Example:
Shouldn't match: "/dealer/help/us/en/pdf/simple.pdf"
Should match: "/dealer/us/en/pdf/simple.pdf"
If lookarounds are supported, this is very easy to achieve:
(?=.*/pdf/)(?!.*help)(.+)
See a demo on regex101.com.
(?:^|\s)((?:[^h ]|h(?!elp))+\/pdf\/\S*)(?:$|\s)
First thing is match either a space or the start of a line
(?:^|\s)
Then we match anything that is not a or h OR any h that does not have elp behind it, one or more times +, until we find a /pdf/, then match non-space characters \S any number of times *.
((?:[^h ]|h(?!elp))+\/pdf\/\S*)
If we want to detect help after the /pdf/, we can duplicate matching from the start.
((?:[^h ]|h(?!elp))+\/pdf\/(?:[^h ]|h(?!elp))+)
Finally, we match a or end line/string ($)
(?:$|\s)
The full match will include leading/trailing spaces, and should be stripped. If you use capture group 1, you don't need to strip the ends.
Example on regex101
I'm having trouble matching the start and end of a regex on Python.
Essentially I'm confused about the when to use word boundaries /b and start/end anchors ^ $
My regex of
^[A-Z]{2}\d{2}
matches 4 letter characters (two uppercase letters, two digits) which is what I'm after
Matches AJ99, RD22, CP44 etc
However, I also noted that AJAJAJAJAJAJAJAJAJSJHS99 could be matched as well. I've tried used ^ and $ together to match the whole string. This doesn't work
^[A-Z]{2}\d{2}$ # this doesn't work
but
^[A-Z]{2}\d{2} # this is fine
[A-Z]{2}\d{2}$ # this is fine
The string I'm matching against is 4 characters long, but in the first two examples the regex could pick the start and end of a longer string respectively.
s = "NZ43" # 4 characters, match perfect! However....
s = "AM27272727" # matches the first example
s = "HAHSHSHSHDS57" # matches the second example
The position anchors ^ and $ place a restriction on the position of your matched chars:
Analyzing your complete regex:
^[A-Z]{2}\d{2}$
^ matches only at the beginning of the text
[A-Z]{2} exactly 2 uppercase Ascii alphabetic characters
\d{2} exactly 2 digits (equivalent to [0-9]{2})
$ matches only at the end of the text
If you remove one or both of the 2 position anchors (^ or $) you can match a substring starting from the beginning or the end as you stated above.
If you want to match exactly a word without using the start/end of the string use the \b anchor, like this:
``\b[A-Z]{2}\d{2}\b``
\b matches at the start/end of text and between a regex word (in regex a word char \w is intended as one of [a-zA-Z0-9_]) and one char not in the word group (available as \W).
The regex above matches WS24 in all the next strings:
WS24 alone
before WS24
WS24 after
before WS24 after
NZ43
It doesn't match:
AM27272727 (it will do if is AM27 272727 or AM27"272727
HAHSHSHSHDS57 (it will do if HAHSHSHSH DS75 or...you get it)
A demo online (the site will be useful to you also to experiment with regex).
The fact that your shown behaviour is like it's supposed to be, your question suggests that you maybe does not have fully understood how regular expressions work.
As a addition to the very good and informative answer of GsusRecovery, here's a site, that guides you through the concepts of regular expressions and tries to teach you the basics with a lessons-based system. To be clear, I do not want to tout this website, as there are plenty of those, but however I could really made a use of this one and so it's the one I'm suggesting.