what is wrong with my word boundary regex? [duplicate] - regex

This question already has answers here:
Regex using word boundary but word ends with a . (period)
(4 answers)
Closed 2 years ago.
I have the following little Python script:
import re
def main ():
thename = "DAVID M. D.D.S."
theregex = re.compile(r"\bD\.D\.S\.\b")
if re.search(theregex, thename):
print ("you did it")
main ()
It's not matching. But if I adjust the regex just slightly and remove the last . it does work, like this:
\bD\.D\.S\b
I feel I'm pretty good at understanding regexes, but this has be baffled. My understanding of \b (word boundary) should be the a zero width match of non alpha-numeric (and underscore). So I would expect
"\bD\.D\.S\.\b"
to match:
D.D.S.
What am I missing?

This doesn't do what you might think it does.
r"\bD\.D\.S\.\b"
Here is an explanation of that regex, with the same examples that are listed below:
D.D.S. # no match, as there is no word boundary after the final dot
D.D.S.S # matches since there is a word boundary between `.` and `S` at the end
Word boundaries are zero-width matchers between word characters (\w, which is [0-9A-Za-z_] plus other "letters" as defined by your locale) and non-word characters (\W, which is the inversion of the previous class). Dot (.) is not a word character, so D.D.S. (note trailing whitespace) has word boundaries (only!) in the following places: \bD\b.\bD\b.\bS\b. (I didn't escape the dots because I'm illustrating the word boundaries, not making a regular expression).
I assume you are trying to match a end of line or whitespace. There are two ways to do that:
r"\bD\.D\.S\.(?!\S)" # by negation: do not match a non-whitespace
r"\bD\.D\.S\.(?:\s|$)" # match either a whitespace character or end of line
I've refined the above regex explanation link to explain the negation example above (note the first ends in …/1 while the second ends in …/2; feel free to further experiment there, it is nice and interactive).

\.\b matches .bla - checks for word character after .
\.\B the opposite matches bla. but not bla.bla - checks for non word after .
\bD\.D\.S\.\B

Related

Create regular expression remove word

Hello good afternoon!!
I'm new to the world of regular expressions and would like some help creating the following expression!
I have a query that returns the following values:
caixa-pod
config-pod
consultas-pod
entregas-pod
monitoramento-pod
vendas-pod
I would like the results to be presented as follows:
caixa
config
consultas
entregas
monitoramento
vendas
In this case, it would exclude the word "-pod" from each value.
I would try (.*)-pod. It is not clear, where do you want to use that regexp (so regexp can be different). I guess it is dashboard variable.
You can try
\b[a-z]*(?=-pod)\b
This regex basically tells the regex engine to match
\b a word boundary
[a-z]* any number of lowercase characters in range a-z (feel free to extend to whatever is needed e.g. [a-zA-Z0-9] matches all alphanumeric characters)
(?=-pod) followed by -pod but exclude that from the result (positive lookahead)
\b another word boundary
\b matches a word boundary position between a word character and non-word character or position (start / end of string).

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

How to fix regex to match the whole word, and not a substring? [duplicate]

This question already has answers here:
Regex.Match whole words
(4 answers)
Regex match entire words only
(7 answers)
Bash regex finding particular words in a sentence
(4 answers)
Closed 1 year ago.
I haven't found any success in fixing this regular expression:
B..y
I am currently searching a text file, its output are the following:
Baby
Babylon
Babyland
eBaby
What should I change in the expression to only output 'Baby' and exclude the other three?
EDIT: What if I have another entry - 'Blay'? I need to get 'Baby' and 'Blay'.
The regex:
\bBaby\b
Test here.
To find both 'Baby' and 'Blay', you need to update the regex to:
\b(Baby|Blay)\b
Test here.
Explanations:
From here about \b:
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
From here about (Baby|Blay) :
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping. If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog, and then another word boundary. If we had omitted the parentheses then the regex engine would have searched for a word boundary followed by cat, or, dog followed by a word boundary.
In Addition to the Answer of virolino:
The Regex Metacharacter \b matches word boundaries, i.e. between two characters, where one is a word character and the other is not is a word character, plus the start and the end of the string, if the first character (or last respectively) is a word character.
A word character is a match to the \w character class - there seems to be no real consent about what a word character actually is, but [A-Za-z0-9_] seems to be the minimum, hence your example should work with virolinos pattern (\bBaby\b) in any case.
Furthermore the pattern match the following strings
Baby-Boomer
Baby.Feed();
See my fork of virolinos regex test.

Regex - Match any word but ignore specific word [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Closed 5 years ago.
I want to match any word that starts/ends or not contain with word "end" but ignore word "end", for example:
hello - would match
boyfriend - would match
endless - would match
endend - would match
but
end - would NOT match
I'm using ^(?!end)).*$ but its not what I want.
Sorry for my english
Try this:
^(?!(end)$).+$
This will match everything except end.
You can use this \b(?!(?:end\b))[\w]+
Components:
\b -> Start of the word boundary for each words.
(?! Negative lookahead to eliminate the word end.
(?:end\b) Non capturing parenthesis with the word end and word boundary.
) Closing tag for negative lookahead.
[\w]+ character class to capture words.
Explanation: The regex search will only look for locations starting with word boundaries, and will remove matches with end as only word. i.e [WORD BOUNDARY]end[END OF WORD BOUNDARY]. \w will capture rest of the word. You can keep incrementing this character class if you wish to capture some special characters like $ etc.
So you want to match any word, but not "end" ?
Unless I'm misunderstanding, a conditional statement is everything that is needed... In pseudocode:
if (word != "end") {
// Match
}
If you want to match all the words in a text that are not "end" you could just remove all the non-alpha characters, replace pattern (^end | end | end$) by an empty string, and then do a string split.
The other answers with a single regex might be better then, because regex matches are O(n), no matter of the pattern.

Regex with start and end match

I'm having trouble matching the start and end of a regex on Python.
Essentially I'm confused about the when to use word boundaries /b and start/end anchors ^ $
My regex of
^[A-Z]{2}\d{2}
matches 4 letter characters (two uppercase letters, two digits) which is what I'm after
Matches AJ99, RD22, CP44 etc
However, I also noted that AJAJAJAJAJAJAJAJAJSJHS99 could be matched as well. I've tried used ^ and $ together to match the whole string. This doesn't work
^[A-Z]{2}\d{2}$ # this doesn't work
but
^[A-Z]{2}\d{2} # this is fine
[A-Z]{2}\d{2}$ # this is fine
The string I'm matching against is 4 characters long, but in the first two examples the regex could pick the start and end of a longer string respectively.
s = "NZ43" # 4 characters, match perfect! However....
s = "AM27272727" # matches the first example
s = "HAHSHSHSHDS57" # matches the second example
The position anchors ^ and $ place a restriction on the position of your matched chars:
Analyzing your complete regex:
^[A-Z]{2}\d{2}$
^ matches only at the beginning of the text
[A-Z]{2} exactly 2 uppercase Ascii alphabetic characters
\d{2} exactly 2 digits (equivalent to [0-9]{2})
$ matches only at the end of the text
If you remove one or both of the 2 position anchors (^ or $) you can match a substring starting from the beginning or the end as you stated above.
If you want to match exactly a word without using the start/end of the string use the \b anchor, like this:
``\b[A-Z]{2}\d{2}\b``
\b matches at the start/end of text and between a regex word (in regex a word char \w is intended as one of [a-zA-Z0-9_]) and one char not in the word group (available as \W).
The regex above matches WS24 in all the next strings:
WS24 alone
before WS24
WS24 after
before WS24 after
NZ43
It doesn't match:
AM27272727 (it will do if is AM27 272727 or AM27"272727
HAHSHSHSHDS57 (it will do if HAHSHSHSH DS75 or...you get it)
A demo online (the site will be useful to you also to experiment with regex).
The fact that your shown behaviour is like it's supposed to be, your question suggests that you maybe does not have fully understood how regular expressions work.
As a addition to the very good and informative answer of GsusRecovery, here's a site, that guides you through the concepts of regular expressions and tries to teach you the basics with a lessons-based system. To be clear, I do not want to tout this website, as there are plenty of those, but however I could really made a use of this one and so it's the one I'm suggesting.