I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.
Related
Hello good afternoon!!
I'm new to the world of regular expressions and would like some help creating the following expression!
I have a query that returns the following values:
caixa-pod
config-pod
consultas-pod
entregas-pod
monitoramento-pod
vendas-pod
I would like the results to be presented as follows:
caixa
config
consultas
entregas
monitoramento
vendas
In this case, it would exclude the word "-pod" from each value.
I would try (.*)-pod. It is not clear, where do you want to use that regexp (so regexp can be different). I guess it is dashboard variable.
You can try
\b[a-z]*(?=-pod)\b
This regex basically tells the regex engine to match
\b a word boundary
[a-z]* any number of lowercase characters in range a-z (feel free to extend to whatever is needed e.g. [a-zA-Z0-9] matches all alphanumeric characters)
(?=-pod) followed by -pod but exclude that from the result (positive lookahead)
\b another word boundary
\b matches a word boundary position between a word character and non-word character or position (start / end of string).
I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.
i need a regex that matches an expression ending with a word boundary, but which does not consider the hyphen as a boundary.
i.e. get all expressions matched by
type ([a-z])\b
but do not match e.g.
type a-1
to rephrase: i want an equivalent of the word boundary operator \b which instead of using the word character class [A-Za-z0-9_], uses the extended class: [A-Za-z0-9_-]
You can use a lookahead for this, the shortest would be to use a negative lookahead:
type ([a-z])(?![\w-])
(?![\w-]) would mean "fail the match if the next character is in \w or is a -".
Here is an option that uses a normal lookahead:
type ([a-z])(?=[^\w-]|$)
You can read (?=[^\w-]|$) as "only match if the next character is not in the character class [\w-], or this is the end of the string".
See it working: http://www.rubular.com/r/NHYhv72znm
I had a pretty similar problem except I didn't want to consider the '*' as a boundary character. Here's what I did:
\b(?<!\*)([^\s\*]+)\b(?!*)
Basically, if you're at a word boundary, look back one character and don't match if the previous character was an '*'. If you're in the middle, don't match on a space or asterisk. If you're at the end, make sure the end isn't an asterisk. In your case, I think you could use \w instead of \s. For me, this worked in these situations:
*word
wo*rd
word*
I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.
If I have the following data:
"test1"."test2" AND "test1"."test2"
What regex can I use to match "test1"."test2"?
I tried the following but it did not work.
\b"test1"."test2"(\s+|$)
In the given example I'd like to match "test1"."test2", and, "test1"."test2"
\b matches at a word boundary, i. e. just before or after an alphanumeric character. Since " is not alphanumeric (and assuming that there is no word character right before it), the assertion fails - and therefore the entire regex.
Drop the \b, escape the dot, and you're set.
This should work
"test1"\."test2"