I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).
Related
I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).
I need match next cases:
no word\ on your
no word\on your
no word\
no word
ignoring :
no word on your
My regular expression doesn't cover my needs
^.*(?=\bword\b)(?:\\.*|)$
To verify results you can use prepared configuration.
...Looking for help.
Your positive lookahead is followed with \\.* optional pattern followed with end of string, so the word string should either start at the same location as \ does or at the end of string, which means, your regex will never match any text.
You can use
^.*\bword(?:\\.*)?$
See the regex demo. There is no need adding another \b as \ after word is already a non-word char, and if the optional \\.* pattern does not exist, the end of string is already acting as a trailing word boundary for word.
Details:
^ - start of string
.* - zero or more chars other than line break chars as many as possible
\b - a word boundary
word - a word
(?:\\.*)? - an optional non-capturing group matching \ and then any zero or more chars other than line break chars as many as possible
$ - end of string.
I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).
I thought [^0-9a-zA-Z]* excludes all alpha-numeric letters, but allows for special characters, spaces, etc.
With the search string [^0-9a-zA-Z]*ELL[^0-9A-Z]* I expect outputs such as
ELL
ELLs
The ELL
Which ELLs
However I also get following outputs
Ellis Island
Bellis
How to correct this?
You may use
(?:\b|_)ELLs?(?=\b|_)
See the regex demo.
It will find ELL or ELLs if it is surrounded with _ or non-word chars, or at the start/end of the string.
Details:
(?:\b|_) - a non-capturing alternation group matching a word boundary position (\b) or (|) a _
ELLs? - matches ELL or ELLs since s? matches 1 or 0 s chars
(?=\b|_) - a positive lookahead that requires the presence of a word boundary or _ immediately to the right of the current location.
change the * to +
a * means any amount including none. A + means one or more. What you probably want though is a word boundry:
\bELL\b
A word boundry is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]). More here about that:
What is a word boundary in regexes?
I want to match all the words that contains the word "oana". I put "OANA" with uppercase letters in some words, at the beginning, middle, and at the end of words.
blah OANAmama blah aOANAtata aOANAt msmsmsOANAasfasfa mOANAmsmf OANAtata OANA3 oanTy
Anyway, I made a regex, but it is not very good, because it doesn't select all words that contains "oana"
\b\w+(oana)\w+\b
Can anyone give me another solution?
You need to use a case insensitive flag and replace + with *:
/\b\w*oana\w*\b/i
See the regex demo (a global modifier may or may not be used, depending on the regex engine). The case insensitive modifier may be passed as an inline option in some regex engines - (?i)\b\w*oana\w*\b.
Here,
\b - a word boundary
\w* - 0+ word chars
oana - the required char string inside a word
\w* - 0+ word chars
\b - a word boundary