elasticsearch regex pattern [duplicate] - regex

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?

In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

Related

Word boundaries in Atlas Search regex operator [duplicate]

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

Can't match word with following \ and all symbols after in my regular expression

I need match next cases:
no word\ on your
no word\on your
no word\
no word
ignoring :
no word on your
My regular expression doesn't cover my needs
^.*(?=\bword\b)(?:\\.*|)$
To verify results you can use prepared configuration.
...Looking for help.
Your positive lookahead is followed with \\.* optional pattern followed with end of string, so the word string should either start at the same location as \ does or at the end of string, which means, your regex will never match any text.
You can use
^.*\bword(?:\\.*)?$
See the regex demo. There is no need adding another \b as \ after word is already a non-word char, and if the optional \\.* pattern does not exist, the end of string is already acting as a trailing word boundary for word.
Details:
^ - start of string
.* - zero or more chars other than line break chars as many as possible
\b - a word boundary
word - a word
(?:\\.*)? - an optional non-capturing group matching \ and then any zero or more chars other than line break chars as many as possible
$ - end of string.

Regex that matches strings that are all lower case and do not contain specific string

I need a regular expression to ensure that entries in a form 1) are all lower case AND 2) do not contain the string ".net"
I can do either of those separately:
^((?!.net).)*$ gives me strings that do not contain .net.
[a-z] only matches lower-cased inputs. But I have not been able to combine these.
I've tried:
^((?!.net).)(?=[a-z])*$
(^((?!.net).)*$)([a-z])
And a few others.
Can anyone spot my error? Thanks!
As you are using a dot in your pattern that would match any char except a newline, you can use a negated character class to exclude matching uppercase chars or a newline.
As suggested by #Wiktor Stribiżew, to rule out a string that contains .net you can use a negative lookahead (?!.*\.net) where the .net (note to escape the dot) is preceded by .* to match 0+ times any character.
^(?!.*\.net)[^\nA-Z]+$
^ Start of string
(?!.*\.net) negative lookahead to make sure the string does not contain .net
[^\nA-Z]+ Match 1+ times any character except a newline or a char A-Z
$ End of string
Regex demo

Word boundary in Lucene regex

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.
Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
Details
(.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
word - a word
([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

Regex: Match all the words that contains some word

I want to match all the words that contains the word "oana". I put "OANA" with uppercase letters in some words, at the beginning, middle, and at the end of words.
blah OANAmama blah aOANAtata aOANAt msmsmsOANAasfasfa mOANAmsmf OANAtata OANA3 oanTy
Anyway, I made a regex, but it is not very good, because it doesn't select all words that contains "oana"
\b\w+(oana)\w+\b
Can anyone give me another solution?
You need to use a case insensitive flag and replace + with *:
/\b\w*oana\w*\b/i
See the regex demo (a global modifier may or may not be used, depending on the regex engine). The case insensitive modifier may be passed as an inline option in some regex engines - (?i)\b\w*oana\w*\b.
Here,
\b - a word boundary
\w* - 0+ word chars
oana - the required char string inside a word
\w* - 0+ word chars
\b - a word boundary