This Regex is not working only in Solr - regex

This Regex is working perfectly in plain C# console application. Based on this we have started using SolrNet. Trying to query a Solr instance for a field by using the same regex, throwing exceptions as shown below
java.lang.IllegalArgumentException: expected ']' at position 70 at org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1087)

You are using Lucene regex engine that is different from the .NET regex engine.
A hyphen is a range operator when it is unescaped even at the end of the character class in a Lucene pattern. So, either escape the hyphen or move to the character class start, i.e. [a-zA-Z'-] => [-a-zA-Z'] and [^a-zA-Z'-] => [^-a-zA-Z'].
It does not look like Lucene regex supports non-capturing groups, so remove all ?: from the pattern.
So, it will look like
([-a-zA-Z']+[^-a-zA-Z']+){0,5}the([^-a-zA-Z']+[-a-zA-Z']+){0,5}([-a-zA-Z']+[^-a-zA-Z']+){0,5}the([^-a-zA-Z']+[-a-zA-Z']+){0,5}

As per your comment, your use case seems best suited to use a phrase query, did you try it?
a query like "website stackoverflow.com is"~5 could work and would be more performant. If the order it's important, you could use two queries ("website stackoverflow"~5 AND "stackoverflow.com is"~5) and use a custom scorer to remove the ones that are not in order. It will be much more performant.

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Lucene regex v4

I am trying to query on Kibana version 7.9.1 for a uuidv4. I disabled the KQL an now it looks like it is using lucene.
Example of a uuid v4:
2334e133-37a6-4039-8acd-b0a561b961b2
Now if I input :
/[0-9a-fA-F]{8}/
in the search bar I get hits, but as soon as I try to escape the hyphen like
/[0-9a-fA-F]{8}\-/
nothing shows up. I would like to use the full regular expression:
[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}
But I can't because of the hyphens.
Is there any other way to escape that pesky hyphen?
I am using elastic search 7.9.1 by the way
I'm not sure why that regex above won't work for you, but this was the best I could come up with given the context: ^[0-9a-fA-F]{8}[^\s\d\w!##$%^&*()_+=\\\][{}|';:"\/.,<>?][0-9a-fA-F]{4}[^\s\d\w!##$%^&*()_+=\\\][{}|';:"\/.,<>?][0-9a-fA-F]{4}[^\s\d\w!##$%^&*()_+=\\\][{}|';:"\/.,<>?][0-9a-fA-F]{4}[^\s\d\w!##$%^&*()_+=\\\][{}|';:"\/.,<>?][0-9a-fA-F]{12}$
It basically is just replacing your "-" with a character not in range "[^...]" that I filled with almost everything except - and added a start character "^" and end character "$"
Again, not sure if lucene is just not using certain parts of regex, but try not escaping the -'s I know some programs will automatically escape symbols for you when using regex.
I ended up using the following regex on lucene in the kibana discover option:
/[0-9a-fA-F]{8}/ AND /[0-9a-fA-F]{4}/ AND /[0-9a-fA-F]{12}/
Not pretty, but it works.

Regex Jersey Rest Service

I have the following regex in jersey, that works:
/artist_{artistUID: [1-9][0-9]*}
however, if i do
/{artistUID: [artist_][1-9][0-9]*}
it does not, what i do not understand how the regexes are being build and do not find any good documentation for it. What i want to do is something like this:
/{artistUID: ([uartist_]|[artist_])[1-9][0-9]*}
to recognize terms like "artist_123" and "uartist_123" and store them in the artistUID value.
You can use the alternation group ((...|...)) rather than a characrter class [...] (that matches 1 single character defined inside it).
Use
/{artistUID: (uartist|artist)_[1-9][0-9]*}
Or to make it shorter, use a ? quantifier after u to make it optional:
/{artistUID: u?artist_[1-9][0-9]*}
See the regex demo

Filter by regex example

Could anyone provide an example of a regex filter for the Google Chrome Developer toolbar?
I especially need exclusion. I've tried many regexes, but somehow they don't seem to work:
It turned out that Google Chrome actually didn't support this until early 2015, see Google Code issue. With newer versions it works great, for example excluding everything that contains banners:
/^(?!.*?banners)/
It's possible -- at least in Chrome 58 Dev. You just need to wrap your regex with forward-slashes: /my-regex-string/
For example, this is one I'm currently using: /^(.(?!fallback font))+$/
It successfully filters out any messages that contain the substring "fallback font".
EDIT
Something else to note is that if you want to use the ^ (caret) symbol to search from the start of the log message, you have to first match the "fileName.js?someUrlParam:lineNumber " part of the string.
That is to say, the regex is matching against not just the log message, but also the stack-entry for the line which made the log.
So this is the regex I use to match all log messages where the actual message starts with "Dog":
/^.+?:[0-9]+ Dog/
The negative or exclusion case is much easier to write and think about when using the DevTool's native syntax. To provide the exclusion logic you need, simply use this:
-/app/ -/some\sother\sregex/
The "-" prior to the regex makes the result negative.
Your expression should not contain the forward slashes and /s, these are not needed for crafting a filter.
I believe your regex should finally read:
!(appl)
Depending on what exactly you want to filter.
The regex above will filter out all lines without the string "appl" in them.
edit: apparently exclusion is not supported?

hl.regex.pattern not working in solr

I am using solr to fetch data.
I was using below parameters to fetch data:
http://testURL/solr/core0/select?start=10&rows=10&hl.fl=CC&hl.requireFieldMatch=true&hl=on&hl.maxAnalyzedChars=1&hl.fragsize=145&hl.snippets=99&sort=COlumn1+desc&q=CC%3a%28%22test%22~2%29&fl=title120%2ccolumn2%2ccolumn3%2cRL_DateTime%2cSid%2ccolumn4%2cguid%2chour&hl.regex.pattern=^\d+%20%3E%3E%20
Above query is not working with hl.regex.pattern parameter.
If I remove "hl.regex.pattern" than it is providing results in highlight section.
If I provide that regex pattern than it will not.
Regex is working in my c# code.
So am I missing anything here?
It's almost certainly the ^\. Those aren't valid in a URI, so you'll have to escape them.
From RFC 1738:
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
This is a little dated, since non-Roman alphanumerics like λάμδα are allowed now, but the gist is the same.
Try hl.regex.pattern=%5E%5Cd+%20%3E%3E%20 instead.