Lucene match only exact query ignoring repeated terms - regex

Given an index where the values of a property 'nodeName' reflect the list below, how can I use Lucene to return only nodes with an exactly matched name?
foo
bar
foobar
foo foo bar
If I search 'bar', I only want the second node returned.
I thought I could use regex in the search term (something like "+nodeName:\"/^{0}$\" where {0} is the query) to match on the start and end of the string, but that's not working - it returns all nodes that include the query.
Also tried an inclusive range ("+nodeName: [{0} TO {0}]") which returned nothing.

Regex query isn't really going to help you here. The regex in your query can not span multiple analyzed terms. The best way to ensure that a match spans the entire contents of a field is to index it in a way that facilitates that, that is, as a single token. I'm assuming this is a TextField using StandardAnalyzer, or something like it. In order to match against the whole input, a StringField would be a good choice, which would index the entire field as one token. Then a simple TermQuery could be used for this sort of search:
TermQuery("nodeName","bar") Would match only the document specified, rather than multiples
TermQuery("nodeName","foo foo bar") Would also match the last example, rather than none at all.
If you also need to be able to perform more standard (full-text) searches against analyzed text in this field, I would recommend indexing the same content in two separate fields, one StringField and one TextField.

Related

Extract id from URL using regex including underscores and alfanumeric characters [duplicate]

I am using a data analysis package that exposes a Regex function for string parsing. I am trying to parse a response from a website that is in the format...
key1=val1&key2=val2&key3=val3 ...
[There is the possibility that the keys and values may be percent encoded, but the current return values are not, the current return values are tokens and other info that are alphanumeric].
I understand this data to be www-form-urlencoded, or alternatively it might be known as query string format.
The object is to extract the value for a given key, if the order of the keys cannot be relied upon. For example, I might know that one of the keys I should receive is "token", so what regex pattern can I use to extract the value for the key "token"? I have searched for this but cannot find anything that does what I need, but if there is a duplicate question, apologies in advance.
In Alteryx, you may use Tokenize with a regex containing a capturing group around the part you need to extract:
The Tokenize Method allows you to specify a regular expression to match on and that part of the string is parsed into separate columns (or rows). When using the Tokenize method, you want to match to the whole token, and if you have a marked group, only that part is returned.
I bolded the part of the method description that proves that if there is a capturing group, only this part will be returned rather than the whole match.
Thus, you may use
(?:^|[?&])token=([^&]*)
where instead of token you may use any of the keys the value for which you want to extract.
See the regex demo.
Details
(?:^|[?&]) - the start of a string, ? or & (if the string is just a plain key-value pair string, you may omit ? and use (?:^|&) or (?<![^&]))
token - the key
= - an equal sign
([^&]*) - Group 1 (this will get extracted): 0 or more chars other than & (if you do not want to extract empty values, replace * with + quantifier).

How do I use regex to return text following specific prefixes?

I'm using an application called Firemon which uses regex to pull text out of various fields. I'm unsure what specific version of regex it uses, I can't find a reference to this in the documentation.
My raw text will always be in the following format:
CM: 12345
APP: App Name
BZU: Dept Name
REQ: First Last
JST: Text text text text.
CM will always be an integer, JST will be sentence that may span multiple lines, and the other fields will be strings that consist of 1-2 words - and there's always a return after each section.
The application, Firemon, has me create a regex entry for each field. Something simple that looks for each prefix and then a return should work, because I return after each value. I've tried several variations, such as "BZU:\s*(.*)", but can't seem to find something that works.
EDIT: To be clear I'm trying to get the value after each prefix. Firemon has a section for each field. "APP" for example is a field. I need a regex example to find "APP:" and return the text after it. So something as simple as regex that identifies "APP:", and grabs everything after the : and before the return would probably work.
You can use (?=\w+ )(.*)
Positive lookahead will remove prefix and space character from match groups and you will in each match get text after space.
I am a little late to the game, but maybe this is still an issue.
In the more recent versions of FireMon, sample regexes are provided. For instance:
jst:\s*([^;]?)\s;
will match on:
jst:anything in here;
and result in
anything in here

Regex pipe needs to take first match

My regular expression is like this:
.*(kgrj4e|\*)[^:]*:([^;]*);?
The 'kgrj4e' part is a userid and is dynamic. The PR.... parts are printers. If the userid is not found I want the default printer (PR12346).
For first test string below I want result to be PR12345, but I get PR12346
snljoe,snlaks,kgrj4e,snlbla:PR12345;*:PR12346
Note: the users snljoe, snlaks and snlbla are just examples and can be totally different. In fact the list of users can be longer or smaller.
For second test string below I want result to be PR12346
snljoe,snlaks,snlbla:PR12345;*:PR12346
How to fix the regular expression so both test strings give the expected result?
You can get the number with a search and replace:
Search for: ^(?:(?!.*,kgrj4e(?:[;,])).*\*:(\w+)|.*?(PR\d+).*)
Replace with: $1$2
See this demo
I assume that the kgrj4e is a user-defined value that should be missing in the string to match the last printer value. If it is present, the first printer value is returned.

ElasticSearch Regexp Filter

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.
POST to localhost:9200/_search
{
"query" : {
"match_all" : { },
"filtered" : {
"filter" : {
"regexp": {
"url":".*info-for/media.*"
}
}
}
},
}
This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?
First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.
This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.
What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.
Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com
A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

Is there a regular expression for a comma separated list of discrete values?

I use the following regular expression to validate a comma separated list of values.
^Dog|Cat|Bird|Mouse(, (Dog|Cat|Bird|Mouse))*$
The values are also listed in a drop down list in Excel cell validation, so the user can select a single value from the drop down list, or type in multiple values separated by commas.
The regular expression does a good job of preventing the user from entering anything but the approved values, but it doesn't prevent the user from entering duplicates. For example, the user can enter "Dog" and "Dog, Cat", but the user can also enter "Dog, Dog".
Is there any way to prevent duplicates using a similar single regular expression? In other words I need to be able to enforce a discrete list of approved comma separated values.
Thanks!
Use a backreference and a negative lookahead:
^(Dog|Cat|Bird|Mouse)(, (?!\1)(Dog|Cat|Bird|Mouse))*$
EDIT: This won't work with cases such as "Cat, Dog, Dog" ... You'll need to come up a hybrid solution for such instances - I don't believe there is a single regex that can handle that.
Here's another technique. You need to check two things, first, that it DOES match this:
(?:(?:^|, )(Dog|Cat|Bird|Mouse))+$
(That's just a slightly shorter version of your original regex)
Then, check that it DOES NOT match this:
(Dog|Cat|Bird|Mouse).+?\1
E.g.
var valid = string.match( /(?:(?:^|, )(Dog|Cat|Bird|Mouse))+$/ ) &&
!string.match( /(Dog|Cat|Bird|Mouse).+?\1/ );
J-P, I tried editing your sample regular expressions so that I could look for duplicates in any comma separated string. Something like this:
var valid = string.match( /(?:(?:^|, )([a-z]*))+$/ ) &&
!string.match( /([a-z]*).+?\1/ );
Unfortunately, I failed. The Force is weak with me. ;)
Thanks again for your help.
What about using some kind of expression like this:
(Dog|Cat|Bird|Mouse){1}
Then you can write only a value from the aray once. It is easy then to add zero or more times, commas, spaces, etc.
I know i'm necroposting but i found this in my search, so i'll let it sit here if anyone will find it.