Sonatype NXRM - Asset Name Matcher [duplicate] - regex

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?

The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one

use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'

For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}

I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}

For partial matching you can either use prefix or match_phrase_prefix.

For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.

using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});

You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.

Related

Regular Expressions in Elasticsearch/Kibana

I only want to retrieve events that match my regular expression for a particular field. For example, events that have an IP address. Elastic doesn't support PCRE so is there a way I can achieve this from their supported regular expression syntax?
Here is the regex I had before discovering it was not supported by Elastic:
https://regex101.com/r/99b6dn/3
Here is what I attempted using Elastic's supported syntax, but it's not working:
/[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*/
Expected result would be myfield:/<myregex>/ returning only logs that match the regex.
Only the non-capture group (?: ) in your pattern is absent from the list of supported regex operators. Try:
([0-9]{1,3}\.){3}[0-9]{1,3}
The query should look something like this:
{
"query": {
"match": {
"path": {
"query": "([0-9]{1,3}\.){3}[0-9]{1,3}",
"type": "phrase"
}
}
}
}
Or a more accurate pattern
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)

What is the correct way to format regex metacharacters and options when using the regex operator in $searchBeta in MongoDB?

I'm trying to do full-text search in MongoDB with $searchBeta (aggregation) and I'm using the 'regex' operator to do so. Here's the portion of the $searchBeta I have that isn't working how I expecting it would:
$searchBeta: {
regex: {
query: '\blightn', // '\b' is the word boundary metacharacter
path: ["name", "set_name"],
allowAnalyzedField: true
}
}
Here's an example of two documents that I'm expecting to get matched by the expression:
{
"name": "Lightning Bolt"
"set_name": "Masters 25"
},
{
"name": "Chain Lightning",
"set_name": "Battlebond"
}
What I actually get:
[] //empty array
If I use an expression like:
$searchBeta: {
regex: {
query: '[a-zA-Z]'
path: ["name", "set_name"],
allowAnalyzedField: true
}
}
then I get results back.
I can't get any expression that has regex metacharacters and/or options in it to work, so I'm pretty sure I'm just entering it wrong in my query string. The $searchBeta regex documentation doesn't really cover how to format metacharacters into your query string. Also, the $searchBeta regex operator is different from $regex because it doesn't require slashes (i.e. "/your expression/" ). Really pulling my hair out on something so simple that I can't figure out.
$searchBeta uses Lucene for regular expressions, which is not Perl Compatible (PCRE) and doesn't support \b. You can read about the Lucene regex syntax here and also Elastic's docs on it are also helpful.
Here is a similar question for ElasticSearch and includes some workarounds.

Kibana Custom Filter ,How to create Regex to eliminate all terms with numeric values

I have a list of request coming in based on free text searches or codes.
I would like to eliminate the code-like requests, and only keep the natural language request.
Therefore I would need a query that can separate those terms.
Below is the query-json I already tried
{
"query": {
"regexp": {
"q": "[^\d\W]"}
}
}
}
error I get is "Bad String" for the following line "q": "[^\d\W]"}
Expected would be to improve the regex in order to be able to keep the relevant data
You may use
"regexp": {
"q": "[^0-9]+"}
}
The Lucene regex engine used in Kibana anchors all patterns by default, so [^0-9]+ will match any string, from start to end of which there are only characters other than digits.
Moreover, \d and \W and other shorthand character classes are not supported either.

Regex: getting all the hashtags and mentions used in all my documents

I'm using the Kibana console to perform such queries (they are separated: one for the hashtags, one for the mentions). The collection of documents are blog entries with a textContent field, which may have user mentions like #theUserName #AnotherOne or hashtags like #helloWorld and #hello2. The queries look like the following one:
GET /xblog/_search
{
"source": [
"id",
"textContent"
],
"query": {
"regexp": {
"textContent": {
"value": "#([^-A-Za-z0-9])",
"flags": "ALL"
}
}
}
}
But the problem is it's returning also the documents that do not contain a #userMention. I think the # in the regex is being treated as a special symbol, but reading the documentation I couldn't find how to escape it.
Inthe docs 1, the authors say that you can escape any symbol with double quotes, so I tested:
""#""
But I got nothing.
I also testes expressions I'm used to, like:
/\s([##][\w_-]+)/g
But that produces multiple errors in Kibana. I tried replacing some parts according to the documentation, but it's still not working.
Can you point me in the right direction?
Thanks in advance,
You enabled the ALL flag that makes # match the whole string, see the ElasticSearch regex documentation:
If you enable optional features (see below) then these characters may also be reserved:
# # & < > ~
Then, in the Any string section:
The at sign "#" matches any string in its entirety.
Enabled with the ANYSTRING or ALL flags.
Since you do not need any special behavior here you may simply tell the engine to use a "simple" regex by passing "flags": "NONE", or escape the #, "\\#([^-A-Za-z0-9])":
Any reserved character can be escaped with a backslash "\*" including a literal backslash character: "\\"
And since you need a whole string match, you may need to add .* on both ends (to match strings containing your match):
"query": {
"regexp": {
"textContent": {
"value": ".*#[^-A-Za-z0-9].*",
"flags": "NONE"
}
}
}
Or
"query": {
"regexp": {
"textContent": {
"value": ".*\\#[^-A-Za-z0-9].*",
"flags": "ALL"
}
}
}

Chrome Extension Multi URL filtering

I am having trouble making my chrome extension work with multiple URL's What should be the format for listing URL's to match?
chrome.webNavigation.onDOMContentLoaded.addListener(function (o) {
chrome.tabs.executeScript(o.id, {
code: "////////"
})}, {
url: [
{ urlContains: ['/shop/jacket', 'shop/t-shirt']}
]
});
I'm assuming a regex would work, but how would I write that?
Your code doesn't work because urlContains expects a single string only.
The simplest regex operator that matches "A or B" is A|B.
So, in your case { urlMatches : "/shop/jacket|shop/t-shirt" }. It's simple in your case since your URL substrings do not contain special characters; in the general case you may need to \-escape some characters.