elasticsearch span_near query false hits - regex

I have a text field which contains an xml-document where I try to find this kind of match:
<Payer> [...] bic=\"123456789\" [...] </Payer>
with the following query:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "payer" }},
{ "span_term" : { "field" : "bic" }},
{ "span_term" : { "field" : "123456789" }},
{ "span_term" : { "field" : "payer"}}
],
"slop" : 500,
"in_order" : true
}
}
}
The problem is that sometimes I get wrong matches if xml-document contains something like:
<Payer>bic=\"111111111\"</Payer><Payee>bic=\"123456789\"</Payee><Payer>bic=\"222222222\"</Payer>
Query finds PayeE instead of PayeR. From elastic point of view it is still valid.
Any ideas I can prevent this "greedy" search?
As far as I know from this topic regexp is not an option because "Elasticsearch (and lucene) don't support full Perl-compatible regex syntax". It means regexp-query matches tokens, not the whole string.
I also tried to make last span_term like /payer or \\/payer or </payer but it finds nothing at all.

You may add a span_not query:
Removes matches which overlap with another span query. The span not query maps to Lucene SpanNotQuery.

Related

Kibana Query Language - find numbers in a field

I am struggling with a simple query that is supposed to work based on many tutorials but cannot make it work. Havin log field
Request sent, method=GET, headers={}, queryParams={forceArray=[true]}, entity=null, payload length=null} playerId=102
I am trying to get playerId with 3 digits value. Following query fails
log: /playerId=[0-9]{1,3}/
with KQLSyntaxError: Expected AND, OR, end of input, whitespace but "{" found. and log: /playerId=[0-9]{1,3}/
but supposed to work according to https://dzone.com/articles/getting-started-with-kibana-advanced-searches
This log: /playerId=[0-9][0-9][0-9]/returns basically everything with a single '0' character
This log: /playerId=*/ for some mysterious reasons returns nothing.
Edit
regular elastic search lucene based query does not work either
{
"query": {
"regexp": {
"log": {
"value": "*playerId*"
}
}
}
}
mapping:
{
"my-index" : {
"mappings" : {
"log" : {
"full_name" : "log",
"mapping" : {
"log" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
any help appreciated
Edit
I validated my regex queries in https://regex101.com/
and they all work.
Edit 2
this works
"query": {
"match": {
"log": "playerId"
}
}
this return empty hits
"query": {
"regexp": {
"log": "playerId"
}
}
regards
This is because Kibana uses KQL (Kibana Query Language) by default and that doesn't support regular expressions.
You need to switch to the Lucene Query Language with the query string syntax which supports the regular expression you're trying.
Just click on KQL at the right end of the search bar to change the search syntax.
Also worth noting that regular expression queries are real performance hogger. You should really parse your logs before ingesting them so you can query the playerId field independently.
In any case, if you really want to do it that way, your query is not that far off from the real thing. Here is the correct version that will work for your case:
{
"query": {
"query_string": {
"query": "/.*playerId=[0-9]{3}/",
"default_field": "log.keyword"
}
}
}

Case Insensitive Search With Regex

I'm trying to implement a case-insensitive search with regex.
Example: /^sanford/i (searching for anything starting with "sanford" disregarding case sensivity.
For case insensitive queries, creating indeces with a custom collation is recommended by the documentation. This works fine as long as no regex is involved.
The problem: searching with a regex (in this case "starts with"), the case-insensitive search does NOT take the index into account.
This is stated in the documentation multiple times and is also reproducable with an explain command.
To sum it up: It works, but without effectively using the index. I'd be glad to get any hints, I can't get rid of the feeling that I'm missing something fundamentally important here.
Inserting the string with toLowerCase and then searching only with lower cased strings is not an option.
I can't use a text index because there can only be one per collection.
Example from the documentation see here, the green info box on the bottom.
#D.SM: The index is used, but it scans all documents.
https://docs.atlas.mongodb.com/schema-suggestions/case-insensitive-regex/
Example document:
{
"name": [{
"family": "Test",
"given": "Name",
}],
}
Index with collation:
{ "name" : "name_family", "key" : { "name.family" : 1 }, "host" : "noneofyourbusiness.com", "accesses" : { "ops" : NumberLong(114), "since" : ISODate("2020-07-30T20:25:59.079Z") }, "shard" : "shA", "spec" : { "v" : 2, "key" : { "name.family" : 1 }, "name" : "name_family", "ns" : "noneofyourbusiness.somethingwithaname", "collation" : { "locale" : "de", "caseLevel" : false, "caseFirst" : "off", "strength" : 1, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } }
}

Elasticsearch regex for full documents does not work

I am using Elasticsearch to store sentences. I want to find sentences matching a regular expression. I tried query_string for this, though it does not return the required sentence.
Query:
{
"_source": "doc.sent",
"query": {
"query_string" : {
"query" : "/food.*table/",
"default_field" : "doc.sent"
}
}
}
Example sentence:
My food is left at the table right now.
You do not need regex for this, but if you want to match multiple words or multiple patterns, you can use & symbol
Intersection
The ampersand "&" joins two patterns in a way that both of them have
to match. For string "aaabbb":
aaa.+&.+bbb # match aaa&bbb # no match Using this feature
usually means that you should rewrite your regular expression.
Enabled with the INTERSECTION or ALL flags.
For your purpose, the query would look like:
{
"_source": "doc.sent",
"query": {
"query_string" : {
"query" : "food&table",
"default_field" : "doc.sent"
}
}
}
Or you could also use ANDor OR operators
{
"_source": "doc.sent",
"query": {
"query_string" : {
"query" : "food AND table",
"default_field" : "doc.sent"
}
}
}

How can I sort mongodb regex search query results based on regex match count

I can't figure out how to sort query results based on the "best" match.
Here's a simple example, I have a "zone" collection containing a list of city/zipcode couples.
If I search several words through the regex using the "and" keyword like that :
"db.zones.find({$or : [ {ville: /ROQUE/}, {ville: /ANTHERON/}] })"
Results won't be ordered by "best match".
What other solutions do I have for that ?
You could try to use http://docs.mongodb.org/manual/reference/operator/query/text/#match-any-of-the-search-terms
db.zones.ensureIndex( { 'ville' : 'text' } ,{ score: {$meta:'textScore'}})
db.zones.find(
{ $text: { $search: "ROQUE ANTHERON"}},
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
Result:
{
"_id" : ObjectId("547c2473371ea419f07b954c"),
"ville" : "ANTHERON",
"score" : 1.1
}
{
"_id" : ObjectId("547c246f371ea419f07b954b"),
"ville" : "ROQUE",
"score" : 1
}
From documentation
If the search string is a space-delimited string, $text operator
performs a logical OR search on each term and returns documents that
contains any of the terms.
You have to use mongodb 2.6
I ended up using ElasticSearch search engine do this query :
#zones = Zone.es.search(
body: {
query: {
bool: {
should: [
{match: {city: search}},
{match: {zipcode: search.to_i}}
]
}
},
size: limit
})
Where search is a search param sent by view.
ElasticSearch with Selectize plugin

MongoDB - strip non numeric characters in field

I have a field of phone numbers where a random variety of separators have been used, such as:
932-555-1515
951.555.1255
(952) 555-1414
I would like to go through each field that already exists and remove the non numeric characters.
Is that possible?
Whether or not it gets stored as an integer or as a string of numbers, I don't care either way. It will only be used for display purposes.
You'll have to iterate over all your docs in code and use a regex replace to clean up the strings.
Here's how you'd do it in the mongo shell for a test collection with a phone field that needs to be cleaned up.
db.test.find().forEach(function(doc) {
doc.phone = doc.phone.replace(/[^0-9]/g, '');
db.test.save(doc);
});
Based on the previous example by #JohnnyHK, I added regex also to the find query:
/*
MongoDB: Find by regular expression and run regex replace on results
*/
db.test.find({"url": { $regex: 'http:\/\/' }}).forEach(function(doc) {
doc.url = doc.url.replace(/http:\/\/www\.url\.com/g, 'http://another.url.com');
db.test.save(doc);
});
Starting in Mongo 4.4, the $function aggregation operator allows applying a custom javascript function to implement behaviour not supported by the MongoDB Query Language.
And coupled with improvements made to db.collection.update() in Mongo 4.2 that can accept an aggregation pipeline, allowing the update of a field based on its own value,
We can manipulate and update a field in ways the language doesn't easily permit and avoid an inefficient find/foreach pattern:
// { "x" : "932-555-1515", "y" : 3 }
// { "x" : "951.555.1255", "y" : 7 }
// { "x" : "(952) 555-1414", "y" : 6 }
db.collection.updateMany(
{ "x": { $regex: /[^0-9]/g } },
[{ $set:
{ "x":
{ $function: {
body: function(x) { return x.replace(/[^0-9]/g, ''); },
args: ["$x"],
lang: "js"
}}
}
}
])
// { "x" : "9325551515", "y" : 3 }
// { "x" : "9515551255", "y" : 7 }
// { "x" : "9525551414", "y" : 6 }
The update consist of:
a match query { "x": { $regex: /[^0-9]/g } }, filtering documents to update (in our case any document that contains non-numeric characters in the field we're interested on updating).
an update aggreation pipeline [ { $set: { active: { $eq: [ "$a", "Hello" ] } } } ] (note the squared brackets signifying the use of an aggregation pipeline). $set is a new aggregation operator and an alias for $addFields.
$function takes 3 parameters:
body, which is the function to apply, whose parameter is the string to modify. The function here simply consists in replacing characters matching the regex with empty characters.
args, which contains the fields from the record that the body function takes as parameter. In our case, "$x".
lang, which is the language in which the body function is written. Only js is currently available.
in mongodb version 4.2 you have regexFind project operator which can be used together with substr inside an aggregation without looping through all the documents in client