My collection contains the following two documents
{
"BornYear": 2000,
"Type": "Zebra",
"Owners": [
{
"Name": "James Bond",
"Phone": "007"
}
]
}
{
"BornYear": 2012,
"Type": "Dog",
"Owners": [
{
"Name": "James Brown",
"Phone": "123"
},
{
"Name": "Sarah Frater",
"Phone": "345"
}
]
}
I would like to find all the animals whichs have an owner called something with James.
I try to unwind the Owners array, but cannot get access to the Name variable.
Bit of a misnomer here. To just find the "objects" or items in a "collection" then all you really need to do is match the "object/item"
db.collection.find({
"Owners.Name": /^James/
})
Which works, but does not of course limit the results to the "first" match of "James", which would be:
db.collection.find(
{ "Owners.Name": /^James/ },
{ "Owners.$": 1 }
)
As a basic projection. But that does not give any more than a "single" match, which means you need the .aggregate() method instead like so:
db.collection.aggregate([
// Match the document
{ "$match": {
"Owners.Name": /^James/
}},
// Flatten or de-normalize the array
{ "$unwind": "Owners" },
// Filter th content
{ "$match": {
"Owners.Name": /^James/
}},
// Maybe group it back
{ "$group": {
"_id": "$_id",
"BornYear": { "$first": "$BornYear" },
"Type": { "$first": "$Type" },
"Ownners": { "$push": "$Owners" }
}}
])
And that allows more than one match in a sub-document array while filtering.
The other point is the "anchor" or "^" caret on the regular expression. You really need it where you can, to make matches at the "start" of the string where an index can be properly used. Open ended regex operations cannot use an index.
You can use dot notation to match against the fields of array elements:
db.test.find({'Owners.Name': /James/})
Related
I have a document which contains a field called info_list which is basically a string with space separated 9 segments.
Mapping of the field is
"info_list": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
And the source document looks like
"_source": {
"id": "1234",
"date": "1614556800000",
"info_list": [
"1234 2D 5678 8765 5678 1111 2222 3333 1"
]
}
The info list basically consists of 9 segments. For sake of question we can say that a,b,c,d,e,f,g,h,i are those 9 segments.
info_list = a+ ' ' + b+ ' ' + c +' ' + d+ ' ' e + ' ' + f + ' ' + g + ' ' + h + ' ' + i
Now suppose I want to search for c with value of 5678, the current implementation uses match_phrase query something like this
GET test/_search
{
"query": {
"match_phrase": {
"info_list": "5678"
}
}
}
The issue with above approach is that ,even though I wanted search result having c = 5678, now if any segment in info_list string has 5678 it will match that, resulting in wrong search result.
I tried using regex query something like
GET /test/_search
{
"query" : {
"query_string" :
{ "fields" : ["info_list"],
"query" : ".* .* 5678 .*"
}
}
}
But this doesn't seem to work.Should I change the mapping of the field ? Any help or suggestions would be appreciated, since I am new to Elastic search.
Fixing the regex
You've got to let the regex engine know exactly how many preceding groups there will be. On top of that, you'll need to use the .keyword field because the standard analyzer that's applied on text fields by default will have split the original string by whitespace and converted each token to lowercase -- both of which should be prevented if you aim to work with capture groups.
Having said that, here's a working regexp query:
GET /test/_search
{
"query": {
"regexp": {
"info_list.keyword": "( ?[a-zA-Z0-9]+){2} 5678 .*"
}
}
}
Extracting the groups before ingestion
Should I change the mapping of the field?
I'd say go for it. When you know which group you'll be targeting, you should, ideally, extract the groups before you ingest the documents.
See, what I'd do in your case is the following:
Preserve the original info_list as a keyword for consistency
Extract the groups in the programming language of your choice, annotate them with keys a to i (analogously to the way you naturally think about said groups).
Store store them inside a nested field in order to guarantee that the connections between the keys and the values aren't lost due to array flattening.
In concrete terms:
Set up a mapping
PUT extracted-groups-index
{
"mappings": {
"properties": {
"info_list": {
"type": "keyword"
},
"info_list_groups": {
"type": "nested",
"properties": {
"group_key": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
}
}
}
}
Ingest the doc(s)
POST extracted-groups-index/_doc
{
"info_list": "1234 2D 5678 8765 5678 1111 2222 3333 1",
"info_list_groups": [
{
"group_key": "a",
"value": "1234"
},
{
"group_key": "b",
"value": "2D"
},
{
"group_key": "c",
"value": "5678"
},
{ ... } // omitted for brevity
]
}
Leverage a pair of nested term queries:
POST extracted-groups-index/_search
{
"query": {
"nested": {
"path": "info_list_groups",
"query": {
"bool": {
"must": [
{
"term": {
"info_list_groups.group_key": "c"
}
},
{
"term": {
"info_list_groups.value": "5678"
}
}
]
}
}
}
}
}
Harnessing the full power of Elasticsearch 🚀
The downside of the nested approach is that it'll increase your index size. Plus, the queries tend to get quite verbose and confusing. If you don't want to go that route, you can leverage what's called a custom analyzer.
Such an analyzer is typically composed of:
a tokenizer (which receives character streams and outputs a stream of tokens -- usually words)
and a few token filters whose role it is to mold the tokens into the desired form.
In concrete terms, the aim here is to:
Take in the string 1234 2D 5678 8765 5678 1111 2222 3333 1 as a whole
Locate the individual groups separated by whitespace
--> (1234) (2D) (5678) (8765) (5678) (1111) (2222) (3333) (1)
Annotate each group with its alphabetical index
--> a:1234 b:2D c:5678 d:8765 e:5678 f:1111 g:2222 h:3333 i:1
And finally split the resulting string by whitespace in order to use queries like a:1234 and c:5678
All of this can be achieved through a combination of the "noop" keyword tokenizer, and pattern_replace + pattern_capture filters:
PUT power-of-patterns
{
"mappings": {
"properties": {
"info_list": {
"type": "text",
"fields": {
"annotated_groups": {
"type": "text",
"analyzer": "info_list_analyzer"
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"info_list_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["pattern_grouper", "pattern_splitter"]
}
},
"filter": {
"pattern_grouper": {
"type": "pattern_replace",
"pattern": "((?<a>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<b>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<c>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<d>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<e>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<f>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<g>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<h>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))((?<i>(?:\b ?[a-zA-Z0-9]+){0}.*?([a-zA-Z0-9]+) ?))",
"replacement": "a:${a}b:${b}c:${c}d:${d}e:${e}f:${f}g:${g}h:${h}i:${i}"
},
"pattern_splitter": {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"([a-i]\\:[a-zA-Z0-9]+)"
]
}
}
}
}
}
Note that the friendly-looking regex from above is nothing more than a repetitive named group catcher.
After setting up the mapping, you can ingest the document(s):
POST power-of-patterns/_doc
{
"info_list": [
"1234 2D 5678 8765 5678 1111 2222 3333 1"
]
}
And then search for the desired segment in a nice, human-readable form:
POST power-of-patterns/_search
{
"query": {
"term": {
"info_list.annotated_groups": "c:5678"
}
}
}
Change your query to:
"query" : "([^ ]+ ){2}5678( [^ ]){6}"
which is English is “two terms-with-a-space, then your search term, then 6 spaces-with-terms”
I'm trying to make a query that gets all the prices that starts with '12'.
I have a collection like this:
{
"place": "Costa Rica",
"name": "Villa Lapas",
"price": 1353,
},
{
"place": "Costa Rica",
"name": "Hotel NWS",
"price": 1948,
},
{
"place": "Costa Rica",
"name": "Hotel Papaya",
"price": 1283,
},
{
"place": "Costa Rica",
"name": "Hostal Serine",
"price": 1248,
},
And I want my results like this:
{
'prices': [
1248,
1283
]
}
I'm converting all the prices to string in order to use a regex function. But I don't understand very well how to use the regex in my query.
My query returns:
{ "prices" : null }
{ "prices" : null }
Could someone please guide me? :)
db.collection.aggregate([
{'$project': {
'_id': 0,
'price': {'$toString': '$price'}
}},
{'$project': {
'prices': {'$regexFind': { 'input': "$price", 'regex': '^12' }}
}}
]).pretty();
You are almost correct.
db.test.aggregate([
{'$project': {
'_id': 0,
'prices': {'$toString': '$price'}
^^^ -> I meant this
}},
{'$match': {
'prices': {'$regex': '^12' }
^^^ -> same here
}}
])
You need to use $match with $regex which yields the result as you expected.
If you use regexFind, it works on all matching docs and returns null where input doesn't match the pattern
And
In the first project you have price instead prices. If you refer the first project name in the second project, then pipeline matches.
I have following json structure in my mongodb.
{
"country": "spain",
"language": "spanish",
"words": [
{
"word": "hello1",
....
},
{
"word": "hello2",
....
},
{
"word": "test",
....
},
]
}
I am trying to get all the dictionaries inside 'words' list which have particular substring matched.
For example, If I have a substring 'hel', then how should I query my document using mongoengine that gives two dictionary with word : 'hello1' and 'hello2'
The following query works only for matched word not with the substring.
data = Data.objects.filter(words__match={"word":"hel"})
// data is empty in this case([])
Using $elemMatch (match in mongoengine) will return the first element that matches the criteria, from the array.
You need to use aggregation to return all matching elements from your array:
pipeline = [
{ "$unwind": "$words" },
{ "$match": {"words.word": {"$regex": "hel"} } },
{ "$project": {"word":"$words.word", "_id":0} },
]
Article.objects.aggregate(*pipeline)
result:
{u'word': u'hello1'}
{u'word': u'hello2'}
Note, using this project stage you need to know all the fields in advance so you can specify them in the projection to return them.
You can also use this project for a different output, to return all fields but wrapped in a 'words dict':
pipeline = [
{ "$unwind": "$words" },
{ "$match": {"words.word": {"$regex": "hel"} } },
{ "$project": {"words":1, "_id":0} },
]
result:
{u'words': {u'otherfield': 1.0, u'word': u'hello1'}}
{u'words': {u'otherfield': 1.0, u'word': u'hello2'}}
I would like to ask if exists some documentation which describe how to work with Elasticseach pattern regex.
I need to write Pattern Capture Token Filter which filter only tokes start with specific word. For example input tokens stream should be like ("abcefgh", "abc123" , "aabbcc", "abc", "abdef") and my tokenizer will return only tokes abcefgh , abc123, abc because those tokens start with "abc".
Can someone help me how to achieve this use-case?
Thanks.
I suggest something like this:
"analysis": {
"analyzer": {
"my_trim_keyword_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"generate_tokens",
"eliminate_tokens",
"remove_empty"
]
}
},
"filter": {
"eliminate_tokens": {
"pattern": "^(?!abc)\\w+$",
"type": "pattern_replace",
"replacement": ""
},
"generate_tokens": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"(([a-z]+)(\\d*))"
]
},
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
}
}
If your tokens are the result of a pattern_capture filter, you'd need to add after this filter the one called eliminate_tokens in my example which basically matches token that don't start with abc. Those that don't match are replaced by empty string ("replacement": "").
After this, to remove the empty tokens I added the remove_empty filter which is basically a stop filter where the stopword is "" (empty string).
Basically i'm trying to implement tags functionality on a model.
> db.event.distinct("tags")
[ "bar", "foo", "foobar" ]
Doing a simple distinct query retrieves me all distinct tags. However how would i go about getting all distinct tags that match a certain query? Say for example i wanted to get all tags matching foo and then expecting to get ["foo","foobar"] as a result?
The following queries is my failed attempts of achieving this:
> db.event.distinct("tags",/foo/)
[ "bar", "foo", "foobar" ]
> db.event.distinct("tags",{tags: {$regex: 'foo'}})
[ "bar", "foo", "foobar" ]
The aggregation framework and not the .distinct() command:
db.event.aggregate([
// De-normalize the array content to separate documents
{ "$unwind": "$tags" },
// Filter the de-normalized content to remove non-matches
{ "$match": { "tags": /foo/ } },
// Group the "like" terms as the "key"
{ "$group": {
"_id": "$tags"
}}
])
You are probably better of using an "anchor" to the beginning of the regex is you mean from the "start" of the string. And also doing this $match before you process $unwind as well:
db.event.aggregate([
// Match the possible documents. Always the best approach
{ "$match": { "tags": /^foo/ } },
// De-normalize the array content to separate documents
{ "$unwind": "$tags" },
// Now "filter" the content to actual matches
{ "$match": { "tags": /^foo/ } },
// Group the "like" terms as the "key"
{ "$group": {
"_id": "$tags"
}}
])
That makes sure you are not processing $unwind on every document in the collection and only those that possibly contain your "matched tags" value before you "filter" to make sure.
The really "complex" way to somewhat mitigate large arrays with possible matches takes a bit more work, and MongoDB 2.6 or greater:
db.event.aggregate([
{ "$match": { "tags": /^foo/ } },
{ "$project": {
"tags": { "$setDifference": [
{ "$map": {
"input": "$tags",
"as": "el",
"in": { "$cond": [
{ "$eq": [
{ "$substr": [ "$$el", 0, 3 ] },
"foo"
]},
"$$el",
false
]}
}},
[false]
]}
}},
{ "$unwind": "$tags" },
{ "$group": { "_id": "$tags" }}
])
So $map is a nice "in-line" processor of arrays but it can only go so far. The $setDifference operator negates the false matches, but ultimately you still need to process $unwind to do the remaining $group stage for distinct values overall.
The advantage here is that arrays are now "reduced" to only the "tags" element that matches. Just don't use this when you want a "count" of the occurrences when there are "multiple distinct" values in the same document. But again, there are other ways to handle that.