Conditional Search a string & concat with another field using MongoDB aggregation - regex

I would like to search a set of documents on a field called SERVICES.
When I search, and
IF : I find first word or words at the beginning of string is Mail or Envelopes delivered or Lost suitcase or Found mail.
THEN : I add to the string SERVICES a period and string value of LAGUAGES field. LAGUAGES string is: ‘Needs immediate action’.
Sample Data:
/* 1 */
{
"SERVICES" : "Mail has been packaged and sitting in mail room",
"LAGUAGES" : "Needs immediate action"
}
/* 2 */
{
"SERVICES" : "Envelopes delivered to client but were not signed for by anyone",
"LAGUAGES" : "Needs immediate action"
}
/* 3 */
{
"SERVICES" : "There were problems with the client's luggage",
"LAGUAGES" : "Needs immediate action"
}
/* 4 */
{
"SERVICES" : "Lost suitcase at airport while in transit",
"LAGUAGES" : "Needs immediate action"
}
/* 5 */
{
"SERVICES" : "Found mail sitting at airport mailing room",
"LAGUAGES" : "Needs immediate action"
}
Required Output:
/* 1 */
{
"SERVICES" : "Mail has been packaged and sitting in mail room. Needs immediate action"
}
/* 2 */
{
"SERVICES" : "Envelopes delivered to client but were not signed for by anyone. Needs immediate action"
}
/* 3 */
{
"SERVICES" : "Lost suitcase at airport while in transit. Needs immediate action"
}
/* 4 */
{
"SERVICES" : "Found mail sitting at airport mailing room. Needs immediate action"
}
Tried below query:
I did a $match first just to filter the information but seems to only filter the last $OR in my statement. Need help.
{
$match: {
$or: [
{
SERVICES: { $regex: "Mail.* " },
SERVICES: { $regex: "Envelopes delivered.* " },
SERVICES: { $regex: "Lost suitcase. * " },
SERVICES: { $regex: "Found mail. * " },
},
];
}
}
How do I search these strings and return the above output. Thanks.

You need to use aggregation's $concat operator :
db.collection.aggregate([
{
$match: {
$or: [
{
SERVICES: { $regex: /^Mail/ }
},
{
SERVICES: { $regex: /^Envelopes delivered/ }
},
{
SERVICES: { regex: /^Lost suitcase/ }
},
{
SERVICES: { $regex: /^Found mail/ }
}
]
}
},
/** `$addFields` will re-create `SERVICES` field with new concated string value
* Or if you just want `SERVICES` field then use `$project` with `_id :0 ` */
{$addFields : {SERVICES : {$concat : ['$SERVICES','.',' ','$LAGUAGES']}}}
])
Or you can use $in instead of $or :
db.collection.aggregate([
{ $match: { SERVICES: { $in: [/^Mail/, /^Envelopes delivered/, /^Lost suitcase/, /^Found mail/] } } } ,
{$addFields : {SERVICES : {$concat : ['$SERVICES', '.', ' ','$LAGUAGES']}}}
]);

Related

Mongo DB searching for occurances by date

So I've got a large dataset stored in my MonogDB of each time a song has been played in my itunes library, so each document is contains the artist name, song name, and date/time it was played. I currently am able to use the following query to search for the most occurances of a song in the database, which basically gives me the total number of times i had played it:
db.apple.aggregate([{ $sortByCount: "$song" }])
Returns:
{ "_id" : "Fireflies (feat. Grieves)", "count" : 336 }
{ "_id" : "Cinderella (feat. Ty Dolla $ign)", "count" : 267 }
{ "_id" : "Check", "count" : 241 }
{ "_id" : "100 Grandkids", "count" : 240 }
{ "_id" : "Late For the Sky (feat. Slug & Aesop Rock)", "count" : 226 }
This returns the total number of plays i have on a song, over the 5 years of plays i have in the database. What i was hoping to be able to do is create a query where it returns the total number of plays of a song for a specific year. I have the following query:
db.apple.find({"playTime" : {$regex : ".*2019*"}}).pretty()
This one returns all the songs that were played in a year but i can't figure out how i would combine these two queries.
Assuming playTime is a string data type ({ "playTime" : "2017-06-17T06:04:40.230Z" }), extract the first 4 characters of the string using the $substrCP and convert to an integer and match with an input year. The $sortByCount stage will remain as it is. The conversion to integer is optional; if not used the input year should be a string.
For example (using integer year):
var INPUT_YEAR = 2017
db.test.aggregate( [
{
$match: {
$expr: {
$eq: [ INPUT_YEAR, { $toInt: { $substrCP: [ "$playTime", 0, 4 ] } } ]
}
}
},
{
$sortByCount: "$song"
}
] )
Since you already have the queries ready, you just need to put them both in the same aggregation pipeline as JBone suggested in the comments. If your queries work as you have mentioned, this will do the trick:
db.apple.aggregate([
{ $sortByCount: "$song" },
{ $match: { "playTime" : {$regex : ".*2019*"} } }
])
If playTime is a string of type ISO 8601 format, then you can try this :
db.apple.aggregate([{
$match: {
$expr: {
$eq: [2019, {
$year: {
$dateFromString: {
dateString: '$playTime'
}
}
}]
}
}
}, { $sortByCount: "$song" }])
Or in case if you can change it to/have ISODate() then :
db.apple.aggregate([{
$match: {
$expr: {
$eq: [2019, {
$year: '$playTime'
}]
}
}
}, { $sortByCount: "$song" }])
Ref : $year,$dateFromString,$match or $isoWeekYear

Implement auto-complete feature using MongoDB search

I have a MongoDB collection of documents of the form
{
"id": 42,
"title": "candy can",
"description": "canada candy canteen",
"brand": "cannister candid",
"manufacturer": "candle canvas"
}
I need to implement auto-complete feature based on the input search term by matching in the fields except id. For example, if the input term is can, then I should return all matching words in the document as
{ hints: ["candy", "can", "canada", "canteen", ...]
I looked at this question but it didn't help. I also tried searching how to do regex search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search but couldn't find any help.
tl;dr
There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.
The problem
As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.
The solution
The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.
Step 1: Use a map/reduce job to extract the words
db.yourCollection.mapReduce(
// Map function
function() {
// We need to save this in a local var as per scoping problems
var document = this;
// You need to expand this according to your needs
var stopwords = ["the","this","and","or"];
for(var prop in document) {
// We are only interested in strings and explicitly not in _id
if(prop === "_id" || typeof document[prop] !== 'string') {
continue
}
(document[prop]).split(" ").forEach(
function(word){
// You might want to adjust this to your needs
var cleaned = word.replace(/[;,.]/g,"")
if(
// We neither want stopwords...
stopwords.indexOf(cleaned) > -1 ||
// ...nor string which would evaluate to numbers
!(isNaN(parseInt(cleaned))) ||
!(isNaN(parseFloat(cleaned)))
) {
return
}
emit(cleaned,document._id)
}
)
}
},
// Reduce function
function(k,v){
// Kind of ugly, but works.
// Improvements more than welcome!
var values = { 'documents': []};
v.forEach(
function(vs){
if(values.documents.indexOf(vs)>-1){
return
}
values.documents.push(vs)
}
)
return values
},
{
// We need this for two reasons...
finalize:
function(key,reducedValue){
// First, we ensure that each resulting document
// has the documents field in order to unify access
var finalValue = {documents:[]}
// Second, we ensure that each document is unique in said field
if(reducedValue.documents) {
// We filter the existing documents array
finalValue.documents = reducedValue.documents.filter(
function(item,pos,self){
// The default return value
var loc = -1;
for(var i=0;i<self.length;i++){
// We have to do it this way since indexOf only works with primitives
if(self[i].valueOf() === item.valueOf()){
// We have found the value of the current item...
loc = i;
//... so we are done for now
break
}
}
// If the location we found equals the position of item, they are equal
// If it isn't equal, we have a duplicate
return loc === pos;
}
);
} else {
finalValue.documents.push(reducedValue)
}
// We have sanitized our data, now we can return it
return finalValue
},
// Our result are written to a collection called "words"
out: "words"
}
)
Running this mapReduce against your example would result in db.words look like this:
{ "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Note that the individual words are the _id of the documents. The _id field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.
Step 2: Query for autocompletion
For autocompletion, we only need the words, without the links to the documents.
Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.
To stick with your example, we would use the following query to get the candidates for autocompletion:
db.words.find({_id:/^can/},{_id:1})
which gives us the result
{ "_id" : "can" }
{ "_id" : "canada" }
{ "_id" : "candid" }
{ "_id" : "candle" }
{ "_id" : "candy" }
{ "_id" : "cannister" }
{ "_id" : "canteen" }
{ "_id" : "canvas" }
Using the .explain() method, we can verify that this query uses only the index.
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 8,
"nscannedObjects" : 0,
"nscanned" : 8,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 8,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
"can",
"cao"
],
[
/^can/,
/^can/
]
]
},
"server" : "32a63f87666f:27017",
"filterSet" : false
}
Note the indexOnly:true field.
Step 3: Query the actual document
Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.
Step 3.1: Get the document of the words collection
When the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.
db.words.find({_id:"canteen"})
which would result in a document like this:
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Step 3.2: Get the actual document
With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:
db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})
Notes
While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.
Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.
Thanks to #Markus solution, I came up with something similar with aggregations instead. Knowing that map-reduce are flagged as deprecated for later versions.
const { MongoDBNamespace, Collection } = require('mongodb')
//.replace(/(\b(\w{1,3})\b(\W|$))/g,'').split(/\s+/).join(' ')
const routine = `function (text) {
const stopwords = ['the', 'this', 'and', 'or', 'id']
text = text.replace(new RegExp('\\b(' + stopwords.join('|') + ')\\b', 'g'), '')
text = text.replace(/[;,.]/g, ' ').trim()
return text.toLowerCase()
}`
// If the pipeline includes the $out operator, aggregate() returns an empty cursor.
const agg = [
{
$match: {
a: true,
d: false,
},
},
{
$project: {
title: 1,
desc: 1,
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$concat: ['$title', ' ', '$desc'],
},
},
},
{
$addFields: {
cleaned: {
$function: {
body: routine,
args: ['$text'],
lang: 'js',
},
},
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$trim: {
input: '$cleaned',
},
},
},
},
{
$project: {
words: {
$split: ['$text', ' '],
},
qt: {
$const: 1,
},
},
},
{
$unwind: {
path: '$words',
includeArrayIndex: 'id',
preserveNullAndEmptyArrays: true,
},
},
{
$group: {
_id: '$words',
docs: {
$addToSet: '$_id',
},
weight: {
$sum: '$qt',
},
},
},
{
$sort: {
weight: -1,
},
},
{
$limit: 100,
},
{
$out: {
db: 'listings_db',
coll: 'words',
},
},
]
// Closure for db instance only
/**
*
* #param { MongoDBNamespace } db
*/
module.exports = function (db) {
/** #type { Collection } */
let collection
/**
* Runs the aggregation pipeline
* #return {Promise}
*/
this.refreshKeywords = async function () {
collection = db.collection('listing')
// .toArray() to trigger the aggregation
// it returns an empty curson so it's fine
return await collection.aggregate(agg).toArray()
}
}
Please check for very minimal changes for your convenience.

How can I sort mongodb regex search query results based on regex match count

I can't figure out how to sort query results based on the "best" match.
Here's a simple example, I have a "zone" collection containing a list of city/zipcode couples.
If I search several words through the regex using the "and" keyword like that :
"db.zones.find({$or : [ {ville: /ROQUE/}, {ville: /ANTHERON/}] })"
Results won't be ordered by "best match".
What other solutions do I have for that ?
You could try to use http://docs.mongodb.org/manual/reference/operator/query/text/#match-any-of-the-search-terms
db.zones.ensureIndex( { 'ville' : 'text' } ,{ score: {$meta:'textScore'}})
db.zones.find(
{ $text: { $search: "ROQUE ANTHERON"}},
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
Result:
{
"_id" : ObjectId("547c2473371ea419f07b954c"),
"ville" : "ANTHERON",
"score" : 1.1
}
{
"_id" : ObjectId("547c246f371ea419f07b954b"),
"ville" : "ROQUE",
"score" : 1
}
From documentation
If the search string is a space-delimited string, $text operator
performs a logical OR search on each term and returns documents that
contains any of the terms.
You have to use mongodb 2.6
I ended up using ElasticSearch search engine do this query :
#zones = Zone.es.search(
body: {
query: {
bool: {
should: [
{match: {city: search}},
{match: {zipcode: search.to_i}}
]
}
},
size: limit
})
Where search is a search param sent by view.
ElasticSearch with Selectize plugin

How to search and replace in mongoose?

My objective :I want to update multiple documents in a collection in a certain path with a condition that path matches a regex then search and replace with certain value in the path then finally save all those documents persistently in db.
example:
myCollection : [
{ doc1 : { summary : 'Summary 1 : one', value : 1 },
{ doc2 : { summary : 'Summaryyuist 2 : two', value : 1 },
{ doc3 : { summary : 'hello 3 : three', value : 3 },
];
now i want to replace all 'Summary' with 'hello' in path :'summary'
so the result after query should be :
myCollection : [
{ doc1 : { summary : 'hello 1 : one', value : 1 },
{ doc2 : { summary : 'helloyuist 2 : two', value : 1 },
{ doc3 : { summary : 'hello 3 : three', value : 3 },
];
I am just looking the query to be used above.
From here
http://mongoosejs.com/docs/api.html#query_Query-regex
I found no information how to implement. Specially what the 'Number' parameter does in regex method.?
Also from here :
http://mongoosejs.com/docs/api.html#query_Query-regex
Also i do not found the 'show code' link for monsoose regex. Can someone at least reply a link for mongoose regex code.
i find only how to find models, NOT how to update with replacement with a value that matches the regex in certain path of docs.
How to achieve my objective?
You could use mongoose Model.Update option i hope.
Model.update = function (query, doc, options, callback) { ... }
Example :
MyModel.update({ age: { $gt: 18 } }, { oldEnough: true }, fn);
MyModel.update({ name: 'Tobi' }, { ferret: true }, { multi: true }, function(err,numberAffected, raw) {
if (err) return handleError(err);
console.log('The number of updated documents was %d', numberAffected);
console.log('The raw response from Mongo was ', raw);
});
http://mongoosejs.com/docs/api.html#model_Model.update

MongoDB aggregation framework on secondary node

How to execute aggregation framework task on secondary node using c++ driver?
Here`s an example that always executes on primary node:
DBClientConnection c;
bo res;
vector<bo> pipeline;
pipeline.push_back( BSON( "$match" << BSON( "firstName" << "Stephen" ) ) );
c.connect( "localhost:12345" );
c.runCommand( "test", BSON( "aggregate" << "people" << "pipeline" << pipeline ), res );
cout << res.toString() << endl;
I need to execute it on secondary.
While I haven't worked with the C++ driver for MongoDB, running aggregations on secondary is easily possible by simply setting the read preference to secondary. For e.g. on the shell:
mongo -u admin -p <pwd> --authenticationDatabase admin --host
RS-repl0-0/server-1.servers.example.com:27017,server-2.servers.example.com:27017
RS-repl0-0:PRIMARY> use test
switched to db test
RS-repl0-0:PRIMARY> db.setSlaveOk() // Ok to run commands on a slave
RS-repl0-0:PRIMARY> db.getMongo().setReadPref('secondary') // Set read pref
RS-repl0-0:PRIMARY> db.getMongo().getReadPrefMode()
secondary
RS-repl0-0:PRIMARY> db.zips.aggregate( [
... { $group: { _id: "$state", totalPop: { $sum: "$pop" } } },
... { $match: { totalPop: { $gte: 10*1000*1000 } } }
... ] )
{ "_id" : "CA", "totalPop" : 29754890 }
{ "_id" : "FL", "totalPop" : 12686644 }
...
One can verify from the MongoDB logs that this indeed ran on the secondary:
...
2016-12-05T06:20:14.783+0000 I COMMAND [conn200] command test.zips command: aggregate { aggregate: "zips", pipeline: [ { $group: { _id: "$state", totalPop: { $sum: "$pop" } } }, {
$match: { totalPop: { $gte: 10000000.0 } } } ], cursor: {} } keyUpdates:0 writeConflicts:0 numYields:229 reslen:338 locks:{ Global: { acquireCount: { r: 466 } }, Database: { acquire
Count: { r: 233 } }, Collection: { acquireCount: { r: 233 } } } protocol:op_command 49ms
...
Note that this is applicable on secondaries of a sharded MongoDB cluster as well.