MongoDB, performance of query by regular expression on indexed fields - regex

I want to find an account by name (in a MongoDB collection of 50K accounts)
In the usual way: we find with string
db.accounts.find({ name: 'Jon Skeet' }) // indexes help improve performance!
How about with regular expression? Is it an expensive operation?
db.accounts.find( { name: /Jon Skeet/ }) // worry! how indexes work with regex?
Edit:
According to WiredPrairie:
MongoDB use prefix of RegEx to lookup indexes (ex: /^prefix.*/):
db.accounts.find( { name: /^Jon Skeet/ }) // indexes will help!'
MongoDB $regex

Actually according to the documentation,
If an index exists for the field, then MongoDB matches the regular
expression against the values in the index, which can be faster than a
collection scan. Further optimization can occur if the regular
expression is a “prefix expression”, which means that all potential
matches start with the same string. This allows MongoDB to construct a
“range” from that prefix and only match against those values from the
index that fall within that range.
http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use
In other words:
For /Jon Skeet/ regex ,mongo will full scan the keys in the index then will fetch the matched documents, which can be faster than collection scan.
For /^Jon Skeet/ regex ,mongo will scan only the range that start with the regex in the index, which will be faster.

In case anyone still has an issue with search performance, there is a way to optimize regex search even if it searches for a word in a sentence (not necessarily at the beginning ^ or the end $ of the string).
The field should have a text index
db.someCollection.createIndex({ someField: "text" })
and the queries on should use regex only after performing a plain search first
db.someCollection.find({ $and:
[
{ $text: { $search: "someWord" }},
{ someField: { $elemMatch: {$regex: /test/ig, $regex: /other/ig}}}
]
})
This ensures that the regex will run only for the results of the initial, plain search, which should be quite fast thanks to the index on this field.
It might have a huge impact on search performance, depending on how large the collection is.

Related

MongoDB: Match multiple values in string field

I have a collection of entities that contain a string field. I'm looking for a way to query the collection with a combined number of values, and get all entities that contain all of these values, with these specifications:
contain ALL provided query values, not just some of them
case-insensitive
regardless of order
'word' query values can be part of something bigger (for example separated by _ or any other character)
So as an example, if I provide these words as the query values:
i am spiderman
(I can separate them by whitespace, give an array, or whatever works..)
I expect these results:
- "i am_spiderMan" // should match
- "AM i spiderman?!" // should match
- "who am I? supermanspiderman" // should match
- "I am superman" // should not match
- "i am spider_man" // should not match
I hope this covers all the cases I tried to describe.
I tried regex, and also did some research with similar questions but could not get it to work.
You could use regular expr. This is working perfectly. When you pass the sentence, you need to put all worlds into array as I have shown below. Refer $all to include all words to find. Reg expr case insensitive
db.collection.find ({ key: { $all: [ /spiderman/i, /i/i, /am/i ] } })

Wrong regexp query for elasticsearch

I have some problems with the regexp query for elasticsearch. In my index there's a text field with comma-separated numeric values (IDs), f.e.
2,140,3,2495
And I have the following query term:
"regexp" : {
"myIds" : {
"value" : "^2495,|,2495,|,2495$|^2495$",
"boost" : 1
}
}
But my result list is empty.
Let me say that I know that regexp queries are kind of slow but the index still exists and is filled with millions of documents so unfortunately it's not an option to restructure it. So I need a regex solution.
In ElasticSearch regex, patterns are anchored by default, the ^ and $ are treated as literal chars.
What you mean to use is "2495,.*|.*,2495,.*|.*,2495|2495" - 2495, at the start of string, ,2495, in the middle, ,2495 at the end or a whole string equal to 2495.
Or, you may use a simpler
"(.*,)?2495(,.*)?"
That means
(.*,)? - an optional text (not including line breaks) ending with ,
2495 - your value
(,.*)? - an optional text (not including line breaks) ending with ,
Here is an online demo showing how this expression works (not a proof though).
Ok, I got it to work but run in another problem now. I built the string as follows:
(.*,)?2495(,.*)?|(.*,)?10(,.*)?|(.*,)?898(,.*)?
It works good for a few IDs but if I have let's say 50 IDs, then ES throws an exception which says that the regexp is too complex to process.
Is there a way to simplify the regexp or restructure the query it selves?

Does mongodb $regex without the option `i` still make use of the index if I am searching on the Index?

I have a model with a normal index using Mongoose.
const mod = new mongoose.Schema({
number: { type: String, required: true, index: { unique: true } },
});
I am using a regex in a query to get the mod corresponding to a specific number. Will my regex query utilize the index that is on this model?
query.number = {
$regex: `.*Q10.*`
}
modelName.find(query)
I am concerned that this is looking through the entire collection without using the indexes. What would be the best way to know if I am using the index. Or if you happen to know a way that will utilize the index could you show me? Here I am looking for all close to Q10, not trying to get an exact match. Would using /^Q10.* be better and use the index?
Referencing MongoDB regex information on index and comments made on this post stackoverflow previous question
The best way to confirm index usage for a given query is using MongoDB's query explain() feature. See Explain Results in the manual for your version of MongoDB for more information on the output fields and interpretation.
With regular expressions a main concern is efficient use of indexes. An unanchored substring match like /Q10/ will require examining all index keys (assuming a candidate index exists, as in your example). This is an improvement over scanning the full collection data (as would be the case without an index), but not as ideal as being able to check a subset of relevant index keys as is possible with a regex prefix search.
If you are routinely searching for substring matches and there is a common pattern to your strings, you could design a more scalable schema. For example, you could save whatever your Q10 value represents into a separate field (such as part_number) where you could use a prefix match or an exact match (non-regex).
To illustrate, I set up some test data using MongoDB 3.4.2 and the mongo shell:
// Needles: strings to search for
db.mod.insert([{number:'Q10'}, {number: 'foo-Q10'}, {number:'Q10-123'}])
// Haystack: some string values to illustrate key comparisons
for (i=0; i<1000; i++) { db.mod.insert({number: "I" + i}) }
Regex search without an index:
db.mod.find({ number: { $regex: /Q10/ }}).explain('executionStats')
The winningPlan is a COLLSCAN (collection scan) which requires the server retrieve every document in the collection to perform the comparison. Note that the original regex includes an unnecessary .* prefix and suffix; this is implicit with a substring match so can be written more concisely as /Q10/.
Highlights from the executionStats section of the explain output:
"nReturned": 2,
"totalKeysExamined": 0,
"totalDocsExamined": 1003,
The explain output confirms there are no index keys examined and 1003 documents (all docs in this collection).
Add an index for the following two examples:
db.mod.createIndex({number:1}, {unique: true})
Regex substring search with an index:
db.mod.find({ number: { $regex: /Q10/}}).explain('executionStats')
The winningPlan is still an IXSCAN, but now has to examine all 1003 indexed string values to find substring matches:
"nReturned": 3,
"totalKeysExamined": 1003,
"totalDocsExamined": 3,
Regex prefix search with an index:
db.mod.find({ number: { $regex: /^Q10/}}).explain('executionStats')
The winningPlan is an IXSCAN (Index scan) which requires 3 key comparisons and 2 document fetches to return the 2 matching documents:
"nReturned": 2,
"totalKeysExamined": 3,
"totalDocsExamined": 2,
A prefix search isn't equivalent to the first two searches, as it will not match the document with value foo-Q10. However, this does illustrate a more efficient regex search.
Note that totalKeysExamined is 3. It might be reasonable to expect this to be 2 since there were only 2 matches, however this metric includes any comparisons with out-of-range keys (eg. end of a range of values). For more information see Explain Results: keysExamined.
With the index enabled, For case sensitive regular expression queries, the query traverses the entire index (load into memory), then load the matching documents to be returned into memory. Its expensive but still could be better than a full collection scan.
For /John Doe/ regex ,mongo will scan the entire keyset in the index
then fetch the matched documents.
However, if you use a prefix query :
Further optimization can occur if the regular expression is a “prefix
expression”, which means that all potential matches start with the
same string. This allows MongoDB to construct a “range” from that
prefix and only match against those values from the index that fall
within that range.

Mongo Match number that "begins with" query in an array of numbers

in the mongo shell I am trying to query into an array of numbers
An example object:
[{"name":"test", "numbers" : [76.3922, 42.9196154]},
{"name":"test", "numbers" : [87.938547, 42.9196154]},
{"name":"test", "numbers" : [42.9196154,87.938547]}]
I tried to use this to find any document what the number starts with 87
db.TestColleciton.find( { "numbers": { $in: [ /^-87/ ] } } )
There are two misunderstandings in your approach here. The first is that regular expressions only ever work with strings and not with numeric values. JavaScript is an example of one language implementation that will "stringify" the numeric value for comparison, but that is the basic process. More later.
The other misunderstanding seems to be the use of $in. You don't need that operator "just" to perform a test on an array element, but rather it is the other way around, where you can supply an array of values to test against a field ( either a single value or an array ). This is basically a shorthand form of an $or condition for testing multiple values on the same field. Since there is only one value to test, you likely don't need it here.
If your intent is to match documents that "start with" the "87" value, then you can use JavaScript evaluation of $where. Though it's not the most optimal thing to do, since an index cannot be employed in the matching and the function supplied must be tested by brute force against the whole collection, or at least the result of other query arguments:
db.TestCollection.find(
function() {
return this.numbers.some(function(el) {
return /^87/.test(el);
});
}
)
Directly supplying a function argument to .find() is a shell "shorcut" for $where, but you can also use the long form with any string that provides a JavaScript expression to return true/false. Also noting that the caret ^ is the correct element to use for "starts with" in a regular expression. JavaScript "stringifies" here, so the test will return true.
But a better case for using natural query operators here would be more performant, basically looking for all values between 87 and 88. This will return all floating point variations in an efficient way:
db.TestCollection.find({
"numbers": { "$gte": 87, "$lt": 88 }
})
So the $gte and $lt operators bound the range for all possible floating point combinations begining with 87 in the most efficient way. And of course you just need to apply the array element to test and all elements will be inpected.
So when looking for numbers that "begin with", then it is always better to look at the numeric "range" consideration rather than resorting to regular expressions.
Refer to this
MongoDB regular expression capabilities for pattern matching strings in queries. We've thought about this, but don't think allowing regex against non string fields is a great idea. It will be very slow, and kind of misleading.
We can do it through $where
> db.TestColleciton.find({$where: 'return this.numbers.some(function(n){ return /^87.*/.test(n);})'})

$regex to find a document in mongodb that contains a string

I am working on a db query in mongo where i need to query the document by regular expressing the string field(fieldToQuery).
the datastructure is like
{
fieldToQuery : "4052577300",
someOtherField : "some value",
...
...
}
I have the value like "804052577300", using which i have to query the above document.
How to achieve the same using $regex operator?
Update:
I need to do like ends with regex in mongo.
You could do a sort of reverse regex query where you create a regex using the fieldToQuery value and then test that against your query value:
var value = "804052577300";
db.test.find({
$where: "new RegExp(this.fieldToQuery + '$').test('" + value + "');"
});
The $where operator allows you to execute arbitrary JavaScript against each doc in your collection; where this is the current doc, and the truthy result of the expression determines whether the doc should be included in the result set.
So in this case, a new RegExp is built for each doc's fieldToQuery field with a $ at the end to anchor the search to the end of the string. The test method of the regex is then called to test whether the value string matches the regex.
The $where string is evaluated server-side, so value's value must be evaluated client-side by directly including its value into the $where string.
Note that $where queries can be quite slow as they can't use indexes.
You could do something like this:
db.collection.find({ fieldToQuery: /4052577300$/})
The regex pattern above is equivalent to the SQL's LIKE operation:
select * from table where fieldToQuery like '%4052577300' => will return the document with the value "804052577300"