Mongo regex index matching - multiple start strings - regex

I want to match multiple start strings in mongo. explain() shows that it's using the indexedfield index for this query:
db.mycol.find({indexedfield:/^startstring/,nonindexedfield:/somesubstring/});
However, the following query for multiple start strings is really slow. When I run explain I get an error. Judging by the faults I can see in mongostat (7k a second) it's scanning the entire collection. It's also alternating between 0% locked and 90-95% locked every few seconds.
db.mycol.find({indexedfield:/^(startstring1|startstring2)/,nonindexedfield:/somesubstring/}).explain();
JavaScript execution failed: error: { "$err" : "assertion src/mongo/db/key.cpp:421" } at src/mongo/shell/query.js:L128
Can anyone shed some light on how I can do this or what is causing the explain error?
UPDATE - more info
Ok, so I managed to get explain to work on the more complex query by limiting the number of results. The difference is this:
For a single substring, "^/BA1/" (yes it's postcodes)
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 10,
"nscannedObjectsAllPlans" : 19,
"nscannedAllPlans" : 19,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"indexedfield" : [
[
"BA1",
"BA2"
],
[
/^BA1/,
/^BA1/
]
]
}
For multiple substrings "^(BA1|BA2)/"
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 1075276,
"nscannedObjectsAllPlans" : 1075285,
"nscannedAllPlans" : 2150551,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 4596,
"indexBounds" : {
"indexedfield" : [
[
"",
{
}
],
[
/^(BA1|BA2)/,
/^(BA1|BA2)/
]
]
}
which doesn't look very good.

$or solves the problem in terms of using the indexes (thanks EddieJamsession). Queries are now lightening fast.
db.mycoll.find({$or: [{indexedfield:/^startstring1/},{indexedfield:/^startstring2/],nonindexedfield:/somesubstring/})
However, I would still like to do this with a regex if possible so I'm leaving the question open. Not least because I now have to refactor my application to take these types of queries into account.

Related

Neptune loader FROM_OR_TO_VERTEX_ARE_MISSING

I tried to follow this example https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-data.html to load data to neptune
curl X POST -H 'Content-Type: application/json' https://endpoint:port/loader -d '
{
"source" : "s3://source.csv",
"format" : "csv",
"iamRoleArn" : "role",
"region" : "region",
"failOnError" : "FALSE",
"parallelism" : "MEDIUM",
"updateSingleCardinalityProperties" : "FALSE",
"queueRequest" : "TRUE"
}'
{
"status" : "200 OK",
"payload" : {
"loadId" : "411ee078-3c44-4620-85ac-e22ef5466bbb"
}
And I get status 200 but then I try to check if the data was loaded and get this:
curl G 'https://endpoint:port/loader/411ee078-3c44-4620-85ac-e22ef5466bbb'
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 4,
"startTime" : 1617653964,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
}
}
I had no idea why I get LOAD_FAILED so I decided to use get-status API to see what errors caused the load failure and got this:
curl -X GET 'endpoint:port/loader/411ee078-3c44-4620-85ac-e22ef5466bbb?details=true&errors=true'
{
"status" : "200 OK",
"payload" : {
"feedCount" : [
{
"LOAD_FAILED" : 1
}
],
"overallStatus" : {
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 4,
"startTime" : 1617653964,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
},
"failedFeeds" : [
{
"fullUri" : "s3://source.csv",
"runNumber" : 1,
"retryNumber" : 1,
"status" : "LOAD_FAILED",
"totalTimeSpent" : 1,
"startTime" : 1617653967,
"totalRecords" : 10500,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 10500
}
],
"errors" : {
"startIndex" : 1,
"endIndex" : 10,
"loadId" : "411ee078-3c44-4620-85ac-e22ef5466bbb",
"errorLogs" : [
{
"errorCode" : "FROM_OR_TO_VERTEX_ARE_MISSING",
"errorMessage" : "Either from vertex, '1414', or to vertex, '70', is not present.",
"fileName" : "s3://source.csv",
"recordNum" : 0
},
What does this error even mean and what is the possible fix?
It looks as if you were trying to load some edges. When an edge is loaded, the two vertices that the edge will be connecting must already have been loaded/created. The message:
"errorMessage" : "Either from vertex, '1414', or to vertex, '70',is not present.",
is letting you know that one (or both) of the vertices with ID values of '1414' and '70' are missing. All vertices referenced by a CSV file containing edges must already exist (have been created or loaded) prior to loading edges that reference them. If the CSV files for vertices and edges are in the same S3 location then the bulk loader can figure out the order to load them in. If you just ask the loader to load a file containing edges but the vertices are not yet loaded, you will get an error like the one you shared.

Case Insensitive Search With Regex

I'm trying to implement a case-insensitive search with regex.
Example: /^sanford/i (searching for anything starting with "sanford" disregarding case sensivity.
For case insensitive queries, creating indeces with a custom collation is recommended by the documentation. This works fine as long as no regex is involved.
The problem: searching with a regex (in this case "starts with"), the case-insensitive search does NOT take the index into account.
This is stated in the documentation multiple times and is also reproducable with an explain command.
To sum it up: It works, but without effectively using the index. I'd be glad to get any hints, I can't get rid of the feeling that I'm missing something fundamentally important here.
Inserting the string with toLowerCase and then searching only with lower cased strings is not an option.
I can't use a text index because there can only be one per collection.
Example from the documentation see here, the green info box on the bottom.
#D.SM: The index is used, but it scans all documents.
https://docs.atlas.mongodb.com/schema-suggestions/case-insensitive-regex/
Example document:
{
"name": [{
"family": "Test",
"given": "Name",
}],
}
Index with collation:
{ "name" : "name_family", "key" : { "name.family" : 1 }, "host" : "noneofyourbusiness.com", "accesses" : { "ops" : NumberLong(114), "since" : ISODate("2020-07-30T20:25:59.079Z") }, "shard" : "shA", "spec" : { "v" : 2, "key" : { "name.family" : 1 }, "name" : "name_family", "ns" : "noneofyourbusiness.somethingwithaname", "collation" : { "locale" : "de", "caseLevel" : false, "caseFirst" : "off", "strength" : 1, "numericOrdering" : false, "alternate" : "non-ignorable", "maxVariable" : "punct", "normalization" : false, "backwards" : false, "version" : "57.1" } } }
}

Mongodb db.collection.distinct() on aws documentdb doesn't use index

Transitioning to new AWS documentDB service. Currently, on Mongo 3.2. When I run db.collection.distinct("FIELD_NAME") it returns the results really quickly. I did a database dump to AWS document DB (Mongo 3.6 compatible) and this simple query just gets stuck.
Here's my .explain() and the indexes on the working instance versus AWS documentdb:
Explain function on working instance:
> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"_id" : 0,
"FIELD_NAME" : 1
},
"inputStage" : {
"stage" : "DISTINCT_SCAN",
"keyPattern" : {
"FIELD_NAME" : 1
},
"indexName" : "FIELD_INDEX_NAME",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"FIELD_NAME" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
Explain on AWS documentdb, not working:
rs0:PRIMARY> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"winningPlan" : {
"stage" : "AGGREGATE",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "COLLSCAN"
}
}
}
},
}
Index on both of these instances:
{
"v" : 1,
"key" : {
"FIELD_NAME" : 1
},
"name" : "FIELD_INDEX_NAME",
"ns" : "db.collection"
}
Also, this database has a couple million documents but there are only about 20 distinct values for that "FIELD_NAME". Any help would be appreciated.
I tried it with .hint("index_name") and that didn't work. I tried clearing plan cache but I get Feature not supported: planCacheClear
COLLSCAN and IXSCAN don't have too much difference in this case, both need to scan all the documents or index entries.

Mongodb Index usage slower with regex anchor

I've got a query that's using a regex anchor and it seems to be slower when running an index scan rather than a collection scan.
A bit of background to the question:
I have a MSSQL database that has approximately 2.8 million rows in a table. We were running the following query against the table to return approximately 2.6 million results in 23 seconds:
select * from table where 'column' like "IL%"
So out of curiosity I decided to see if mongodb could perform this any faster than my MSSQL database and on a new test server I created a mongodb database which I filled 1 collection (test1) with just under 3 million objects. Here's the basic structure of a document in a collection:
> db.test1.findOne()
{
"_id" : 2,
"Other_REV" : "NULL",
"Holidex_Code" : "W8BP0",
"Segment_Name" : "NULL",
"Source" : "Forecast",
"Date_" : ISODate("2009-11-12T11:14:00Z"),
"Rooms_Sold" : 3,
"FB_REV" : "NULL",
"Rate_Code" : "ILM87",
"Export_Date" : ISODate("2010-12-12T11:14:00Z"),
"Rooms_Rev" : 51
}
All of my records have Rate_Code prefixed with IL and I ran the following query against the database which took just over 3 seconds:
> db.test1.find({'Rate_Code':{$regex: /^IL/}}).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 2999999,
"nscannedObjects" : 2999999,
"nscanned" : 2999999,
"nscannedObjectsAllPlans" : 2999999,
"nscannedAllPlans" : 2999999,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 4,
"nChunkSkips" : 0,
"millis" : 3398,
"indexBounds" : {
},
"server" : "MONGODB:27017"
}
Out of curiosity I created an index to see if I could speed up the retrieval at all:
> db.test1.ensureIndex({'Rate_Code':1})
However this appears to actually slow down the resolution of the query to approximately 6 seconds on average:
> db.test1.find({'Rate_Code':{$regex: /^IL/}}).explain()
{
"cursor" : "BtreeCursor Rate_Code_1",
"isMultiKey" : false,
"n" : 2999999,
"nscannedObjects" : 2999999,
"nscanned" : 2999999,
"nscannedObjectsAllPlans" : 2999999,
"nscannedAllPlans" : 2999999,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 4,
"nChunkSkips" : 0,
"millis" : 5895,
"indexBounds" : {
"Rate_Code" : [
[
"IL",
"IM"
]
]
},
"server" : "MONGODB:27017"
}
The OS has 2GB of memory and appears to be holding both indexes quite comfortably in memory with no disk usage being recorded when the query is ran:
> db.test1.stats()
{
"ns" : "purify.test1",
"count" : 2999999,
"size" : 623999808,
"avgObjSize" : 208.0000053333351,
"storageSize" : 790593536,
"numExtents" : 18,
"nindexes" : 2,
"lastExtentSize" : 207732736,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 153218240,
"indexSizes" : {
"_id_" : 83722240,
"Rate_Code_1" : 69496000
},
"ok" : 1
}
I'm thinking the slow down is due to mongodb performing a full scan of the index followed by a full collection scan as it can't be sure that all my matches are in the index but I'm not entirely sure if this is the case. Is there any way that this could be improved upon for better performance?
Thanks for any help.

combining regex and embedded objects in mongodb queries

I am trying to combine regex and embedded object queries and failing miserably. I am either hitting a limitation of mongodb or just getting something slightly wrong maybe someone out ther has encountered this. The documentation certainly does'nt cover this case.
data being queried:
{
"_id" : ObjectId("4f94fe633004c1ef4d892314"),
"productname" : "lightbulb",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
{
"country" : "USA",
"storeCode" : "xzy-6784"
},
{
"country" : "USA",
"storeCode" : "abc-3454"
},
{
"country" : "CANADA",
"storeCode" : "abc-6845"
}
]
}
assume the collection contains only one record
This query returns 1:
db.testCol.find({"availability":{"country" : "USA","storeCode":"xzy-6784"}}).count();
This query returns 1:
db.testCol.find({"availability.storeCode":/.*/}).count();
But, this query returns 0:
db.testCol.find({"availability":{"country" : "USA","storeCode":/.*/}}).count();
Does anyone understand why? Is this a bug?
thanks
You are referencing the embedded storecode incorrectly - you are referencing it as an embedded object when in fact what you have is an array of objects. Compare these results:
db.testCol.find({"availability.0.storeCode":/x/});
db.testCol.find({"availability.0.storeCode":/a/});
Using your sample doc above, the first one will not return, because the first storeCode does not have an x in it ("abc-1234"), the second will return the document. That's fine for the case where you are looking at a single element of the array and pass in the position. In order to search all of the objcts in the array, you want $elemMatch
As an example, I added this second example doc:
{
"_id" : ObjectId("4f94fe633004c1ef4d892315"),
"productname" : "hammer",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
]
}
Now, have a look at the results of these queries:
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/a/}}}).count();
2
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/x/}}}).count();
1