Mongodb index based text searches to match full string

Mongodb index based text searches to match full string - regex

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for 'google seo', it'd return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like 'Why should I google seo', 'What is google seo', 'What does google have to do with seo' etc. should return. Any combination of the following would be perfect. I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn't really matter if the results aren't the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words 'google' and 'seo' in the same sentence, it is alot of unnecessary code which also isnt 100% accurate. The current data base schema is as follows
{
_id: 0000000000,
search_string: string,
difficulty: number,
clicks: number,
volume: number,
keyword: string
}
The backend is a NodeJS server.
Any help is appreciated. Thanks!

Would combining the two approaches (text search and a regex) work?
No playground link since this needs a text index to demonstrate, but consider the following sample documents:
test> db.foo.find()
[
{ _id: 1, val: 'google seo' },
{ _id: 2, val: 'google ' },
{ _id: 3, val: 'seo random ' },
{ _id: 4, val: 'none' }
]
As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):
test> db.foo.find({$text:{$search:'google seo'}})
[
{ _id: 2, val: 'google ' },
{ _id: 1, val: 'google seo' },
{ _id: 3, val: 'seo random ' }
]
If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' }
]
It also works if the words are out of order as we'd expect:
test> db.foo.insert({_id:5, val:'seo out of order google string'})
{ acknowledged: true, insertedIds: { '0': 5 } }
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' },
{ _id: 5, val: 'seo out of order google string' }
]
The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.
Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).

What worked for me
Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app.
For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.
These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.
EDIT
I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it. Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
After that we send the query to the database and retrieve the results.
the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to ['going', 'between', 'Russia', 'Ukraine'] . As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the js contains() method. The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).

Related

AWS DynamoDB Scan Filter Expression Returning Empty

I'm stumped trying to figure out why my scan won't return anything but [ ]. Here are my scan params:
var params = {
TableName: tableName,
FilterExpression: "#wager = :wager",
ExpressionAttributeNames: {
"#wager": "wager"
},
ExpressionAttributeValues: {
":wager": wager
}
};
My DynamoDB table works perfectly when I run a filter expression in the DynamoDB dashboard, like "wager [NUMBER] = 0.001".

jellycsc and Seth Geoghegan already mentioned the two most likely explanations in comments:
First, make sure you do not call a single Scan operation, but rather do a loop to get all the pages of the scan result. The specific way to do this depends on which programing language you are using. When your filter leaves only a small subset of the results (e.g., only when wage is exactly 0.001) remember to read all the pages is critical, because the first page may be empty: DynamoDB might have read 1MB of items (the default page size), and none of them matched wager=0.001 so an empty first page is return.
Second, the wager might have a wrong type. Obviously, if you store numbers but search for a string, nothing would match, so check you didn't do that. But a more subtle problem can be how you store numbers. DynamoDB holds floating-point in an unusual manner - using decimal, not binary, digits. This means that DynamoDB can hold the number "0.001" precisely, without any rounding errors. The same cannot be said for most programming languages. On my machine, if I set a "double" variable to 0.001, the result is 0.0010000000000000000208. If I pass this to DynamoDB, the equality check will not match! This means you should make sure that wager variable is not a double. In Python, for example, wager should be set to Decimal("0.001") - note how it is constructed from the string "0.001", not from the floating-point 0.001 which already has rounding errors.

thanks for the ideas everyone. it did indeed turn out to be a type issue - all I had to do was cast "wager" as
wager = Number(wager);
ahead of setting the scan params (the same params I have in the question worked).

Searching a Mongo database using PyMongo, while using regex

I currently have a PyMongo collection with around 100,000 documents. I need to perform a regex search on each of these documents, checking each document against around 1,800 values to see if a particular field (which is an array) contains one of the 1,800 strings. After testing a variety of ways of using regex, such as compiling into a regular expression, multiprocessing and multi-threading, the performance is still abysmal, and takes around 30-45 minutes.
The current regex I'm using to find the value at the end of the string is:
rgx = re.compile(string_To_Be_Compared + '$')
And then this is ran using a standard pymongo find query:
coll.find( { 'field' : rgx } )
I was wondering if anyone had any suggestions for querying these values in a more optimal way? Ideally the search to return all the values should take less than 5 minutes. Would the best course of action to be use something like ElasticSearch or am I missing something basic?
Thanks for you time

Data modeling with document database?

I am new to working with Data.
So I have a lot of data based on time.
Data row for every 15 mins. Should I compute the data and store data for every 1 hour, 1 day, 1 month on the database?
if I do would this schema be good.
{
_id: "joe",
name: "Joe Bookreader",
time min: [
{
time: "1",
steps: "10"
},
{
time: "2",
steps: "4"
}
]
time day: [
{
time: "1",
steps: "30"
},
{
time: "2",
steps: "30"
}
]
}
If you have any advice on how I can improve my data modeling knowledge with document databases, I would be really grateful.

For a minute step away from programmatic approach to the problem and think about the task at hand.
How are you going to use that data after you stored it? When you use the data it is important for you to know exactly number of steps for a particular user or you want to see a big picture based on the time particular sample points in time.
If you care for per user perspective then your scheme above will work. On the other hand if you want to run global reports like how far along users were on average (or total) during certain time,then I would opt in for schema where your document is time (point in time or range in time), while user and steps are your properties.
Another important concept in database is not to statically store data that can be calculated on the fly. As with any rules there are some exceptions to this. Like Cached values that are short lived and will not have major effect on your application if they are incorrect. Another one is reports, you produced a report for the user based on current values and stored it. If user feels like getting fresh data, user will re-run the report. (I am sure there are few other)
But in most cases the risk that comes with serving stale/wrong data resulting in wrong decision based on that data will outweigh performance benefit of avoiding extra calculations.
The reason I am mentioning this, is because you are storing time min and time day. If time day can be calculated based on time min you should not store it in the database, but rather calculate it on the fly. You can write queries that will produce actual result of time day without using any extra computational power on your application node. All computations will be done on the data node, much more efficiently than a compute node and without network penalties.
I realize this post is a bit old, but I hope my answer will help someone.

Getting count of distinct groupings in RavenDB index

I have a number of documents within RavenDB of the form:
{
"Id": "composite of namespace and video id",
"Namespace": "youtube",
"VideoId": "12345678901",
"Start": "00:00:05"
}
I have a number of documents that reference different segments of the actual thing; in this case, I have multiple documents representing different timestamps within a video.
What I'd like to do is get a count of the distinct number of VideoId instances for a particular Namespace.
At first, I thought I could handle the distinct in the mapping:
from v in docs.Clips.Select(c => new { c.Namespace, c.VideoId }).Distinct()
But that doesn't work, as that query isn't run over the entire document set (so it's impossible to perform a Distinct call here).
I've thought about trying to handle this in the reduce part, but I can't think of an aggregate operation which would group this appropriately.
The shape of the map/reduce structure right now is:
new { Type = "providercount", Key = "youtube", Count = 1 }
As this is part of a multi-map which produces a summary.
How can I produce the count of distinct Namespace/VideoId values with this document structure?

One way to do it might be to group by NameSpace and VideoId. That will get you distinct items. Then you would have to count all of those groups in a TransformResults section. However, I don't recommend doing this with a large number of items. Transformation steps run as part of the query, so performance would be a big problem.
A better approach would be to keep an additional separate document per video (not per clip). For example:
videos/youtube/12345678901
{
"Title": "whatever",
"NumberOfClips": 3,
"Clips": ["clipid1","clipid2","clipid3"]
}
I put a few properties in there that might be useful for other purposes, but the main point is that there is only one document per video.
Building these documents could be done in a couple of different ways:
You could write code in your application to add/update the Video documents at the same time you are writing Clip documents.
You could write a map/reduce index for the Clip documents and group by the NameSpace/VideoId, and then use the Indexed Properties Bundle to maintain the Video documents from the results.
Either way, once you have the set of Video documents, you can then do a simple map/reduce on those to get the count of distinct videos.

mongodb find followed by update semantics

This page shows a update reaching into a previously retrieved (find) document and querying a subelement (array) to update it. I pretty much need to do the exact same thing. Code for the example:
> t.find()
{ "_id" : ObjectId("4b97e62bf1d8c7152c9ccb74"), "title" : "ABC",
"comments" : [ { "by" : "joe", "votes" : 3 }, { "by" : "jane", "votes" : 7 } ] }
> t.update( {'comments.by':'joe'}, {$inc:{'comments.$.votes':1}}, false, true )
What are the rules governing find-followed-by-update, I haven't noticed an explanation for this in the documentation. Does the same stuff apply to use of mongodb via drivers ? A link to the relevant semantics would be helpful. I am using the C++ driver.
edit: self answer
the 2 commands can be rolled into one (and this is one way of removing the ambiguity this question raises), the query part of an update can refer to a array sub-element, and the $ symbol will reference to it. I assume you can only reference one sub-element in the query part of an update operation. In my case the update operation looks as follows :
db.qrs.update ( { "_id" : ObjectId("4f1fa126adf93ab96cb6e848"), "urls.u_id" : 171 }, { "$inc" : { "urls.$.CC": 1} })
The _id correctly "primes" the right unique row, and the second query element "urls.u_id" : 171 assures that the row in question has the right field. urls.$.CC then routes the $inc operation to the correct array entry.
recomendation to any mongodb dev or document writer
Do not show examples which have potential race conditions in them. Always avoid showing multiple operations that can be done atomically.

The rules are relatively straightforward. The results of the update may or may not be available to any subsequent reads depending on a number of things (slaveOk true/false in combination with repsets, update and find using different connections, write safety). You can guarantee it to be available if you do a safe write (w >= 1) and perform the find on the same connection. Most drivers offer functionality for this (typically "requestStart" and "requestDone").
All that said, there's a much better solution available to you for this, namely findAndModify. This operation finds a document, updates it and returns either the old version of the document or the newly updated version. This command is available in the C++ driver. For a reference look here : http://www.mongodb.org/display/DOCS/findAndModify+Command
EDIT : Please note that the "find" in the example is only there to show the reader of the documentation what the structure/schema of the documents inside the collection is to place the subsequent "update" in context. The "update" operation is in no way affected by the "find" before it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js