mongodb find followed by update semantics

mongodb find followed by update semantics - c++

This page shows a update reaching into a previously retrieved (find) document and querying a subelement (array) to update it. I pretty much need to do the exact same thing. Code for the example:
> t.find()
{ "_id" : ObjectId("4b97e62bf1d8c7152c9ccb74"), "title" : "ABC",
"comments" : [ { "by" : "joe", "votes" : 3 }, { "by" : "jane", "votes" : 7 } ] }
> t.update( {'comments.by':'joe'}, {$inc:{'comments.$.votes':1}}, false, true )
What are the rules governing find-followed-by-update, I haven't noticed an explanation for this in the documentation. Does the same stuff apply to use of mongodb via drivers ? A link to the relevant semantics would be helpful. I am using the C++ driver.
edit: self answer
the 2 commands can be rolled into one (and this is one way of removing the ambiguity this question raises), the query part of an update can refer to a array sub-element, and the $ symbol will reference to it. I assume you can only reference one sub-element in the query part of an update operation. In my case the update operation looks as follows :
db.qrs.update ( { "_id" : ObjectId("4f1fa126adf93ab96cb6e848"), "urls.u_id" : 171 }, { "$inc" : { "urls.$.CC": 1} })
The _id correctly "primes" the right unique row, and the second query element "urls.u_id" : 171 assures that the row in question has the right field. urls.$.CC then routes the $inc operation to the correct array entry.
recomendation to any mongodb dev or document writer
Do not show examples which have potential race conditions in them. Always avoid showing multiple operations that can be done atomically.

The rules are relatively straightforward. The results of the update may or may not be available to any subsequent reads depending on a number of things (slaveOk true/false in combination with repsets, update and find using different connections, write safety). You can guarantee it to be available if you do a safe write (w >= 1) and perform the find on the same connection. Most drivers offer functionality for this (typically "requestStart" and "requestDone").
All that said, there's a much better solution available to you for this, namely findAndModify. This operation finds a document, updates it and returns either the old version of the document or the newly updated version. This command is available in the C++ driver. For a reference look here : http://www.mongodb.org/display/DOCS/findAndModify+Command
EDIT : Please note that the "find" in the example is only there to show the reader of the documentation what the structure/schema of the documents inside the collection is to place the subsequent "update" in context. The "update" operation is in no way affected by the "find" before it.

Related

Mongodb index based text searches to match full string

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for 'google seo', it'd return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like 'Why should I google seo', 'What is google seo', 'What does google have to do with seo' etc. should return. Any combination of the following would be perfect. I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn't really matter if the results aren't the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words 'google' and 'seo' in the same sentence, it is alot of unnecessary code which also isnt 100% accurate. The current data base schema is as follows
{
_id: 0000000000,
search_string: string,
difficulty: number,
clicks: number,
volume: number,
keyword: string
}
The backend is a NodeJS server.
Any help is appreciated. Thanks!

Would combining the two approaches (text search and a regex) work?
No playground link since this needs a text index to demonstrate, but consider the following sample documents:
test> db.foo.find()
[
{ _id: 1, val: 'google seo' },
{ _id: 2, val: 'google ' },
{ _id: 3, val: 'seo random ' },
{ _id: 4, val: 'none' }
]
As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):
test> db.foo.find({$text:{$search:'google seo'}})
[
{ _id: 2, val: 'google ' },
{ _id: 1, val: 'google seo' },
{ _id: 3, val: 'seo random ' }
]
If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' }
]
It also works if the words are out of order as we'd expect:
test> db.foo.insert({_id:5, val:'seo out of order google string'})
{ acknowledged: true, insertedIds: { '0': 5 } }
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' },
{ _id: 5, val: 'seo out of order google string' }
]
The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.
Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).

What worked for me
Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app.
For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.
These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.
EDIT
I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it. Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
After that we send the query to the database and retrieve the results.
the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to ['going', 'between', 'Russia', 'Ukraine'] . As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the js contains() method. The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).

Map-Reduce with a wait

The concept of map-reduce is very familiar. It seems like a great fit for a problem I'm trying to solve, but it's either missing something (or I lack enough understanding of the concept).
I have a stream of items, structured as follows:
{
"jobId": 777,
"numberOfParts": 5,
"data": "some data..."
}
I want to do a map-reduce on many such items.
My mapping operation is straightforward - take the jobId.
My reduce operation is irrelevant for this phase, but all we know is that it takes multiple strings (the "some data..." part) and somehow reduces them to a single object.
The only problem is - I need all five parts of this job to complete before I can reduce all the strings into a single object. Every item has a "numberOfParts" property which indicates the number of items I must have before I apply the reduce operation. The items are not ordered, therefore I don't have a "partId" field.
Long story short - I need to apply some kind of a waiting mechanism that waits for all parts of the job to complete before initiating the reduce operation, and I need this waiting mechanism to rely on a value that exists within the payload (therefore solutions like kafka wouldn't work).
Is there a way to do that, hopefully using a single tool/framework?
I only want to write the map/reduce part and the "waiting" logic, the rest I believe should come out of the box.
**** EDIT ****
I'm currently in the design phase of the project and therefore not using any framework (such as spark, hadoop, etc...)
I asked this because I wanted to find out the best way to tackle this problem.

"Waiting" is not the correct approach.
Assuming your jobId is the key, and data contains some number of parts (zero or more), then you must have multiple reducers. One that gathers all parts of the same job, then another that processes all jobs with a collection of parts greater than or equal to numberOfParts while ignoring others

Is there a equivalent utassert.eqtable (Available in utplsql version 2.X) in utplsql version 3.3?

After going through the documentations of Utplsql 3.0.2 , I couldn't find any references the assertion api as available in the older versions. Please let me know whether is there a equivalent assertion like utassert.eqtable available in newer versions.

I have just recently gone through the same pain. Most utPLSQL examples out there are for utPLSQL v2. It transpires appears that the assertions have been deprecated, and have been replaced by "Expects". I found a great blog post by Jacek Gebal that describes this. I've tried to put this and other useful links a page about how unit testing fits into Redgate's Oracle DevOps pipeline (I work for Redgate and we often get asked how to best implement automated unit testing for Oracle).

I don't think you can compare tables straight away, but you can compare cursors, which is quite flexible, because you can, for instance, set-up a cursor with test data based on a dual query, and then check that against the actual data in the table, something like this:
procedure TestCursorExample is
v_Expected sys_refcursor;
v_Actual sys_refcursor;
begin
-- Arrange (Nothing really to arrange, except setting the expectation).
open v_Expected for
select 'me#example.com' as Email
from dual;
-- Act
SomeUpsertProc('me', 'me#example.com');
-- Assert
open v_Actual for
select Email
from Tbl_User
where UserName = 'me';
ut.expect(v_Actual).to_equal(v_Expected);
end;
Also, the example above works in Oracle 11, but if you're in 12c, apparently things got even easier, because you can use the table operator with locally defined types.
I've used a similar solution to verify that certain columns of a row were updated, while others were not. You can easily open a cursor for the original data, with some columns replaces by the new fixed values. Then do the update. Then open a cursor with the new actual data of all columns. You still have to write the queries, but it's way more compact than querying everything into variables and comparing those individually.
And, because you can open the 'expected' cursor before doing the actual 'act' step of the test, you can be sure that the query with 'expected' data is not affected by the test itself, and can even base that cursor on the data you are going to modify.
For comparing the data, the cursors are serialized to XML. This may have some side effects. In the test example above, my act step didn't actually do anything, so I got this difference, showing the count as well as showing the missing data.
If your cursors have more columns, and multiple difference, it can sometimes take a seconds to spot the differences between the XML tags. Also, there are currently some edge-case issues with this, I think because of how trimming works in XML.
1) testcursorexample
Actual: refcursor [ count = 0 ] was expected to equal: refcursor [ count = 1 ]
Diff:
Rows: [ 1 differences ]
Row No. 1 - Missing: <EMAIL>me#example.com</EMAIL>
at "MySchema.MyTestPackage", line 410 ut.expect(v_Actual).to_equal(v_Expected);
See also: 'comparing cursors' from utPLSQL 3 concepts

SAP JCo 3 RFC RSAQ_REMOTE_QUERY_CALL - unexpected results

We’re using JCo 3.0 to connect to RFCs and read data from SAP R/3. We use one RFC RFC_READ_TABLE often and use a second custom RFC to read employee information. My questions revolve around a third RFC RSAQ_REMOTE_QUERY_CALL. I'm calling an ad-hoc query I built in SAP using this RFC but I’m not getting the expected results. The main problem is that it appears that SAP is ignoring one of my selection criteria and using what was saved in SAP when I originally built it. The date criterion stored in my ad-hoc is 6/23/2013. If I pass in 6/28/2013 from JCo, I get the same results as if I had passed 6/23/2013 from JCo.
We have built several ad-hoc queries whose only criteria is a personnel number and call them successfully using RFC RSAQ_REMOTE_QUERY_CALL.
Background on my ad-hoc query: reporting period of today, joining together four aspects of an employee’s information: their latest action (hire, rehire, etc.), organization (e.g. company), pay (e.g. pay scale level) and communication (e.g. email). The query will run every workday.
Here are my questions:
My ad-hoc has three selection criteria. The first two are simple strings. The third is a date. The date will vary each time the query runs. We are referencing the first criteria using SP$00001, the second with SP$00002 and the third with SP$00003. The order of the criteria changes from the ad-hoc to SQ01 (what was SP$00001 in the ad-hoc is now SP$00003). Shouldn’t we reference them in the order defined in the ad-hoc (e.g. SP$00001)?
The two simple string selections are using OPTION “EQ”. The date criteria is using OPTION GT (greater than). Is “GT” correct?
We have some limited accessibility to SAP. Is there a way to see which SP$ parameters are mapped to which criteria?
If my ad-hoc was saved with five criteria but four of them never change when I call the ad-hoc from JCo, do I just need to set the value of the one or do I need to set the other four as well?
Do I have to call this ad-hoc using a variant (function.getImportParameterList().setValue(“VARIANT”, “VARIANT_NAME”))?
Does the Reporting Period have an impact on the date criteria? I have tried changing the Reporting Period to be PNPBEGDA = today and PNPENDDA = today and noticed no change.
Is there a way in SAP to get a “declaration” of your ad-hoc (name, inputs, outputs, criteria)? I have looked at JCoFunction.toXml() and JCoFunctionTemplate. These are good if you want to see something at runtime before it goes to SAP, but I’m looking for something I can use on the JCo end to help me write Java code that matches the ad-hoc.
I have looked at length on the web for answers to my questions and have not found anything that is useful. If there is anything which would help me, please let me know.
Thanks,
LM

Since I don't know much about SQnn, I won't be able to answer all of your questions...
I don't know, sorry.
It should be, at least it's the usual operator for greater than.
Yes - set an external breakpoint right inside the function module and trace its execution while performing the RFC call. Warning: At least basic ABAP knowledge required.
I don't know, sorry.
I don't know either, sorry.
That would depend on the query, I suspect...
JCo won't be able to help you out there - it doesn't know about queries, it only knows function modules. There might be other RSAQ_* function modules to get that information though.

I played with setting up a variant in SQ01 for my query. I added some settings in the variant that solved my problem and answered several of my questions in my post. The main thing I did was add a dynamically calculated date as part of my criteria. Here's how:
1. In SQ01, access menu "Go To" -> "Maintain Variants".
2. Choose your variant and in subobjects, choose "Attributes" and click "Change".
3. In the displayed list, find your date criterion.
4. Choose "D" in Selection Variable, choose a comparison option (mine was GT for greater than), and a "Name of a Variable" (really, this is the type of dynamic date calculation you need).
5. Go back to the Subobjects panel, choose "Values" and click "Change".
6. Enter any other criteria you need in the "Program selections" section.
7. Save the variant.
By doing this, I don't need to pass anything into the query from JCo. Also, SAP will automatically update the date criteria you entered in step #4 above.
So to to answer my questions from my original post:
1 and 4. It doesn't matter because I'm no longer passing anything in from JCo.
2. "GT" is Greater Than.
3 and 7. If anyone knows, I'd really like to find out.
5. Use the name you as it is in SAP (step #2 above).
6. I still don't know, but it's not holding me up.
I'm posting this in case anyone out there needs this type of information. Thanks to Esti and vwegert for helping me out.

Getting count of distinct groupings in RavenDB index

I have a number of documents within RavenDB of the form:
{
"Id": "composite of namespace and video id",
"Namespace": "youtube",
"VideoId": "12345678901",
"Start": "00:00:05"
}
I have a number of documents that reference different segments of the actual thing; in this case, I have multiple documents representing different timestamps within a video.
What I'd like to do is get a count of the distinct number of VideoId instances for a particular Namespace.
At first, I thought I could handle the distinct in the mapping:
from v in docs.Clips.Select(c => new { c.Namespace, c.VideoId }).Distinct()
But that doesn't work, as that query isn't run over the entire document set (so it's impossible to perform a Distinct call here).
I've thought about trying to handle this in the reduce part, but I can't think of an aggregate operation which would group this appropriately.
The shape of the map/reduce structure right now is:
new { Type = "providercount", Key = "youtube", Count = 1 }
As this is part of a multi-map which produces a summary.
How can I produce the count of distinct Namespace/VideoId values with this document structure?

One way to do it might be to group by NameSpace and VideoId. That will get you distinct items. Then you would have to count all of those groups in a TransformResults section. However, I don't recommend doing this with a large number of items. Transformation steps run as part of the query, so performance would be a big problem.
A better approach would be to keep an additional separate document per video (not per clip). For example:
videos/youtube/12345678901
{
"Title": "whatever",
"NumberOfClips": 3,
"Clips": ["clipid1","clipid2","clipid3"]
}
I put a few properties in there that might be useful for other purposes, but the main point is that there is only one document per video.
Building these documents could be done in a couple of different ways:
You could write code in your application to add/update the Video documents at the same time you are writing Clip documents.
You could write a map/reduce index for the Clip documents and group by the NameSpace/VideoId, and then use the Indexed Properties Bundle to maintain the Video documents from the results.
Either way, once you have the set of Video documents, you can then do a simple map/reduce on those to get the count of distinct videos.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js