Deduplicaton / matching in Couchdb? - mapreduce

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.

Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.

You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.

Related

Django inner join with search terms for tables across many-to-many relationships

I have a dynamic hierarchical advanced search interface I created. Basically, it allows you to search terms, from about 6 or 7 tables that are all linked together, that can be and'ed or or'ed together in any combination. The search entries in the form are all compiled into a complex Q expression.
I discovered a problem today. If I provide a search term for a field in a many-to-many related sub-table, the output table can include results from that table that don't match the term.
My problem can by reproduced in the shell with a simple query:
qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
sqs = qs[0].msrun.sample.animal.studies.all()
sqs.count()
#result: 2
Specifically:
In [3]: qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
In [12]: ss = s[0].msrun.sample.animal.studies.all()
In [13]: ss[0].__dict__
Out[13]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bfbf940>,
'id': 3,
'name': 'Small OBOB',
'description': ''}
In [14]: ss[1].__dict__
Out[14]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bea81f0>,
'id': 1,
'name': 'obob_fasted',
'description': ''}
The ids in sqs queryset include 1 & 3 even though I only searched on 3. I don't get back literally all studies, so it is filtering some un-matching study records. I understand why I see that, but I don't know how to execute a query that treats it like a join I could perform in SQL where I can restrict the results to only include records that match the query, instead of getting back only records in the root model and gathering everything left-joined to those root model records.
Is there a way to do such an inner join (as the result of a single complex Q expression in a filter) on the entire set of linked tables so that I only get back records that match the M:M field search term?
UPDATE:
By looking at the SQL:
In [3]: str(s.query)
Out[3]: 'SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC'
...I can see that the query is as specific as I would like it to be, but in the template, how do I specify that I want what I would have seen in the SQL result, had I supplied all of the specific related table fields I wanted to see in the output? E.g.:
SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id", "DataRepo_animal_studies"."study_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC
All the Django ORM gives you back from a query (whose filters use fields from many-to-many [henceforth "M:M"] related tables), is a set of records from the "root" table from which the query started. It uses the join logic that you would use in an SQL query, obtaining records using the foreign keys, so you are guaranteed to get back root table records that DO link to a M:M related table record that matches your search term, but when you send the root table records(/queryset) to be rendered in a template and insert a nested loop to access records from the M:M related table, it always gets everything linked to that root table record - whether they match your search term or not, so if you get back a root table record that links to multiple records in an M:M related table, at least 1 record is guaranteed to match your search term, but other records may not.
In order to get the inner join (i.e. only including combined records that match search terms in the table(s) queried), you simply have to roll-your-own, because Django doesn't support it. I accomplished this in the following way:
When rendering search results, wherever you want to include records from a M:M related table, you have to create a nested loop. At the inner-most loop, I essentially re-enforce that the record matches all of the search terms. I.e. I implement my own filter using conditionals on each combination of records (from the root table and from each of the M:M related table). If any record combination does not match, I skip it.
Regarding the particulars as to how I did that, I won't get too into the weeds, but at each nested loop, I maintain a dict of its key path (e.g. msrun__sample__animal__studies) as the key and the current record as the value. I then use a custom simple_tag to re-run the search terms against the current record from each table (by checking the key path of the search term against those available in the dict).
Alternatively, you could do this in the view by compiling all of the matching records and sending the combined large table to the template - but I opted not to do this because of the hurdles surrounding cached_properties and since I already pass the search term form data to re-display the executed search form, the structure of the search was already available.
Watch out #1 - Even without "refiltering" the combined records in the template, note that the record count can be inaccurate when dealing with a combined/composite/joined table. To ensure that the number of records I report in my header above the table was always correct/accurate (i.e. represented the number of rows in my html table), I kept track of the filtered count and used javascript under the table to update the record count reported at the top.
Watch out #2 - There is one other thing to keep in mind when querying using terms from M:M related tables. For every M:M related table record that matches the search term, you will get back a duplicate record from the root table, and there's no way to tell the difference between them (i.e. which M:M record/field value matched in each case. For example, if you matched 2 records from an M:M related table, you would get back 2 identical root table records, and when you pass that data to the template, you would end up with 4 rows in your results, each with the same data from the root table record. To avoid this, all you have to do is append the .distinct() filter to you results queryset.

django setting filter field with a variable

I show a model of sales that can be aggregated by different fields through a form. Products, clients, categories, etc.
view_by_choice = filter_opts.cleaned_data["view_by_choice"]
sales = sales.values(view_by_choice).annotate(........).order_by(......)
In the same form I have a string input where the user can filter the results. By "product code" for example.
input_code = filter_opts.cleaned_data["filter_code"]
sales = sales.filter(prod_code__icontains=input_code)
What I want to do is filter the queryset "sales" by the input_code, defining the field dynamically from the view_by_choice variable.
Something like:
sales = sales.filter(VARIABLE__icontains=input_code)
Is it possible to do this? Thanks in advance.
You can make use of dictionary unpacking [PEP-448] here:
sales = sales.filter(
**{'{}__icontains'.format(view_by_choice): input_code}
)
Given that view_by_choice for example contains 'foo', we thus first make a dictionary { 'foo__icontains': input_code }, and then we unpack that as named parameter with the two consecutive asterisks (**).
That being said, I strongly advice you to do some validation on the view_by_choice: ensure that the number of valid options is limited. Otherwise a user might inject malicious field names, lookups, etc. to exploit data from your database that should remain hidden.
For example if you model has a ForeignKey named owner to the User model, he/she could use owner__email, and thus start trying to find out what emails are in the database by generating a large number of queries and each time looking what values that query returned.

Querying nested attributes in Amazon DynamoDB

How can I efficiently query on nested attributes in Amazon DynamoDB?
I have a document structure as below, which lets me store related information in the document itself (rather than referencing it).
It makes sense to store the seminars nested in the course, since they will likely be queried alongside the course (they are all course-specific, i.e. a course has many seminars, and a seminar belongs to a course).
In CouchDB, which I’m migrating from, I could write a View that would project some nested attributes for querying. I understand that I can’t project anything that isn’t a top-level attribute into a dynamodb secondary index, so this approach doesn’t seem to work.
This brings me back to the question: how can I efficiently query on nested attributes without scanning, if I can’t use them as keys in an index?
For example, if I want to get average attendance at Nelson Mandela Theatre, how can I query for the values of registrations and attendees in all seminars that have a location of “Nelson Mandela Theatre” without resorting to a scan?
{
“course_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminars”: [
{
“seminar_id”: “XXXYYY-12345”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “How to lie with statistics”,
“registrations”: “92”,
“attendees”: “61”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “155555555”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for dog owners”,
“registrations”: “244”,
“attendees”: “240”
},
{
“seminar_id”: “XXXAAA-54321”,
“epoch_time”: “223456789”,
“duration”: “4000”,
“location”: “Starbucks”,
“name”: “Is feral cat population growth a leading indicator for the S&P 500?”,
“registrations”: “40”
}
]
}
{
“course_id”: “CJX-5553389”,
“course_name”: “Cat Health 101”,
“tutors”: [“Cognito-sub-4”, “Cognito-sub-9”],
“seminars”: [
{
“seminar_id”: “TTRHJK-43278”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Catwoman Hall”,
“name”: “Emotional support octopi for cats”,
“registrations”: “88”,
“attendees”: “87”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “123666789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for cat owners”,
“registrations”: “44”,
“attendees”: “44”
}
]
}
Index cannot be created for nested attributes (i.e. document data types in Dynamodb).
Document Types – A document type can represent a complex structure
with nested attributes—such as you would find in a JSON document. The
document types are list and map.
Query Api:-
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
Scan API:-
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
In order to use Query API, the hash key value is required. The OP doesn't have any information that hash key value is available. As per OP, the data needs to be queried by location attribute which is inside the Dynamodb List data type. Now, the option is to look at GSI.
Kindly read more about the GSI. One of the rules is that GSI can be created using top level attributes only. So, the location can't be used to create the index.
So, creating the GSI in order to use Query API has been ruled out as well.
The index key attributes can consist of any top-level String, Number,
or Binary attributes from the base table; other scalar types, document
types, and set types are not allowed.
Because of the above mentioned reasons, the Query API can't be used to get the data based on location attribute assuming hash key value is not available.
If hash key value is available, FilterExpression can be used to filter the data. Only way to filter the data present in the complex list data type is CONTAINS function. In order to use CONTAINS function, all the attributes in the occurrence is required to match the data (i.e. seminar_id, location, duration and all other attributes). So, it is definitely not possible to fulfil the use case mentioned in the OP using the current data model.
Proposed alternate solution:-
Re-modeling the data structure as mentioned below could be an option to resolve the problem. There is definitely no other solution available to fulfil the use case using Query API.
Main Table :-
Course Id - Hash Key
seminar_id - Sort Key
GSI :-
Seminar location - Hash Key
Course Id - Sort Key
In a DynamoDB table, each key value must be unique. However, the key
values in a global secondary index do not need to be unique.
Now, you can use the Query API on GSI to get the data for Seminar location is equal to Nelson Mandela Theatre. You can use the course id in the query api if you know the value. The query api will potentially give multiple items in the result set. You can use FilterExpression if you would like to further filter the data based on some non key attributes.
This is an example from here where you use a filter expression, it is with a scan operation, but maybe you can apply something similar for query instead of scan (take a look at the API):
{
"TableName": "MyTable",
"FilterExpression": "#k_Compatible.#k_RAM = :v_Compatible_RAM",
"ExpressionAttributeNames": {
"#k_Compatible": "Compatible",
"#k_RAM": "RAM"
},
"ExpressionAttributeValues": {
":v_Compatible_RAM": "RAM1"
}
}
You can do one thing to make it working on Scan
Store the object in stringify format like
{
"language": "[{\"language\":\"Male\",\"proficiency\":\"Female\"}]"
}``
and then can perform scan operation
language: {
contains: "Male"
}
on client side you can perform JSON.parse(language)
I have not such experience with DynamoDB yet but started setudying it since I'm planning on use it for my next project.
As far as I could understand from AWS documentation, the answer to your question is: it's not possible to efficiently query on nested attributes.
Looking at Best Practices, spetially Best Practices for Using Secondary Indexes in DynamoDB, it's possible to understand that the right approach should be using diffent line types under the same Partition Key as shown here. Then under the same course_id you would have a generic sorting key(sk). The first register would then have sk = 'Details' with course's data, then other registers like "seminar-1" and it's data, and so on.
You would then set seminar's properties you would like to query as SGI (Secondary Global Index) bearing in mind that it can only have 5 SGI per table.
Hope it helps.
You can use document paths to filter the values. Use seminars.location as the document path.

Couchbase compound key with range filtering & ordering and filtering by second key

I have documents, that look like
{
...
date: '2013-05-25T04:06:20.277Z',
user: 'user1'
...
}
I would need to find documents that have a date within a given range, and a given user. I also need to sort by date.
I have tried emit([doc.user, dateToArray(doc.date)], null) but with this, I cannot sort by date, because AFAIK the key that needs to be sorted on has to be on the left side. Is this correct?
If I try to flip the keys the other way around, no matter what user I put in the startKey & endKey it does not change anything.
for example: startKey: [[0,11,21,0,0,0],"user1"], endKey: [[2016,11,21,0,0,0],"user1"] finds documents from all users, even though I would suppose it to find only documents, where the second key is user1.
How should I do this? My document count can go up to millions, so doing stuff on code-side is out of question..
Atleast for now, I ended up having 2 seperate views (byDate and byUserAndDate)
When I need to find only by date I use the byDate view which has the date as the only key, sorting works fine. Also, when I search by particular user I use the byUserAndDate which has [doc.user, doc.date] as its compound key, when the result contains only items for 1 user, the sort obviously works fine because it will sort by user first, and then by date.

Select distinct count cloudant/couchdb

I am starting a project using Cloudant.
It's a simple system for logging, so I can track the usage of my apps.
My documents looks like this:
{
app:'name of the app',
type:'page view | login | etc..',
owner:'email_of_the_user',
device: 'iphone | android | etc..', date:
'yyyy-mm-dd'
}
I've tried to do some map reducing and faceted searches, but couldn't find so far the result for what I want.
I want to count the number of distinct documents grouped by same owner, date (yyyy-mm-dd), and app.
[For example, if a the same guy logs in the app twice or 20 times in the same date, it will be counted only once.
I want to count how many single users used an app each day, no matter what's the type of the log, or the device he used.]
If it was SQL, assuming that each key of the document is a column, I would query something like this:
SELECT app, date, count(*) FROM LOGS group by date, owner, app
ant the result would be something like:
'App1', '2015-06-01', 200
'App1', '2015-06-02', 232
'App2', '2015-06-01', 142
'App2', '2015-06-02', 120
How can I get the same result using Cloudant/CouchDB?
You can do this using design documents, as Cesar mentioned. A concrete example would be to create a view where your map function emits the field on where you want to group on, such as:
function(doc) {
emit(doc.email, 1);
}
Then, you select your desired reduce function (such as _count). When viewing this on Cloudant dashboard, make sure you select Reduce as part of the query options. When accessing the view via URL you need to pass the appropriate parameters (reduce=true&group=true).
The documentation on Views here is pretty thorough: https://docs.cloudant.com/creating_views.html
For what you need there is a feature on couldant/couchdb called design document. You can check their documentation for this feature for details or this guide:
http://guide.couchdb.org/draft/design.html
Cloudant documentation:
https://docs.cloudant.com/design_documents.html
Design documents are similar views on the SQL world.
Regards,
We were able to do this in our project using the Cloudant Java API...
https://github.com/cloudant/java-cloudant
You should be able to get this sort of result by creating a view that has a map function like this...
function(doc) {
emit([doc.app, doc.date, doc.owner], 1);
}
The reduce function should look like this:
function(keys, values, rereduce){
if (rereduce){
return sum(values);
} else {
return sum(values);
}
}
Then we used the following query to get the data we wanted.
Database db = ....
db.view(viewName).startKey(startKeys).endKey(endKeys)
.group(true).includeDocs(false).query(castClass)
We supplied the view name and some start and end keys (since we emitted a compound key and we needed to supply a filter) and then used the group method to get the data back as you need it.
Revised..
With this new emit key in the map function you should get results like this:
{[
{[app1, 2015,06,28, john#somewhere.net], 12}, <- john visited 12 times on that day...
{[app1, 2015,06,29, john#somewhere.net], 10},
{[app1, 2015,06,28, ann#somewhere.net], 1}
]}
If you use good start and end keys, the amount of records you're querying will stay small and the number of records you get back is the unique visitors you are seeking. Note that in this scenario you are getting back a bit more than you want, but it does work.