Select distinct count cloudant/couchdb - mapreduce

I am starting a project using Cloudant.
It's a simple system for logging, so I can track the usage of my apps.
My documents looks like this:
{
app:'name of the app',
type:'page view | login | etc..',
owner:'email_of_the_user',
device: 'iphone | android | etc..', date:
'yyyy-mm-dd'
}
I've tried to do some map reducing and faceted searches, but couldn't find so far the result for what I want.
I want to count the number of distinct documents grouped by same owner, date (yyyy-mm-dd), and app.
[For example, if a the same guy logs in the app twice or 20 times in the same date, it will be counted only once.
I want to count how many single users used an app each day, no matter what's the type of the log, or the device he used.]
If it was SQL, assuming that each key of the document is a column, I would query something like this:
SELECT app, date, count(*) FROM LOGS group by date, owner, app
ant the result would be something like:
'App1', '2015-06-01', 200
'App1', '2015-06-02', 232
'App2', '2015-06-01', 142
'App2', '2015-06-02', 120
How can I get the same result using Cloudant/CouchDB?

You can do this using design documents, as Cesar mentioned. A concrete example would be to create a view where your map function emits the field on where you want to group on, such as:
function(doc) {
emit(doc.email, 1);
}
Then, you select your desired reduce function (such as _count). When viewing this on Cloudant dashboard, make sure you select Reduce as part of the query options. When accessing the view via URL you need to pass the appropriate parameters (reduce=true&group=true).
The documentation on Views here is pretty thorough: https://docs.cloudant.com/creating_views.html

For what you need there is a feature on couldant/couchdb called design document. You can check their documentation for this feature for details or this guide:
http://guide.couchdb.org/draft/design.html
Cloudant documentation:
https://docs.cloudant.com/design_documents.html
Design documents are similar views on the SQL world.
Regards,

We were able to do this in our project using the Cloudant Java API...
https://github.com/cloudant/java-cloudant
You should be able to get this sort of result by creating a view that has a map function like this...
function(doc) {
emit([doc.app, doc.date, doc.owner], 1);
}
The reduce function should look like this:
function(keys, values, rereduce){
if (rereduce){
return sum(values);
} else {
return sum(values);
}
}
Then we used the following query to get the data we wanted.
Database db = ....
db.view(viewName).startKey(startKeys).endKey(endKeys)
.group(true).includeDocs(false).query(castClass)
We supplied the view name and some start and end keys (since we emitted a compound key and we needed to supply a filter) and then used the group method to get the data back as you need it.
Revised..
With this new emit key in the map function you should get results like this:
{[
{[app1, 2015,06,28, john#somewhere.net], 12}, <- john visited 12 times on that day...
{[app1, 2015,06,29, john#somewhere.net], 10},
{[app1, 2015,06,28, ann#somewhere.net], 1}
]}
If you use good start and end keys, the amount of records you're querying will stay small and the number of records you get back is the unique visitors you are seeking. Note that in this scenario you are getting back a bit more than you want, but it does work.

Related

Querying nested attributes in Amazon DynamoDB

How can I efficiently query on nested attributes in Amazon DynamoDB?
I have a document structure as below, which lets me store related information in the document itself (rather than referencing it).
It makes sense to store the seminars nested in the course, since they will likely be queried alongside the course (they are all course-specific, i.e. a course has many seminars, and a seminar belongs to a course).
In CouchDB, which I’m migrating from, I could write a View that would project some nested attributes for querying. I understand that I can’t project anything that isn’t a top-level attribute into a dynamodb secondary index, so this approach doesn’t seem to work.
This brings me back to the question: how can I efficiently query on nested attributes without scanning, if I can’t use them as keys in an index?
For example, if I want to get average attendance at Nelson Mandela Theatre, how can I query for the values of registrations and attendees in all seminars that have a location of “Nelson Mandela Theatre” without resorting to a scan?
{
“course_id”: “ABC-1234567”,
“course_name”: “Statistics 101”,
“tutors”: [“Cognito-sub-1”, “Cognito-sub-2”],
“seminars”: [
{
“seminar_id”: “XXXYYY-12345”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “How to lie with statistics”,
“registrations”: “92”,
“attendees”: “61”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “155555555”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for dog owners”,
“registrations”: “244”,
“attendees”: “240”
},
{
“seminar_id”: “XXXAAA-54321”,
“epoch_time”: “223456789”,
“duration”: “4000”,
“location”: “Starbucks”,
“name”: “Is feral cat population growth a leading indicator for the S&P 500?”,
“registrations”: “40”
}
]
}
{
“course_id”: “CJX-5553389”,
“course_name”: “Cat Health 101”,
“tutors”: [“Cognito-sub-4”, “Cognito-sub-9”],
“seminars”: [
{
“seminar_id”: “TTRHJK-43278”,
“epoch_time”: “123456789”,
“duration”: “5400”,
“location”: “Catwoman Hall”,
“name”: “Emotional support octopi for cats”,
“registrations”: “88”,
“attendees”: “87”
},
{
“seminar_id”: “BBBCCC-44444”,
“epoch_time”: “123666789”,
“duration”: “5400”,
“location”: “Nelson Mandela Theatre”,
“name”: “Statistical significance for cat owners”,
“registrations”: “44”,
“attendees”: “44”
}
]
}
Index cannot be created for nested attributes (i.e. document data types in Dynamodb).
Document Types – A document type can represent a complex structure
with nested attributes—such as you would find in a JSON document. The
document types are list and map.
Query Api:-
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
Scan API:-
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
In order to use Query API, the hash key value is required. The OP doesn't have any information that hash key value is available. As per OP, the data needs to be queried by location attribute which is inside the Dynamodb List data type. Now, the option is to look at GSI.
Kindly read more about the GSI. One of the rules is that GSI can be created using top level attributes only. So, the location can't be used to create the index.
So, creating the GSI in order to use Query API has been ruled out as well.
The index key attributes can consist of any top-level String, Number,
or Binary attributes from the base table; other scalar types, document
types, and set types are not allowed.
Because of the above mentioned reasons, the Query API can't be used to get the data based on location attribute assuming hash key value is not available.
If hash key value is available, FilterExpression can be used to filter the data. Only way to filter the data present in the complex list data type is CONTAINS function. In order to use CONTAINS function, all the attributes in the occurrence is required to match the data (i.e. seminar_id, location, duration and all other attributes). So, it is definitely not possible to fulfil the use case mentioned in the OP using the current data model.
Proposed alternate solution:-
Re-modeling the data structure as mentioned below could be an option to resolve the problem. There is definitely no other solution available to fulfil the use case using Query API.
Main Table :-
Course Id - Hash Key
seminar_id - Sort Key
GSI :-
Seminar location - Hash Key
Course Id - Sort Key
In a DynamoDB table, each key value must be unique. However, the key
values in a global secondary index do not need to be unique.
Now, you can use the Query API on GSI to get the data for Seminar location is equal to Nelson Mandela Theatre. You can use the course id in the query api if you know the value. The query api will potentially give multiple items in the result set. You can use FilterExpression if you would like to further filter the data based on some non key attributes.
This is an example from here where you use a filter expression, it is with a scan operation, but maybe you can apply something similar for query instead of scan (take a look at the API):
{
"TableName": "MyTable",
"FilterExpression": "#k_Compatible.#k_RAM = :v_Compatible_RAM",
"ExpressionAttributeNames": {
"#k_Compatible": "Compatible",
"#k_RAM": "RAM"
},
"ExpressionAttributeValues": {
":v_Compatible_RAM": "RAM1"
}
}
You can do one thing to make it working on Scan
Store the object in stringify format like
{
"language": "[{\"language\":\"Male\",\"proficiency\":\"Female\"}]"
}``
and then can perform scan operation
language: {
contains: "Male"
}
on client side you can perform JSON.parse(language)
I have not such experience with DynamoDB yet but started setudying it since I'm planning on use it for my next project.
As far as I could understand from AWS documentation, the answer to your question is: it's not possible to efficiently query on nested attributes.
Looking at Best Practices, spetially Best Practices for Using Secondary Indexes in DynamoDB, it's possible to understand that the right approach should be using diffent line types under the same Partition Key as shown here. Then under the same course_id you would have a generic sorting key(sk). The first register would then have sk = 'Details' with course's data, then other registers like "seminar-1" and it's data, and so on.
You would then set seminar's properties you would like to query as SGI (Secondary Global Index) bearing in mind that it can only have 5 SGI per table.
Hope it helps.
You can use document paths to filter the values. Use seminars.location as the document path.

RealmSwift limiting and fetching last 30 records into tableview

I'm a newbie of using RealmSwift and i'm creating chat like application using swift 3.0 with backend database as RealmSwift. while inserting the chat works good into realm, but the thing when fetch the records
let newChat = uiRealm.objects(Chats.self).filter(
"(from_id == \(signUser!.user_id)
OR from_id == \(selectedList.user_id))
AND (to_id == \(signUser!.user_id)
OR to_id == \(selectedList.user_id))"
).sorted(byProperty: "id", ascending: true)
i don't know how to limit the last 30 records for the chat conversation. In the above code i just fetch the records from "Chat" table with filtering the chat as "SIGNED USERID AND TO USERID". and also if i list all the records(like more than 150 chat conversation) for the particular chat, then scrolling up the records from tableview got stuck up or hang for a while. So please give some idea about how to limit last 30 records and stop hanging the tableview. Thanks in advance
Like I wrote in the Realm documentation, because Realm Results objects are lazily-loaded, it doesn't matter if you query for all of the objects and then simply load the ones you need.
If you want to line it up to a table view, you could create an auxiliary method that maps the last 30 results to a 0-30 index range, which would be easier to then pass straight to the table view's data source:
func chat(atIndex index: Integer) -> Chats {
let mappedIndex = (newChat.count - 30) + index
return newChat[mappedIndex]
}
If you've already successfully queried and started accessing these objects (i.e. the query itself didn't hang), I'm not sure why the table view would hang after the fact. You could try running the Time Profiler in Instruments to track down exactly what's causing the main thread to be blocked.

Django create date list from existing objects

i come with this doubt abobt how to make a linked dates list based on existing objects, first of all i have a model with a DateTimeField which stores the date and hour that the object was added.
I have something like:
|pk|name|date
|1|name 1|2016-08-02 16:14:30.405305
|2|name 2|2016-08-02 16:15:30.405305
|3|name 3|2016-08-03 16:46:29.532976
|4|name 4|2016-08-04 16:46:29.532976
And i have some records with the same day but different hour, what i want is to make a list displaying only the unique days:
2016-08-02
2016-08-03
2016-08-04
And also because i'm using the CBV DayArchiveView i want to add a link to that elements to list them per day with a url pattern like this:
url(r'^archive/(?P<year>[0-9]{4})/(?P<month>[-\w]+)/(?P<day>[0-9]+)/$', ArticleDayArchiveView.as_view(), name="archive_day"),
The truth is that i don't have a clue of how to achieve that, can you help me with that?
Extracting unique dates
instances = YourModel.objects.all()
unique_dates = list(set(map(lambda x: x.date.strftime("%Y-%m-%d"), instances)))
About listing them, your url pattern looks ok. You need to define a view in order to retrieve them and wire up with that url.
UPDATE:
If you want to order them, just:
sorted_dates = sorted(unique_dates)

Couchbase compound key with range filtering & ordering and filtering by second key

I have documents, that look like
{
...
date: '2013-05-25T04:06:20.277Z',
user: 'user1'
...
}
I would need to find documents that have a date within a given range, and a given user. I also need to sort by date.
I have tried emit([doc.user, dateToArray(doc.date)], null) but with this, I cannot sort by date, because AFAIK the key that needs to be sorted on has to be on the left side. Is this correct?
If I try to flip the keys the other way around, no matter what user I put in the startKey & endKey it does not change anything.
for example: startKey: [[0,11,21,0,0,0],"user1"], endKey: [[2016,11,21,0,0,0],"user1"] finds documents from all users, even though I would suppose it to find only documents, where the second key is user1.
How should I do this? My document count can go up to millions, so doing stuff on code-side is out of question..
Atleast for now, I ended up having 2 seperate views (byDate and byUserAndDate)
When I need to find only by date I use the byDate view which has the date as the only key, sorting works fine. Also, when I search by particular user I use the byUserAndDate which has [doc.user, doc.date] as its compound key, when the result contains only items for 1 user, the sort obviously works fine because it will sort by user first, and then by date.

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.
You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.