CouchDB - reprocessing view results - mapreduce

I decided to try out CouchDB today, using the ~9GB of Amazon Review data here:
http://snap.stanford.edu/data/web-Movies.html
What I am trying to do is find the least helpful users of all time. The people who have written the largest number of reviews which other people find unhelpful (are they Amazon's greatest trolls? Or just disagreeable? I want to see).
I've written a map function to find the userID for all users that have a difference of helpfulness rating of over 5, then a reduce function to sum them, to find out how often they appear.
// map function:
function(doc){
var unhelpfulness = doc.helpfulness[1] - doc.helpfulness[0]
if(unhelpfulness > 5){
emit(doc.userId, 1);
}
}
// reduce function:
function(keys, values){
return sum(values);
}
This gives me a view of userId : number of unhelpful reviews.
I want to take this output and then reprocess it with more map reduce, to find out who's written the most unhelpful reviews. How do I do this? Can I export a view as another table or something? Or am I just thinking about this problem in the wrong way?

You are on the right track. Couch db does not allow the results to be sorted by value but it has a list function that can be used to perform operations on the results of the view. From the couchdb book
Just as show functions convert documents to arbitrary output formats, CouchDB list functions allow you to render the output of view queries in any format. The powerful iterator API allows for flexibility to filter and aggregate rows on the fly, as well as output raw transformations for an easy way to make Atom feeds, HTML lists, CSV files, config files, or even just modified JSON.
So we will use list to filter and aggregate. In your design document create a list function like so
function(head, req)
{
var row; var rows=[];
while(row=getRow()){rows.push(row); }
rows.sort(function(a,b){return b.value -a.value});
send(JSON.stringify(rows[0]));
}
now if you query
/your-database/_design/your-design-doc-name/your-list-name/your-view-name?group=true
You should have the name of the person who has the most unhelpful review. Couch db makes it easy to find a troll :)

Related

couchdb mapreduce query intersection of multiple keys

I'm manipulating documents containing a dictionnary of arbitrary metadata, and I would like to search documents based on metadata.
So far my approach is to build an index from the metadata. For each document, I insert each (key,value) pair of the metadata dictionary.
var metaIndexDoc = {
_id: '_design/meta_index',
views: {
by_meta: {
map: function(doc) {
if (doc.meta) {
for (var k in doc.meta) {
var v = doc.meta[k];
emit(k,v);
}
}
}.toString()
}
}
};
That way, I can query for all the docs that have a metadata date, and the result will be sorted based on the value associated with date. So far, so good.
Now, I would like to make queries based on multiple criteria: docs that have date AND important in their metadata. From what I've gathered so far, it looks like the way I built my index won't allow that. I could create a new index with ['date', 'important'] as keys, but the number of indexes would grow exponentially as I add new metadata keys.
I also read about using a new kind of document to store the fact that document X has metadata (key,value), which is definitely how you would do it in a relational database, but I would rather have self-contained documents if it is possible.
Right now, I'm thinking about keeping my metaIndex, making one query for date, one for important, and then use underscore.intersection to compute the intersection of both lists.
Am I missing something ?
EDIT: after discussion with #alexis, I'm reconsidering the option to create custom indexes when I need them and to let PouchDB manage them. It is still true that with a growing number of metadata fields, the number of possible combinations will grow exponentially, but as long as the indexes are created only when they are needed, I guess I'm good to go...

PowerBI and nested 1:N data

I'm trying to leverage the advantages of DocumentDB / Elastic / NoSQL for retrieving big data and to visualize it. I want to use PowerBI to do that, which is pretty good, however, I have no clue how to model a document which has a 1:N nested data field. E.g.
{
name: string,
age: int
children: [ { name: string }... ]
}
In a normal case, you would flatten the table by expanding the nested values and joining them, but how does one do that when it's 1:N / A list. Is there a way to maybe extract that into it's own table?
I've been thinking about making a bridge which translates a document into data tables, but that feels like an incorrect way to go, and further proves some complications with regards to how many endpoints and queries there should be made.
I can't help but think this is a solved issue, as many places analyse and visualize large amounts of data stored in no sql. The alternative is a normalized relational database, but having millions and millions of entries in that which you analyze also seems incorrect when nosql is tuned for these scenarios.
If the data 1:N, but not arbitrarily deep, you can use the expand option in the query tab. You will get one row for each instance of customer that has all the attributes of the container.
If you want to get more sophisticated, you could normalized the schema by expanding just the customer id column (assuming there is one in your data) into one table, and expanding the customer details into another one, then creating a relationship across them. That makes aggregations easier (like count of parents). You'd just load the data twice, and delete the columns you don't need.

CouchDB reduce error?

I have a sample database in CouchDB with the information of a number of aircraft, and a view which shows the manufacturer as key and the model as the value.
The map function is
function(doc) {
emit(doc["Manufacturer"], doc._id)
}
and the reduce function is
function(keys, values, rereduce){
return values.length;
}
This is pretty simple. And I indeed get the correct result when I show the view using Futon, where I have 26 aircraft of Boeing:
"BOEING" 26
But if I use a REST client to query the view using
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"
I get
{"rows":[
{"key":null,"value":2}
]}
I have tested different clients (including web browser, REST client extensions, and curl), all give me the value 2! While queries with other keys work correctly.
Is there something wrong with the MapReduce function or my query?
The issue could be because of grouping
Using group=true (which is Futon's default), you get a separate reduce value for each unique key in the map - that is, all values which share the same key are grouped together and reduced to a single value.
Were you passing group=true as a query parameter when querying with curl etc? Since it is passed by default in futon you saw the results like
BOEING : 26
Where as without group=true only the reduced value was being returned.
So try this query
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"&group=true
You seem to be falling into the re-reduce-trap. Couchdb strictly speaking uses a map-reduce-rereduce process.
Map: reformats your data in the output format.
Reduce: aggregates the data of several (but not all entries with the same key) - which works correctly in your case.
Re-reduce: does the same as reduce, but on previously reduced data.
As you change the format of the value in the reduce stage, the re-reduce call will aggregate the number of already reduced values.
Solutions:
You can just set the value in the map to 1 and reduce a sum of the values.
You check for rereduce==true and in that case return a sum of the values - which will be the integer values returned by the initial reduce.

Overcoming querying limitations in Couchbase

We recently made a shift from relational (MySQL) to NoSQL (couchbase). Basically its a back-end for social mobile game. We were facing a lot of problems scaling our backend to handle increasing number of users. When using MySQL loading a user took a lot of time as there were a lot of joins between multiple tables. We saw a huge improvement after moving to couchbase specially when loading data as most of it is kept in a single document.
On the downside, couchbase also seems to have a lot of limitations as far as querying is concerned. Couchbase alternative to SQL query is views. While we managed to handle most of our queries using map-reduce, we are really having a hard time figuring out how to handle time based queries. e.g. we need to filter users based on timestamp attribute. We only need a user in view if time is less than current time:
if(user.time < new Date().getTime() / 1000)
What happens is that once a user's time is set to some future time, it gets exempted from this view which is the desired behavior but it never gets added back to view unless we update it - a document only gets re-indexed in view when its updated.
Our solution right now is to load first x user documents and then check time in our application. Sorting is done on user.time attribute so we get those users who's time is less than or near to current time. But I am not sure if this is actually going to work in live environment. Ideally we would like to avoid these type of checks at application level.
Also there are times e.g. match making when we need to check multiple time based attributes. Our current strategy doesn't work in such cases and we frequently get documents from view which do not pass these checks when done in application. I would really appreciate if someone who has already tackled similar problems could share their experiences. Thanks in advance.
Update:
We tried using range queries which works for only one key. Like I said in most cases we have multiple time based keys meaning multiple ranges which does not work.
If you use Date().getTime() inside a view function, you'll always get the time when that view was indexed, just as you said "it never gets added back to view unless we update it".
There are two ways:
Bad way (don't do this in production). Query views with stale=false param. That will cause view to update before it will return results. But view indexing is slow process, especially if you have > 1 milllion records.
Good way. Use range requests. You just need to emit your date in map function as a key or a part of complex key and use that range request. You can see one example here or here (also if you want to use DateTime in couchbase this example will be more usefull). Or just look to my example below:
I.e. you will have docs like:
doc = {
"id"=1,
"type"="doctype",
"timestamp"=123456, //document update or creation time
"data"="lalala"
}
For those docs map function will look like:
map = function(){
if (doc.type === "doctype"){
emit(doc.timestamp,null);
}
}
And now to get recently "updated" docs you need to query this view with params:
startKey="dateTimeNowFromApp"
endKey="{}"
descending=true
Note that startKey and endKey are swapped, because I used descending order. Here is also a link to documnetation about key types that couchbase supports.
Also I've found a link to a question that can also help.

Custom Date Aggregate Function

I want to sort my Store models by their opening times. Store models contains is_open function which controls Store's opening time ranges and produces a boolean if it's open or not. The problem is I don't want to sort my queryset manually because of efficiency problem. I thought if I write a custom annotate function then I can filter the query more efficiently.
So I googled and found that I can extend Django's aggregate class. From what I understood, I have to use pre-defined sql functions like MAX, AVG etc. The thing is I want to check that today's date is in a given list of time intervals. So anyone can help me that which sql name should I use ?
Edit
I'd like to put the code here but it's really a spaghetti one. One pages long code only generates time intervals and checks the suitable one.
I want to avoid :
alg= lambda r: (not (s.is_open() and s.reachable))
sorted(stores,key=alg)
and replace with :
Store.objects.annotate(is_open = CheckOpen(datetime.today())).order_by('is_open')
But I'm totally lost at how to write CheckOpen...
have a look at the docs for extra