couchdb mapreduce query intersection of multiple keys - mapreduce

I'm manipulating documents containing a dictionnary of arbitrary metadata, and I would like to search documents based on metadata.
So far my approach is to build an index from the metadata. For each document, I insert each (key,value) pair of the metadata dictionary.
var metaIndexDoc = {
_id: '_design/meta_index',
views: {
by_meta: {
map: function(doc) {
if (doc.meta) {
for (var k in doc.meta) {
var v = doc.meta[k];
emit(k,v);
}
}
}.toString()
}
}
};
That way, I can query for all the docs that have a metadata date, and the result will be sorted based on the value associated with date. So far, so good.
Now, I would like to make queries based on multiple criteria: docs that have date AND important in their metadata. From what I've gathered so far, it looks like the way I built my index won't allow that. I could create a new index with ['date', 'important'] as keys, but the number of indexes would grow exponentially as I add new metadata keys.
I also read about using a new kind of document to store the fact that document X has metadata (key,value), which is definitely how you would do it in a relational database, but I would rather have self-contained documents if it is possible.
Right now, I'm thinking about keeping my metaIndex, making one query for date, one for important, and then use underscore.intersection to compute the intersection of both lists.
Am I missing something ?
EDIT: after discussion with #alexis, I'm reconsidering the option to create custom indexes when I need them and to let PouchDB manage them. It is still true that with a growing number of metadata fields, the number of possible combinations will grow exponentially, but as long as the indexes are created only when they are needed, I guess I'm good to go...

Related

Compare values in dynamodb during query without knowing values in ExpressionAttributeValues

Is it possible to apply a filter based on values inside a dynamodb database?
Let's say the database contains an object info within a table:
info: {
toDo: x,
done: y,
}
Using the ExpressionAttributeValues, is it possible to check whether the info.toDo = info.done and apply a filter on it without knowing the current values of info.toDo and info.done ?
At the moment I tried using ExpressionAttributeNames so it contains:
'#toDo': info.toDo, '#done': info.done'
and the filter FilterExpression is
#toDo = #done
but I'm retrieving no items doing a query with this filter.
Thanks a lot!
DynamoDB is not designed to perform arbitrary queries as you might be used to in a relational database. It is designed for fast lookups based on keys.
Therefore, if you can add an index allowing you to access the records you look for, you can use it for this new access pattern. For example, if you add an index that uses info.toDo as the partition key and info.done as the sort key. You can then use the index to scan the records with the conditional expression of PK=x and SK=x, assuming that the list of possible values is limited and known.

DynamoDB fast search on complex data types

I need to create a new table on AWS DynamoDB that will have a structure like the following:
{
"email" : String (key),
... : ...,
"someStuff" : SomeType,
... : ...,
"listOfIDs" : Array<String>
}
This table contains users' data and a list of strings that I'll often query (see listOfIDs).
Since I don't want to scan the table every time in order to get the user linked to that specific ID due to its slowness, and I cannot create an index since it's an Array and not a "simple" type, how could I improve the structure of my table? Should I use a different table where I have all my IDs and the users linked to them in a "flat" structure? Is there any other way?
Thank you all!
Perhaps another table that looks like:
ID string / hash key,
Email string / range key,
Any other attributes you may want to access
The unique combination of ID and email will allow you to search on the "List of IDs". You may want to include other attributes within this table to save you from needing to perform another query.
Should I use a different table where I have all my IDs and the users linked to them in a "flat" structure?
I think this is going to be your best bet if you want to leverage DynamoDB's parallelism for query performance.
Another option might be using a CONTAINS expression in a query if your listOfIDs is stored as a set, but I can't imagine that will scale performance-wise as your table grows.

PowerBI and nested 1:N data

I'm trying to leverage the advantages of DocumentDB / Elastic / NoSQL for retrieving big data and to visualize it. I want to use PowerBI to do that, which is pretty good, however, I have no clue how to model a document which has a 1:N nested data field. E.g.
{
name: string,
age: int
children: [ { name: string }... ]
}
In a normal case, you would flatten the table by expanding the nested values and joining them, but how does one do that when it's 1:N / A list. Is there a way to maybe extract that into it's own table?
I've been thinking about making a bridge which translates a document into data tables, but that feels like an incorrect way to go, and further proves some complications with regards to how many endpoints and queries there should be made.
I can't help but think this is a solved issue, as many places analyse and visualize large amounts of data stored in no sql. The alternative is a normalized relational database, but having millions and millions of entries in that which you analyze also seems incorrect when nosql is tuned for these scenarios.
If the data 1:N, but not arbitrarily deep, you can use the expand option in the query tab. You will get one row for each instance of customer that has all the attributes of the container.
If you want to get more sophisticated, you could normalized the schema by expanding just the customer id column (assuming there is one in your data) into one table, and expanding the customer details into another one, then creating a relationship across them. That makes aggregations easier (like count of parents). You'd just load the data twice, and delete the columns you don't need.

Obtain value from HBase table by key

There is a HBase table with tens of billions records where a key is a line of 40 bytes. And also there is a list of hundreds of thousands keys. I need to get all records with this keys and return value of certain table field. So, my purpose is transform a set of keys to a set of values. What is the most convenient and/or efficient way to perform the task (with any programming language and technology)?
You can use HBase Java API. In java-like pseudo code
conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "ZOOKEEPER_USED_BY_HBASE")
connection = ConnectionFactory.createConnection(conf)
table = connection.getTable("tablename")
gets = new ArrayList<Get>()
for all keys {
gets.add(new Get(key.toBytes()))
}
table.get(gets)
A few more suggestions:
Have a look at Get javadocs, you can configure it to return only columns
you are interested in
If keys share some common prefix using Scan with start/stop row might work as well. Call scan.setCaching(5000) to make it slightly faster if you use it.
I was testing MapReduce on MongoDB to see how efficient it is at grabbing key/value pairs from a collection. It was only a collection of 100k records, but a small JavaScript function was able to retrieve all the countries and the amount of times they appeared in the collection.
Map1 = function()
{
Emit(this.country, 1)
}
Reduce1 = function(key, vals) {
for(var i=0, sum=0; i < vals.length; i++)
{
sum += vals[i];
}
return sum;
}
Then again, i don't know how effective M/R would be with billions of records.

CouchDB - reprocessing view results

I decided to try out CouchDB today, using the ~9GB of Amazon Review data here:
http://snap.stanford.edu/data/web-Movies.html
What I am trying to do is find the least helpful users of all time. The people who have written the largest number of reviews which other people find unhelpful (are they Amazon's greatest trolls? Or just disagreeable? I want to see).
I've written a map function to find the userID for all users that have a difference of helpfulness rating of over 5, then a reduce function to sum them, to find out how often they appear.
// map function:
function(doc){
var unhelpfulness = doc.helpfulness[1] - doc.helpfulness[0]
if(unhelpfulness > 5){
emit(doc.userId, 1);
}
}
// reduce function:
function(keys, values){
return sum(values);
}
This gives me a view of userId : number of unhelpful reviews.
I want to take this output and then reprocess it with more map reduce, to find out who's written the most unhelpful reviews. How do I do this? Can I export a view as another table or something? Or am I just thinking about this problem in the wrong way?
You are on the right track. Couch db does not allow the results to be sorted by value but it has a list function that can be used to perform operations on the results of the view. From the couchdb book
Just as show functions convert documents to arbitrary output formats, CouchDB list functions allow you to render the output of view queries in any format. The powerful iterator API allows for flexibility to filter and aggregate rows on the fly, as well as output raw transformations for an easy way to make Atom feeds, HTML lists, CSV files, config files, or even just modified JSON.
So we will use list to filter and aggregate. In your design document create a list function like so
function(head, req)
{
var row; var rows=[];
while(row=getRow()){rows.push(row); }
rows.sort(function(a,b){return b.value -a.value});
send(JSON.stringify(rows[0]));
}
now if you query
/your-database/_design/your-design-doc-name/your-list-name/your-view-name?group=true
You should have the name of the person who has the most unhelpful review. Couch db makes it easy to find a troll :)