Do values come into a Cloudant reducer in key order? - mapreduce

I'm writing map/reduce code for a database on Cloudant. Do the values come in to the reduce(keys, values, rereduce) function in key order when rereduce=false? I assume they would because that's how I am accustomed to things working in Hadoop, but I can't find anything in the Cloudant documentation that explicitly says they do.

It is not guaranteed that values come into the reduce function in key order when rereduce=false.

Related

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

I am new to AWS. while reading the docs here and example I came to know that sort key is not only use to sort the data in partitions but also used to enhance the searching criteria on dynamoDB table.But the same we can do with the help of filterCondition. So what is the difference,
and also acc. to example given we can use sort/range key in withKeyConditionExpression("CreateDate = :v_date and begins_with(IssueId, :v_issue)")
but when I tried it gave me exception
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Query key condition not supported
Thanks
To limit the Items returned rather than returning all Items with a particular HASH key.
There are two different ways we can handle this
The ideal way is to build the element we want to query into the RANGE key. This allows us to use Key Expressions to query our data, allowing DynamoDB to quickly find the Items that satisfy our Query.
A second way to handle this is with filtering based on non-key attributes. This is less efficient than Key Expressions but can still be helpful in the right situations. Filter expressions are used to apply server-side filters on Item attributes before they are returned to the client making the call. Filtering is Applied after DynamoDB Query is completed . If you retrieve 100KB of data in Query step but filter it down to 1KB of data, you will consume the Read Capacity Units for 100KB of data
Moral is - Filtering and projection expressions aren't a magic bullet - they won't make it easy to quickly query your data in additional ways. However, they can save network transfer time by limiting the number and size of items transferred back to your network. They can also simplify application complexity by pre-filtering your results rather than requiring application-side filtering.
From dynamodbguide
dynamodbguide

How to perform a range query over AWS dynamoDB

I have a AWS DynamoDB table storing books information, the hash key is book id. There is an attribute for book price.
Now I want to perform a query to return all the books whose price is lower than a certain value. How to do this efficiently, without scanning the whole table?
The query on secondary-index seems only could return a set of entries with the index being a certain value, so I am confused about how to perform a range query efficiently. Thank you very much!
There are two things that maybe you are confusing. The range key with a range on an attribute.
To clarify, in this case you would need a secondary index and when querying the index you would specify a key condition (assuming java and assuming secondary index on value - this in pretty much any sdk supported language)
see http://docs.amazonaws.cn/en_us/AWSJavaSDK/latest/javadoc/index.html?com/amazonaws/services/dynamodbv2/model/QueryRequest.html w/ a BETWEEN condition.
You can't do query of that kind. DynamoDB is sharded across many nodes by hash key, so doing a query without hash key (on all hash keys) is essentially a full scan.
A hack for your case would be to have a hash key with only one value for the whole table, but this is fundamentally wrong because you loose all the pros of using DynamoDB. See hot hash key issue for more info: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

rereduce and group=true in CouchDB

Actually this question was raised by someone else at here https://stackoverflow.com/questions/13338799/does-couchdbs-group-true-prevent-rereduce.
But there is no convincing answer.
group=true is the conceptual equivalent of group_level=exact, so CouchDB runs a reduce per unique key in the map row set.
This is how it is explained in doc.
It sounds like CouchDB would collect all the values for the same key and only reduce one time per each distinct key.
But in another article, it is said that
If the query is on the reduce value of each key (group_by_key = true),
then CouchDB try to locate the boundary of each key. Since this range
is probably not fitting exactly along the B+Tree node, CouchDB need to
figure out the edge of both ends to locate the partially matched leave
B+Tree node and resend its map result (with that key) to the View
Server. This reduce result will then merge with existing rereduce
result to compute the final reduce result of this key.
It sounds like rereduce may happen when group=true.
In my project, there are many documents but there are most 2 values with the same key after grouping for each distinct key.
Will rereduce happen in this case?
Best Regards
Yes. Rereduce is always a possibility.
If this is a problem, there is a rereduce parameter in the reduce function, which allows you to detect if this is happening.
http://docs.couchdb.org/en/latest/couchapp/ddocs.html#reduce-and-rereduce-functions

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.

Add Indexes (db_index=True)

I'm reading a book about coding style in Django and one thing they discuss is db_index=True. Ever since I started using Django, I've never used this function because I'm not really sure what it does.
So my question is, when to consider adding indexes?
This is not really django specific; more to do with databases. You add indexes on columns when you want to speed up searches on that column.
Typically, only the primary key is indexed by the database. This means look ups using the primary key are optimized.
If you do a lot of lookups on a secondary column, consider adding an index to that column to speed things up.
Keep in mind, like most problems of scale, these only apply if you have a statistically large number of rows (10,000 is not large).
Additionally, every time you do an insert, indexes need to be updated. So be careful on which column you add indexes.
As always, you can only optimize what you can measure - so use the EXPLAIN statement and your database logs (especially any slow query logs) to find out where indexes can be useful.
The above answer is correct but in some cases where the search is being done on columns that have only varchar datatype like email. There you need to add an index.
Following is the way of doing that:
Index(name='covering_index', fields=['headline'], include=['pub_date'])
reference from https://docs.djangoproject.com/en/3.2/ref/models/indexes/