What's the difference between BatchGetItem and Query in DynamoDB? - amazon-web-services

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?

There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.

As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.

In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).

DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.

Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.

Related

DynamoDB: When does 1MB limit for queries apply

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?
Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

DynamoDB: Making range query v/s query each item separately

Lets say, I have several items in the dynamodb with the same partition-key and different sort-keys.
Is there any difference between consumed read capacity units if I query the records using a sort-key constraint in a single go v/s query each item individually? Assume that the number of sort-keys to be fetched at-a-time are around 50. The official-documentation says that
One read capacity unit represents one strongly consistent read per
second, or two eventually consistent reads per second, for an item up
to 4 KB in size.
From this definition, it doesn't seem that there should be a difference since this definition is independent of how we query the database.
Apart from additional network delay, does the second approach have any other downside?
Please note that the costing is based on Read Capacity Units (RCU) and Write Capacity Units (WCU).
RCU formula:-
RCU = read capacity unit per item × number of reads per second
Before going into the below calculation, calculate the item size. You can get the item size from AWS console.
Go to the dynamodb table on AWS console --> Overview tab --> See at the bottom.
Lets talk about RCU. In the above case,
Scenario 1 - Getting all the data in one go using hash key only:-
In this scenario, the number of items read will be high (i.e. 50 items data). Calculate the size and check how many RCU required.
Scenario 2 - Getting the data multiple times using hash key and sort key:-
In this scenario, the API will be called multiple times. So, the number of reads per second will go up. Calculate the number of reads required and check how many RCU required.
Compare the RCU calculated in scenario 1 and 2. Choose the option which has less RCU in order to save cost.

Why is AWS RDS MYSQL INSERT taking READ IOPS?

I have db.r3.2xlarge with 4000 PIOPS. I'm inserting like 1 billion rows from EC2 instances. There are like 40GB free RAM right now.
Currently, out of 4000 PIOPS, READ PIOPS is taking 3000 and I'm only getting 1000 WRITE PIOPS. So, it's been a low writing.
How do i check which is taking READ PIOPS? And how to speed thing up?
Thank you.
Edit:
insert ignore into dna (hash, time, song_id) values (b%s, b%s, %s)
I'm using self.cursor.executemany(query, rows) from python
hash + time + song_id is a composite primary key.
I'm using AWS RDS InnoDB.
I have 4000 PIOPS. However, it is now stuck at 2000 total. I have 60MB/s WRITE THROUGHPUT.
If the hash is your primary key or is indexed, you're not inserting in primary my and/or index order.
Also, you're using INSERT IGNORE, which suggests you are trying to avoid the inevitable duplicate key error because there's duplicate data among what you're inserting.
For both of these reasons, InnoDB has to do a lot of readying to load the appropriate pages from the tablespaces on disk into memory to find the spot(s) in the primary and/or any secondary indexes where the next row needs to go, which may turn out to be wasted effort if the row is a duplicate, and may turn out to require a page split so that space is available to randomly insert the next hash into its proper place.
If hash is the primary key, it would probably be to your advantage to drop all other indexes while inserting, then add them at the end, where they can be built more efficiently.
Pre-sorting the inserts by hash should help, some, if the batches are large enough and hash is indeed the primary key.

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.