Is it possible to increase item size in Dynamo DB - amazon-web-services

After doing some research I found that the maximum size of an item (one row in a table) is 400 KB.
Source of research: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-items
I wanted to insert a text data that contains over 1 MB of size. This is basically one row worth of data.
For example,
I have a table called users which contains summary of the user.
The summary is a text field (String) and I want to insert data over 1 MB. But Dynamo DB allows only 400 KB.
Note
I cannot store this in a file and keep a pointer

You can not store more than 400 KB worth of data in one DynamoDB item (or record).
This link, that you shared in your comment, requires you to break larger records into multiple items and handle merges in your application layer. This isn't supported transparently by DynamoDB.

Just want to note that a tactical way to store items larger than 400KB into DynamoDB tables is to use compression algorithms and save the compressed data as a binary attribute.
Clients need to compress and uncompress the data.
More information here.1 and here.2

As the link mentions, you can break down the text summary into smaller chunks, let say of size w. Use indexes 1...n to store it. The value of n can be kept in another table.
What additional advantage this will give you is you can fetch the summary chunks parallelly.

Related

Whether projection fields improve query performance in Dynamodb?

I use Dynamodb to store User data. Each user has many fields like age, gender, first/last name, address etc. I need to support a query API which response first, last, middle name only, without other fields.
In order to provide a better performance, I have two solutions:
Create a GSI which only includes those query fields. It will make each row very small.
Query the table with projection fields parameter including those query fields.
The item size is 1KB with 20 attributes. 1MB is the maximum data returned from one query. So I should receive 1024 items from querying the main index. If I use field projection to reduce the number of fields, will it give me more items in the response?
Based on dynamodb only response maximum 1MB data, which solution is better for me to use?
What you are trying to achieve is called "Sparse indexes".
Without knowing the table traffic pattern and historical amount of data. Another consideration is the amount of RCU (read capacity units) used for the operation.
FilterExpression is applied after a Query finishes, but before the results are returned.
Link to Documentation
With that in mind, the amount of RCU used by the FilterExpression solution will grow based on the number of fields/data the item has.
You are increasing your costs over time and need to worry about the item size and amount of fields it has.
A review of how RCU works:
DynamoDB read requests can be either strongly consistent, eventually consistent, or transactional.
A strongly consistent read request of an item up to 4 KB requires one read request unit.
An eventually consistent read request of an item up to 4 KB requires one-half read request unit.
A transactional read request of an item up to 4 KB requires two read request units.
Link to documentation
You can use GSI to have a separate throughput and control the used RCU capacity. The amount of data that will be transferred can be predictable. The RCU utilization will be based on the index entries only (first, last, middle and name)
You will need to update your application to use the new index and work with eventually consistent reads. GSI doesn't have support for a strongly consistent read.
Global secondary indexes support eventually consistent reads, each of which consume one half of a read capacity unit. This means that a single global secondary index query can retrieve up to 2 × 4 KB = 8 KB per read capacity unit.
For global secondary index queries, DynamoDB calculates the provisioned read activity in the same way as it does for queries against tables. The only difference is that the calculation is based on the sizes of the index entries, rather than the size of the item in the base table.
Link to documentation
Returning to your question: "which solution is better for me to use?"
Do you need strongly consistent reads? You need to use the table base index with FilterExpression. Otherwise, use GSI.
A good reading is this article: When to use (and when not to use) DynamoDB Filter Expressions
First of all it's important to note that DynamoDBs 1MB limit is not a blocker, it's there for performance reasons.
Your use case seems to want to unnecessarily reduce your payload to below the 1MB limit. However, you should just introduce pagination.
DynamoDB paginates the results from Query operations. With pagination, the Query results are divided into "pages" of data that are 1 MB in size (or less). An application can process the first page of results, then the second page, and so on.
The LastEvaluatedKey from a Query response should be used as the ExclusiveStartKey for the next Query request. If there is not a LastEvaluatedKey element in a Query response, then you have retrieved the final page of results. If LastEvaluatedKey is not empty, it does not necessarily mean that there is more data in the result set. The only way to know when you have reached the end of the result set is when LastEvaluatedKey is empty.
Ref
GSI or ProjectionExpression
This ultimately depends on what you need. For example, if you simply just want certain attributes and the base tables keys are suitable for your access patterns then I would 100% use a ProjectionExpression and paginate the results until I have all the data.
You should only create a GSI should the keys of the base table not suit your access pattern needs. GSI will increase your table costs and you will be storing more data and consuming extra throughput when your use-case doesn't need to.

Indexed Range Query with DynamoDB

With DynamoDB, there is simply no straightforward way to perform an indexed range query over a column. Primary key, local secondary index, and global secondary index all require a partition key to range query.
For example, suppose I have a high-scores table with a numerical score attribute. There is no way to get the top 10 scores or top scores 25 to 50 with an indexed range query
So, what is the idiomatic or preferred way to perform this incredibly common task?
Settle for a table scan.
Use a static partition key and take advantage of partition queries.
Use a fixed number of static partition keys and use multiple partition queries.
It's either 2) or 3) but it depends on the amount and structure of data as well as the read/write activity.
There's no generic answer here as it's use-case specific.
As long as you can get away with it, you probably want to use 2) as it only requires a single Query API call. If you have lots of data or heavy read/write activity, you'd use some bucketing-strategy (very close to your third option) to write to multiple partitions, then do multiple queries and aggregate the results.
DDB isn't suited for analytics. As Maurice said you can facilitate what you need via secondary index, but there are also other options to consider:
If you are providing this Top N to your customers consistently/frequently and N is fixed, then you can have dedicated item(s) that hold this information and you would update that/those item(s) upon writing an item to a table. You can have 1 item for the whole top N or you can apply some bucketing strat.
If your system needs this information infrequently (on some singular occasions), then scan might be also fine.
If this is for analytics/research, consider exporting the table to S3 and using Athena.

Pagination in DynamoDB list of items

I am having a list of items, can i implement pagination on location attribute e.g fetch from indexes 0 to 10, 10 to 20...
{
continent:String,
country:String,
locations:[ {address,phone},{address,phone}....]
}
The DynamoDB read requests (e.g., GetItem) have a ProjectionExpression option where you can pass a list of attributes as well as sub-attributes you want to fetch from the item instead of the whole item. So you can use
ProjectionExpression="locations[10],locations[11],...,locations[19]"
to have GetItem return just elements 10..19 from the "locations" list. Unfortunately there is no shorter syntax to get a range of indices.
However, you should be aware that even if you ask to only read elements 10..19 from the list, you'll actually be paying to read the entire item. DynamoDB assumes that items (the whole thing in your example) are fairly small (the hard limit is 400 KB), and any read or write to an item - even if just to a part of it - incurs a cost proportional to the length of the entire item. So although you could do paging using the syntax I just described, it not usually not a cost-effective approach. If you read an item in 20 "pages", you'll be paying for reading the same item 20 times! You're better off just fetching the entire item (again, it's limited in size anyway) once and paging through it in the client.
When talking about paging being an effective approach, the discussion is usual about multiple items in the same partition, i.e., multiple items with the same hash key and different sort key. You can then page through these items using the Query request, and you will be paying just for the items you paged through. With such approach the length of the "list" (it's not really a list, but rather many separate items) is practically unbounded.

Query result size in AWS Redshift

Is there a good way to know the query response size in Redshift?
Other providers create a temporary table (or you can do it) and then, check the table size. In Redshift, a temporary table does not have an entry in svv_table_info view.
Creating a temp table does work for your case and the size of all tables can be found by looking at stv_blocklist. Since every table is composed of blocks and all blocks are 1MB in size this gives a "from the horses mouth" size for the table. Just remember that this gives the blocks needed to store the table and in some cases this can be misleading - DISTSTYLE ALL tables will have N copies of the data. In general this is a good way to find the size of any table. You can also query the number of rows in the temp table for a different size assessment.
The downside of the temp table approach is the time it takes to setup and organize the data into a table. You still need to select this data for output assuming that is the intent once the size has been assessed. A more common, but more advanced approach for dealing with potentially oversized output is to set up a cursor to hold the output.
A cursor is an output buffer on the Redshift leader node that holds the results of a query before transmitting to the requesting client. Then the cursor contents can be read in chunks of row (10,000 typically) and when more data is needed, more rows are read. Many BI tools will use cursors so that malformed reports don't swamp the tool with too much data. You can also query the size (rows and bytes) of a cursor by looking at stv_active_cursors.
The downside of cursors is that they require work from the leader node, not much, but some. Significant over-use / mis-use of cursors can slow the leader down or potentially fill up the leader node's disks (but this is unlikely since the leader has the same disk size as a compute node but not as much data to store). Also since cursors are read in a loop, fetch then process, fetch then process, the client is typically an application and not a user. However, I have done cursors interactively in order to see the size of the output before I pull it out over the network - run query to cursor, read the size of the cursor, and then if the size is ok read the entire cursor in one chunk.
Bottom line - cursors could be the solution you are looking for (I can't really tell without the use case being described).

What is the idiomatic way to perform a migration on a dynamo table

Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.