Maximum Partition key length of my data in Dynamo DB - amazon-web-services

I have an use case to place constraints on the key size in my application. I tried to find the max length of partition key so far in my DynamoDB table. This will help me to know my data before placing any internal constraints on the data that I am using as a partition key in Dynamo DB.
Example: Let's say here is my table with a partition key (idempotent_id). I want to know the max length of partition keys so far (in this case 7).
idempotent_id
1234
12
1234567
12345
I tried using Dynamo DB console from my aws account. I looked at query and scan api of DynamoDB. But nothing seems good fit for me. May be this is something we can't find using DynamoDB? or may be I am searching wrongly?
Any help would be appreciated.

Partition Keys and Sort Keys
Partition Key Length
The minimum length of a partition key value is 1 byte. The maximum length is 2048 bytes.
Partition Key Values
There is no practical limit on the number of distinct partition key values, for tables or for secondary indexes.
Sort Key Length
The minimum length of a sort key value is 1 byte. The maximum length is 1024 bytes.
Sort Key Values
In general, there is no practical limit on the number of distinct sort key values per partition key value.
The exception is for tables with secondary indexes. With a local secondary index, there is a limit on item collection sizes: For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB. This might constrain the number of sort keys per partition key value. For more information, see Item Collection Size Limit.

From your comment:
I am trying to find the maximum size/length of idempotent_id so far in my single table.
In order to do this without any auxiliary data, you will need to perform a full table scan and get the result attributes you care for from each item. You can use a ProjectionExpressions to reduce the amount of data retrieved.
You could store the value in another attribute and create a GSI on that which will give you the ability to query that index in an ordering.
Another option would be to use something like DynamoDB Streams to listen to events and keep track of the max size in a different storage medium.

Take a look at data types section and their limits: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
String
The length of a String is constrained by the maximum item size of 400
KB.
Strings are Unicode with UTF-8 binary encoding. Because UTF-8 is a
variable width encoding, DynamoDB determines the length of a String
using its UTF-8 bytes.
Number
A Number can have up to 38 digits of precision, and can be positive,
negative, or zero.
Positive range: 1E-130 to 9.9999999999999999999999999999999999999E+125
Negative range: -9.9999999999999999999999999999999999999E+125 to
-1E-130 DynamoDB uses JSON strings to represent Number data in requests and replies. For more information, see DynamoDB Low-Level
API.
If number precision is important, you should pass numbers to DynamoDB
using strings that you convert from a number type.
Binary
The length of a Binary is constrained by the maximum item size of 400
KB.
Applications that work with Binary attributes must encode the data in
Base64 format before sending it to DynamoDB. Upon receipt of the data,
DynamoDB decodes it into an unsigned byte array and uses that as the
length of the attribute.

Range Key Example
string = '1USER:01G23WSPRVXA8BWK5ERD0TDYT501G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sl'
string.length => 1024
You can also check string bytes HERE
What is interesting, is if I add one more character to this string and then create a new item with this string as the range key I get the following error.
UnhandledPromiseRejectionWarning: ValidationException: Hash primary key values must be under 2048 bytes, and range primary key values must be under 1024 bytes
The measured reality of it is that the range key length can be <= 1024 where the error message says only that of < 1024. The same difference for the Primary Hash Key of 2048; The warning says < 2048 but the database excepts <= 2048.
Note: Both keys must be at least one character. The following is an error message when the Range key is a blank string.
UnhandledPromiseRejectionWarning: ValidationException: One or more parameter values are not valid. The AttributeValue for a key attribute cannot contain an empty string value. Key: SK
Primary (Hash) Key Example
string = '01G23XW4MCMTGMRWD4R78NMA29:USERS31:1USER:01G23WSPRVXA8BWK5ERD0TDYT501G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;sljs;gjasjfdaskfjas;dfjaskdfjaskldfjkasjfjkasjfjlsjdfjskjfjlsajf;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahskghlahkdhgkjashdjkghjakshghasjkdhg;jslfjklsajjj;sdfjsfjas;fjsa;fjs;ldfkjsadk;lsfasjjf;jfj;jdksfjsljfksfksf;afjs;lfjsljflsjkssdjfsjfsljskdkdkdkdkdkdkddkdkksajglks;jgsag;ljd;lagjdkllgkjsklgj01G0G9WD6E1JXPZTFBJAAFKJ9N:CLIENTONE:NAVBAR:V1:lsj;lajgjsldglsjgjasgjsjdgjsdjgjajjg;slskahlsfdjlashghjskhglhaskjghlhsdghjkahsdsagsdgdggg'
string.length => 2048

Related

Timestream- failed to satisfy constraint: Member must satisfy the bytes range [1, 2048]

I am trying to insert a record in Timestream table where an attribute value exceeds the bytes range[1, 2048]. Is it possible to increase the range?
It seems that based on the Timestream record docs, the MeasureValue can only contain a maximum of 2048 characters; If you are facing this problem, I would suggest to rethink the way you store data as TimeStream is meant for building time-series datasets and this type of large data are better off stored in either NoSQL or RDBMS stores.
You might want to considering swapping the contents of that field with a key that points to DynamoDB to address your challenge.
MeasureValue
- **Contains** the measure value for the time-series data point.
- **Type**: String
- **Length** Constraints: Minimum length of 1. Maximum length of 2048.
- **Required**: No

What should be the value of VisitorId for Recommendations AI?

Are there specific requirements for the VisitorID value (max length or maybe it should be composed only from numbers...)? If we send for ex. "visitor76a008b9d38491ff142774559e740552" it will work?
Yes it will work since your string input for visitorID is 39 bytes. Field visitorID has a length limit of 128 bytes and there are no input restrictions. As long as the string input is within the said limit it should be fine.
visitorID -> string
Required. A unique identifier for tracking visitors with a length
limit of 128 bytes.
For example, this could be implemented with a http cookie, which
should be able to uniquely identify a visitor on a single device. This
unique identifier should not change if the visitor log in/out of the
website. Maximum length 128 bytes. Cannot be empty.

AWS Redshift: How to store text field with size greater than 100K

I have a text field in parquet file with max length 141598. I am loading the parquet file to redshift and got the error while loading as the max a varchar can store is 65535.
Is there any other datatype I can use or another alternative to follow?
Error while loading:
S3 Query Exception (Fetch). Task failed due to an internal error. The length of the data column friends is longer than the length defined in the table. Table: 65535, Data: 141598
No, the maximum length of a VARCHAR data type is 65535 bytes and that is the longest data type that Redshift is capable of storing. Note that length is in bytes, not characters, so the actual number of characters stored depends on their byte length.
If the data is already in parquet format then possibly you don't need to load this data into a Redshift table at all, instead you could create a Spectrum external table over it. The external table definition will only support a VARCHAR definition of 65535, the same as a normal table, and any query against the column will silently truncate additional characters beyond that length - however the original data will be preserved in the parquet file and potentially accessible by other means if needed.

Dynamo DB: global secondary index, sparse index

I am considering taking advantage of sparse indexes as described in the AWS guidelines. In the example described --
... in the GameScores table, certain players might have earned a particular achievement for a game - such as "Champ" - but most players have not. Rather than scanning the entire GameScores table for Champs, you could create a global secondary index with a partition key of Champ and a sort key of UserId.
My question is: what happens when the number of champs becomes very large? I suppose that the "Champ" partition will become very large and you would start to experience uneven load distribution. In order to get uniform load distribution, would I need to randomize the "Champ" value by (effectively) sharding over n shards, e.g. Champ.0, Champ.1 ... Champ.99?
Alternatively, is there a different access pattern that can be used when fetching entities with a specific attribute that may grow large over time?
this is exactly the solution you need (Champ.0, Champ.1 ... Champ.N)
N should be [expected partitions for this index + some growth gap] (if you expect for high load, or many 'champs' then you can choose N=200) (for a good hash distribution over partitions). i recommend that N will be modulo on userId. (this can help you to do some manipulations by userId.)
we also use this solution if your hash key is Boolean (in dynamodb you can represent boolean as string), so in this case the hash will be "true.0", "true.1" .... "true.N" and the same for "false".

What's the difference between BatchGetItem and Query in DynamoDB?

I've been going through AWS DynamoDB docs and, for the life of me, cannot figure out what's the core difference between batchGetItem() and Query(). Both retrieve items based on primary keys from tables and indexes. The only difference is in the size of the items retrieved but that doesn't seem like a ground breaking difference. Both also support conditional updates.
In what cases should I use batchGetItem over Query and vice-versa?
There’s an important distinction that is missing from the other answers:
Query requires a partition key
BatchGetItems requires a primary key
Query is only useful if the items you want to get happen to share a partition (hash) key, and you must provide this value. Furthermore, you have to provide the exact value; you can’t do any partial matching against the partition key. From there you can specify an additional (and potentially partial/conditional) value for the sort key to reduce the amount of data read, and further reduce the output with a FilterExpression. This is great, but it has the big limitation that you can’t get data that lives outside a single partition.
BatchGetItems is the flip side of this. You can get data across many partitions (and even across multiple tables), but you have to know the full and exact primary key: that is, both the partition (hash) key and any sort (range). It’s literally like calling GetItem multiple times in a single operation. You don’t have the partial-searching and filtering options of Query, but you’re not limited to a single partition either.
As per the official documentation:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.html#CapacityUnitCalculations
For BatchGetItem, each item in the batch is read separately, so DynamoDB first rounds up the size of each item to the next 4 KB and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchGetItem reads a 1.5 KB item and a 6.5 KB item, DynamoDB will calculate the size as 12 KB (4 KB + 8 KB), not 8 KB (1.5 KB + 6.5 KB).
For Query, all items returned are treated as a single read operation. As a result, DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
You should use BatchGetItem if you need to retrieve many items with little HTTP overhead when compared to GetItem.
A BatchGetItem costs the same as calling GetItem for each individual item. However, it can be faster since you are making fewer network requests.
In a nutshell:
BatchGetItem works on tables and uses the hash key to identify the items you want to retrieve. You can get up to 16MB or 100 items in a response
Query works on tables, local secondary indexes and global secondary indexes. You can get at most 1MB of data in a response. The biggest difference is that query support filter expressions, which means that you can request data and DDB will filter it server side for you.
You can probably achieve the same thing if you want using any of these if you really want to, but rule of the thumb is you do a BatchGet when you need to bulk dump stuff from DDB and you query when you need to narrow down what you want to retrieve (and you want dynamo to do the heavy lifting filtering the data for you).
DynamoDB stores values in two kinds of keys: a single key, called a partition key, like "jupiter"; or a compound partition and range key, like "jupiter"/"planetInfo", "jupiter"/"moon001" and "jupiter"/"moon002".
A BatchGet helps you fetch the values for a large number of keys at the same time. This assumes that you know the full key(s) for each item you want to fetch. So you can do a BatchGet("jupiter", "satrun", "neptune") if you have only partition keys, or BatchGet(["jupiter","planetInfo"], ["satrun","planetInfo"], ["neptune", "planetInfo"]) if you're using partition + range keys. Each item is charged independently and the cost is same as individual gets, it's just that the results are batched and the call saves time (not money).
A Query on the other hand, works only inside a partition + range key combo and helps you find items and keys that you don't necessarily know. If you wanted to count Jupiter's moons, you'd do a Query(select(COUNT), partitionKey: "jupiter", rangeKeyCondition: "startsWith:'moon'"). Or if you wanted the fetch moons no. 7 to 15 you'd do Query(select(ALL), partitionKey: "jupiter", rangeKeyCondition: "BETWEEN:'moon007'-'moon015'"). Here you're charged based on the size of the data items read by the query, irrespective of how many there are.
Adding an important difference. Query supports Consistent Reads, while BatchGetITem does not.
BatchGetITem Can use Consistent Reads through TableKeysAndAttributes
Thanks #colmlg for the information.