I have recently started exploring Amazon redshift database. However I am not able to find the below database maximum parameters anywhere in the documentations .
Parameters
Columns Maximum per table or view
Names Maximum length of database and column names
Characters Maximum number of characters in a char/varchar field
Connections Maximum connections to the server
concurency Maximm number of concurrent users
Row size Maximum row size
DISTKEY Maximum per table
SORTKEY Maximuum per table(compound/interval)
Cluster size Maximum cluster size(in terms of compressed datasize)
Would be of great help if anyone can provide the info
Connection limits, concurrency limits and naming contraints are detailed here:
http://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html
Currently there is a max of 500 connections and 50 concurrency per cluster.
You can have one DISTKEY per table, more details here:
http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
Interleaved sort keys are limited to eight columns:
http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
Maximum length of character data types:
http://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html
Cluster size is determined by the number and type of nodes in the cluster. In terms of storage the current maximum possible is 128 x ds2.8xlarge nodes, for a max storage of 2 Petabytes:
http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html
With recent RA3 node type (ra3.16xlarge) you can go up to 8PB
500 is maximun for Query Editor not the real maximum DB Connections.
Related
I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)
In the Dynamodb table, each partition is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. What I don't understand is that how these limits relate to the table's total RCU/WCU?
For example, if I configure a table's RCU to 6000 and WCU to 3000. Is this capacity evenly used by all partitions in the table? Or do all partitions fight for the total capacity?
I can't find a way to know how many partitions the DynamoDB table is using. Is there a metric to tell me that?
The single-partition limit will only matter if your workload is so terribly imbalanced that a significant percentager of requests go to the same partition. In a better designed data model, you have a large number of different partition keys, which allows DynamoDB to use a large number of different partitions, so you never see a significant percentage of your requests going to the same partition.
That does not mean, however, that the load on all partitions is equal. It might very well be that one partition sees twice the number of requests as another partition. A few years ago, this meant your performance suffered: DynamoDB split the provisioned capacity (RCU/WCU) equally between partitions, so as the busier partition got throttled sooner, the total capacity you got from DynamoDB was less than what you paid for. However, they fixed this a few years ago with what they call adaptive capacity: DynamoDB now detects when your workloads total capacity is under what you paid for, and increase the capacity limits on individual partitions.
For example if you provision 10,000 RCU capacity and DynamoDB divides your data into 10 partitions, each of those start out with 1,000 RCU. However, it one partition gets double the requests as other, this will lead the workload to doing only 1000+9*500 = 5,500 RCU, significanltly less than the 10,000 you are paying for. So DynamoDB quickly recognizes this, and increases the busy partition's limit from 1,000 to 1,818 RCU - and now the total performance is 1,818 + 9*909 = 9,999 RCU. DynamoDB does this automatically for you - you don't need to do anything special. All you need to is to make sure that your workload has enough different partition keys, and no significant percentage of requests go to one specific partition keys - otherwise DynamoDB will not be able to achieve high total RCU - it will always be limited by that single-partition limit of 3,000.
Regarding your last question, I don't know if there is such a metric (maybe another responder will know), but the important thing to check is that you have a lot of partition keys. If that's the case, and your workload doesn't access one specific key for a large percentage of the requests, you should be safe.
I am currently working with Big Query and understand that there is a partition limit of up to 4,000 partitions.
Does anyone know if this limit apply to Active Storage Tier only or both Active & Long Term Storage Tier?
Reason for asking because I have a partitioned table, partitioned by hour and have been using it for more than 6 months already but we don't get any error prompting partition limit exceed 4,000 when we insert new data.
I have did a count on the number of partition attached image below:
As we can see the total partitions is 6,401 and we are still able to insert new data.
At the same we also create a new partitioned table and try moving data into this newly created partitioned table but we encountered some error saying we have exceeded the limit of 4,000.
In addition, I also tried to insert data incrementally but I still get error as follow:
Steps to reproduce error:
Create a partitioned table (partition by hour)
Start moving data by month from another table
My finding:
The mentioned partition limit is only applicable to active storage tier.
Can anyone help to confirm on this?
As I understood the limitation, you can't modify more than 4000 partitions in one job. Your jobs that you describe first are supposedly working because they are modifying only a few partitions.
When you try to move more than 4000 partitions in one go, you will hit the limitation as you described.
I noticed I was hitting this limitation on both Active Storage and Long Term Storage. This is a BigQuery-wide limitation.
I'm writing a script, that should fill the new table with data in the shortest terms (~650Gb table).
The partition(hash) key is different between all records, so I can't imagine the better key.
I've set the provisioned WCU for this table at 4k.
When script works, 16 independent threads put different data into the table at a high rate. During execution, I receive ProvisionedThroghputException. The Cloudwatch graphs show that consumed WCU is capped at 1000WCU.
It could happen if all data is put to one partition.
As I understand, the DynamoDb would create the new partition, when data size would exceed the 10Gb limit. Is it so?
So, during this data fill operation, I have only 1 partition and the limit of 1000WCU is understandable.
I've checked the https://aws.amazon.com/ru/premiumsupport/knowledge-center/dynamodb-table-throttled/
But seems that these suggestions are applied to already filled tables and you try to add a lot of new data there.
So I have 3 questions:
1. How I can speed up the process of inserting data into the new empty table?
2. When DynamoDB decide to create a new partition?
3. Can I set up a minimum number of partitions (for ex. 4), to use all the power of provisioned WCU (4k)?
UPD Cloudwatch graph:
UPD2 the HASH key is long number. Actually it's not strongly unique. But max rows with same HASH key but different RANGE keys is 2.
You can't manually specify the number of partitions used by DDB. It's automatically handled behind the scenes.
However, the way it's handled is laid out in the link provided by F_SO_K.
1 for every 10GB of data
1 for every 3000RCU and/or 1000WCU provisioned.
If you've provisioned 4000WCU, then you should have at least 4 partitions and you should be seeing 4000WCU consumed. Especially given that you said your hash key is unique for every record, you should have data uniformly spread out and not be running into a "hot" partition.
You mentioned cloudwatch showing consumed WCU at 1000, does cloudwatch also show provisioned capacity at 4000WCU?
If so, not sure what's going on, may have to call AWS.
I know that dynamoDB supports shards. I wanted to know that is it possible to add shards dynamically.
Suppose I provisioned 4 shards and shardkey would be customerID.
Now in the future I want to provision 6 more shards, is it possible to add it?
Suppose if we can add 6 more shards how will the old data gets remapped to new shards and will the availability or consistency take hit?
For remapping my guess is that they must using consistent hashing.
No, There is no way to provision partitions as many as you want manually.
The number of Dynamodb partition is decided by specific criteria.
This is the criteria.
Partitions by capacity = (RCUs/3000) + (WCUs/1000)
It is depending on how many capacity you provision to the table.
Partitions by size = TableSizeInGB/10
It is depending on how far the table size is.
Total Partitions = Take the largest of your Partitions by capacity and Partitions by size and round this up to an integer.
For more information, I recommend you read the post .