Understanding Segments in AWS Glue and its usage on Presto/Trino

Understanding Segments in AWS Glue and its usage on Presto/Trino - amazon-web-services

I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)

Related

Athena query timeout for bucket containing too many log entries

I am running a simple Athena query as in
SELECT * FROM "logs"
WHERE parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')
BETWEEN parse_datetime('2021-12-01:00:00:00','yyyy-MM-dd:HH:mm:ss')
AND
parse_datetime('2021-12-21:19:00:00','yyyy-MM-dd:HH:mm:ss');
However this times out due to the default DML 30 min timeout.
The entries of the path I am querying are a few millions.
Is there a way to address this in Athena or is there a better suited alternative for this purpose?

This is normally solved with partitioning. For data that's organized by date, partition projection is the way to go (versus an explicit partition list that's updated manually or via Glue crawler).
That, of course, assumes that your data is organized by the partition (eg, s3://mybucket/2021/12/21/xxx.csv). If not, then I recommend changing your ingest process as a first step.
You my want to change your ingest process anyway: Athena isn't very good at dealing with a large number of small files. While the tuning guide doesn't give an optimal filesize, I recommend at least a few tens of megabytes. If you're getting a steady stream of small files, use a scheduled Lambda to combine them into a single file. If you're using Firehose to aggregate files, increase the buffer sizes / time limits.
And while you're doing that, consider moving to a columnar format such as Parquet if you're not already using it.

AWS Neptune Node counts timing out

We're running a large bulk load into AWS neptune and can no longer query the graph to get node counts without the query timing out. What options do we have to ensure we can audit the total counts in the graph?
Fails on curl and sagemaker notebook.

There are a few of things you could consider.
The easiest is to just increase the timeout specified in the cluster and/or instance parameter group, so that the query can (hopefully) complete.
If your Neptune engine version is 1.0.5.x then you can use the DFE engine to improve Gremlin count performance. You just need to enable the DFE engine using DFEQueryEngine=viaQueryHint in the cluster parameter group.
If you get the status of the load it will show you a value for the number of records processed so far. In this context a record is not a row from a CSV file or RDF format file. Instead it is the count of triples loaded in the RDF case and the count of property values and labels in the property graph case. As a simple example, imagine a CSV file with 100 rows and each row has 6 columns. Not including the ID column that is a label and 4 properties. The total number of records to load will be 100*5 i.e 500. If you have sparse rows then the calculation will be approximate unless you add up every non ID column with an actual value.
If you have the Neptune streams feature enabled you can inspect the stream and find the last vertex or edge created. Note that just enabling streams for this purpose may not be the ideal choice as it will impact the speed of the load as adding to the stream adds some overhead.

AWS DynamoDB: What does the graph implies? What needs to be done? Few of my btachwrite (delete request) failed

Can somebody tell what needs to be done?
Im facing few issues when I am having 1000+ events.
Few of them are not getting deleted after my process.
Im doing a batch delete through batchwriteitem

Each partition on a DynamoDB table is subject to a hard limit of 1,000 write capacity units and 3,000 read capacity units. If your workload is unevenly distributed across partitions, or if the workload relies on short periods of time with high usage (a burst of read or write activity), the table might be throttled.
It seems You are using DynamoDB adaptive capacity, however, DynamoDB adaptive capacity automatically boosts throughput capacity to high-traffic partitions. However, each partition is still subject to the hard limit. This means that adaptive capacity can't solve larger issues with your table or partition design. To avoid hot partitions and throttling, optimize your table and partition structure.
https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/
One way to better distribute writes across a partition key space in Amazon DynamoDB is to expand the space. You can do this in several different ways. You can add a random number to the partition key values to distribute the items among partitions. Or you can use a number that is calculated based on something that you're querying on.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html

UNLOAD: what are slices in the cluster?

According to the AWS documentation for UNLOAD the number of files written to S3 are the "number of slices in the cluster". Our cluster has 24 nodes.
What counts as a slice? Why are we getting 64 files on S3 (see screenshot) and not 24?
Most files are around 37MB but some only ~300B (only contain the column headers), why are these files added?

You can think of a slice as a partition on each node, by default each node will have at least 2 slices however this default does vary by the node size.
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
The files will vary in size depending on how the data is distributed across nodes (this is decided by your distribution keys). If you're getting some files with little data this would mean that the result set from your query is retrieving minimal data from one slive whilst gaining more from another slice.
If you haven't ever configured your distribution scheme in the schema of a table it will by default be EVEN.
The below links should help to go into further details on these subjects:
Data warehouse system architecture
Amazon Redshift clusters - Node type details
Distribution styles

DynamoDB fill empty table with tonns of data capped at 1000WCU

I'm writing a script, that should fill the new table with data in the shortest terms (~650Gb table).
The partition(hash) key is different between all records, so I can't imagine the better key.
I've set the provisioned WCU for this table at 4k.
When script works, 16 independent threads put different data into the table at a high rate. During execution, I receive ProvisionedThroghputException. The Cloudwatch graphs show that consumed WCU is capped at 1000WCU.
It could happen if all data is put to one partition.
As I understand, the DynamoDb would create the new partition, when data size would exceed the 10Gb limit. Is it so?
So, during this data fill operation, I have only 1 partition and the limit of 1000WCU is understandable.
I've checked the https://aws.amazon.com/ru/premiumsupport/knowledge-center/dynamodb-table-throttled/
But seems that these suggestions are applied to already filled tables and you try to add a lot of new data there.
So I have 3 questions:
1. How I can speed up the process of inserting data into the new empty table?
2. When DynamoDB decide to create a new partition?
3. Can I set up a minimum number of partitions (for ex. 4), to use all the power of provisioned WCU (4k)?
UPD Cloudwatch graph:
UPD2 the HASH key is long number. Actually it's not strongly unique. But max rows with same HASH key but different RANGE keys is 2.

You can't manually specify the number of partitions used by DDB. It's automatically handled behind the scenes.
However, the way it's handled is laid out in the link provided by F_SO_K.
1 for every 10GB of data
1 for every 3000RCU and/or 1000WCU provisioned.
If you've provisioned 4000WCU, then you should have at least 4 partitions and you should be seeing 4000WCU consumed. Especially given that you said your hash key is unique for every record, you should have data uniformly spread out and not be running into a "hot" partition.
You mentioned cloudwatch showing consumed WCU at 1000, does cloudwatch also show provisioned capacity at 4000WCU?
If so, not sure what's going on, may have to call AWS.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js