Max function in RedShift - amazon-web-services

In snowflake MAX is a metadata query which means, I do not use any compute, the result comes from the service layer metadata store. In RedShift also does it work in a similar way or does it have to do a scan of all the rows of the table(which may again be distributed across nodes) to get the MAX value.
Thanks

Related

Understanding Segments in AWS Glue and its usage on Presto/Trino

I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)

AWS Neptune Node counts timing out

We're running a large bulk load into AWS neptune and can no longer query the graph to get node counts without the query timing out. What options do we have to ensure we can audit the total counts in the graph?
Fails on curl and sagemaker notebook.
There are a few of things you could consider.
The easiest is to just increase the timeout specified in the cluster and/or instance parameter group, so that the query can (hopefully) complete.
If your Neptune engine version is 1.0.5.x then you can use the DFE engine to improve Gremlin count performance. You just need to enable the DFE engine using DFEQueryEngine=viaQueryHint in the cluster parameter group.
If you get the status of the load it will show you a value for the number of records processed so far. In this context a record is not a row from a CSV file or RDF format file. Instead it is the count of triples loaded in the RDF case and the count of property values and labels in the property graph case. As a simple example, imagine a CSV file with 100 rows and each row has 6 columns. Not including the ID column that is a label and 4 properties. The total number of records to load will be 100*5 i.e 500. If you have sparse rows then the calculation will be approximate unless you add up every non ID column with an actual value.
If you have the Neptune streams feature enabled you can inspect the stream and find the last vertex or edge created. Note that just enabling streams for this purpose may not be the ideal choice as it will impact the speed of the load as adding to the stream adds some overhead.

Getting a "Disk Full" error from Redshift Spectrum

I am facing the problem of frequent Disk Full error on Redshift Spectrum, as a result, I have to repeatedly scale up the cluster. It seems that the caching would be deleted.
Ideally, I would like the scaling up to keep the caching, and finding a way to know how much disk space would be needed in a query.
Is there any document out there that talks about the caching of Redshift Spectrum, or they are using the same mechanism to Redshift?
EDIT: As requested by Jon Scott, I am updating my question
SELECT p.postcode,
SUM(p.like_count),
COUNT(l.id)
FROM post AS p
INNER JOIN likes AS l
ON l.postcode = p.postcode
GROUP BY 1;
The total of zipped data on S3 is about 1.8 TB. Athena took 10 minutes, scanned 700 GBs and told me Query exhausted resources at this scale factor
EDIT 2: I used a 16 TB SSD cluster.
You did not mention the size of the Redshift cluster you are using but the simple answer is to use a larger Redshift cluster (more nodes) or use a larger node type (more disk per node).
The issue is occurring because Redshift Spectrum is not able to push the full join execution down to the Spectrum layer. A majority of the data is being returned to the Redshift cluster simply to execute the join.
You could also restructure the query so that more work can be pushed down to Spectrum, in this case by doing the grouping and counting before joining. This will be most effective if the total number of rows output from each subquery is significantly fewer than the rows that would be returned for the join otherwise.
SELECT p.postcode
, p.like_count
, l.like_ids
FROM (--Summarize post data
SELECT p.postcode
, SUM(p.like_count)
FROM post AS p
GROUP BY 1
) AS p
INNER JOIN (--Summarize likes data
SELECT l.postcode
, COUNT(l.id) like_ids
FROM likes AS l
GROUP BY 1
) AS l
-- Join pre-summarized data only
ON l.postcode = p.postcode
;

Columnar database queries in Amazon Redshift

I'm learning Amazon Redshift. Heard that it is very powerful storage on cloud and works very fast on data where aggregate operations are required because it stores data column-wise.
Am not able to find any example queries? Could someone share with me some examples of Aggregate queries running on Amazon Redshift? Is it different from normal relation database queries?
You are correct -- Amazon Redshift is a columnar database. This means that data is stored on disk per column, making operations on a column very fast. For example, adding the Sales column for a particular value in the Country column only requires accessing two columns rather than all columns in a table.
Other benefits are that data in Redshift is compressed (which works well with the columnar concept, because each column uses its own compression method based on the data stored) and the fact that it is a clustered database, so compute and storage can be scaled by adding additional nodes.
Amazon Redshift presents itself as a PostgreSQL database, so you just use industry-standard SQL to query data. No changes to queries are required.
However, you can optimize Redshift by wisely choosing a Distribution Key for each table that determines how data is distributed amongst nodes, and carefully select the Sort Key, which determines how data is stored on each node. Put simply, data should be distributed by how you JOIN tables and should be sorted by what you use in WHERE statements.
As for sample queries... it totally depends upon your data! Queries look exactly the same as normal SQL.

AWS Data Pipeline doesn't use DynamoDB's indexes

I have a data pipeline running every hour, running a HiveCopyActivity to select the past hour's data from DynamoDB into S3. The table I'm selecting from has a hash key VisitorID and range key Timestamp, around 4 million rows and is 7.5GB in size. To reduce the time taken for the job, I created a global secondary index on Timestamp but after monitoring Cloudwatch, it seems that HiveCopyActivity doesn't use the index. I've read through all the relevant AWS documentation but can't find any mention of indexes.
Is there a way to force data pipeline to use an index while filtering like this? If not, are there any alternative applications which could transfer hourly (or any other period) data from DynamoDB to S3?
The DynamoDB EMR Hive adapter currently doesn't support using indexes, unfortunately. You would need to write your own sweeper that scans the index and outputs it to S3 - you can check out https://github.com/awslabs/dynamodb-import-export-tool for some basics to implementing the import/export pipe. The library is essentially a parallel scan framework for sweeping DDB tables.