UNLOAD: what are slices in the cluster? - amazon-web-services

According to the AWS documentation for UNLOAD the number of files written to S3 are the "number of slices in the cluster". Our cluster has 24 nodes.
What counts as a slice? Why are we getting 64 files on S3 (see screenshot) and not 24?
Most files are around 37MB but some only ~300B (only contain the column headers), why are these files added?

You can think of a slice as a partition on each node, by default each node will have at least 2 slices however this default does vary by the node size.
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
The files will vary in size depending on how the data is distributed across nodes (this is decided by your distribution keys). If you're getting some files with little data this would mean that the result set from your query is retrieving minimal data from one slive whilst gaining more from another slice.
If you haven't ever configured your distribution scheme in the schema of a table it will by default be EVEN.
The below links should help to go into further details on these subjects:
Data warehouse system architecture
Amazon Redshift clusters - Node type details
Distribution styles

Related

Understanding Segments in AWS Glue and its usage on Presto/Trino

I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)

Query pre-created sub-folders in s3 using a single table schema in Athena

I am exploring AWS Athena to query files in s3. We have a separate service that writes data into s3 in the following structure:
data
/log1
/log2
/log3
All the files have the same schema.
Following is the schema of the files:
id (a random string id)
timestamp
value
However, we need to be able to query data in a single folder - log1, log2 along with querying all the data together.
One option is to create separate tables for these. However, the sub folders log1, log2, etc. correspond to a device and these could be in numbers of 100s or thousands. These names would be dynamic and will be entered by the user for querying. Also, there are other query capabilities we need such as querying data between two timestamps, etc. Such queries will be fired at the /data folder level.
What would be a good way to structure the folders and the corresponding tables? I have read multiple questions that suggest partitioning, but for my use case, I don't really understand how to partition the data. I am extremely new to Athena and still learning. Any advice would be much appreciated.
Thank you in advance.
Partitioning will have an impact on how much data will be scanned with by every query and therefore improving the performance and lowering the cost - a good explanation can be found in AWS Partitiong Data:
You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data coming from many different sources but loaded one time per day, may partition by a data source identifier and date.
If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only from that partition.
There are also some good recommendations regarding the partitions in Top 10 Performance Tuning Tips for AWS Athena:
When deciding the columns on which to partition, consider the following:
Columns that are used as filters are good candidates for partitioning.
Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit.
If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit.
Athena released lately a new feature called Partition Projections which might be helpful in your case:
In partition projection, partition values and locations are calculated from configuration rather than read from a repository like the AWS Glue Data Catalog. Because in-memory operations are often faster than remote operations, partition projection can reduce the runtime of queries against highly partitioned tables.
Especially the Dynamic ID Partitioning could be interesting in your case.
How to partition in the end depends on the queries and how they are designed:
Most queries include a time frame? Then you should consider date as a partition
Most queries filter for a specific device (or a small amount of ids)? Then it might be a better choice to use device id as partition or at least trying to bucketing these. Also depends on the amount of rows per device to not make it too granular.
You can also partition by date and device id.
Since you already have a partition by device I would go for this in the beginning and use projections to query this data.

Estimating the size of a AWS Neptune graph database

I am currently building a graph using AWS Neptune. Is there a way of determining or calculating the size of a filled database with AWS Neptune?
There is an answer already in this post, but posting one more with a bit more details, as the previous answer does not mention if the storage includes space used by replication, deleted data etc.
As #Morinaga already pointed out, Cloudwatch exposes the amount of bytes used by actual datapages under AWS/Neptune -> By Cluster -> VolumeBytesUsed. This shows the exact storage that you get charged for. Internally Neptune uses a distributed storage for the data, which includes multiple copies, some additional storage for metadata etc. None of that info impacts how you get billed, so they are not included in VolumeBytesUsed.
Neptune also supports copy-on-write, where you can create a cloned volume from another cluster. One thing to note with cloned volumes is that the new cluster only takes us space for pages that have diverged from the source. So when you plot the VolumeBytesUsed metric for a clone, you would see a much smaller number for the clone as long as the source cluster is still active and lying around. If you delete the source cluster, the space is then re-adjusted in the clones. Do make a note of this, to avoid any possible confusion later on.
Last thing to note is that Neptune, as of Sept 2020, does not do volume shrinking. The VolumeBytesUsed is pretty much a high watermark of how much data pages were used, and deleting a lot of data just clears the data in the data pages, it does not remove it from the volume. So if you create a cluster, add a bunch of data and them delete everything, your VolumeBytesUsed would still show the high watermark. When you insert new data, we would reuse the available data pages first, so you don't end up paying for new data pages.
AWS Cloud Watch can be used to figure out the exact size of your filled database.
Under Metrics you can select Neptune and search for the MetricName='VolumeBytesUsed'. This will show you the amount of data that has been uploaded to your database.
It really depends on how much data you store in vertex and edge properties. Taylor answer here explains more as storage capacity is dynamically allocated in Amazon Neptune.

Amazon Redshift optimizer (?) and distribution styles

I was studying Amazon Redshift using the Sybex Official Study Guide, at page 173 there are a couple of phrases:
You can configure the distribution style of a table to give Amazon RS hints as to how the data should be partitioned to best meet your query patterns. When you run a query, the optimizer shifts the rows to the compute node as needed to perform any joins and aggregates.
That leads me to some questions?
1) What is the role of "optimizer"? Do data re-arranged across compute nodes to boost performance for each new query?
2) if 1) is true and new query completly different is performed: What happen to the old data in the compute nodes?
3) Can you explain me better the 3 distribution styles (EVEN, KEY, ALL) particularly the KEY style.
Extra questions:
1) Does the leader node has records?
To clarify a few things:
The Distribution Key is not a hint -- data is actually distributed according to the key
When running a query, the data is not "shifted" -- rather, copies of data might be sent to other nodes so that data can be joined on a particular node, but the data does not then reside on the destination node
The optimizer doesn't actually "do" anything -- it just computes the process that the nodes will follow (Redshift apparently writes C programs that are sent to each node)
The only thing you really need to know about the Optimizer is:
Query optimizer
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
From Data Warehouse System Architecture:
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes.
The Leader Node does not contain any data (unless you launch a single-node cluster, in which case the same server is used as the Leader Node and Compute Node).
For information on Distribution Styles, refer to: Distribution Styles
If you really want to learn about Redshift, then read the Redshift Database Developer Guide. If you are merely studying for the Solutions Architect exam, the above links will be sufficient for the level of Redshift knowledge.

Comparison of (1, 2, 3 nodes) clusters in redshift and csv files

While comparing (1 node, 2 node, 3 node) clusters...how to findout which cluster is best in performance?
In your point of view which cluster is the best in performance
How to copy a CSV file from S3 into redshift and specify where i have commited mistake?
The best way to determine the setup that provides the best performance is to perform your own tests. For example, try loading data and runnign queries with different numbers and sizes of nodes. This way, you will have results that are accurate based upon your own particular data and that way that it is stored in Amazon Redshift.
When loading data into Redshift, the load will run better when you have more nodes because it is performed in parallel across all nodes.
In general, more nodes will always give better performance because each node adds additional storage and compute. However, you will have to trade-off performance against cost.