How does AWS charge the Redshift Spectrum cluster? - amazon-web-services

AWS doc on the pricing of AWS Redshift Spectrum says that we pay for only TB scanned. However, I still need to create a Redshift cluster and specify instance type as well as how many nodes in the cluster. So my question is: does AWS charge for the created cluster? I'd assume that a cluster with 2 dc2.8xlarge nodes would be more expensive than a cluster with 2 dc2.large nodes, even for a Redshift Spectrum cluster, but I can't find any documentation that explicitly discusses this.

Redshift Spectrum is a querying engine service offered by AWS allowing customers to use only the computing capability of Redshift clusters on data available in S3 in different formats. This feature enables customers to add external tables to Redshift clusters and run complex read queries over them without actually loading or copying data to Redshift.
Since Redshift Spectrum is a built-in feature of Amazon Redshift, you need a redshift cluster.
To your question: Answer is yes. AWS charge for the created cluster. Yes even for redshift spectrum.
Here is the pricing calculator
dc2.8xlarge 32 244 GiB 2.56TB SSD 7.50 GB/s $4.80 per Hour
dc2.large 2 15 GiB 0.16TB SSD 0.60 GB/s $0.25 per Hour

Related

Run multiple Data fusion replication jobs on one dataproc cluster

I am currently analyzing GCP data fusion replication features to ingest initial snapshot followed by the CDC.
The plan is to create one replication job per table because adding a new table is not supported once the replication job is created. I tried to a table by deleting and creating the replication job with same name. But it results the initial snapshot load for the other tables as well.
Having said that, in order to overcome the above 2 scenarios, I am planning to create replication job per table. However, every replication job creates its own dataproc cluster which will incur more costs.
Is it possible to run all replication jobs on one dataproc autoscaling cluster?
Note: The instance type is Basic. 
Yes, Reusing the cluster is possible https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1683390465/Reusing+Dataproc+clusters,
This will mark your already provisioned cluster as reusable at the end of the job, and will just save you provisioning time of dataproc cluster (approx. 90 - 150 sec) for every run.
Not sure if multiple data fusion jobs can be summited to same cluster parallelly, which I am looking for :)

Amazon Aurora Snapshot backups are full or incremental?

RDS Snapshot backup is full backup in the first time, and the second snapshot is incremental backup. I can find out about this in the following documents.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html
The first snapshot of a DB instance contains the data for the full DB instance. Subsequent snapshots of the same DB instance are incremental, which means that only the data that has changed after your most recent snapshot is saved.
I'd like to know Aurora's snapshot taking is a full backup or a differential.
Does anyone have any information on this?
I've checked the following in the manual, but I can't confirm that Aurora's snapshot works with this text.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Backups.html
Aurora backs up your cluster volume automatically and retains restore data for the length of the backup retention period. Aurora backups are continuous and incremental so you can quickly restore to any point within the backup retention period.
And, I've checked the AWS re:Invent 2019 materials below. I thought take a full image snapshot of in each segment(per 10GB protection groups), does this right?
https://youtu.be/Ul-j5fKfv2k?t=1095
AWS re:Invent 2019: [REPEAT 1] Deep dive on Amazon Aurora with PostgreSQL compatibility (DAT328-R1)
AWS works always on incremental snapshots.. Even if you take EBS volume snapshot.. it will be incremental.
Here is the link to aws document. Please search for word incremental on this page
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Backups.html
Deepak is correct, look what AWS says in the documentation
Backups
Aurora backs up your cluster volume automatically and retains restore
data for the length of the backup retention period. Aurora backups
are continuous and incremental so you can quickly restore to any
point within the backup retention period. No performance impact or
interruption of database service occurs as backup data is being
written. You can specify a backup retention period, from 1 to 35 days,
when you create or modify a DB cluster.
If you want to retain a backup beyond the backup retention period, you
can also take a snapshot of the data in your cluster volume. Because
Aurora retains incremental restore data for the entire backup
retention period, you only need to create a snapshot for data that you
want to retain beyond the backup retention period. You can create a
new DB cluster from the snapshot.
Note For Amazon Aurora DB clusters, the default backup retention
period is one day regardless of how the DB cluster is created.
You cannot disable automated backups on Aurora. The backup retention
period for Aurora is managed by the DB cluster.
Your costs for backup storage depend upon the amount of Aurora backup
and snapshot data you keep and how long you keep it. For information
about the storage associated with Aurora backups and snapshots, see
Understanding Aurora Backup Storage Usage. For pricing information
about Aurora backup storage, see Amazon RDS for Aurora Pricing. After
the Aurora cluster associated with a snapshot is deleted, storing that
snapshot incurs the standard backup storage charges for Aurora.
Aurora manual snapshots are technically incremental (only technically). That is why they can be generated so quickly. BUT, they are billed as "full backups".
So if you snapshot your database everyday for 30 days, and the database is on average 10GB large, you will be billed for 30x10GB = 300GB of storage, even if the difference between each snapshots is tiny.
So even if AWS was using about 12GB to store those backups, they will bill you for 300GB.

Redshift Pause and Resume

With Redshift's Pause and Resume feature, do we pay for both the storage used on the cluster and for the snapshots? If yes to the storage on the cluster, then how do we calculate the storage cost because today we pay combined for both the computing and the storage?
Thanks!
From Overview of managing clusters in Amazon Redshift - Amazon Redshift:
When you pause a cluster, Amazon Redshift creates a snapshot
...
When you pause a cluster, billing is suspended.
From Amazon Redshift Pricing - Cloud Data Warehouse - Amazon Web Services:
The pause and resume feature allows you to suspend on-demand billing during the time the cluster is paused. During the time that a cluster is paused you only pay for backup storage.
Therefore, it seems that you would only be charged for the snapshot.

AWS' EMR vs EC2 pricing confusion

https://aws.amazon.com/emr/pricing/
Can someone explain why the price for EMR and EC2 differs so much, we are considering whether build our spark cluster on EMR or using Clourdera on EC2. Did I miss anything obvious? Thanks
What is EMR?
Amazon EMR provides a managed Hadoop framework that makes it easy,
fast, and cost-effective to process vast amounts of data across
dynamically scalable Amazon EC2 instances.
What is EC2?
Amazon EC2’s simple web service interface allows you to obtain and
configure capacity with minimal friction. It provides you with
complete control of your computing resources and lets you run on
Amazon’s proven computing environment.
Now, what is EMR pricing?
EMR pricing is essentially the price you pay for "Cluster Management" related computing.
In a big data setup, cluster computing is not enough, you need "node computing" too, that is where EC2 and it's pricing comes into picture.
As explained in EMR Pricing documentation, you will be charged for both EMR computing & EC2 computing when you use EMR.
The Amazon EMR price is in addition to the Amazon EC2 price (the price
for the underlying servers). There are a variety of Amazon EC2 pricing
options you can choose from, including On-Demand (shown below), 1 year
& 3 year Reserved Instances, and Spot instances.
Why the pricing different?
It depends on what type of service & hardware being used and ultimately only AWS team can answer.

Cost of storing AMI

I understand Amazon will charge per GB provisioned EBS storage. If I create AMI of my instance, does this mean my EBS volume will be duplicated, and hence incur additional cost?
Is there other cost charge in creating and storing an AMI (Amazon Machine Image)?
You are only charged for the storage of the bits that make up your AMI, there are no charges for creating an AMI.
EBS-backed AMIs are made up of snapshots of the EBS volumes that form the AMI. You will pay storage fees for those snapshots according to the rates listed here. Your EBS volumes are not "duplicated" until the instance is launched, at which point a volume is created from the stored snapshots and you'll pay regular EBS volume fees and EBS snapshot billing.
S3-backed AMIs have their information stored in S3 and you will pay storage fees for the data being stored in S3 according to the S3 pricing, whether the instance is running or not.
In this case, you will pay for the size of the storage used, instead of the storage provisioned. Snapshots will not store any empty blocks.
In short, yes, you will incur additional charges, but at a less rate, namely, EBS snapshot storage rate. Provisioned EBS is the 'live' HD that will be charged at $0.10 per GB per month if using standard SSD (gp2, USA east pricing for 2022 used throughout). And if you provisioned 50 GB, you will be fully charged for that 50 GB, even if you are only using 5% of it. The charges will incur even if you forget to attach to an EC2 instance. $5 per month in this case.
When you create an AMI, AWS will create a snapshot in the background. This snapshot is viewable under EBS Snapshots and will not be deletable as long as that AMI is in existence. You will get an error if you try to delete this snapshot. Snapshots cost less than 'provisioned' EBS at $0.05 per GB per month, and since snapshots ignore empty blocks, it will be shrunk to used size, so if you are only using 5% of 50GB, the snapshot should only be around 2.5 GB. $0.13 per month in this case. No other charges.
If you are creating a lot of these, it can get expensive very quickly, so some people save these AMIs into S3, which is cheaper than EBS snapshots. This is somewhat advanced and as far as I know, it can only be done via AWS CLI, and not in the console. You use a command called aws ec2 create-store-image-task and you have to specify the destination bucket name, and make sure permissions for S3, EBS and EC2 will all allow it. More detail at the official AWS documentation. This would reduce the cost to about $0.023 per GB per month. There are other changes relating to this method, i.e. EBS Direct API, but it is not much and you can look it up in the documentations.
Recently in November 2021, AWS released archived function for EBS snapshots, which allows you to archive your snapshots for a minimum of 90 days for $0.0125. You do have to pay $0.03 per GB for restoring the data. However, this is designed for EBS backups (e.g. daily backups using snapshots) and you cannot archive an EBS snapshot that is associated with an AMI. You will get an error: Failed to archive snapshot... snap-xyz is in use by ami-123.
Below is an excerpt of an actual AWS bill that will explain it in a visual sense.