AWS' EMR vs EC2 pricing confusion - amazon-web-services

https://aws.amazon.com/emr/pricing/
Can someone explain why the price for EMR and EC2 differs so much, we are considering whether build our spark cluster on EMR or using Clourdera on EC2. Did I miss anything obvious? Thanks

What is EMR?
Amazon EMR provides a managed Hadoop framework that makes it easy,
fast, and cost-effective to process vast amounts of data across
dynamically scalable Amazon EC2 instances.
What is EC2?
Amazon EC2’s simple web service interface allows you to obtain and
configure capacity with minimal friction. It provides you with
complete control of your computing resources and lets you run on
Amazon’s proven computing environment.
Now, what is EMR pricing?
EMR pricing is essentially the price you pay for "Cluster Management" related computing.
In a big data setup, cluster computing is not enough, you need "node computing" too, that is where EC2 and it's pricing comes into picture.
As explained in EMR Pricing documentation, you will be charged for both EMR computing & EC2 computing when you use EMR.
The Amazon EMR price is in addition to the Amazon EC2 price (the price
for the underlying servers). There are a variety of Amazon EC2 pricing
options you can choose from, including On-Demand (shown below), 1 year
& 3 year Reserved Instances, and Spot instances.
Why the pricing different?
It depends on what type of service & hardware being used and ultimately only AWS team can answer.

Related

Accessing instance storage in AWS SageMaker notebooks

I'm trying to train a model using AWS SageMaker notebooks and am disappointed with how slowly the model is training. I think my bottleneck lies with the IOPS speed to the persistent storage (EFS and EBS) my SageMaker notebooks are accessing for the dataset.
First, I tried training on a SageMaker Studio ml.g4dn.xlarge instance, then moved everything over to a SageMaker notebook ml.g4dn.xlarge instance through Jupyter. Even though g4dn.xlarge instances come with a physically wired 125GB SSD, I'm unable to access it because SageMaker Studio automatically creates an EFS store, and SageMaker notebook instances automatically create an EBS store. How could I store my dataset on the 125GB SSD instead of EFS or EBS to speed up the IOPS?
It is clear that there are instances with memory optimised for large amounts of data. In your case, if the dataset is given as input to the model with exactly that size (so there is no upstream preprocessing to lighten this amount of data), you must know that the g4dn is EBS optimised.
The most obvious answer i can think of is to use an S3 bucket
From "Maximum transfer speed between Amazon EC2 and Amazon S3":
Traffic between Amazon EC2 and Amazon S3 can leverage up to 100 Gbps
of bandwidth to VPC endpoints and public IPs in the same region.
Besides being very fast and performant, it is also the best solution in terms of design for all components of your project on AWS. Clearly, it entails different costs and a different architecture, but you will enjoy the maximum speed that the set of AWS services can offer you (and possibly require special configurations for even better performance).
My advice is to follow the AWS guidelines for developing a complex project from scratch: Build, training and deployment of machine learning models.

Does AWS ECS Control Plane i.e Cluster cost any thing?

On looking at the EKS Pricing page, its very clear that the cluster i.e control plane as of today costs $.10/hour. Quote from - https://aws.amazon.com/eks/pricing/
You pay $0.10 per hour for each Amazon EKS cluster that you create.
But on looking at ECS pricing page - https://aws.amazon.com/ecs/pricing/, I am not able to figure out the cluster i.e control plane cost. So before creating an ECS cluster and leaving it there irrespective of usage, I want to know how I will be billed.
Please share your thoughts!!!!
Also, my understanding is, for EKS cluster, I will be charged for the cluster irrespective of the usage i.e the cluster is used for deployments/pods/services etc or just left doing nothing. Please correct if wrong.
No, ECS does not have a control plane/cluster fee. You only pay for the EC2 or Fargate resources ECS runs your containers on.
Your understanding about the EKS cluster costs is correct.
Note: There are other fees an ECS cluster can generate, such as CloudWatch Logs and Metrics fees, but that's true for all the AWS compute services, including EKS, Elastic Beanstalk, Lambda, etc.

Amazon RDS instances and the new Compute Savings Plans

I have a small single-instance deployment running on an EC2 instance which hosts both a web application and its database (MySQL). I've been looking to separate the deployment out into an EC2 instace for the web app and an RDS cluster for the database, and wanted to take advantage of the new AWS Savings Plans for both if possible.
My questions the are:
AWS Savings Plans seem to only apply to 'pure' compute EC2 instances, not to RDS instances as well. Can someone confirm or disprove this?
If Savings Plans did apply to RDS instances, is there a reason to not use them, and instead just use an Instance Reservation?
Since August 2020, AWS Savings Plans includes:
Amazon EC2
AWS Lambda
AWS Fargate
They do not apply to Amazon RDS db instances. For those, you can continue to use Amazon RDS Reserved Instances.
I want to clarify that even though Savings Plans do not cover RDS instances, they do cover EC2 instances that are part of EMR, ECS and EKS Clusters. Based on this link:
"Both plan types apply to EC2 instances that are a part of Amazon EMR, Amazon EKS, and Amazon ECS clusters. Amazon EKS charges will not be covered by Savings Plans, but the underlying EC2 instances will be. "
Also, Compute Savings Plans also apply to your Fargate and Lambda usage.
We moved to RDS from EC2 instances running self installed MySQL years ago. For me, at has been great. All of the RDS features work flawlessly, point and click, without the mundane work of spinning up, replicating, backing up, and failing over databases. It simply works great. Use reserved instances if you plan on keeping for at least a year. At 30% savings the cost is awash even if you bail on the server after about 9 months and don't use the entire year. Plus you can sell the unused remaining on the marketplace.
Downsides?
You do NOT get command line OS access to the MySQL server. You get an admin login to mySQL. The only way to manage it is through the AWS UI and the mysql client command line or managing client (like MySQL Workbench or Heidi).
You may want to run a mysqldump script on a separate EC2 to dump databases separately/additionally. AWS does SNAPSHOTS which require an entire restore of a sandbox server just to get a single table someone botched up, for example. I go to the MySQLdump files all the time. Never have needed the SNAPSHOT unless I am spinning up a sandbox copy of the entire instance for some reason.
In a nutshell, mySQL on RDS is great.
One other side note. We migrated an app using MySQL5.7 to Aurora MySQL with absolutely zero issues. Complete drop-in replacement (in our case).

Does running AWS Redshift/taking/keeping snapshot of it run EC2 instance internally?

I am running few experiments on AWS Redshift in the free tier with a single node dc2.large cluster. I keep a snapshot as I do not need it to run the cluster at night and again restoring from that snapshot the next morning.
I can see my EC2 bill is slowly rising up with the utility but not a single documentation or blog I could find to understand if a running Redshift cluster uses EC2 instance or taking and keeping a snapshot of a Redshift cluster does the same.
Can anyone help me understanding the behavior?
Your usage of Amazon Redshift (including running a cluster and creating/keeping snapshots) should not create a charge for Amazon EC2 resources.
It might generate traffic within the VPC depending upon how you are connecting to it (eg cross-AZ traffic).

How to get usage details of AWS EBS service?

My team got billing for AWS EBS and we have no idea what is it about.
Going to our console and try to open EBS section though we failed to get one as the snapshots below.
So my question is how to get breakdown details on AWS EBS usage billing?
p.s.
The EBS usage cost is viewed under EC2 section in billing tab in the management console
EBS is under EC2 in the AWS Management Console. If you look at the size of your volumes, the type of volumes and also the size of your snapshots that should help identify the cost.
E.g. faster volumes types, provisioned IOPS, large volumes/snapshots are more expensive.
You can put this information into the AWS Simple Montly Calculator to work out the cost https://calculator.s3.amazonaws.com/index.html
You can also look at the usage in CloudWatch, however it's under EBS there and shows per volume statistics.
For AWS billing you can use cost explorer for generating handsome reports.
Billing Break-Down:
In aws they have grouped multiple offerings under one service. For ex: EBS, ELB, EIP all grouped under EC2.
Same when they provide billing breakdown they provide it on service basis then region basis and then offering.
If you want better breakdown best way is to use cost tags and cost explorer.
Like in your case you can give your service_tag = ebs and application_tag = app1.
Then by using cost explorer : filter_by = app_tag and region and etc. and group_by = service_tag. By performing multiple combinations you can get a clear view of your cost and usage.