amazon emr - does creating a cluster use data transfer out? - amazon-web-services

I am using aws with emr over ec2 and getting tons of charges for data transfer out, about 900 gb for a few days, but I don't send any data out.
The only thing I am doing is creating an emr cluster and downloading data from s3 to it.
I found this about costs and see that data transfer out to the internet Is not supposed to happen when you are not sending any data!
I keep seeing multiple charges for data transfer out to many aws regions and to the internet, I can't find any reference for a reason to it. what can it be?

Most likely is that you're accessing an S3 bucket in a different region. Either for your data or for writing EMR cluster logs.
There are a couple of ways to diagnose this. First, of course, is to look at your EMR cluster config.
Second is to enable VPC flow logs, which will tell you the exact source and destination for your data. These may, however, be limited: if you're running all traffic through a NAT then they'll just show the NAT, not the ultimate source/destination.
A third approach is to use a security group that prevents outbound connections, and look in your logs to see what fails.

Related

Can Kafka Connect be made rack aware so that my connector reads all partitions from one broker?

We have a Kafka cluster in Amazon MSK that has 3 brokers in different availability zones of the same region. I want to set up a Kafka Connect connector that backs up all data from our Kafka brokers to Amazon S3, and I'm trying to do it with MSK Connect.
I set up Confluent's S3 Sink Connector on MSK Connect and it works - everything is uploaded to S3 as expected. The problem is that it costs a fortune in data transfer charges - our AWS bills for MSK nearly double whenever the connector is running, with EU-DataTransfer-Regional-Bytes accounting for the entire increase.
It seems that the connector is pulling messages from all three of our brokers, i.e. from three different AZs, and so we're getting billed for inter-AZ data transfer. This makes sense because by default it will read a partition from that partition's leader, which could be any of the three brokers. But if we were creating a normal consumer, not a connector, it would be possible to restrict the consumer to read all partitions from a specific broker:
"client.rack" : "euw1-az3"
☝️ For a consumer in the euw1-az3 AZ, this setting makes the consumer read all partitions from the local broker, regardless of the partitions' leader - which avoids the need for inter-AZ data transfer and brings the bills down massively.
My question is, is it possible to do something similar for a Kafka Connector? What config setting do I have to pass to the connector, or the worker, to make it only read from one specific broker/AZ? Is this possible with MSK Connect?
Maybe I am missing something about your question. I think you want to have a look at this:
https://docs.confluent.io/platform/current/tutorials/examples/multiregion/docs/multiregion.html
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
I though it was general knowledge, it applies to any on-premises or cloud deployment.
AWS confirmed to me on a call with support that MSK Connect doesn't currently support rack awareness. I was able to solve my problem by deploying the connector in an EC2 instance (not on MSK Connect) with the connect worker config consumer.client.rack set to the same availability zone that the EC2 instance is running in.

How to track AWS data transfer charges?

I want help on understanding the AWS cost explorer graph to track the huge data transfer usage.
I have noticed the AWS account bills for jan, Feb and March (till current date) where it is showing a huge data transfer charge as a bill line item (image attached AWS Bill line Item)
regional data transfer - in/out/between EC2 AZs or using elastic IPs
or ELB
. Further i checked it in AWS Cost Explorer reports by applying Group by filter Region wise and can see that it has data transfer for each region but also for
No Region
, i am not able able to understand this bar graph (please see the image attached and yellow graph AWS Cost Explorer Reports Region Wise) with level "No Region".
A good starting point would be to enable VPC Flow Logs. VPC Flow Logs will show you where the source and destination of all the traffic within your VPC. After you've analysed the logs, you should have a good indication of where to begin investigating.
Out of context but adding it here as it might help you: for some services such as S3, you can enable object-level logging to help you understand what is accessing your objects, which could help you further understand why you're paying for data transfers.
You can avoid paying for data transfer charges between AWS services by using VPC Endpoints. VPC endpoints allow you to connect directly to the service rather than over the internet, which will avoid incurring extra data charges. More on VPC Endpoints here.

How to set the low memory alarm for AWS EC2 instance, S3 bucket?

I am using AWS EC2 for deployment of the dropwizard server code. We recently came across a case where the instance was stopped automatically. On investigation we found that it's whole memory was consumed we saw what consumed the memory mostly were backup files & log files. We removed those & restarted the server it is working well.
To avoid such behavior in future we thought of making use of ClaudWatch alarms provided by the AWS, but the parameters for alarm for EC2 are more of disk throughput, Network related not the diskspace related.
EC2 monitoring.
They suggest having cloud watch agent to be installed on actual instance.
In RDS there is memory alarm type which gets triggered on if memory remains less than some predefined criterion.
For EC2 instance, S3 bucket, Do we have any cloud watch alarm type or any other tool which will trigger a notification on emails when instance is low on memory?
Note: S3 provides object count alarms but couldn't find any specific to low memory.
Update:
Comment by Michael suggests that "There are two problems, here. EC2 instances do not stop when they run out of "memory" (nor storage, which is what you are actually describing). This does not happen. Also, bucket storage is unlimited."
There could be two possibilities instance may have stopped due to other reasons, but when we investigated the added storage to an instance which was 8GB got full. For S3 as he suggested there no limit how much can be stored [ Hence the couldwatch dashboard may be showing object count, not actual memory consume but is there any way that when S3 consumes a certain amount of files ( In MBs or GBs ) on the AWS we may get notifications ]
As you mentioned you need to put monitoring script or CloudWatch agent on EC2 instance to export memory usage or disk usage and attach alert to it - see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html
S3 is about storing blob files - you don't need to care about memory usage in that service as that's handled under-hood by AWS so that's why you don't have access to any memory related metric.
Depending on your EC2 instance type you can configure those kinds of alarms.
Memory, Disk-space, CPU utilization relates to OS metrics.
There are various tools too to monitor those for example : Nagios
Or else you can setup your own custom monitoring via email if instances are Unix/AMI
Set up cron jobs and execute the monitoring shell scripts --> disk specific, CPU etc and set up email notification targeting your email addresses.

AWS Disaster recovery together with backup and storage

I have an implementation of hybrid AWS setup where I have an on-prem hadoop cluster and also replication enabled towards an AWS setup with similar hadoop cluster running at low capacity for disaster recovery. This is an active active disaster recovery setup in AWS. Is it still recommended to take backups for data that is stored on AWS?
Is it still recommended to take backups for data that is stored on AWS?
Not clear what AWS services you're referring to
Well, let's say you have an S3 bucket only bound to us-east-1 and that region becomes unavailable... You can't access your data. Therefore, it's encouraged to replicate to another region. However S3 supposedly has several 9's of availability, and if an AWS service is down in a major region, it's probably expected that a good portion of the internet is in-accessible, not only your data

amazon web services - Durability

Can you let me know if data on below AWS technology keeps data on
Multiple Facilities? How many? Different Availability Zones?
S3, EBS, Dynamo DB
Also want to know in general what is the distance between two AZ, want to make sure that any catastrophe can destroy complete region?
Just to Start Point out All the above asked questions are easily answered in AWS Documentation.
What is Region and Availability-Zone ?
Refer This Documentation
Each region is a separate geographic area. Each region has multiple,
isolated locations known as Availability Zones.
Also want to know in general what is the distance between two AZ ?
I don't think any one would know answer to that , Amazon Does not Publish such kind of Information about their Data Centers,they are secretive about it.
Now to Start with S3 , As Per AWS Documentation:
Although, by default, Amazon S3 stores your data across multiple
geographically distant Availability Zones.
Now You can Also Enable Cross Region Replilcation as per AWS documentation but that will incur extra cost :
Cross-region replication is a bucket-level configuration that enables
automatic, asynchronous copying of objects across buckets in different
AWS Regions.
Now for EBS as per AWS Documentation :
Each Amazon EBS volume is automatically replicated within its
Availability Zone to protect you from component failure, offering high
availability and durability
Also As per Documentation You can Create Point In Time Snapshot and make it available in Another AWS Region and all the Snapshots are backed up on AWS S3.
Now for DyanamoDB as per AWS Documentation :
DynamoDB stores data in partitions. A partition is an allocation of
storage for a table, backed by solid-state drives (SSDs) and
automatically replicated across multiple Availability Zones within an
AWS Region.
Now you can make it available across region for more details please refer to this AWS Documentation
Hope This Clears your Doubts!
By default all these services replicate the data in different AZ(availability zones) which are in the same AWS region.
But AWS also provided the mechanism to replicate the data across different region(which you can choose), so that you can have more fault tolerant and low latency for the users(you can serve your users from the servers which is in the same region).
However keep in mind that replicating data across multiple zones involves more cost.
You can read AWS doc http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html to know where all aws regions and AZ presents to figure out the where they are located.
Whole Idea to keep different AZ and region is to provide high availability, so you shouldn't bother about the distance and availability, if you are having replication across multi AZ or region.
Edit :- Thanks to Michael for pointing out that EBS volumes are only replicated (mirrored) within the AZ where the volume is created