I use Apache Spark and Redshift in VPС and also use AWS S3 for source data and temp data for Redshift COPY.
Right now I suspect that performance of read/write from/to AWS S3 is not good enough and based on the suggestion in the following discussion https://github.com/databricks/spark-redshift/issues/318 I have created S3 endpoint within the VPC. Right now I can't see any performance difference before and after S3 endpoint creation when I'm loading data from S3.
In Apache Spark I read data in the following way:
spark.read.csv("s3://example-dev-data/dictionary/file.csv")
Do I need to add/configure some extra logic/configuration on AWS EMR Apache Spark in order to proper use of AWS S3 endpoint?
The S3 VPC Endpoint is a Gateway Endpoint so you have to put a new entry in the routing table of your subnets where you start EMR clusters that route the traffic to the endpoint.
Related
I would like to know if it is possible to configure an endpoint to S3 (S3 endpoint) for AWS Athena, not the VPC endpoint. I have looked at it everywhere in the documentation I could not find it. Is this even possible?
The idea is to use the endpoint to get to S3 for all the Athena queries.
Thanks and best regards
Krishna
An --endpoint-url is normally used to override how the AWS CLI access an AWS service.
I see it used when people use an S3-compatible service such as Wasabi, where they are pointing to a different service rather than the 'real' S3.
Amazon Athena knows how to connect directly to Amazon S3. It is not possible to override the S3 Endpoint when Athena connects to S3.
I have a lambda function accessing a S3 bucket using aws-sdk
There are a high number of operations(requests) to the S3 bucket, which is increasing considerably the cost to use lambda
I was hoping that the requests use the s3:// protocol but there are going over the internet
I understand that one solution could be:
Attach the Lambda to a VPC
Create a VPC endpoint to S3
Update the route tables of the VPC
Is there a simpler way to do so?
An alternative could be creating an API Gateway, and creating lambda proxy method integration following the AWS Guide or Tutorial.
You can then configure your apigateway to act as your external facing integration over the internet and your lambda / s3 stays within AWS.
The traffic won't go over the internet and incur additional data transfer cost as long as the non-VPC lambda function is executing in the same region as the S3 bucket. So VPC is not needed in this case.
https://aws.amazon.com/s3/pricing/
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket.
• Data transferred out to Amazon CloudFront (CloudFront).
You can think of lambda as ec2. So the data transfer is free but be careful you still need to pay for api request.
I have spark jobs running on a EKS cluster to ingest AWS logs from S3 buckets.
Now I have to ingest logs from another AWS account. I have managed to use the below setting to successfully read in data from cross account with hadoop AssumedRoleCredentialProvider.
But how do I save the dataframe back to my own AWS account S3? It seems no way to set the Hadoop S3 config back to my own AWS account.
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.external.id","****")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.assumed.role.arn","****")
val data = spark.read.json("s3a://cross-account-log-location")
data.count
//change back to InstanceProfileCredentialsProvider not working
spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider")
data.write.parquet("s3a://bucket-in-my-own-aws-account")
As per the hadoop documentation different S3 buckets can be accessed with different S3A client configurations, having a per bucket configuration including the bucket name.
Eg: fs.s3a.bucket.<bucket name>.access.key
Check the below URL: http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
AWS Glue is serverless but there is a way to assign a VPC and subnet to a Glue ETL job when the job is working with a DB connection (RDS, JDBC or RedShift). This part is fine.
The problem we are facing is when the Glue job only operated on S3 buckets and does not use any other DB.
How to make sure that Glue accesses these S3 buckets through VPC endpoint?
Even if we define a VPC endpoint for a VPC, how to ensure the ETL job runs in the same VPC?
When Glue job works on S3 source and S3 destination, it does not ask for VPC details.
Can any of you help resolve this?
It is possible to make sure the traffic is not leaving VPC when the source and destination is s3. Please refer to this to configure s3 VPC endpoint and adding it to your Glue job.
Also refer to this if you see any issues in accessing s3.
Suppose I create a vpc and a vpc-endpoint in region1.
Can I communicate to an s3-bucket-in-region2 using this vpc-endpoint, i.e. without using the internet?
No, VPC endpoints to not support cross region requests. Your bucket(s) need to be in the same region as the VPC.
Endpoints for Amazon S3
Endpoints currently do not support cross-region requests—ensure that
you create your endpoint in the same region as your bucket. You can
find the location of your bucket by using the Amazon S3 console, or by
using the get-bucket-location command. Use a region-specific Amazon S3
endpoint to access your bucket; for example,
mybucket.s3-us-west-2.amazonaws.com. For more information about
region-specific endpoints for Amazon S3, see Amazon Simple Storage
Service (S3) in Amazon Web Services General Reference.