AWS Glue is serverless but there is a way to assign a VPC and subnet to a Glue ETL job when the job is working with a DB connection (RDS, JDBC or RedShift). This part is fine.
The problem we are facing is when the Glue job only operated on S3 buckets and does not use any other DB.
How to make sure that Glue accesses these S3 buckets through VPC endpoint?
Even if we define a VPC endpoint for a VPC, how to ensure the ETL job runs in the same VPC?
When Glue job works on S3 source and S3 destination, it does not ask for VPC details.
Can any of you help resolve this?
It is possible to make sure the traffic is not leaving VPC when the source and destination is s3. Please refer to this to configure s3 VPC endpoint and adding it to your Glue job.
Also refer to this if you see any issues in accessing s3.
Related
I would like to know if it is possible to configure an endpoint to S3 (S3 endpoint) for AWS Athena, not the VPC endpoint. I have looked at it everywhere in the documentation I could not find it. Is this even possible?
The idea is to use the endpoint to get to S3 for all the Athena queries.
Thanks and best regards
Krishna
An --endpoint-url is normally used to override how the AWS CLI access an AWS service.
I see it used when people use an S3-compatible service such as Wasabi, where they are pointing to a different service rather than the 'real' S3.
Amazon Athena knows how to connect directly to Amazon S3. It is not possible to override the S3 Endpoint when Athena connects to S3.
I'm using AWS services to create a datapipeline
I have data stored in an Amazon S3 bucket and I plan to use the glue crawler to crawl the data under a prefix to extract the metadata and after a glue job to do ETL and save the data in another bucket.
My question is : in which network the services works and communicates each other? it is possible that the data will be moved from Amazon S3 to glue through the public internet?
is there any link to aws documentation that explain which networks AWS services uses when they transfer data between them?
You need to grand explicit permission to any resource to be able access your S3 bucket.
AIM Roles. Using policy create a role and attach that role to AWS resource.
Bucket Policy is another mechanism to grant access.
By default everything is private, you need to grant access otherwise No is not accessible from the internet.
I have a lambda function accessing a S3 bucket using aws-sdk
There are a high number of operations(requests) to the S3 bucket, which is increasing considerably the cost to use lambda
I was hoping that the requests use the s3:// protocol but there are going over the internet
I understand that one solution could be:
Attach the Lambda to a VPC
Create a VPC endpoint to S3
Update the route tables of the VPC
Is there a simpler way to do so?
An alternative could be creating an API Gateway, and creating lambda proxy method integration following the AWS Guide or Tutorial.
You can then configure your apigateway to act as your external facing integration over the internet and your lambda / s3 stays within AWS.
The traffic won't go over the internet and incur additional data transfer cost as long as the non-VPC lambda function is executing in the same region as the S3 bucket. So VPC is not needed in this case.
https://aws.amazon.com/s3/pricing/
You pay for all bandwidth into and out of Amazon S3, except for the following:
• Data transferred in from the internet.
• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket.
• Data transferred out to Amazon CloudFront (CloudFront).
You can think of lambda as ec2. So the data transfer is free but be careful you still need to pay for api request.
I use Apache Spark and Redshift in VPС and also use AWS S3 for source data and temp data for Redshift COPY.
Right now I suspect that performance of read/write from/to AWS S3 is not good enough and based on the suggestion in the following discussion https://github.com/databricks/spark-redshift/issues/318 I have created S3 endpoint within the VPC. Right now I can't see any performance difference before and after S3 endpoint creation when I'm loading data from S3.
In Apache Spark I read data in the following way:
spark.read.csv("s3://example-dev-data/dictionary/file.csv")
Do I need to add/configure some extra logic/configuration on AWS EMR Apache Spark in order to proper use of AWS S3 endpoint?
The S3 VPC Endpoint is a Gateway Endpoint so you have to put a new entry in the routing table of your subnets where you start EMR clusters that route the traffic to the endpoint.
I have to two separate Lambda functions - one to read a file from a S3 bucket and write to memcache cluster. They work well individually. However, I am unable to 'merge' them together.
Firstly, the function to read from S3 works from 'No VPC' setting whereas, the function to write to Elastic Cache works only when the function and cluster are in the same VPC.
Secondly, the function to read from S3 had only the AmazonS3FullAccess policy applied. While I have now applied the AWSLambdaVPCAccessExecutionRole also, I am not sure, if this setting will work because of the VPC difference mentioned above.
Is AWS Step function the answer? How do I build a serverless application that reads a file from S3 and writes to Elastic Cache cluster?
You don't need step functions for this. Run the function in the VPC with your ElastiCache cluster. Either add an S3 endpoint to your VPC, or a NAT gateway. The S3 endpoint is the easiest solution. Then your function will have access to both ElastiCache and S3.
For the IAM role, you need to go into IAM and create a new role that has the permissions of AWSLambdaVPCAccessExecutionRole as well as the necessary S3 permissions. You can assign multiple policies to a single role if necessary. Then assign that role to the Lambda function.