How can I access aws resources in VPC from AWS glue? - amazon-web-services

I have a glue job which is hitting an API hosted over an EC2 instance.
The problem is EC2 instance resides within a VPC blocking all public access.
I tried creating an endpoint interface in my VPC but still can't access the REST API.
The host is always unreachable but when I try to access the API from VPC it is working fine.
The security group associated with the EC2 instance is used while creating the VPC Endpoint.
Any help is appreciated

If you go to AWS Glue console, under connections, create a connection. What is meant by a dummy connection, is just be a non-existent database or resource for example: jdbc:mysql://some-fake-endpoint-here:3306/mydb. After this you choose the correct VPC, subnet and security group. Which means a test connection will not work in this context but what it brings is a way to introduce your VPC, Subnet and Security group information to the job. Testing such a connection can be done using a python-shell job or launch an ec2 instance in the same vpc or same subnet and run something like nc -vz endport port.
This connection metadata information will facilitate the launching of elastic network interfaces in your account that allow glue DPUs to communicate with your resource at runtime. More on how connections in glue is discussed here.

Related

Connection from Lambda to RDS in a different account

I have an RDS in one AWS Account - say Acct-1.
The RDS is public (i know it's not a good idea and there are other solutions for that)
I have a lambda in another AWS Account - say Acct-2 which runs in a VPC.
I have setup VPC peering between the 2 accounts, the route table entries are in place as well as the security groups IN/OUT bound policies in place.
In Acct-2 I can verify that I can connect to the RDS instance in Acct-1 using a mysql cient from an EC2 instance. The EC2 instance is in the same subnet as the Lambda and they both have the same security group.
But the Lambda gets a timeout connection. The Lambda has the typical Lambda execution role that Allows logs, and network interfaces.
Thoughts on what could be missing ? Does the RDS need to grant specific access to the Lambda service even if it's running in a VPC ?
Clarification: There is no route to the RDS instance from the internet. Clearly, the ec2 host is able to resolve the Private IP for the RDS instance from the DNS name and connect.
Lambda is unable to resolve the private IP for the RDS instance.
I'm trying to keep the traffic within AWS so as to not pay egress costs.

Amazon RDS and VPC Endpoints Connectivity

I am having an Amazon RDS Postgres instance which resides in the default VPC.
To connect to it, i am using different EC2 instances (Java Spring Boot and NodeJs) running in ElasticBeanstalk. These instances also reside in the default VPC.
Do these EC2 instances connect to/query the RDS instance through the internet or the calls do not leave the AWS Network?
If they leave the AWS network and the calls go through the internet, is creating a VPC endpoint the right solution? Or my whole understanding is incorrect.
Thanks a lot for your help.
Do these EC2 instances connect to/query the RDS instance through the internet or the calls do not leave the AWS Network?
The DNS of the RDS endpoint will resolve to private IP address when used from within VPC. So communication is private, even if you use public subnets or set your RDS instance as publicly available. However, for connection from outside of AWS, the RDS endpoint will resolve to public IP address if the db instance is publicly available.
If they leave the AWS network and the calls go through the internet, is creating a VPC endpoint the right solution?
There is no VPC endpoint for RDS client connections, only for management actions (creating db-instance, termination, etc). In contrast, Aurora Serverless has Data API with corresponding VPC endpoint.
To secure your DB-Instances communications you need to be sure at least about the following:
locate your RD in private subnet (route table does not contain default outbound route to internet gateway).
RDS security group just accept traffic inbound only from instances security group/groups on TCP port for PostgreSQL which is usually 5432.
In this case Traffice to RDS will go localy in your vpc, for vpc endpoints it can be used to access RDS API operations privatly which is not your case (you just need to connect your app to DB using connection string)

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).
Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.
Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?
You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.
https://docs.aws.amazon.com/glue/latest/dg/connection-using.html
Network (designates a connection to a data source within an Amazon
Virtual Private Cloud environment (Amazon VPC))
https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html
To allow AWS Glue to communicate with its components, specify a
security group with a self-referencing inbound rule for all TCP ports.
By creating a self-referencing rule, you can restrict the source to
the same security group in the VPC and not open it to all networks.
However, this requires the location and credentials of specific
databases, and it is not clear if and when other traffic (e.g., a REST
call to a service) is tunnelled through the VPC.
I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:
All JDBC data stores that are accessed by the job must be available
from the VPC subnet. To access Amazon S3 from within your VPC, a VPC
endpoint is required. If your job needs to access both VPC resources
and the public internet, the VPC needs to have a Network Address
Translation (NAT) gateway inside the VPC.

How can AWS Glue access IP whitelisted resource

If I have a service that needs to have IP whitelisting, how can I connect AWS Glue to it? I read that I seem to be able to put AWS Glue in a private VPC and configure a NAT gateway. I can then allow my NAT IP to connect to the service. However, I cannot find anyway to configure my Glue Job to run inside a subnet/VPC. How do I do this?
The job will run automatically in a VPC if you attach a Database connection to a resource which is inside the VPC. For example, I have a job that reads data from S3 and writes into an Aurora database in a private VPC using a Glue connection (configured as JDBC).
That job automatically has access to all the resources inside the VPC, as explained here. If the VPC has enabled NAT for external access, then your job can also take advantage of that.
Note if you use a connection that requires VPC and you use S3, you will need to enable an endpoint for S3 in that VPC as well.
The answer for your question is answered here -- https://stackoverflow.com/a/64414639 Note that Glue is a 'managed' service so it does not release any list IP addresses such that can be whitelisted. As a workaround you can use a EC2 instance to run your custom python OR pyspark script and whitelist the IP address of that particular EC2 instance

AWS EMR on VPC with EC2 Instance

I am doing a reading on AWS EMR on VPC but it seems like it is more of design consideration for AWS EMR Service to access EMR cluster for calls.
What I am trying to do is host a VPC with ALB and EC2 instance running an application as a service to access EMR cluster.
VPC -> Internet Gateway -> Load Balancer -> EC2 (Application endpoints) -> EMR Cluster
I don't want Cluster to be accessible from outside except through Public IP of IG. But Public IP can access only EC2 instance hosting application which calls EMR cluster on same VPC.
Is it recommended approach?
The design looks something like below.
Some challenges I am tackling is how to access S3 from EMR if on VPC,
and if the application is running on EC2 can it access EMR cluster, and if EMR cluster would be available publicly?
Any guidance links or recommendations would be welcome.
EDIT:
Or if I create EMR on VPC do i need to wrap it inside of another VPC something like below?
The simplest design is:
Put everything in a public subnet in a VPC
Use Security Groups to control access to the EMR cluster
If you are security-paranoid, then you could use:
Put publicly-accessible resources (eg EC2) in a public subnet
Put EMR in a private subnet
Use a NAT Gateway or VPC-Endpoints to allow EMR to communicate with S3 (which is outside the VPC)
The first option is simpler and Security Groups act as firewalls that can fully protect the EMR cluster. You would create three security groups:
ELB-SG: Permit inbound access from the Internet on your desired ports. Associate the security group with your Load Balancer.
EC2-SG: Permit inbound access from ELB-SG (from the Security Group itself). Associate the security group with your EC2 instances.
EMR-SG: Permit inbound access from EC2-SG (from the Security Group itself). Associate EMR-SG with the EMR cluster.
This will permit only the Load Balancer to communicate with the EC2 instances and only the EC2 instances to communicate with the EMR cluster. The EMR cluster will be able to connect directly to the Internet to access Amazon S3 due to default rules permitting Outbound access.