I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).
Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.
Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?
You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.
https://docs.aws.amazon.com/glue/latest/dg/connection-using.html
Network (designates a connection to a data source within an Amazon
Virtual Private Cloud environment (Amazon VPC))
https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html
To allow AWS Glue to communicate with its components, specify a
security group with a self-referencing inbound rule for all TCP ports.
By creating a self-referencing rule, you can restrict the source to
the same security group in the VPC and not open it to all networks.
However, this requires the location and credentials of specific
databases, and it is not clear if and when other traffic (e.g., a REST
call to a service) is tunnelled through the VPC.
I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:
All JDBC data stores that are accessed by the job must be available
from the VPC subnet. To access Amazon S3 from within your VPC, a VPC
endpoint is required. If your job needs to access both VPC resources
and the public internet, the VPC needs to have a Network Address
Translation (NAT) gateway inside the VPC.
Related
I have VPC with couple of subnets containing EC2 instances.
The EC2 instances have code that invokes various AWS services like dybamodb.
Is the connection from EC2 to AWS Service (like dynamodb) happening within the AWS Network, or via public internet?
Is there any way to control this?
Is the connection from EC2 to AWS Service (like dynamodb) happening within the AWS Network, or via public internet?
Technically the process on EC2 would be hitting the AWS DynamoDB public API which is on the Internet. The traffic would be routed through the Internet Gateway you have attached to the VPC. I think if it is all in the same region it may not actually leave the AWS data center, and you could try testing that via tools like traceroute, but I don't think there are any guarantees of that.
Is there any way to control this?
Yes, add a VPC Endpoint to your VPC for the service you want to connect to. Then the DNS server in your VPC will route all traffic to that service over the VPC Endpoint, instead of routing it to your VPC's Internet Gateway. The traffic will then be guaranteed to stay within the AWS network.
I am having an Amazon RDS Postgres instance which resides in the default VPC.
To connect to it, i am using different EC2 instances (Java Spring Boot and NodeJs) running in ElasticBeanstalk. These instances also reside in the default VPC.
Do these EC2 instances connect to/query the RDS instance through the internet or the calls do not leave the AWS Network?
If they leave the AWS network and the calls go through the internet, is creating a VPC endpoint the right solution? Or my whole understanding is incorrect.
Thanks a lot for your help.
Do these EC2 instances connect to/query the RDS instance through the internet or the calls do not leave the AWS Network?
The DNS of the RDS endpoint will resolve to private IP address when used from within VPC. So communication is private, even if you use public subnets or set your RDS instance as publicly available. However, for connection from outside of AWS, the RDS endpoint will resolve to public IP address if the db instance is publicly available.
If they leave the AWS network and the calls go through the internet, is creating a VPC endpoint the right solution?
There is no VPC endpoint for RDS client connections, only for management actions (creating db-instance, termination, etc). In contrast, Aurora Serverless has Data API with corresponding VPC endpoint.
To secure your DB-Instances communications you need to be sure at least about the following:
locate your RD in private subnet (route table does not contain default outbound route to internet gateway).
RDS security group just accept traffic inbound only from instances security group/groups on TCP port for PostgreSQL which is usually 5432.
In this case Traffice to RDS will go localy in your vpc, for vpc endpoints it can be used to access RDS API operations privatly which is not your case (you just need to connect your app to DB using connection string)
I have a glue job which is hitting an API hosted over an EC2 instance.
The problem is EC2 instance resides within a VPC blocking all public access.
I tried creating an endpoint interface in my VPC but still can't access the REST API.
The host is always unreachable but when I try to access the API from VPC it is working fine.
The security group associated with the EC2 instance is used while creating the VPC Endpoint.
Any help is appreciated
If you go to AWS Glue console, under connections, create a connection. What is meant by a dummy connection, is just be a non-existent database or resource for example: jdbc:mysql://some-fake-endpoint-here:3306/mydb. After this you choose the correct VPC, subnet and security group. Which means a test connection will not work in this context but what it brings is a way to introduce your VPC, Subnet and Security group information to the job. Testing such a connection can be done using a python-shell job or launch an ec2 instance in the same vpc or same subnet and run something like nc -vz endport port.
This connection metadata information will facilitate the launching of elastic network interfaces in your account that allow glue DPUs to communicate with your resource at runtime. More on how connections in glue is discussed here.
I have production stacks inside a Production account and development stacks inside a Development account. The stacks are identical and are setup as follows:
Each stack as its own VPC.
Within the VPC are two public subnets spanning to AZs and two private subnets spanning to AZs.
The private Subnets contain the RDS instance.
The public Subnets contain a Bastion EC2 instance which can access the RDS instance.
To access the RDS instance, I either have to SSH into the Bastion machine and access it from there, or I create an SSH tunnel via the Bastion to access it through a Database client application such as PGAdmin.
Current DMS setup:
I would like to be able to use DMS (Database Migration Service) to replication an RDS instance from Production into Development. So far I am trying the following but cannot get it to work:
Create a VPC peering connection between Development VPC and Production VPC
Create a replication instance in the private subnet of the Development VPC
Update the private subnet route tables in the development VPC to route traffic to the CIDR of the production VPC through the VPC peering connection
Ensure the Security group for the replication instance can access both RDS instances.
Main Problem:
When creating the source endpoint in DMS, the wizard only shows RDS instances from the same account and the same region, and only allows RDS instances to be configured using server names and ports, however, the RDS instances in my stacks can only be accessed via Bastion machines using tunnelling. Therefore the test endpoint connection always fails.
Any ideas of how to achieve this cross account replication?
Any good step by step blogs that detail how to do this? I have found a few but they don't seem to have RDS instances sitting behind bastion machines and so they all assume the endpoint configuration wizard can be populated using server names and ports.
Many thanks.
Securing the RDS instances via the Bastion host is sound security practice, of course, for developer/operational access.
For DMS migration service however, you should expect to open security group for both the Target and Source RDS database instances to allow the migration instance to have access to both.
From Network Security for AWS Database Migration Service:
The replication instance must have access to the source and target endpoints. The security group for the replication instance must have network ACLs or rules that allow egress from the instance out on the database port to the database endpoints.
Database endpoints must include network ACLs and security group rules that allow incoming access from the replication instance. You can achieve this using the replication instance's security group, the private IP address, the public IP address, or the NAT gateway’s public address, depending on your configuration.
See
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Security.Network.html
For network addressing and to open the RDS private subnet, you'll need a NAT on both source and target. They can be added easily, and then terminated after the migration.
You can now use Network Address Translation (NAT) Gateway, a highly available AWS managed service that makes it easy to connect to the Internet from instances within a private subnet in an AWS Virtual Private Cloud (VPC).
See
https://aws.amazon.com/about-aws/whats-new/2015/12/introducing-amazon-vpc-nat-gateway-a-managed-nat-service/
Is it save to transfer unsecured http messages between two ec2 instances within the same vpc in aws?
Or is it necessary to use ssh tunneling etc?
It's safe in the sense that only your instances exist in the VPC. So the traffic between your two instances in your VPC cannot be sniffed by a 3rd party.
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a
logically isolated section of the Amazon Web Services (AWS) Cloud
where you can launch AWS resources in a virtual network that you
define. You have complete control over your virtual networking
environment, including selection of your own IP address range,
creation of subnets, and configuration of route tables and network
gateways.
Source: Amazon VPC