I need to connect RDS Postgre db which is behind a VPC in a private subnet from Glue. I am not able to connect the db using Glue Connection which will be used in spark code in glue.
If you check Glue architecture, it spins up the servers in the VPC, subnet and security group that you select in the DB connection.
So, if you want to access a RDS ensure that the VPC & subnet can access the RDS JDBC port.
Follow the link for details of setting up VPC and subnet for glue
Related
We want to move our AWS RDS database to GCP CloudSQL. We want to do this without downtime. So our approach was to set up a HA VPN tunnel and use Data Migration Service to sync everything to CloudSQL.
The RDS database is in a private subnet on the AWS side. I've successfully set up a HA VPN tunnel between this AWS private subnet and a private subnet in our GCP project.
I'm able to verify that this works because I can do the following things:
ping from an instance in GCP in the private subnet to an instance in AWS in that private subnet
ping from an instance in AWS in the private subnet to the instance in GCP
After installing MySQL on the GCP instance, I'm able to connect and query the RDS database
I'm struggling with setting up the Data Migration Service in GCP to sync the data from the RDS instance. I've chosen the CloudSQL instance to have a Private IP, not a public one. As connectivity method, I select VPC peering and select the VPC in which the GCP instance from which I'm able to contact the RDS instance resides.
I understand that CloudSQL is created in a project peered to my GCP project, and the CloudSQL instance resides in a subnet in this new project. So there is no route from this subnet to my private subnet. However, I see that it is peered automatically. In this peering connection, I checked the option to import and export custom routes, but still, I cannot reach the RDS from the CloudSQL instance.
I've got routes in GCP for the private subnet IP range of AWS, with the next hop the VPN tunnels.
I'm not sure what I need to do to connect CloudSQL to RDS on this point.
I am trying to create an external schema, and my command is as follows. As of course, I have changed the names of the components/items to non-meaningful names just to hide my production values:
create external schema sb_external
from data catalog
database 'dev'
iam_role 'arn:aws:iam::490412345678:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift'
create external database if not exists;
The query is ran in the Redshift database using "psql" CLI from within an EC2 instance. It is a private subnet, while the EC2 instance and the Redshift Database are in 2 different VPCs joined by VPC Peering. On the VPC where we have the EC2 instance, we have a Glue Endpoint.
While I run the above query from the same VPC where I have the Redshift database, I still get an error as follows, even if in the same VPC I have created an Endpoint Interface for Glue.
Failed to perform AWS request, curlError=Failed to connect to glue.eu-west-1.amazonaws.com port 443: Connection timed out
With or Without the VPC Endpoint, we have the same error.
Any help in this regard would be highly appreciated.
I have also faced the same issue and somehow I managed to resolve it.
This error caused when you enable Enhanced VPC routing in your cluster.
By default Glue endpoint uses default security group.
As error starting "glue.eu-west-1.amazonaws.com", you need to enable DNS hostnames and DNS resolution for your VPC.
Also add inbound rule for port number 443 which is for https in default security group with source as Redshift's security group.
listing few links which helped me:
[+]. https://docs.aws.amazon.com/glue/latest/dg/vpc-interface-endpoints.html
[+]. https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#vpce-interface-limitations
[+]. https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html#spectrum-enhanced-vpc-considerations
"Access to AWS Glue or Amazon Athena
Redshift Spectrum accesses your data catalog in AWS Glue or Athena. Another option is to use a dedicated Hive metastore for your data catalog.
To enable access to AWS Glue or Athena, configure your VPC with an internet gateway or NAT gateway. Configure your VPC security groups to allow outbound traffic to the public endpoints for AWS Glue and Athena. Alternatively, you can configure an interface VPC endpoint for AWS Glue to access your AWS Glue Data Catalog. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted within the AWS network."
A reason for this error message is having enabled Enhanced VPC on your Redshift Cluster.
As per documentation https://aws.amazon.com/premiumsupport/knowledge-center/redshift-enhanced-vpc-routing/ enabling Enhanced VPC might impact Unload / Copy commands. Here you are trying to create an external schema and one of the potential reason for this error is having enabled this configuration.
If you are using Enhanced VPC:
Create VPC Endpoints Interface for: S3, Glue and, if using: LakeFormation, Athena.
Create a VPC Endpoint Gateway for S3
Ensure all endpoints have ingress 443 from the security group of Redshift
Ensure Redshift has egress to the endpoints
Check the routing table has routing to S3 Gateway prefix (not just IP to the S3 interface)
Check DNS in the VPC
The security group associated with the Redshift cluster needs to have egress configured for enabling outbound traffic.
Example egress configuration:
from port: 0
to port: 0
protocol: -1 (all protocols)
CIDR IP: "0.0.0.0/0"
References
AWS::EC2::SecurityGroupEgress
I have a glue job which is hitting an API hosted over an EC2 instance.
The problem is EC2 instance resides within a VPC blocking all public access.
I tried creating an endpoint interface in my VPC but still can't access the REST API.
The host is always unreachable but when I try to access the API from VPC it is working fine.
The security group associated with the EC2 instance is used while creating the VPC Endpoint.
Any help is appreciated
If you go to AWS Glue console, under connections, create a connection. What is meant by a dummy connection, is just be a non-existent database or resource for example: jdbc:mysql://some-fake-endpoint-here:3306/mydb. After this you choose the correct VPC, subnet and security group. Which means a test connection will not work in this context but what it brings is a way to introduce your VPC, Subnet and Security group information to the job. Testing such a connection can be done using a python-shell job or launch an ec2 instance in the same vpc or same subnet and run something like nc -vz endport port.
This connection metadata information will facilitate the launching of elastic network interfaces in your account that allow glue DPUs to communicate with your resource at runtime. More on how connections in glue is discussed here.
I setup a JDBC connection in AWS Glue to an RDS database. When I test the connection from AWS Console, I get an error: Could not find S3 endpoint or NAT gateway for subnetId xxxx. Why does AWS Glue connection to RDS need S3 VPC Endpoint?
The RDS instance has a security group that is completely open to all IP addresses.
I don't know exactly what it is needed for, but my Glue connection to RDS started working only when I had created S3 endpoint.
VPC → Endpoints
Create S3 endpoint
Service category: AWS services
Service name: com.amazonaws.eu-central-1.s3
VPC: choose one that your RDS is associated with
Route tables: choose one that contain subnets for VPC
I am setting up some AWS Glue jobs and wanted to switch the subnet that the connection was using. However, when I delete the connection the ENI from Glue is still attached to the subnet and I can't detach it (permission denied). Also when using "RDS" as the connection type, it defaults to using that subnet. If I create a JDBC connection I can force it to be the other subnet, but the ENI then is attached to both.
How do I get rid of this ENI? Deleting the connection didn't help
How do I control the auto-population of the subnet when setting up an RDS connection?