AWS Glue Timeout: Creating External Schema In Redshift - amazon-web-services

I am trying to create an external schema, and my command is as follows. As of course, I have changed the names of the components/items to non-meaningful names just to hide my production values:
create external schema sb_external
from data catalog
database 'dev'
iam_role 'arn:aws:iam::490412345678:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift'
create external database if not exists;
The query is ran in the Redshift database using "psql" CLI from within an EC2 instance. It is a private subnet, while the EC2 instance and the Redshift Database are in 2 different VPCs joined by VPC Peering. On the VPC where we have the EC2 instance, we have a Glue Endpoint.
While I run the above query from the same VPC where I have the Redshift database, I still get an error as follows, even if in the same VPC I have created an Endpoint Interface for Glue.
Failed to perform AWS request, curlError=Failed to connect to glue.eu-west-1.amazonaws.com port 443: Connection timed out
With or Without the VPC Endpoint, we have the same error.
Any help in this regard would be highly appreciated.

I have also faced the same issue and somehow I managed to resolve it.
This error caused when you enable Enhanced VPC routing in your cluster.
By default Glue endpoint uses default security group.
As error starting "glue.eu-west-1.amazonaws.com", you need to enable DNS hostnames and DNS resolution for your VPC.
Also add inbound rule for port number 443 which is for https in default security group with source as Redshift's security group.
listing few links which helped me:
[+]. https://docs.aws.amazon.com/glue/latest/dg/vpc-interface-endpoints.html
[+]. https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#vpce-interface-limitations
[+]. https://docs.aws.amazon.com/redshift/latest/mgmt/spectrum-enhanced-vpc.html#spectrum-enhanced-vpc-considerations
"Access to AWS Glue or Amazon Athena
Redshift Spectrum accesses your data catalog in AWS Glue or Athena. Another option is to use a dedicated Hive metastore for your data catalog.
To enable access to AWS Glue or Athena, configure your VPC with an internet gateway or NAT gateway. Configure your VPC security groups to allow outbound traffic to the public endpoints for AWS Glue and Athena. Alternatively, you can configure an interface VPC endpoint for AWS Glue to access your AWS Glue Data Catalog. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted within the AWS network."

A reason for this error message is having enabled Enhanced VPC on your Redshift Cluster.
As per documentation https://aws.amazon.com/premiumsupport/knowledge-center/redshift-enhanced-vpc-routing/ enabling Enhanced VPC might impact Unload / Copy commands. Here you are trying to create an external schema and one of the potential reason for this error is having enabled this configuration.

If you are using Enhanced VPC:
Create VPC Endpoints Interface for: S3, Glue and, if using: LakeFormation, Athena.
Create a VPC Endpoint Gateway for S3
Ensure all endpoints have ingress 443 from the security group of Redshift
Ensure Redshift has egress to the endpoints
Check the routing table has routing to S3 Gateway prefix (not just IP to the S3 interface)
Check DNS in the VPC

The security group associated with the Redshift cluster needs to have egress configured for enabling outbound traffic.
Example egress configuration:
from port: 0
to port: 0
protocol: -1 (all protocols)
CIDR IP: "0.0.0.0/0"
References
AWS::EC2::SecurityGroupEgress

Related

Django App in ECS Container Cannot Connect to S3 in Gov Cloud

I have a container running in an EC2 instance on ECS. The container is hosting a django based application that utilizes S3 and RDS for its file storage and db needs respectively. I have appropriately configured my VPC, Subnets, VPC endpoints, Internet Gateway, roles, security groups, and other parameters such that I am able to host the site, connect to the RDS instance, and I can even access the site.
The issue is with the connection to S3. When I try to run the command python manage.py collectstatic --no-input which should upload/update any new/modified files to S3 as part of the application set up the program hangs and will not continue. No files are transferred to the already set up S3 bucket.
Details of the set up:
All of the below is hosted on AWS Gov Cloud
VPC and Subnets
1 VPC located in Gov Cloud East with 2 availability zones (AZ) and one private and public subnet in each AZ (4 total subnets)
The 3 default routing tables (1 for each private subnet, and 1 for the two public subnets together)
DNS hostnames and DNS resolution are both enabled
VPC Endpoints
All endpoints have the "vpce-sg" security group attached and are associated to the above vpc
s3 gateway endpoint (set up to use the two private subnet routing tables)
ecr-api interface endpoint
ecr-dkr interface endpoint
ecs-agetn interface endpoint
ecs interface endpoint
ecs-telemetry interface endpoint
logs interface endpoint
rds interface endpoint
Security Groups
Elastic Load Balancer Security Group (elb-sg)
Used for the elastic load balancer
Only allows inbound traffic from my local IP
No outbound restrictions
ECS Security Group (ecs-sg)
Used for the EC2 instance in ECS
Allows all traffic from the elb-sg
Allows http:80, https:443 from vpce-sg for s3
Allows postgresql:5432 from vpce-sg for rds
No outbound restrictions
VPC Endpoints Security Group (vpce-sg)
Used for all vpc endpoints
Allows http:80, https:443 from ecs-sg for s3
Allows postgresql:5432 from ecs-sg for rds
No outbound restrictions
Elastic Load Balancer
Set up to use an Amazon Certificate https connection with a domain managed by GoDaddy since Gov Cloud route53 does not allow public hosted zones
Listener on http permanently redirects to https
Roles
ecsInstanceRole (Used for the EC2 instance on ECS)
Attached policies: AmazonS3FullAccess, AmazonEC2ContainerServiceforEC2Role, AmazonRDSFullAccess
Trust relationships: ec2.amazonaws.com
ecsTaskExecutionRole (Used for executionRole in task definition)
Attached policies: AmazonECSTaskExecutionRolePolicy
Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com
ecsRunTaskRole (Used for taskRole in task definition)
Attached policies: AmazonS3FullAccess, CloudWatchLogsFullAccess, AmazonRDSFullAccess
Trust relationships: ec2.amazonaws.com, ecs-tasks.amazonaws.com
S3 Bucket
Standard bucket set up in the same Gov Cloud region as everything else
Trouble Shooting
If I bypass the connection to s3 the application successfully launches and I can connect to the website, but since static files are supposed to be hosted on s3 there is less formatting and images are missing.
Using a bastion instance I was able to ssh into the EC2 instance running the container and successfully test my connection to s3 from there using aws s3 ls s3://BUCKET_NAME
If I connect to a shell within the application container itself and I try to connect to the bucket using...
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
s3.meta.client.head_bucket(Bucket=bucket.name)
I receive a timeout error...
File "/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 179, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f3da4467190>, 'Connection to BUCKET_NAME.s3.amazonaws.com timed out. (connect timeout=60)')
...
File "/.venv/lib/python3.9/site-packages/botocore/httpsession.py", line 418, in send
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://BUCKET_NAME.s3.amazonaws.com/"
Based on this article I think this may have something to do with the fact that I am using the GoDaddy DNS servers which may be preventing proper URL resolution for S3.
If you're using the Amazon DNS servers, you must enable both DNS
hostnames and DNS resolution for your VPC. If you're using your own
DNS server, ensure that requests to Amazon S3 resolve correctly to the
IP addresses maintained by AWS.
I am unsure of how to ensure that requests to Amazon S3 resolve correctly to the IP address maintained by AWS. Perhaps I need to set up another private DNS on route53?
I have tried a very similar set up for this application in AWS non-Gov Cloud using route53 public DNS instead of GoDaddy and there is no issue connecting to S3.
Please let me know if there is any other information I can provide to help.
AWS Region
The issue lies within how boto3 handles different aws regions. This may be unique to usage on AWS GovCloud. Originally I did not have a region configured for S3, but according to the docs an optional environment variable named AWS_S3_REGION_NAME can be set.
AWS_S3_REGION_NAME (optional: default is None)
Name of the AWS S3 region to use (eg. eu-west-1)
I reached this conclusion thanks to a stackoverflow answer I was using to try to manually connect to s3 via boto3. I noticed that they included an argument for region_name when creating the session, which alerted me to make sure I had appropriately set the region in my app.settings and environment variables.
If anyone has some background on why this needs to be set for GovCloud functionality but apparently not for commercial, I would be interested to know.
Signature Version
I also had to specify the AWS_S3_SIGNATURE_VERSION in app.settings so boto3 knew to use version 4 of the signature. According to the docs
As of boto3 version 1.13.21 the default signature version used for generating presigned urls is still v2. To be able to access your s3 objects in all regions through presigned urls, explicitly set this to s3v4. Set this to use an alternate version such as s3. Note that only certain regions support the legacy s3 (also known as v2) version.
Some additional information in this stackoverflow response details that new S3 regions deployed after January 2014 will only support signature version 4. AWS docs notice
Apparently GovCloud is in this group of newly deployed regions.
If you do not specify this calls to the s3 bucket for static files, such as js scripts, during operation of the web application will receiving a 400 response. S3 responds with the error message
<Code>InvalidRequest</Code>
<Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message>
<RequestId>#########</RequestId>
<HostId>##########</HostId>
</Error>```

Purpose of Redshift Enhanced VPC routing?

What is the purpose of Enhanced VPC routing for Redshift ?
I've read the doc https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
but it it not clear to me.
When you create a redshift cluster, the leader node resides in a VPC / subnet.
Hence it will always use VPC routing, Security groups etc to route requests right ?
How come that redshift wouldn't use VPC traffic when performing "COPY" commands ?
Enhanced VPC routing forces the traffic to go through your VPC.
With it disabled, even if your cluster is in a VPC, it will route traffic via the public Internet instead of going through the VPC.
This is because it uses an "internal" network interface that's outside of the VPC, regardless of whether or not the cluster itself is in a VPC.
Here's a relevant excerpt from the docs:
In Amazon Redshift, network traffic created by COPY, UNLOAD, and Amazon Redshift Spectrum flow through a network interface. This network interface is internal to the Amazon Redshift cluster, and is located outside of your Amazon Virtual Private Cloud (Amazon VPC). By default, the network traffic is then routed through the public internet to reach its destination.
However, when you enable Amazon Redshift enhanced VPC routing, Amazon Redshift routes the network traffic through a VPC instead. Amazon Redshift enhanced VPC routing uses an available routing option, prioritizing the most specific route for network traffic. The VPC endpoint is prioritized as the first route priority. If a VPC endpoint is unavailable, Amazon Redshift routes the network traffic through an internet gateway, NAT instance, or NAT gateway.

Cannot Create Glue Connection

I setup a JDBC connection in AWS Glue to an RDS database. When I test the connection from AWS Console, I get an error: Could not find S3 endpoint or NAT gateway for subnetId xxxx. Why does AWS Glue connection to RDS need S3 VPC Endpoint?
The RDS instance has a security group that is completely open to all IP addresses.
I don't know exactly what it is needed for, but my Glue connection to RDS started working only when I had created S3 endpoint.
VPC → Endpoints
Create S3 endpoint
Service category: AWS services
Service name: com.amazonaws.eu-central-1.s3
VPC: choose one that your RDS is associated with
Route tables: choose one that contain subnets for VPC

Database Connection Error from Glue spark

I need to connect RDS Postgre db which is behind a VPC in a private subnet from Glue. I am not able to connect the db using Glue Connection which will be used in spark code in glue.
If you check Glue architecture, it spins up the servers in the VPC, subnet and security group that you select in the DB connection.
So, if you want to access a RDS ensure that the VPC & subnet can access the RDS JDBC port.
Follow the link for details of setting up VPC and subnet for glue

How can I connect AWS glue connection to MongoDb in Atlas which is hosted on Goolge Cloud Platform?

Hi I am facing trouble in crawling Mongo data to S3 using a crawler from AWS-Glue. In Mongo Atlas you need to whitelist IPs that it expects connections from. As Aws-Glue is serverless I do not have any fixed IP as such. Please suggest any solutions for this.
According to the document: Connecting to a JDBC Data Store in a VPC, AWS Glue jobs belong to the specified VPC ( and the VPC has NAT gateway ) should have a fixed IP address. For example, after configured a NAT gateway for the VPC, HTTP requests from the EC2 server belongs to the VPC has fixed IP address.
I'v not tested this for Glue. How about setting VPC and that NAT gateway for the Glue job ?