Redshift Enhanced VPC Routing - amazon-web-services

Question: What are the downsides (if any) to enabling Enhanced VPC Routing on an Amazon Redshift cluster?
According to the documentation, there is no extra charge and traffic is prevented from traveling over the public internet. Why wouldn't this be a default option always enabled?
https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
If Enhanced VPC Routing is not enabled, Amazon Redshift routes traffic through the internet, including traffic to other services within the AWS network.
There is no additional charge for using Enhanced VPC Routing. You might incur additional data transfer charges for certain operations. These include such operations as UNLOAD to Amazon S3 in a different AWS Region. COPY from Amazon EMR, or Secure Shell (SSH) with public IP addresses.
Background: We have a Redshift cluster that intermittently drops ODBC connections with a TCP reset, but only when enhanced VPC routing is enabled.

The Query Editor in the Redshift console does not support clusters with Enhanced VPC Routing enabled. That is the only downside that I know of.

Related

Purpose of Redshift Enhanced VPC routing?

What is the purpose of Enhanced VPC routing for Redshift ?
I've read the doc https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
but it it not clear to me.
When you create a redshift cluster, the leader node resides in a VPC / subnet.
Hence it will always use VPC routing, Security groups etc to route requests right ?
How come that redshift wouldn't use VPC traffic when performing "COPY" commands ?
Enhanced VPC routing forces the traffic to go through your VPC.
With it disabled, even if your cluster is in a VPC, it will route traffic via the public Internet instead of going through the VPC.
This is because it uses an "internal" network interface that's outside of the VPC, regardless of whether or not the cluster itself is in a VPC.
Here's a relevant excerpt from the docs:
In Amazon Redshift, network traffic created by COPY, UNLOAD, and Amazon Redshift Spectrum flow through a network interface. This network interface is internal to the Amazon Redshift cluster, and is located outside of your Amazon Virtual Private Cloud (Amazon VPC). By default, the network traffic is then routed through the public internet to reach its destination.
However, when you enable Amazon Redshift enhanced VPC routing, Amazon Redshift routes the network traffic through a VPC instead. Amazon Redshift enhanced VPC routing uses an available routing option, prioritizing the most specific route for network traffic. The VPC endpoint is prioritized as the first route priority. If a VPC endpoint is unavailable, Amazon Redshift routes the network traffic through an internet gateway, NAT instance, or NAT gateway.

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).
Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.
Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?
You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.
https://docs.aws.amazon.com/glue/latest/dg/connection-using.html
Network (designates a connection to a data source within an Amazon
Virtual Private Cloud environment (Amazon VPC))
https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html
To allow AWS Glue to communicate with its components, specify a
security group with a self-referencing inbound rule for all TCP ports.
By creating a self-referencing rule, you can restrict the source to
the same security group in the VPC and not open it to all networks.
However, this requires the location and credentials of specific
databases, and it is not clear if and when other traffic (e.g., a REST
call to a service) is tunnelled through the VPC.
I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:
All JDBC data stores that are accessed by the job must be available
from the VPC subnet. To access Amazon S3 from within your VPC, a VPC
endpoint is required. If your job needs to access both VPC resources
and the public internet, the VPC needs to have a Network Address
Translation (NAT) gateway inside the VPC.

Within AWS and AWS to On-Premise private connectivity

I have done a clean sweep of AWS docs but couldn't find answer to my scenario. I'm looking for a solution wherein I will have private connectivity(no data flows through Internet but within AWS network) between my two VPCs and VPC to On-premise connectivity. I'm aware of AWS PrivateLink and Direct Connect but they have some limitations e.g. a RDS Instance cannot be exposed as an Endpoint service to be consumed and things like that.
Is there any way I can achieve the above ?
AWS Transit Gateway allows you to setup direct networking between VPCs and your on premises environment. It supports both VPN and Direct Connect for the on premises leg of the connection.
https://aws.amazon.com/transit-gateway/

Can you explain AWS billing rates on data transfer

To design a system I need to decide on where to deploy the instances (suppose that I don't really care where they are but only want to optimize costs).
The on-demand page mentions several billing items:
Data Transfer IN To Amazon EC2 From Internet
Data Transfer OUT From Amazon EC2 To Internet
Data Transfer OUT From Amazon EC2 To (a list of regions)
Data Transfer Across AZ within this Region
My questions:
About item 1 - they say this is free, is it? does it make sense that from Internet to Amazon is free while from Amazon to Amazon is not free? (I'm talking on the inbound data here, not the outbound).
In items 2-3: does "Amazon" refer to all AWS services, including another EC2 instance?
Regarding item 4: it is written "Data transferred "in" to and "out" of Amazon EC2, Amazon RDS, Amazon Redshift , Amazon DynamoDB Accelerator (DAX), and Amazon ElastiCache instances or Elastic Network Interfaces across VPC peering connections in the same AWS region is charged at $0.01/GB." Is that meaning that if I run a process between 2 EC2 instances on the same region then I pay for each GB twice? first for outbound from one instance and second for the inbound on the other instance.
The simple rules-of-thumb are:
Inbound traffic from the Internet to the AWS Cloud is free.
Outbound traffic from the AWS Cloud to the Internet is charged at the applicable rates in each region (this is the majority of the cost). This applies to anything that sends traffic out to the Internet from your AWS services.
Outbound traffic from the AWS Cloud to Amazon CloudFront has a lesser rate
Traffic within a region but between Availability Zones is 1c/GB in each direction. In fact, the wording on the EC2 Instance Pricing page now shows this.
To answer your specific questions:
Inbound is free
Outbound is for any AWS service that sends traffic to the Internet
Traffic between AZs or via VPC Peering is charged in "each direction"

Secure data ingestion in Apache Spark on AWS EC2

I am planning to launch a Spark cluster on AWS EC2 instances using(https://spark.apache.org/docs/1.6.2/ec2-scripts.html). This is planned to be in a private subnet in a custom VPC.
With this background I see two options for secure data ingestion from the Internet:
Use S3 as the landing area and move data to Spark master node with a VPC S3 endpoint. There will be costs associated with POST/GET requests
Use a NAT instance in a separate public subnet and land the data directly in the master node of the spark cluster. There will be no costs apart from the extra EC2 NAT instance/NAT gateway.
Do you consider both options secure? If so, which one would you prefer?