Secure data ingestion in Apache Spark on AWS EC2 - amazon-web-services

I am planning to launch a Spark cluster on AWS EC2 instances using(https://spark.apache.org/docs/1.6.2/ec2-scripts.html). This is planned to be in a private subnet in a custom VPC.
With this background I see two options for secure data ingestion from the Internet:
Use S3 as the landing area and move data to Spark master node with a VPC S3 endpoint. There will be costs associated with POST/GET requests
Use a NAT instance in a separate public subnet and land the data directly in the master node of the spark cluster. There will be no costs apart from the extra EC2 NAT instance/NAT gateway.
Do you consider both options secure? If so, which one would you prefer?

Related

Accessing S3 from inside EKS using boto3

I have a Python application deployed on EKS (Elastic Kubernetes Service). This application saves large files inside an S3 bucket using the AWS SDK for Python (boto3). Both the EKS cluster and the S3 bucket are in the same region.
My question is, how is communication between the two services (EKS and S3) handled by default?
Do both services communicate directly and internally through the Amazon network, or do they communicate externally via the Internet?
If they communicate via the internet, is there a step by step guide on how to establish a direct internal connection between both services?
how is communication between the two services (EKS and S3) handled by default?
By default the network topology of your EKS offers route to the public AWS S3 endpoints.
Do both services communicate directly and internally through the Amazon network, or do they communicate externally via the Internet?
Your cluster needs to have network access to the said public AWS S3 endpoints. Example, worker nodes running in public subnet or the use of NAT gateway in private subnet.
...is there a step by step guide on how to establish a direct internal connection between both services?
You create VPC endpoints for S3 in the VPC that your EKS runs to ensure network communication with S3 stay within AWS network. VPC endpoints for S3 support both interface and gateway type. Try this article to learn about the basic of S3 endpoints, you can use the same method to create endpoints in the VPC where your EKS runs. Request to S3 from your pods will then use the endpoint to reach out to S3 within AWS network.
You can add S3 access to your EKS node IAM role, this link shows you how to add ECR registry access to EKS node IAM role, but it is the same for S3.
The other way is to make environment variables available in your container, see this link, though I would recommend the first way.

aws vpc endpoints - how it works?

I am trying to understand the concept of how VPC endpoints work and I am not sure that I understand the AWS documentation. For example, I have a private S3 bucket and I have an EKS cluster. So if my bucket is private I believe that traffic from the EKS cluster to S3 does not go through the internet, but only through the AWS network. But in a case my s3 bucket was public, then probably I will need to set up the VPC endpoint, so traffic will not leave the AWS. The same logic I would expect with ECR, if it is private you load images to your EKS through AWS network.
So what is the exact case when you need to use VPC endpoint within your AWS account (not from on-prem or another VPC)?
VPC endpoints are typically used with public AWS services (such as S3, DynamoDB, ECR, etc.) when the client applications are hosted inside your VPC and you do not want to route traffic via public Internet, which would otherwise result in a number of hops to reach the AWS service.
Imagine a situation when you have an app running on an EC2 instance, which is deployed to a private subnet of your VPC (i.e. a Pod in your EKS cluster). This app reads/writes data from/to AWS S3. If you do not use a VPC endpoint, your traffic will first reach your NAT gateway, then your VPC's Internet gateway out to the public Internet. Eventually, it will hit AWS S3. The response will travel back via the same route.
Same thing with ECR (i.e. a new instance of your Kubernetes Pod started by the kubelet). It's better (i.e. quicker) to pick the shortest route to download a Docker image from ECR rather than traverse a number of switches/routers. With a VPC endpoint your traffic will first hit the VPC endpoint (without leaving your private subnet) and then reach e.g. ECR directly (traffic does not leave the Amazon network).
As correctly mentioned by #jarmod, one should differentiate between routing (Layer 3 in the OSI model) and authentication/authorization (Layer 7). For example, you can use a VPC endpoint to reach AWS S3, but not be authorized (or even unauthenticated) to e.g. read a file from an S3 bucket.
Hope this clarifies the idea behind using VPC endpoints.

Purpose of Redshift Enhanced VPC routing?

What is the purpose of Enhanced VPC routing for Redshift ?
I've read the doc https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html
but it it not clear to me.
When you create a redshift cluster, the leader node resides in a VPC / subnet.
Hence it will always use VPC routing, Security groups etc to route requests right ?
How come that redshift wouldn't use VPC traffic when performing "COPY" commands ?
Enhanced VPC routing forces the traffic to go through your VPC.
With it disabled, even if your cluster is in a VPC, it will route traffic via the public Internet instead of going through the VPC.
This is because it uses an "internal" network interface that's outside of the VPC, regardless of whether or not the cluster itself is in a VPC.
Here's a relevant excerpt from the docs:
In Amazon Redshift, network traffic created by COPY, UNLOAD, and Amazon Redshift Spectrum flow through a network interface. This network interface is internal to the Amazon Redshift cluster, and is located outside of your Amazon Virtual Private Cloud (Amazon VPC). By default, the network traffic is then routed through the public internet to reach its destination.
However, when you enable Amazon Redshift enhanced VPC routing, Amazon Redshift routes the network traffic through a VPC instead. Amazon Redshift enhanced VPC routing uses an available routing option, prioritizing the most specific route for network traffic. The VPC endpoint is prioritized as the first route priority. If a VPC endpoint is unavailable, Amazon Redshift routes the network traffic through an internet gateway, NAT instance, or NAT gateway.

How to access Amazon DynamoDB service through a private VPC endpoint from another region?

We have 2 regions, primary and secondary where the VPC is configured so that the EC2 instances in that VPC would make requests to a private VPC endpoint that would serve up DynamoDB from that region. Our Amazon DynamoDB tables are global tables. The goal is to have our requests stay within the Amazon network for security reasons.
We have a scheduled task that would run on an EC2 instance in our primary region. We want to make it more resilient by having it failover DynamoDB requests to the secondary region in the event that the primary region DynamoDB service is degraded. This was recommend by AWS in the Availability and Durability section.
I've looked through these documentations: Endpoints for Amazon DynamoDB and Using Amazon VPC Endpoints to Access DynamoDB, but they don't seem to offer any solution. Is it even possible to make requests to a private VPC endpoint from another region?
The goal is to have multi-region resilient and good security by not having requests going out to the internet.
Unfortunately this isn't possible from the documentation at https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-ddb.html:
Endpoints currently do not support cross-region requests—ensure that you create your endpoint in the same Region as your DynamoDB tables.
Also documented here: https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-gateway.html#vpc-endpoints-limitations
Endpoint connections cannot be extended out of a VPC. Resources on the other side of a VPN connection, VPC peering connection, transit gateway, AWS Direct Connect connection, or ClassicLink connection in your VPC cannot use the endpoint to communicate with resources in the endpoint service.

How can AWS Glue access IP whitelisted resource

If I have a service that needs to have IP whitelisting, how can I connect AWS Glue to it? I read that I seem to be able to put AWS Glue in a private VPC and configure a NAT gateway. I can then allow my NAT IP to connect to the service. However, I cannot find anyway to configure my Glue Job to run inside a subnet/VPC. How do I do this?
The job will run automatically in a VPC if you attach a Database connection to a resource which is inside the VPC. For example, I have a job that reads data from S3 and writes into an Aurora database in a private VPC using a Glue connection (configured as JDBC).
That job automatically has access to all the resources inside the VPC, as explained here. If the VPC has enabled NAT for external access, then your job can also take advantage of that.
Note if you use a connection that requires VPC and you use S3, you will need to enable an endpoint for S3 in that VPC as well.
The answer for your question is answered here -- https://stackoverflow.com/a/64414639 Note that Glue is a 'managed' service so it does not release any list IP addresses such that can be whitelisted. As a workaround you can use a EC2 instance to run your custom python OR pyspark script and whitelist the IP address of that particular EC2 instance