Can Elasticsearch 1.x Discover Clusters on an IP Range? - amazon-web-services

I have two Elasticsearch clusters set up on AWS using EC2 servers, sat behind a load balancer. I want to configure AWS to start up new instances when the load hits a certain threshold on the load balancer. Currently I have the Elasticsearch instances speaking to each other with the following config in config/elasticsearch.yml:
discover.zen.ping.unicast.hosts: ["172.x.x.x","172.x.x.x"]
Since the new instance will start up on an indeterminable private IP in the subnet, I was wondering if it was possible to point the config at an IP range so that new clusters could be discovered as they start up. Are there security implications to doing this?
If not, is there another way to achieve the same outcome? I am aware that Elasticsearch is meant to handle load balancing and scaling by itself, however this is part of a larger solution and specified in security specifications of my partner on the project.
The Question
When hosting Elasticsearch on AWS EC2:
Can you set up Elasticsearch discovery to use an IP range?
Are there security implications to doing this when using AWS?
If not, what is the appropriate way to handle this scenario?

I found this out myself. For anyone else who happens upon this, here is what I did.
I set up a basic instance with Elasticsearch, Apache and Java installed, and made a simple Elasticsearch startup script so that it would start on system boot.
I set the node name on my ES instances to be node.name: ${HOSTNAME} so that each node would be identifiable in the cluster and has a unique name in case I ever need to manually log into the server to check anything.
The big gotcha is that you have to use Elasticsearch's cloud-aws plugin for the relevant version of Elasticsearch that you're using. It can be found here
Using elasticsearch.yml you can configure your installations using cloud-aws specific config, you can make it so that ES isn't dependent on an IP address to join a cluster, but can instead discover a cluster based on instance tags, subnets, or anything else that is unique but determinable about your new instances in AWS.
I found this article very helpful. The main points from the article that are helpful are below.
AWS IAM Role
An IAM role needs to be set up and allocated to your new instances to allow them to communicate with AWS and check the new instances for certain values. The role should have these permissions.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeTags",
"ec2:DescribeRegions",
"ec2:DescribeSecurityGroups"
],
"Resource": "*"
}
]
Elasticsearch config
After installing cloud-aws, you will be able to use new config in your elasticsearch.yml file. First, remove discover.zen.ping.unicast.hosts: as this is no longer needed. The most basic setup to get this to work is:
cloud:
aws:
access_key: <YOUR AWS KEY>
secret_key: <YOUR AWS SECRET>
discovery:
type: ec2
But a more tuned configuration is below. It will look for instances with a tag of esdiscovery:enabled:
discovery.zen.hosts_provider: ec2
discovery.ec2.tag.esdiscovery: enabled
cloud:
aws:
access_key: <YOUR AWS KEY>
secret_key: <YOUR AWS SECRET>
region: <YOUR AWS REGION>
discovery:
type: ec2
If you're using EU West, then you need to replace the region with the following config:
cloud:
aws:
ec2:
endpoint: ec2.eu-west-2.amazonaws.com

Related

ECS Fargate task fails: CannotPullContainerError: inspect image has been retried 5 time(s): httpReaderSeeker: failed open: unexpected status code

We have other ECS Services running which use images from our private ECR repo. However for our Services in the same cluster which are trying to pull from Docker Hub we are getting the following error:
CannotPullContainerError: inspect image has been retried 5 time(s): httpReaderSeeker: failed open: unexpected status code https://registry-1.docker.io...: 4...
(The message itself is truncated at the end: it is literally "4...").
Judging by the fact that it's getting a status code response, that suggests that it's able to talk to Docker Hub and it's not a network connectivity issue within our AWS configuration. We are trying to use an image in our ECS Task from a public repo, one is a Redis image and another is a Hasura image. I'm not sure how to see the status code itself since it's truncated in the AWS console.
When I hit the URL from the error in my browser this is the response:
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":[{"Type":"repository","Class":"","Name":"hasura/graphql-engine","Action":"pull"}]}]}
I get a similar response with the Redis image. I didn't think we needed any authentication to pull public images - we've run ECS Tasks in the past without requiring authentication to Docker Hub?
For completeness, I've included the checks below for troubleshooting this error, however as mentioned, since we're getting a response code from Docker Hub it doesn't look like these checks are relevant.
AWS has this guide to troubleshoot 'CannotPullContainer' errors and for this particular error on Fargate there is this guide. Here are the things from the guide we have checked:
Confirm that your VPC networking configuration allows your Amazon ECS infrastructure to reach the image repository
This ECS Task was in a private subnet, and it's route table had the following routes:
10.0.0.0/16 -> local (active)
0.0.0.0/0 -> NAT Gateway (active)
The NAT Gateway has status available and an Elastic IP address assigned.
Check the VPC DHCP Option Set
Looking at the VPC and going to the DHCP options set we can see Domain name servers is set to: 'AmazonProvidedDNS'
Check the task execution role permissions
More details about configuring this are in this guide.
The same IAM role is used in the task definition for both the 'task role' and 'task execution role.' This has been with the following default policy as defined in the guide mentioned:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:CreateLogStream",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability"
],
"Resource": "*"
}
]
}
Check that the image exists
This is the image we're trying to pull from Docker Hub. The image exists and I can pull it from my local machine without having to authenticate.
I faced the same exact issue and spent a whole day looking for an error in my configuration.
It turned out DockerHub was blocking requests from my ECS task to pull docker images https://www.docker.com/increase-rate-limits.
If you pull images from DockerHub and face the same issue, just wait for 6 hours or create a new cluster that will give you a new IP address.
For the rate limiting from Docker problem, another solution is to host a copy of the container in ECR. The container I wanted had a sample Dockerfile which I was able to simply execute. 1 caveat here is you will want to periodically rebuild this container to take into account future releases.
Another option is ECRs 'Pull through cache' private repo, which seems to be a caching layer in ECR. I haven't tried that approach myself.

AWS on Terraform: Error deleting resource: timeout while waiting for state to become 'destroyed'

I'm using Terraform (v0.12.28) to launch my AWS environment (aws provider v2.70.0).
When I try to remove all resources with terraform destroy I'm facing the error below:
error deleting subnet (subnet-XXX): timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)
I can add my Terraform code but I think there is nothing special in my resources stack which basically includes:
VPC and Subnets.
Internet and NAT GTWs.
Application Load Balancers.
Route tables.
Auto-generated NACL and Elastic Network Interfaces (ENIs).
In my case, the problem seems to be related to the ENIs which are attached to the ALBs - as can be seen from the AWS console:
While searching for solutions I noticed that it is a common problem which can come in different resources and type of dependencies.
I'll focus in this question to problems which are related to VPC components (Subnets, ENIs etc') and resources that have dependency on them (Load Balancers, EC2,Lambda functions etc') and are failing to be deleted probably due to the fact that a detaching phase is required prior to the deletion.
Any help will be highly appreciated.
(*) The Terraform user for this environment (DEV) has full Admin privileges:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
So this shouldn't be related to policies.
Examples for related issues:
Update: Issue affecting HashiCorp Terraform resource deletions after the VPC Improvements to AWS Lambda (Solution doesn't work - I've an updated version of AWS provider).
AWS VPC - cannot detach "in use" AWS Lambda VPC ENI
Lambda Associated EC2 Subnet and Security Group Deletion Issues and Improvements
AWS: deletion of subnet times out because of scaling group
Error waiting for route table (rtb-xxxxxx) to become destroyed: timeout while waiting for state to become
Error waiting for internet gateway to detach / Cluster has node groups attached
I ran into this issue while trying to destroy an EKS cluster after I had already deployed services onto the cluster, specifically a load balancer. To fix this I manually deleted the load balancer and the security group associated to the load balancer.
Terraform is not aware of the resources provisioned by k8s and will not clean up dependent resources.
If you're unsure what resources are preventing Terraform from destroying infrastructure you can try an of:
Use terraform apply to get back into a good state and then use kubectl to clean up resources before running terraform destroy again.
This knowledge base article includes a script you can run to identify dependencies: https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-dependency-error-delete-vpc/
Review CloudTrail logs to see what resources were created. If this was an issue with EKS you can filter by username: AmazonEKS.
Another variation of this issue is a DependencyViolation error. Ex:
Error deleting VPC: DependencyViolation: The vpc 'vpc-xxxxx' has dependencies and cannot be deleted. status code: 400
I ran into this issue just now…
One (hacky) solution is to attempt to delete the subnet through the AWS Console. AWS will then tell you what is preventing the subnet from being deleted—for me, it was two network interfaces that needed to be detached and then deleted before Terraform had the power to delete my subnets.
Like Mathew Tinsley says, there are sometimes associated resources created implicitly by AWS that Terraform can't destroy by itself.
I had similar issue while destroying step function, and the problem was that it had active executions (status: Running). I stopped them and it successfully deleted the step function.

how to update AWS ECS Container Agent on Fargate launch type instances

I am trying to configure AWS ECS using awsvpc mode with an IAM role to use specifically for tasks. Our ECS instances are of fargate launch types. After specifying a Task IAM role in the task configuration, we ssh into our task and try to run awscli commands and get the following error:
Unable to locate credentials. You can configure credentials by running "aws configure".
In order to troubleshoot, we ran the same docker image in a container with an EC2 launch type and when we ran the same awscli command, it errors by saying the assumed role does not have sufficient permissions. We noticed that this was because it was assuming the container instance IAM role, rather than the Task IAM role.
Based on the documentation here, it is clear that when using awsvpc networking mode, we need to set the ECS_AWSVPC_BLOCK_IMDS agent configuration variable to true in the agent configuration file and restart the agent in order for our instances to assume the Task IAM role rather than the container instance IAM role.
For the time being, for performance testing purposes, we need to deploy with the Fargate launch type and according to the docs, the container agent should be installed automatically for Fargate:
The Amazon ECS container agent is installed on the AWS managed infrastructure used for tasks using the Fargate launch type. If you are only using tasks with the Fargate launch type no additional configuration is needed and the content in this topic does not apply.
However, we still need to be able to assume our task IAM role. Is there a way to update the necessary environment variable in the AWS-managed agent configuration file so as to allow the assuming of the task IAM role? Or is there another way to allow this?
When creating the task definition for your Fargate Task, are you assigning a Task Role ARN? There are two IAM ARNs needed. The Execution Role ARN is the IAM role to start the container in your Fargate cluster and uses permissions to setup the CloudWatch logs and possibly pulling an image from ECR. The Task Role ARN is the IAM Role that the container has. Make sure the Task Role ARN has the ECS Trust Relationship.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

How to use AWS ECS Task Role in Node AWS SDK code

Code that uses the AWS Node SDK doesn't seem to be able to gain the role permissions of the ECS task.
If I run the code on an EC2 ECS instance, the code seems to inherit the role on the instance, not of the task.
If I run the code on Fargate, the code doesn't get any permission.
By contrast, any bash scripts that run within the instance seem to have the proper permissions.
Indeed, the documentation doesn't mention this as an option for the node sdk, just:
Loaded from IAM roles for Amazon EC2 (if running on EC2),
Loaded from the shared credentials file (~/.aws/credentials),
Loaded from environment variables,
Loaded from a JSON file on disk,
Hardcoded in your application
Is there any way to have your node code gain the permissions of the ECS task?
This seems to be the logical way to pass permissions to your code. It works beautifully with code running on an instance.
The only workaround I can think of is to create one IAM user per ECS service and pass the API Key/Secret as environmental variables in the task definition. However, that doesn't seem very secure since it would be visible in plain text to anyone with access to the task definition.
Your question is missing a lot of details on how you setup your ECS Cluster plus I am not sure if the question is for ECS or for Fargate specifically.
Make sure that you are using the latest version of the SDK. Javascript supports ECS and Fargate task credentials.
Often there is confusion about credentials on ECS. There is the IAM role that is assigned to the Cluster EC2 instances and the IAM role that is assigned to ECS tasks.
The most common problem is the "Trust Relationship" has not been setup on the ECS Task Role. Select your IAM role and then the "Trust Relationships" tab and make sure that it looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
In addition to the standard Amazon ECS permissions required to run tasks and services, IAM users also require iam:PassRole permissions to use IAM roles for tasks.
Next verify that you are using the IAM role in the task definition. Specify the correct IAM role ARN in the Task Role field. Note that this different than Task Execution Role (which allows containers to pull images and publish logs).
Next make sure that your ECS Instances are using the latest version of the ECS Agent. The agent version is listed on the "ECS Instances" tab under the right hand side column "Agent version". The current version is 1.20.3.
Are you using an ECS optimized AMI? If not, add --net=host to your docker run command that starts the agent. Review this link for more information.
I figured it out. This was a weird one.
A colleague thought it would be "safer" if we call Object.freeze on proccess.env. This was somehow interfering with the SDK's ability to access the credentials.
Removed that "improvement" and all is fine again. I think the lesson is "do not mess with process.env".

Register AWS ECS task in service discovery namespace (private hosted zone)

I'm quite bad at using AWS but I'm trying to automate the set up of an ECS cluster with private DNS names in route53, using the new service discovery mechanism. I am able to click my way through the AWS UI to accomplish a DNS entry showing up in a private hosted zone but I cannot figure out the JSON parameters to add to the json for the command below to accomplish the same thing.
aws ecs create-service --cli-input-json file://aws/createService.json
and below is the approximate contents of the createService.json
referenced above
"cluster": "clustername",
"serviceName": "servicename",
"taskDefinition": "taskname",
"desiredCount": 1,
// here is where I'm guessing there should be some DNS config referencing some
// namespace or similar that I cannot figure out...
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": [
"subnet-11111111"
],
"securityGroups": [
"sg-111111111"
],
"assignPublicIp": "DISABLED"
}
}
I'd be grateful for any ideas since my googling skills apparently aren't good enough for this problem as it seems. Many thanks!
To automatically have an ECS service register instances into a servicediscovery service you can use the serviceRegistries attribute. Add the following to the ECS service definition json:
{
...
"serviceRegistries": [
{
"registryArn": "arn:aws:servicediscovery:region:aws_account_id:service/srv-utcrh6wavdkggqtk"
}
]
}
The attribute contains a list of autodiscovery services that should be updated by ECS when it creates or destroys a task as part of the service. Each registry is referenced using the ARN of the autodiscovery service.
To get the Arn use the AWS cli command aws servicediscovery list-services
Strangely the documentation of the ECS service definition does not contain information about this attribute. However this tutorial about service discovery does.
As it turns out there is no support in ecs create service for adding it to the service registry, i.e. the route53 private hosted zone. Instead I had to use aws servicediscovery create-service and then servicediscovery register-instance to finally get an entry in my private hosted zone.
This became a quite complicated solution so I'll instead give Terraform a shot at it since I found they recently added support for ECS service discovery and see where that takes me...