Problems connecting to Neptune from Lambda - amazon-web-services

I have created a simple AWS Neptune cluster, with a writer and no read replicas. I used the option to create a new VPC for it, and two security groups were automatically created for it, too.
I also have a Lambda that calls that Nepture cluster's endpoint. I have configured the Lambda with the Neptune cluster's VPC, specifying all of its subnets and the two security groups mentioned above. I didn't manually modified the inbound and outbound rules once they have been automatically assigned upon me performing the VPC configuration from the AWS Console (just going through the steps).
The Lambda is written in Python and uses the requests library to make HTTPS calls, with AWS Singature V4. The execution role for the Lambda has NeptuneFullAccess and an inline policy to allow configuring a VPC for the Lambda (which has been done, so that policy works).
The Lambda calls the Neptune cluster's endpoint, with the cluster's name and ID redacted, on port 8182:
https://NAME.cluster-ID.us-east-1.neptune.amazonaws.com:8182
I get the following error:
{
"errorMessage": "2020-05-20T21:26:35.066Z c8ee70ac-6390-48fd-a32e-36f80d58a24e Task timed out after 3.00 seconds"
}
What am I doing wrong?
UPDATE: So, it looks like the second security group for the Neptune cluster was created by me selecting an option when creating the cluster. So, I tried again with Choose existing option for the security group, instead of Create new. (I guess I was confused before, because I was creating a whole new VPC, so how could a security group already exist? But the wizard just assumes the default security group that would be created by then.)
Now, I no longer get the same error. However, what I see is this:
{
"errorType": "Runtime.ExitError",
"errorMessage": "RequestId: 48e3b4fb-1b88-48d3-8834-247dbb1a4f3f Error: Runtime exited without providing a reason"
}
The log shows this:
{
"requestId": "b8b91c18-34cd-c5f6-9103-ed3357b9241e",
"code": "BadRequestException",
"detailedMessage": "Bad request."
}
The query was (given the Lambda code described in https://docs.amazonaws.cn/en_us/neptune/latest/userguide/iam-auth-connecting-python.html):
{
"host": "NAME.cluster-ID.us-east-1.neptune.amazonaws.com:8182",
"method": "GET",
"query_type": "status",
"query": ""
}
Any suggestions?
UPDATE: Trying against another Neptune cluster, the [Errno 111] Connection refused' error comes back. I have noticed an odd thing, however: I have some orphaned network interfaces, from when the Lambda was associated with the VPCs of now-deleted Neptune clusters. The network interfaces are marked in use, however, and I cannot detach and delete them, not even with the Force detachment option. Getting the You are not allowed to manage 'ela-attach' attachments error.
UPDATE: Starting with a fresh Lambda (no redoing its VPC configuration, and so no orphaned network interfaces anymore) and a fresh Neptune cluster with IAM Auth enabled and configured (and even with the Lambda's execution role given full admin access for the purposes of debugging, to eliminate any missing permissions), still getting this error:
{
"errorMessage": "HTTPSConnectionPool(host='NAME.cluster-ID.us-east-1.neptune.amazonaws.com', port=8182): Max retries exceeded with url: /status/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1f9f98c310>: Failed to establish a new connection: [Errno 111] Connection refused'))",
"errorType": "ConnectionError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 71, in lambda_handler\n return make_signed_request(host, method, query_type, query)\n",
" File \"/var/task/lambda_function.py\", line 264, in make_signed_request\n r = requests.get(request_url, headers=headers, verify=False, params=request_parameters)\n",
" File \"/var/task/requests/api.py\", line 76, in get\n return request('get', url, params=params, **kwargs)\n",
" File \"/var/task/requests/api.py\", line 61, in request\n return session.request(method=method, url=url, **kwargs)\n",
" File \"/var/task/requests/sessions.py\", line 530, in request\n resp = self.send(prep, **send_kwargs)\n",
" File \"/var/task/requests/sessions.py\", line 643, in send\n r = adapter.send(request, **kwargs)\n",
" File \"/var/task/requests/adapters.py\", line 516, in send\n raise ConnectionError(e, request=request)\n"
]
}

A few things to check:
Is the security group attached to the Neptune instance allowing traffic from the subnets that are configured for the Lambda function? The default inbound rule for the security group attached to Neptune is to only allow traffic from the IP address from which it was provisioned.
The NeptuneFullAccess built-in IAM policy is for control plane actions, not for data plane operations. You'll need to create an IAM policy using the policy document defined here [1] and attach that policy to which ever Lambda execution role you are using. Then, you need to use that role to sign the request being made to Neptune. The Python request library does not do SigV4 signing, so you'll need to follow a procedure similar to what is laid out here [2].
If you really want to simplify all of this, we've published a Python library that helps with managing connections, IAM auth, and sending queries to Neptune. You can find it here [3].
[1] https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth.html
[2] https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-python.html
[3] https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-python-utils

Thanks to the help of the Neptune team (an amazing response! they called me to discuss this), I was able to figure this out.
First, the Connection refused error disappeared once I redid the setup with a fresh Neptune cluster and the Use existing option for the security group, as well as a brand new Lambda added to the Neptune cluster's VPC. Apparently, redoing VPC configuration on a Lambda sometimes leaves orphaned network interfaces that are hard to delete. So, do the VPC config on a Lambda only once!
Second, the runtime error that started showing up after that is due to a bug in the Python code provided by AWS here: https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-python.html
Namely, the make_signed_request function in that script doesn't return a value. It should return r.text or, better yet, json.loads(r.text). Then, everything works just fine.

From your error message:
Task timed out after 3.00 seconds
You have to increase your lambda execution timeout, as your current setup of 3 seconds is not enough for it successful competition:
The amount of time that Lambda allows a function to run before stopping it. The default is 3 seconds. The maximum allowed value is 900 seconds.
If your function runs more than the set timeout, lambda service is going to terminate it due to running more than the given timeout threshold.
As a side note:
Since you use lambda in a vpc, you have to remember that lambda functions do not have public IPs nor internet access. You may not be able to connect to your db even if you increase the function timeout. This can be overcome if you run your lambda function in private subnet and have NAT gateway or NAT instance correctly setup.

Related

Lambda invocation failed with status: 403 on new AWS region

I enabled a new AWS region (Africa, Cape Town)
I created a new lambda on the new region. I connected the mentioned lambda to my API-Gateway located in Frankfurt region and when trying to access it, there is a internal server error.
CloudWatch shows the following:
(ee2d73a9-e0ff-4ba2-a445-4348e86bcfc1) Lambda invocation failed with status: 403. Lambda request id: ed3b6fc8-0959-4f43-8c3c-32d6c811e9f2
(ee2d73a9-e0ff-4ba2-a445-4348e86bcfc1) Execution failed due to configuration error: The security token included in the request is invalid
However, when I create another API Gateway in Africa, I can only access African Lambdas, and I get the same error trying to access anything outside Africa.
Any solutions?

GCP - Cloud Composer 2 - Create operation on this environment failed

I am trying to create a default Composer 2 Instance on GCP and get the Errors:
CREATE operation on this environment failed 32 minutes ago with the following error message:
Composer Backend timed out. Currently running tasks are [stage: CP_COMPOSER_AGENT_RUNNING
description: "No agent response published."
...
or
CREATE operation on this environment failed 32 minutes ago with the following error message:
Environment couldn't be created, but no error was surfaced. This can be cause by a lack of
proper permissions. Check if this environment's service account ... .iam.gserviceaccount.com
has the 'roles/composer.worker' role and there is no firewall inhibiting internal
communications set.
I already tried to add the Composer Worker role to the service account and all other required roles (e.g. Cloud Composer v2 API Service Agent Extension) in https://cloud.google.com/composer/docs/composer-2/access-control (for public as well as for private, eventhough instance is public).
I looked into the GKE instance and found the Pod composer-agent failing with:
Traceback (most recent call last): File "composer_agent.py", line 467, in <module> main() File "composer_agent.py", line 292, in main responses = pubsub_subscriber.pull() (...)
oauth2client.client.HttpAccessTokenRefreshError: Failed to retrieve
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/(...)
compute#developer.gserviceaccount.com/token from the Google Compute Enginemetadata service.
Response:
{'date': 'Thu, 17 Feb 2022 10:29:46 GMT', 'status': '403', 'content-length': '668', 'content-
type': 'text/plain; charset=utf-8', 'x-content-type-options': 'nosniff'}
So I assume there is still some permission issue but I can not figure out what, Composer 1 Instances can be created without a problem, as well for a different project Composer 2 Instances with the same permissions on the service accounts.
I also tried to create different than default compute service account with the required permissions but also without success. I also checked that the service account I am adding permissions to is actually the service account sending the request in the composer-agent and is sending the environment creation request to the GKE cluster.
I hope anyone can help, who faced similar issues or knows more about the error occuring in composer-agent, thank you very much!
After being in contact with the Google Support Team, the solution was to manually enable the "IAM Service Account Credentials API". There was no issue in Service Account Rights or Firewall settings.

Internal Server Error on Terraform provisioned API Gateway Trigger

I'm using Terraform to allocate a Lambda function that's triggered by a request to an API Gateway on AWS. Here's my .tf file:
dsf
This correctly sets up the Lambda function, but something is off with the API Gateway. When I curl with the output URL (https://cqsso4fw67.execute-api.us-east-1.amazonaws.com), it gives me the following:
curl https://cqsso4fw67.execute-api.us-east-1.amazonaws.com/
{"message":"Internal Server Error"}%
If I manually remove this API Gateway and create a new one from the AWS Console within the Lambda function and run curl again, I get "SUCCESS!" (which is what my Lambda should do).
smf
Any ideas how I can fix the .tf file to correctly setup the API Gateway trigger?
Here are the differences between the one created in AWS and the one provisioned in Terraform. Terraform one is at the top.
The TF code that you showed is correct. Most likely reason for the failure is your lambda function, which you haven't showed. You have to inspect lambda function logs in CloudWatch to check why it fails.
Edit:
The function should return:
return {
'statusCode': 200,
'body': 'SUCCESS'
}
If anyone comes across this issue, I managed to fix it by greatly simplifying things. I switched out all the API Gateway clauses in the Terraform file with just:
resource "aws_apigatewayv2_api" "apigw-terraform" {
name = "terraform-api"
protocol_type = "HTTP"
target = aws_lambda_function.w-scan-terraform.arn
}
The tutorial I was following had implemented something more complex than what I needed.

Unable to create AWS EKS cluster with eksctl

Unable to create AWS EKS cluster with eksctl from Windows 10 PC. Here is the command which I'm executing
eksctl create cluster --name revit --version 1.17 --region ap-southeast-2 --fargate
Version of eksctl: 0.25.0
AWS CLI Version: aws-cli/2.0.38 Python/3.7.7 Windows/10 exe/AMD64
Error on executing create cluster command
2020-08-08T19:05:35+10:00 [ℹ] eksctl version 0.25.0
2020-08-08T19:05:35+10:00 [ℹ] using region ap-southeast-2
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 54.121635ms
2020-08-08T19:05:35+10:00 [!] retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network.) from ec2metadata/GetToken - will retry after delay of 86.006168ms
I had the same error, I've got rid of it providing my AWS credentials for programmatic access (AWS Access Key ID, AWS Secret Access Key):
$ aws configure
Next time I used eksctl it just didn't try to authenticate on its own and command passed.
I suspect this is related to this: https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/
Specifically:
Protecting against open layer 3 firewalls and NATs Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.
Are you by any chance running the command above from within a container launched in bridge mode? I had a similar problem. If that is the case you could run it using --network host or by passing the creds as system variables.

Elastic BeanStalk MultiContainer docker fails

I want to deploy an multi-container application in elastic beanstalk. I get the following error.
Error 1: The EC2 instances failed to communicate with AWS Elastic
Beanstalk, either because of configuration problems with the VPC or a
failed EC2 instance. Check your VPC configuration and try launching
the environment again.
I have set up the VPC with just the public subnet and the security group that allows all traffic both inbound and outbound. I know this is not encouraged for production level deployment, but I have reduced the complexity to find the cause of the error.
So, the load balancer and the EC2 instance are inside the same public subnet that is attached with the internet gateway. They both share the same security group allowing all the traffic.
Before the above error, I also get another error stating
Error 2: No ecs task definition (or empty definition file) found in environment
Having said, I have bundled my Dockerrun.aws.json file with .ebextensions folder inside the source bundle which the beanstalk uses for deployment.
After all these errors, drilling down to two questions:
I cannot understand why No ecs task error appears, when I have packaged my dockerrun.aws.json file containing containerDefinitions?
Since there is no ecs task running, there is nothing running in the instance. Is this why beanstalk and ELB cannot communicate to the instance? (Assuming my public subnet and all traffic security group is not a problem)
The problem was the VPC. Even I had the simple VPC with just an public subnet, the beanstalk cannot talk to the instance and so cannot deploy the ECS task definition and docker containers in the instance.
By creating two subnets namely public and private and having an NAT instance in public subnet, which becomes the router for the instances in the private subnet. Making the above setup worked for me and I could deploy the ECS task definition successfully to the EC2 instance in the private subnet.
I found this question because I got the same error. Here are the steps that worked for me to actually deploy a multi-container app on Beanstalk:
To get past this particular error, I used the eb CLI tools. For some reason, using eb deploy instead of zipping and uploading myself fixed this. It didn't actually work, but it gave me a new error.
So, I changed my Dockerrun.aws.json, a file format that needs WAY more documentation, until I stopped getting errors about that.
Then, I got an even better error!
ERROR: [Instance: i-0*********0bb37cf] Command failed on instance.
Return code: 1 Output: (TRUNCATED)..._api_call
raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when
calling the GetObject operation: Access Denied
Failed to download authentication credentials [config file name] from [bucket name].
Hook /opt/elasticbeanstalk/hooks/appdeploy/enact/02update-
credentials.sh failed. For more detail, check /var/log/eb-activity.log
using console or EB CLI.
Per this part of the docs the way to solve this is to
Open the Roles page in the IAM console.
Choose aws-elasticbeanstalk-ec2-role.
On the Permissions tab, under Managed Policies, choose Attach Policy.
Select the managed policy for the additional services that your application uses. For example, AmazonS3FullAccess or AmazonDynamoDBFullAccess. (For our problem, the S3 one)
Choose Attach Policies.
This part got really exciting, because I got yet another error: Authentication credentials are not in JSON format as expected. Please generate the credentials using 'docker login'. (Keep in mind, I tried to follow the instructions on how to do this to the letter, but, oh well). Turns out this one was on me, I had malformed JSON in my DockerHub auth file stored on S3. I renamed the file to dockercfg.json to get syntax checking, and it seems the Beanstalk/ECS is okay with having the .json as part of the name, because this time... there was a different error: CannotPullContainerError: Error: image [DockerHub organization]/[repo name]:latest not found). Hmm, maybe there was a typo? Let's check:
$ docker run -it [DockerHub organization]/[repo name]:latest
Unable to find image '[DockerHub organization]/[repo name]:latest' locally
latest: Pulling from [DockerHub organization]/[repo name]
Ok, the repo is there. So... my auth is bad? Yup, turns out I followed an example in the DockerHub auth docs that was of what you shouldn't do. Your dockercfg.json should look like
{
"https://index.docker.io/v1/": {
"auth": "ZWpMQ=Vyd5zOmFsluMTkycN0ZGYmbn=WV2FtaGF2",
"email": "your#email.com"
}
}
There were a few more errors (volume sourcePath has to be a absolute path! That's what the invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed message means), but it eventually deployed. Sorry for the novel; hoping it helps someone.