AWS Lambda cannot reach internal servers from within VPC - amazon-web-services

I have a lambda which is attempting to make a REST call to an on-prem server outside of AWS. We have the lambda running from a VPC which has a VPN connection to our local resources. The same rest call runs successfully from EC2 with the VPC but the lambda request hangs. The security groups are open. Any ideas how to debug this?
Here is the bulk of the lambda
def lambda_handler(event, context):
config = configparser.ConfigParser()
config.read('config')
pattern = re.compile(".*"+config['DEFAULT']['my-pattern'])
logger.info(event['Records'])
sns_json = event['Records'][0]['Sns']
sns_message = json.loads(sns_json['Message'])
logger.info(sns_message['Records'][0]['s3'])
s3_object = sns_message['Records'][0]['s3']
new_file_name = s3_object['object']['key']
bucket = s3_object['bucket']['name']
if pattern.match(new_file_name):
new_json = {"text": "New file (" + new_file_name + ") added to the bucket. " + bucket,
"title": config['DEFAULT']['default_message_title']}
webhook_post = requests.get("http://some-ip:4500/")
logger.info("Webhook Post Status: " + str(webhook_post.status_code) + str(webhook_post))
logger.info("Skip teams webhook");
outgoing_message_dict = {
's3Bucket': bucket,
'somefile': new_file_name
}
return outgoing_message_dict
I don't receive any errors from the request, it just hangs until my lambda times-out.

I believe I found the source of the problem. Ultimately I believe the issue is with our on-prem firewall. The VPN tunnel wasn't active at all times. Others have mentioned that it needs to be activated from the on-prem network. I created an ec2 instance and connected to it, activating the VPN. What I ran the lambda shortly after, I could successfully reach the local REST endpoint I was trying connect to.
I have not implemented the final solution yet, but from the firewall we should be able to set the connection to have a keep-alive ping so our connection does not time-out. I hope this helps others. Thank you for the feedback!

Related

AWS EC2 getting a new dynamic IP without shutting down and start back up

I am testing a component that required changing the IP address for every test.
Using EC2, I can perform the following actions to change my IP:
Shutdown the VM
Start up the VM
New IP obtained
While that work wonderfully, I need to wait 3 mins for it so shutdown and start back up for each test, which become quite troublesome overtime.
Would like to ask if there is anyway to, click a button / execute a script and obtain a new IP instantly? Thanks.
When you use Elastic IP address (EIP), you can allocate new IP addresses faster than stopping and starting the EC2 inctance.
Here is a snippet with Python boto3 for checking IP addresses are changed.
Prerequisite
boto3
EC2 instance (InstanceID)
No EIP in the region
Code
import boto3
from botocore.exceptions import ClientError
def check_eip():
ec2 = boto3.client("ec2")
res_describe = ec2.describe_addresses()
if res_describe["Addresses"]:
return res_describe["Addresses"][0]["PublicIp"]
else:
return "NO EIP"
InstanceID = "i-xxxxxxxxxxxxxxxxx"
ec2 = boto3.client("ec2")
try:
print("Step0: ", check_eip())
allocation = ec2.allocate_address(Domain="vpc")
print("Step1: ", check_eip())
response = ec2.associate_address(
AllocationId=allocation["AllocationId"], InstanceId=InstanceID
)
print("Step2: ", check_eip())
# Do something here with the new EIP
response2 = ec2.disassociate_address(
AssociationId=response["AssociationId"])
print("Step3: ", check_eip())
response3 = ec2.release_address(AllocationId=allocation["AllocationId"])
print("Step4: ", check_eip())
except ClientError as e:
print(e)
Output
The output will be like this:
Step0: NO EIP
Step1: aaa.bbb.ccc.ddd
Step2: aaa.bbb.ccc.ddd
Step3: aaa.bbb.ccc.ddd
Step4: NO EIP
It takes several seconds to run one sequence.
Whey you run the code again, EIP will be changed.
Note
Please make sure to release unused EIP.
$0.005 per Elastic IP address not associated with a running instance per hour on a pro rata basis

AWS Lambda connection to SQS timed out

I am working on an task which involves Lambda function running inside VPC.
This function is supposed to push messages to SQS and lambda execution role has policies : AWSLambdaSQSQueueExecutionRole and AWSLambdaVPCAccessExecutionRole added.
Lambda functions :
# Create SQS client
sqs = boto3.client('sqs')
queue_url = 'https://sqs.ap-east-1a.amazonaws.com/073x08xx43xx37/xyz-queue'
# Send message to SQS queue
response = sqs.send_message(
QueueUrl=queue_url,
DelaySeconds=10,
MessageAttributes={
'Title': {
'DataType': 'String',
'StringValue': 'Tes1'
},
'Author': {
'DataType': 'String',
'StringValue': 'Test2'
},
'WeeksOn': {
'DataType': 'Number',
'StringValue': '1'
}
},
MessageBody=(
'Testing'
)
)
print(response['MessageId'])
On testing the execution result is as :
{
"errorMessage": "2020-07-24T12:12:15.924Z f8e794fc-59ba-43bd-8fee-57f417fa50c9 Task timed out after 3.00 seconds"
}
I increased the Timeout from Basic Settings to 5 seconds & 10
seconds as well. But the error kept coming.
If anyone has faced similar issue in past or is having an idea how to get this resolved, Please help me out.
Thanks you in advance.
When an AWS Lambda function is configured to use an Amazon VPC, it connects to a nominated subnet of the VPC. This allows the Lambda function to communicate with other resources inside the VPC. However, it cannot communicate with the Internet. This is a problem because the Amazon SQS public endpoint lives on the Internet and the function is timing-out because it is unable to reach the Internet.
Thus, you have 3 options:
Option 1: Do not connect to a VPC
If your Lambda function does not need to communicate with a resource in the VPC (such as the simple function you have provided above), simply do not connect it to the VPC. When a Lambda function is not connected to a VPC, it can communicate with the Internet and the Amazon SQS public endpoint.
Option 2: Use a VPC Endpoint
A VPC Endpoint provides a means of accessing an AWS service without going via the Internet. You would configure a VPC endpoint for Amazon SQS. Then, when the Lambda function wishes to connect with the SQS queue, it can access SQS via the endpoint rather than via the Internet. This is normally a good option if the Lambda function needs to communicate with other resources in the VPC.
Option 3: Use a NAT Gateway
If the Lambda function is configured to use a private subnet, it will be able to access the Internet if a NAT Gateway has been provisioned in a public subnet and the Route Table for the private subnet points to the NAT Gateway. This involves extra expense and is only worthwhile if there is an additional need for a NAT Gateway.
If you're using the boto3 python library in a lambda in a VPC, and it's failing to connect to an sqs queue through a vpc endpoint, you must set the endpoint_url when creating the sqs client. Issue 1900 describes the background behind this.
The solution looks like this (for an sqs vpc endpoint in us-east-1):
sqs_client = boto3.client('sqs',
endpoint_url='https://sqs.us-east-1.amazonaws.com')
Then call send_message or send_message_batch as normal.
You need to place your lambda inside your VPC then set up a VPC endpoint for SQS or NAT gateway, When you add your lambda function to a subnet, make sure you ONLY add it to the private subnets, otherwise nothing will work.
Reference
https://docs.aws.amazon.com/lambda/latest/dg/vpc.html
https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/
I am pretty convinced that you cannot call an SQS queue from within a VPC using Lambda using an SQS endpoint. I'd consider it a bug, but maybe the Lambda team did this for a reason. In any case, You will get a message timeout. I cooked up a simple test Lambda
import json
import boto3
import socket
def lambda_handler(event, context):
print('lambda-test SQS...')
sqsDomain='sqs.us-west-2.amazonaws.com'
addr1 = socket.gethostbyname(sqsDomain)
print('%s=%s' %(sqsDomain, addr1))
print('Creating sqs client...')
sqs = boto3.client('sqs')
print('Sending Test Message...')
response = sqs.send_message(
QueueUrl='https://sqs.us-west-2.amazonaws.com/1234567890/testq.fifo',
MessageBody='Test SQS Lambda!',
MessageGroupId='test')
print('SQS send response: %s' % response)
return {
'statusCode': 200,
'body': json.dumps(response)
}
I created a VPC, subnet, etc per - Configuring a Lambda function to access resources in a VPC. The EC2 instance in this example has no problem invoking SQS through the private endpoint from the CLI per this tutorial.
If I drop my simple Lambda above into the same VPC and subnet, with SQS publishing permissions etc. and invoke the test function it will properly resolve the IP address of the SQS endpoint within the subnet, but the call will timeout (making sure your Lambda timeout is more than 60 seconds to let boto fail). Enabling boto debug logging further confirms that the IP is resolved correctly and the HTTP request to SQS times out.
I didn't try this with a non-FIFO queue but as the HTTP call is failing on connection request this shouldn't matter. It's got to be a routing issue from the Lambda as the EC2 in the same subnet works.
I modified my simple Lambda and added an SNS endpoint and did the same test which worked. The issue issue appears to be specific to SQS best I can tell.
import json
import boto3
import socket
def testSqs():
print('lambda-test SQS...')
sqsDomain='sqs.us-west-2.amazonaws.com'
addr1 = socket.gethostbyname(sqsDomain)
print('%s=%s' %(sqsDomain, addr1))
print('Creating sqs client...')
sqs = boto3.client('sqs')
print('Sending Test Message...')
response = sqs.send_message(
QueueUrl='https://sqs.us-west-2.amazonaws.com/1234567890/testq.fifo',
MessageBody='Test SQS Lambda!',
MessageGroupId='test')
print('SQS send response: %s' % response)
return {
'statusCode': 200,
'body': json.dumps(response)
}
def testSns():
print('lambda-test SNS...')
print('Creating sns client...')
sns = boto3.client('sns')
print('Sending Test Message...')
response = sns.publish(
TopicArn='arn:aws:sns:us-west-2:1234567890:lambda-test',
Message='Test SQS Lambda!'
)
print('SNS send response: %s' % response)
return {
'statusCode': 200,
'body': json.dumps(response)
}
def lambda_handler(event, context):
#return testSqs()
return testSns()
I think your only options are NAT (per John above), bounce your calls off a local EC2 (NAT will be simpler, cheaper, and more reliable), or use a Lambda proxy outside the VPC. Which someone else suggested in a similar post. You could also subscribe an SQS queue to an SNS topic (I prototyped this and it works) and route it out that way too, but that just seems silly unless you absolutely have to have SQS for some obscure reason.
I switched to SNS. I was just hoping to get some more experience with SQS. Hopefully somebody can prove me wrong, but I call it a bug.

Node Lambda AWS TimeoutError: Socket timed out without establishing a connection to cloudformation

I am running a Node(12.x) Lambda in AWS. The purpose of this lambda is to interact with Cloudformation stacks, and I'm doing that via the aws-sdk. When testing this lambda locally using lambda-local, it executes successfully and the stack can be seen in CREATING state in AWS console.
However, when I push and run this lambda in AWS, it fails after 15 seconds, and I get this error:
{"errorType":"TimeoutError","errorMessage":"Socket timed out without establishing a connection","code":"TimeoutError","message":"Socket timed out without establishing a connection","time":"2020-06-29T03:10:27.668Z","region":"us-east-1","hostname":"cloudformation.us-east-1.amazonaws.com","retryable":true,"stack":["TimeoutError: Socket timed out without establishing a connection"," at Timeout.connectTimeout [as _onTimeout] (/var/task/node_modules/aws-sdk/lib/http/node.js:69:15)"," at listOnTimeout (internal/timers.js:549:17)"," at processTimers (internal/timers.js:492:7)"]}
This lead me to investigate the lambda timeout and the possible configuration changes I could make found in https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-retry-timeout-sdk/ and https://aws.amazon.com/premiumsupport/knowledge-center/lambda-vpc-troubleshoot-timeout/ but nothing worked.
I found a couple of similar issues such as AWS Lambda: Task timed out which include possible suggestions such as lambda timeout and lambda memory issues, but Ive set mine to 30 seconds and the logs show max memory used is 88MB out of possible 128MB, but I tried with an increase anyway, and no luck.
The curious part is that it fails without establishing a connection to hostname cloudformation.us-east-1.amazonaws.com. How is that possible when the role assigned to the lambda has full Cloudformation privileges? I'm completely out of ideas so any help would be greatly appreciated. Heres my code:
TEST EVENT:
{
"stackName": "mySuccessfulStack",
"app": "test"
}
Function my handler calls (createStack):
const AWS = require('aws-sdk');
const templates = {
"test": {
TemplateURL: "https://<bucket>.s3.amazonaws.com/<path_to_file>/test.template",
Capabilities: ["CAPABILITY_IAM"],
Parameters: {
"HostingBucket": "test-hosting-bucket"
}
}
}
async function createStack(event) {
AWS.config.update({
maxRetries: 2,
httpOptions: {
timeout: 30000,
connectTimeout: 5000
}
});
const cloudformation = new AWS.CloudFormation();
const { app, stackName } = event;
let stackParams = templates[app];
stackParams['StackName'] = app + "-" + stackName;
let formattedTemplateParams = [];
for (let [key, value] of Object.entries(stackParams.Parameters)) {
formattedTemplateParams.push({"ParameterKey":key, "ParameterValue": value})
}
stackParams['Parameters'] = formattedTemplateParams;
const result = await cloudformation.createStack(stackParams).promise();
return result;
}
Lambda function in a VPC does not public IP address nor internet access. From docs:
Connect your function to private subnets to access private resources. If your function needs internet access, use NAT. Connecting a function to a public subnet does not give it internet access or a public IP address.
There are two common solutions for that:
place lambda function in a private subnet and setup NAT gateway in public subnet. Then set route table from private subnet to the NAT device. This will enable the lambda to access the internet and subsequently CloudFormation service.
setup a VPC interface endpoint for CloudFormation. This will allow your lambda function in private subnet to access CloudFormation without the internet.

Problems connecting to Neptune from Lambda

I have created a simple AWS Neptune cluster, with a writer and no read replicas. I used the option to create a new VPC for it, and two security groups were automatically created for it, too.
I also have a Lambda that calls that Nepture cluster's endpoint. I have configured the Lambda with the Neptune cluster's VPC, specifying all of its subnets and the two security groups mentioned above. I didn't manually modified the inbound and outbound rules once they have been automatically assigned upon me performing the VPC configuration from the AWS Console (just going through the steps).
The Lambda is written in Python and uses the requests library to make HTTPS calls, with AWS Singature V4. The execution role for the Lambda has NeptuneFullAccess and an inline policy to allow configuring a VPC for the Lambda (which has been done, so that policy works).
The Lambda calls the Neptune cluster's endpoint, with the cluster's name and ID redacted, on port 8182:
https://NAME.cluster-ID.us-east-1.neptune.amazonaws.com:8182
I get the following error:
{
"errorMessage": "2020-05-20T21:26:35.066Z c8ee70ac-6390-48fd-a32e-36f80d58a24e Task timed out after 3.00 seconds"
}
What am I doing wrong?
UPDATE: So, it looks like the second security group for the Neptune cluster was created by me selecting an option when creating the cluster. So, I tried again with Choose existing option for the security group, instead of Create new. (I guess I was confused before, because I was creating a whole new VPC, so how could a security group already exist? But the wizard just assumes the default security group that would be created by then.)
Now, I no longer get the same error. However, what I see is this:
{
"errorType": "Runtime.ExitError",
"errorMessage": "RequestId: 48e3b4fb-1b88-48d3-8834-247dbb1a4f3f Error: Runtime exited without providing a reason"
}
The log shows this:
{
"requestId": "b8b91c18-34cd-c5f6-9103-ed3357b9241e",
"code": "BadRequestException",
"detailedMessage": "Bad request."
}
The query was (given the Lambda code described in https://docs.amazonaws.cn/en_us/neptune/latest/userguide/iam-auth-connecting-python.html):
{
"host": "NAME.cluster-ID.us-east-1.neptune.amazonaws.com:8182",
"method": "GET",
"query_type": "status",
"query": ""
}
Any suggestions?
UPDATE: Trying against another Neptune cluster, the [Errno 111] Connection refused' error comes back. I have noticed an odd thing, however: I have some orphaned network interfaces, from when the Lambda was associated with the VPCs of now-deleted Neptune clusters. The network interfaces are marked in use, however, and I cannot detach and delete them, not even with the Force detachment option. Getting the You are not allowed to manage 'ela-attach' attachments error.
UPDATE: Starting with a fresh Lambda (no redoing its VPC configuration, and so no orphaned network interfaces anymore) and a fresh Neptune cluster with IAM Auth enabled and configured (and even with the Lambda's execution role given full admin access for the purposes of debugging, to eliminate any missing permissions), still getting this error:
{
"errorMessage": "HTTPSConnectionPool(host='NAME.cluster-ID.us-east-1.neptune.amazonaws.com', port=8182): Max retries exceeded with url: /status/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1f9f98c310>: Failed to establish a new connection: [Errno 111] Connection refused'))",
"errorType": "ConnectionError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 71, in lambda_handler\n return make_signed_request(host, method, query_type, query)\n",
" File \"/var/task/lambda_function.py\", line 264, in make_signed_request\n r = requests.get(request_url, headers=headers, verify=False, params=request_parameters)\n",
" File \"/var/task/requests/api.py\", line 76, in get\n return request('get', url, params=params, **kwargs)\n",
" File \"/var/task/requests/api.py\", line 61, in request\n return session.request(method=method, url=url, **kwargs)\n",
" File \"/var/task/requests/sessions.py\", line 530, in request\n resp = self.send(prep, **send_kwargs)\n",
" File \"/var/task/requests/sessions.py\", line 643, in send\n r = adapter.send(request, **kwargs)\n",
" File \"/var/task/requests/adapters.py\", line 516, in send\n raise ConnectionError(e, request=request)\n"
]
}
A few things to check:
Is the security group attached to the Neptune instance allowing traffic from the subnets that are configured for the Lambda function? The default inbound rule for the security group attached to Neptune is to only allow traffic from the IP address from which it was provisioned.
The NeptuneFullAccess built-in IAM policy is for control plane actions, not for data plane operations. You'll need to create an IAM policy using the policy document defined here [1] and attach that policy to which ever Lambda execution role you are using. Then, you need to use that role to sign the request being made to Neptune. The Python request library does not do SigV4 signing, so you'll need to follow a procedure similar to what is laid out here [2].
If you really want to simplify all of this, we've published a Python library that helps with managing connections, IAM auth, and sending queries to Neptune. You can find it here [3].
[1] https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth.html
[2] https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-python.html
[3] https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-python-utils
Thanks to the help of the Neptune team (an amazing response! they called me to discuss this), I was able to figure this out.
First, the Connection refused error disappeared once I redid the setup with a fresh Neptune cluster and the Use existing option for the security group, as well as a brand new Lambda added to the Neptune cluster's VPC. Apparently, redoing VPC configuration on a Lambda sometimes leaves orphaned network interfaces that are hard to delete. So, do the VPC config on a Lambda only once!
Second, the runtime error that started showing up after that is due to a bug in the Python code provided by AWS here: https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-python.html
Namely, the make_signed_request function in that script doesn't return a value. It should return r.text or, better yet, json.loads(r.text). Then, everything works just fine.
From your error message:
Task timed out after 3.00 seconds
You have to increase your lambda execution timeout, as your current setup of 3 seconds is not enough for it successful competition:
The amount of time that Lambda allows a function to run before stopping it. The default is 3 seconds. The maximum allowed value is 900 seconds.
If your function runs more than the set timeout, lambda service is going to terminate it due to running more than the given timeout threshold.
As a side note:
Since you use lambda in a vpc, you have to remember that lambda functions do not have public IPs nor internet access. You may not be able to connect to your db even if you increase the function timeout. This can be overcome if you run your lambda function in private subnet and have NAT gateway or NAT instance correctly setup.

AWS Java SDK - Get EC2 instance info

Given an instance id, I want to get an EC2 instance info (for example, its running status, private IP, public IP).
I have done some research (i.e. looking at the sample code posted here Managing Amazon EC2 Instances)
but there is only sample code of getting the Amazon EC2 instances for your account and region.
I tried to modify the sample and here is what I came up with:
private static AmazonEC2 getEc2StandardClient() {
// Using StaticCredentialsProvider
final String accessKey = "access_key";
final String secretKey = "secret_key";
BasicAWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
return AmazonEC2ClientBuilder.standard()
.withRegion(Regions.AP_NORTHEAST_1)
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.build();
}
public static void getInstanceInfo(String instanceId) {
final AmazonEC2 ec2 = getEc2StandardClient();
DryRunSupportedRequest<DescribeInstancesRequest> dryRequest =
() -> {
DescribeInstancesRequest request = new DescribeInstancesRequest()
.withInstanceIds(instanceId);
return request.getDryRunRequest();
};
DryRunResult<DescribeInstancesRequest> dryResponse = ec2.dryRun(dryRequest);
if(!dryResponse.isSuccessful()) {
System.out.println("Failed to get information of instance " + instanceId);
}
DescribeInstancesRequest request = new DescribeInstancesRequest()
.withInstanceIds(instanceId);
DescribeInstancesResult response = ec2.describeInstances(request);
Reservation reservation = response.getReservations().get(0);
Instance instance = reservation.getInstances().get(0);
System.out.println("Instance id: " + instance.getInstanceId(), ", state: " + instance.getState().getName() +
", public ip: " + instance.getPublicIpAddress() + ", private ip: " + instance.getPrivateIpAddress());
}
It is working fine but I wonder if it's the best practice to get info from a single instance.
but there is only sample code of getting the Amazon EC2 instances for your account and region.
Yes, you may get only instance information you have permission to read.
It is working fine but I wonder if it's the best practice to get info from a single instance
You have multiple options.
For getting EC2 metadata from any client (e.g. from your on-premise network) your code seems ok.
If you are running the code in the AWS environment (on an EC2, lambda, docker, ..) you may specify a service role allowed calling the describeInstances operation from the service. Then you don't need to specify the AWS credentials explicitly (DefaultAWSCredentialsProviderChain will work).
If you are getting the EC2 metadata from the instance itself, you can use the EC2 metadata service.