How can I get an invoking lambda to run a cloud custodian policy in multiple different accounts on one run? - amazon-web-services

I have multiple c7n-org policies to be run in all regions in a list of accounts. Locally I can do this easily with the c7n-org run -c accounts.yml -s out --region all -u cost-control.yml.
The goal is to have an aws lambda function running this daily on all accounts like this. Currently I have a child lambda function for each policy in cost-control.yml and an invoker lambda function that loops through each function and calls it passing it the appropriate arn role to assume and region each time. Because I am calling the child functions for all accounts and all regions, the child functions are called over and over with different parameters to parse.
To get the regions to change each time I needed to remove an if statement in the SDK in handler.py (line 144) that is caching the config files so that it reads the new config w the parameters in subsequent invocations.
# one time initialization for cold starts.
global policy_config, policy_data
if policy_config is None:
with open(file) as f:
policy_data = json.load(f)
policy_config = init_config(policy_data)
load_resources(StructureParser().get_resource_types(policy_data))
I removed the "if policy_config is None:" line and modified the filename to a new config file that I wrote to tmp within the custodian_policy.py lambda code which is the config with the parameters for this invocation.
In the log streams for each invocation of the child lambdas the accounts are not assumed properly. The regions are changing properly and cloud custodian is calling the policy on the different regions but it is keeping the initial account from the first invocation. Each log stream shows the lambda assuming the role of the first called parameters from the invoker and then not changing the role in the next calls though it is receiving the correct parameters.
I've tried changing the cloud custodian SDK code in handler.py init_config() to try to force it to change the account_id each time. I know I shouldn't be changing the SDK code though and there is probably a way to do this properly using the policies.
I've thought about trying the fargate route which would be more like running it locally but I'm not sure if I would come across this issue there too.
Could anyone give me some pointers on how to get cloud custodian to assume roles on many different lambda invocations?

I found the answer in local_session function in utils.py of the c7n SDK. It was caching the session info for up to 45 minutes which is why it was reusing the old account info each lambda invocation within each log stream.
By commenting out lines 324 and 325, I forced c7n to create a new session each time with the passed in account parameter. The new function should look like this:
def local_session(factory, region=None):
"""Cache a session thread local for up to 45m"""
factory_region = getattr(factory, 'region', 'global')
if region:
factory_region = region
s = getattr(CONN_CACHE, factory_region, {}).get('session')
t = getattr(CONN_CACHE, factory_region, {}).get('time')
n = time.time()
# if s is not None and t + (60 * 45) > n:
# return s
s = factory()
setattr(CONN_CACHE, factory_region, {'session': s, 'time': n})
return s

Related

Terraform handle multiple lambda functions

I have a requirement for creating aws lambda functions dynamically basis some input parameters like name, docker image etc.
I have been able to build this using terraform (triggered using gitlab pipelines).
Now the problem is that for every unique name I want a new lambda function to be created/updated, i.e if I trigger the pipeline 5 times with 5 names then there should be 5 lambda functions, instead what I get is the older function being destroyed and a new one being created.
How do I achieve this?
I am using Resource: aws_lambda_function
Terraform code
resource "aws_lambda_function" "executable" {
function_name = var.RUNNER_NAME
image_uri = var.DOCKER_PATH
package_type = "Image"
role = role.arn
architectures = ["x86_64"]
}
I think there is a misunderstanding on how terraform works.
Terraform maps 1 resource to 1 item in state and the state file is used to manage all created resources.
The reason why your function keeps getting destroyed and recreated with the new values is because you have only 1 resource in your terraform configuration.
This is the correct and expected behavior from terraform.
Now, as mentioned by some people above, you could use "count or for_each" to add new lambda functions without deleting the previous ones, as long as you can keep track of the previous passed values (always adding the new values to the "list").
Or, if there is no need to keep track/state of the lambda functions you have created, terraform may not be the best solution to solve your needs. The result you are looking for can be easily implemented by python or even shell with aws cli commands.

Evaluate AWS CDK Stack output to another Stack in different account

I am creating two Stack using AWS CDK. I use the first Stack to create an S3 bucket and upload lambda Zip file to the bucket using BucketDeployment construct, like this.
//FirstStack
const deployments = new BucketDeployment(this, 'LambdaDeployments', {
destinationBucket: bucket,
destinationKeyPrefix: '',
sources: [
Source.asset(path)
],
retainOnDelete: true,
extract: false,
accessControl: BucketAccessControl.PUBLIC_READ,
});
I use the second Stack just to generate CloudFormation template to my clients. In the second Stack, I want to create a Lambda function with parameters S3 bucket name and key name of the Lambda zip I uploaded in the 1st stack.
//SecondStack
const lambdaS3Bucket = "??"; //TODO
const lambdaS3Key = "??"; //TODO
const bucket = Bucket.fromBucketName(this, "Bucket", lambdaS3Bucket);
const lambda = new Function(this, "LambdaFunction", {
handler: 'index.handler',
runtime: Runtime.NODEJS_16_X,
code: Code.fromBucket(
bucket,
lambdaS3Key
),
});
How do I refer the parameters automatically from 2nd Lambda?
In addition to that, the lambdaS3Bucket need to have AWS::Region parameters so that my clients can deploy it in any region (I just need to run the first Stack in the region they require).
How do I do that?
I had a similar usecase to this one.
The very simple answer is to hardcode the values. The bucketName is obvious.
The lambdaS3Key You can look up in the synthesized template of the first stack.
More complex answer is to use pipelines for this. I've did this and in the build step of the pipeline I extracted all lambdaS3Keys and exported them as environment variable, so in the second stack I could reuse these in the code, like:
code: Code.fromBucket(
bucket,
process.env.MY_LAMBDA_KEY
),
I see You are aware of this PR, because You are using the extract flag.
Knowing that You can probably reuse this property for Lambda Key.
The problem of sharing the names between the stacks in different accounts remains nevertheless. My suggestion is to use pipelines and the exported constans there in the different steps, but also a local build script would do the job.
Do not forget to update the BucketPolicy and KeyPolicy if You use encryption, otherwise the customer account won't have the access to the file.
You could also read about the AWS Service Catalog. Probably this would be a esier way to share Your CDK products to Your customers (CDK team is going to support the out of the box lambda sharing next on)

List all LogGroups using cdk

I am quite new to the CDK, but I'm adding a LogQueryWidget to my CloudWatch Dashboard through the CDK, and I need a way to add all LogGroups ending with a suffix to the query.
Is there a way to either loop through all existing LogGroups and finding the ones with the correct suffix, or a way to search through LogGroups.
const queryWidget = new LogQueryWidget({
title: "Error Rate",
logGroupNames: ['/aws/lambda/someLogGroup'],
view: LogQueryVisualizationType.TABLE,
queryLines: [
'fields #message',
'filter #message like /(?i)error/'
],
})
Is there anyway I can add it so logGroupNames contains all LogGroups that end with a specific suffix?
You cannot do that dynamically (i.e. you can't make this work such that if you add a new LogGroup, the query automatically adjusts), without using something like AWS lambda that periodically updates your Log Query.
However, because CDK is just a code, there is nothing stopping you from making an AWS SDK API call inside the code to retrieve all the log groups (See https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/CloudWatchLogs.html#describeLogGroups-property) and then populate logGroupNames accordingly.
That way, when CDK compiles, it will make an API call to fetch LogGroups and then generated CloudFormation will contain the log groups you need. Note that this list will only be updated when you re-synthesize and re-deploy your stack.
Finally, note that there is a limit on how many Log Groups you can query with Log Insights (20 according to https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html).
If you want to achieve this, you can create a custom resource using AwsCustomResource and AwsSdkCall classes to do the AWS SDK API call (as mentioned by #Tofig above) as part of the deployment. You can read data from the API call response as well and act on it as you want.

Daily AWS Lambda not creating Athena partition, however commands runs successfully

I have an Athena database set up pointing at an S3 bucket containing ALB logs, and it all works correctly. I partition the table by a column called datetime and the idea is that it has the format YYYY/MM/DD.
I can manually create partitions through the Athena console, using the following command:
ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'
I have created a lambda to run daily to create a new partition, however this doesn't seem to work. I use the boto3 python client and execute the following:
result = athena.start_query_execution(
QueryString = "ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'",
QueryExecutionContext = {
'Database': 'web'
},
ResultConfiguration = {
"OutputLocation" : "s3://aws-athena-query-results-093305704519-eu-west-1/Unsaved/"
}
)
This appears to run successfully without any errors and the query execution even returns a QueryExecutionId as it should. However if I run SHOW PARTITIONS web.alb_logs; via the Athena console it hasn't created the partition.
I have a feeling it could be down to permissions, however I have given the lambda execution role full permissions to all resources on S3 and full permissions to all resources on Athena and it still doesn't seem to work.
Since Athena query execution is asynchronous your Lambda function never sees the result of the query execution, it just gets the result of starting the query.
I would be very surprised if this wasn't a permissions issue, but because of the above the error will not appear in the Lambda logs. What you can do is to log the query execution ID and look it up with the GetQueryExecution API call to see that the query succeeded.
Even better would be to rewrite your code to use the Glue APIs directly to add the partitions. Adding a partition is a quick and synchronous operation in Glue, which means you can make the API call and get a status in the same Lambda execution. Have a look at the APIs for working with partitions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html

Listing Notebook instances tags can takes ages

I am currently using the boto3 SDK from a Lambda function in order to retrieve various information about the Sagemaker Notebook Instances deployed in my account (almost 70 so not that many...)
One of the operations I am trying to perform is listing the tags for each instance.
However, from time to time it takes ages to return the tags : my Lambda either gets stopped (I could increase the timeout but still...) or a ThrottlingException is raised from the sagemaker.list_tags function (which could be avoided by increasing the number of retry upon sagemaker boto3 client creation) :
sagemaker = boto3.client("sagemaker", config=Config(retries = dict(max_attempts = 10)))
instances_dict = sagemaker.list_notebook_instances()
if not instances_dict['NotebookInstances']:
return "No Notebook Instances"
while instances_dict:
for instance in instances_dict['NotebookInstances']:
print instance['NotebookInstanceArn']
start = time.time()
tags_notebook_instance = sagemaker.list_tags(ResourceArn=instance['NotebookInstanceArn'])['Tags']
print (time.time() - start)
instances_dict = sagemaker.list_notebook_instances(NextToken=instances_dict['NextToken']) if 'NextToken' in instances_dict else None
If you guys have any idea to avoid such delays :)
TY
As you've noted you're getting throttled. Rather than increasing the number of retries you might try to change the delay (i.e. increase the growth_factor). Seems to be configurable looking at https://github.com/boto/botocore/blob/develop/botocore/data/_retry.json#L83
Note that buckets (and refill rates) are usually at the second granularity. So with 70 ARNs you're looking at some number of seconds; double digits does not surprise me.
You might want to consider breaking up the work differently since adding retries/larger growth_factor will just increase the length of time the function will run.
I've had pretty good success at breaking things up so that the Lambda function only processes a single ARN per invocation. The Lambda is processing work (I'll typically use a SQS queue to manage what needs to be processed) and the rate of work is configurable via a combination of configuring the Lambda and the SQS message visibility.
Not know what you're trying to accomplish outside of your original Lambda I realize that breaking up the work this way might (or will) add challenges to what you're doing overall.
It's also worth noting that if you have CloudTrail enabled the tags will be part of the event data (request data) for the "EventName" (which matches the method called, i.e. CreateTrainingJob, AddTags, etc.).
A third option would be if you are trying to find all of the notebook instances with a specific tag then you can use Resource Groups to create a query and find the ARNs with those tags fairly quickly.
CloudTrail: https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/Welcome.html
Resource Groups: https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html
Lambda with SQS: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html