Enable AWS Glue Continuous Logging from create_job - amazon-web-services

I'm creating a Glue Job using boto3 create_job. I was interested to pass a parameter to enable Continuous Logging (no filter) for this new Job.
Unfortunately neither here or here I can find any useful parameter to enable it.
Any suggestions?

Found a way to do it, it's similar to other arguments, you just need to pass these log related arguments in Default Arguments like -
glueClient.create_job(
Name="testBoto",
Role="Role_name",
Command={
'Name': "some_name",
'ScriptLocation' : "some_location"
},
DefaultArguments={
"--enable-continuous-cloudwatch-log": "true",
"--enable-continuous-log-filter": "true"
}
)

Related

Why is the JSON output of my task being escaped by AWS Step Functions?

I have a Lambda function that runs in a step function.
The Lambda function returns a JSON string as output.
When I debug the function locally, I see that the JSON is valid but when I run the step function and get to the next step after my function, I can see that all my " have turned to \" and there is a " at the beginning and end of my JSON.
So a JSON object that looks like the following when I debug my function:
{"test":60,"test2":"30000","test3":"result1"}
Ends up looking like the following as the input of the step after my lambda:
"{\"test\":60,\"test2\":\"30000\",\"test3\":\"result1\"}"
Why does my valid JSON object end up being escaped?
How can I prevent this from happening?
The Lambda function returns a JSON string as output.
That is exactly why your JSON is being escaped - you're returning your object as a JSON string e.g. using JSON.stringify not as a JSON object.
The easiest way to fix this would be to just return the object & not convert the output to a JSON string. That way, it won't be escaped & will be returned, as you expect, as an object.
However, if it must stay as a JSON string for whatever reason, you can use the States.StringToJson(...) intrinsic function to unescape the escaped JSON string using the ResultSelector property of your task.
So for example, if your output is:
{
"Payload": "{\"test\":60,\"test2\":\"30000\",\"test3\":\"result1\"}",
...
}
To be able to unescape the output before passing it to the next task, set the ResultSelector of your task to:
"ResultSelector": {
"Payload.$":"States.StringToJson($.Payload)"
}
Or if using the Workflow Studio, click on the task, check the Output > Transform result with ResultSelector - optional checkbox and fill in the text box with the above ResultSelector object:
Either way, the final result of your task definition should look like this:
{
...
"States": {
"Lambda Invoke": {
"Type": "Task",
...
"ResultSelector": {
"Payload.$":"States.StringToJson($.Payload)"
}
}
}
}
The output will then be as you expect:
{
"Payload": {
"test": 60,
"test2": "30000",
"test3": "result1"
}
}
While the answer from #Ermiya Eskandary is entirely correct, you also have a few more options that you can use to your advantage with or without using ResultSelector (if its a stringified json, then you pretty much have to use ResultSelector however as that answer mentioned) property. ResultPath and OutputPath.
If you do not need the incoming event for anything else after this Lambda, then have your lambda return an Json like object (ie: if in python, return a dict)
In your State Machine Definition, then include two properties in your Lambda Task
OutputPath:"$.SomeKey",
ResultPath:"$.SomeKey"
the SomeKey has to be the same for both.
What these two lines together in the task properties is say (ResultPath) "Put the output of this lambda in the event under the key 'SomeKey'" and then (OutputPath) "only send this key on to the next Task"
If you still need the data from the Input, you can use ResultPath: alone, which will put the output of the Lambda under the key assigned and append it to the InputEvent as well.
See This documentation for more info
Newbie to step functions here. I noticed there are two different ways to call a lambda from step functions:
The AWS-SDK way using Resource: arn:aws:states:::aws-sdk:lambda:invoke
The "optimised" way using Resource: arn:aws:states:::lambda:invoke
I found that the optimized way does a much better job with the JSON coming back from the python Lambda whereas the AWS SDK way was an escaped mess.

Pass CDK context values per deployment environment

I am using context to pass values to CDK. Is there currently a way to define project context file per deployment environment (dev, test) so that when the number of values that I have to pass grow, they will be easier to manage compared to passing the values in the command-line:
cdk synth --context bucketName1=my-dev-bucket1 --context bucketName2=my-dev-bucket2 MyStack
It would be possible to use one cdk.json context file and only pass the environment as the context value in the command-line, and depending on it's value select the correct values:
{
...
"context": {
"devBucketName1": "my-dev-bucket1",
"devBucketName2": "my-dev-bucket2",
"testBucketName1": "my-test-bucket1",
"testBucketName2": "my-test-bucket2",
}
}
But preferably, I would like to split it into separate files, f.e. cdk.dev.json and cdk.test.json which would contain their corresponding values, and use the correct one depending on the environment.
According to the documentation, CDK will look for context in one of several places. However, there's no mention of defining multiple/additional files.
The best solution I've been able to come up with is to make use of JSON to separate context out per environment:
"context": {
"dev": {
"bucketName": "my-dev-bucket"
}
"prod": {
"bucketName": "my-prod-bucket"
}
}
This allows you to access the different values programmatically depending on which environment CDK is deploying to.
let myEnv = dev // This could be passed in as a property of the class instead and accessed via props.myEnv
const myBucket = new s3.Bucket(this, "MyBucket", {
bucketName: app.node.tryGetContext(myEnv).bucketName
})
You can also do so programmatically in your code:
For instance, I have a context variable of deploy_tag cdk deploy Stack\* -c deploy_tag=PROD
then in my code, i have retrieved that deploy_tag variable and I make the decisions there, such as: (using python, but the idea is the same)
bucket_name = BUCKET_NAME_PROD if deploy_tag == 'PROD' else BUCKET_NAME_DEV
this can give you a lot more control, and if you set up a constants file in your code you can keep that up to date with far less in your cdk.json that may become very cluttered with larger stacks and multiple environments. If you go this route then you can have your Prod and Dev constants file, and your context variable can inform your cdk which file to load for a given deployment.
i also tend to create a new class object with all my deployment properties either assigned or derived, and pass that object into each stack, retrieving what i need out of there.

substitution variable $BRANCH_NAME gives nothing while building

I'm building docker images using cloud builder trigger, previously $BRNACH_NAME was working but now its giving null.
Thanks in advance.
I will post my comment as an answer as it is too long for comment section.
According to this documentation, you should have the possibility to use $BRANCH_NAME default substitution for builds invoked by triggers.
In the same documentation it is stated that:
If a default substitution is not available (such as with sourceless
builds, or with builds that use storage source), then occurrences of
the missing variable are replaced with an empty string.
I assume this might be the reason you are receiving NULL.
Have you performed any changes? Could you please provide some further information, such as your .yaml/.json file, your trigger configuration and the error you are receiving?
The problem was not in $BRANCH_NAME, I was using the resulted JSON to fetch the branch name.
like,
"source": {
"repoSource": {
"projectId": "project_id",
"repoName": "bitbucket_repo_name",
"branchName": "integration"
}
}
and
I was using build_details['source']['repoSource']['branchName']
but now it's giving like
"source": {
"repoSource": {
"projectId": "project_id",
"repoName": "bitbucket_repo_name",
"commitSha": "ght8939jj5jd9jfjfjigk0949jh8wh4w"
}
},
so, now I'm using build_details['substitutions']['BRANCH_NAME'] and its working fine.

How to run Lambda created in CDK on a regular basis?

As the title says - I've created a Lambda in the Python CDK and I'd like to know how to trigger it on a regular basis (e.g. once per day).
I'm sure it's possible, but I'm new to the CDK and I'm struggling to find my way around the documentation. From what I can tell it will use some sort of event trigger - but I'm not sure how to use it.
Can anyone help?
Sure - it's fairly simple once you get the hang of it.
First, make sure you're importing the right libraries:
from aws_cdk import core, aws_events, aws_events_targets
Then you'll need to make an instance of the schedule class and use the core.Duration (docs for that here) to set the length. Let's say 1 day for example:
lambda_schedule = aws_events.Schedule.rate(core.Duration.days(1))
Then you want to create the event target - this is the actual reference to the Lambda you created in your CDK earlier:
event_lambda_target = aws_events_targets.LambdaFunction(handler=lambda_defined_in_cdk_here)
Lastly you bind it all together in an aws_events.Rule like so:
lambda_cw_event = aws_events.Rule(
self,
"Rule_ID_Here",
description=
"The once per day CloudWatch event trigger for the Lambda",
enabled=True,
schedule=lambda_schedule,
targets=[event_lambda_target])
Hope that helps!
The question is for Python but thought it might be useful to post a Javascript equivalent:
const aws_events = require("aws-cdk-lib/aws-events");
const aws_events_targets = require("aws-cdk-lib/aws-events-targets");
const MyLambdaFunction = <...SDK code for Lambda function here...>
new aws_events.Rule(this, "my-rule-identifier", {
schedule: aws_events.Schedule.rate(aws_cdk_lib.Duration.days(1)),
targets: [new aws_events_targets.LambdaFunction(MyLambdaFunction)],
});
Note: The above is for version 2 of the SDK - might need a few tweaks for v3.

Should I have concern about datastoreRpcErrors?

When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you