Evaluating value of run_dag.conf - google-cloud-platform

I'm very new to airflow, and have been playing with it on GCP.
I'm modifying the example at https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf that shows how a DAG can be invoked by a cloud function.
That simple DAG just prints the content of run_dag.conf using the bash operator.
I'm now trying to get the value of run_dag.conf['bucket'] and run_dag.conf['name'] in order to create an example where I use the CloudSqlImportOperator.
My problem is that it seems that I can't find a way to get those values to be passed as part of the body on the operator.
My understanding is that jinja templates get evaluated at the operators. My first attempt was to do:
import_body = {
"importContext": {
"fileType": "csv",
"database": "dw",
"uri": "gs://{{ dag_run.conf['bucket'] }}/{{ dag_run.conf['name'] }}",
"csvImportOptions": {
"table": "SHAKESPEARE",
"columns": ["word", "word_count", "corpus", "corpus_date"]
}
}
}
And that fails because the jinja template section never gets evaluated, and the operator receives a literal "gs://{{ dag_run.conf['bucket'] }}/{{ dag_run.conf['name'] }}" instead.
I tried to pass a string instead:
import_body = """{
"importContext": {
"fileType": "csv",
"database": "dw",
"uri": "gs://{{ dag_run.conf['bucket'] }}/{{ dag_run.conf['name'] }}",
"csvImportOptions": {
"table": "SHAKESPEARE",
"columns": ["word", "word_count", "corpus", "corpus_date"]
}
}
}"""
And still I'm getting an error now : 'str' object has no attribute 'get'
I've seen examples using the PythonOperator and kwargs to fetch the contents, but so far no example of reading the contents of that dag_run.conf object inside the code.
What would be a proper way of doing that?
Cheers

In the example you mentioned, the jinja template is passed to the parameter bash_command which is a templated field. If you look at the python operator source code you can see that the only templated parameter is templated_dict so you to make airflow evaluate the {{ dag_run.conf['bucket'] }} you need to pass it through this variable. I am shooting into the dark here because you did not post the full code but the solution should be something like the following:
Inside the python code that your python operator call (works with python3):
import_body = f'''{
"importContext": {
"fileType": "csv",
"database": "dw",
"uri": "gs://{templated_dict['bucket']}/{templated_dict['name']}",
"csvImportOptions": {
"table": "SHAKESPEARE",
"columns": ["word", "word_count", "corpus", "corpus_date"]
}
}
}'''
when you define the python operator in the DAG:
python_operator.PythonOperator(
task_id=f'task_id',
python_callable=my_func,
provide_context=True,
templated_dict={
"bucket": "{{ dag_run.conf['bucket'] }}",
"name": "{{ dag_run.conf['name'] }}"
},
dag=dag
)
Note that I referenced to the airflow version 1.10.2, which I assume you are running because you tagged google cloud composer and this is the latest version supported.
If you look at 1.10.3 you can see that op_args and op_kwargs are added to the templated fields of the python operator. So in the next version update, you will be able to pass it as ousing also those.

Related

awscli DescribeUserPoolClient doesn't return almost anything despite the documentation it should return all appClient settings

I'm trying to run DescribeUserPoolClient through python code and also through cloudshell, and this command doesn't return almost anything:
{
"UserPoolClient": {
"UserPoolId": "id",
"ClientName": "name",
"ClientId": "id",
"ClientSecret": "secret",
"LastModifiedDate": "2021-05-10T14:21:24.733000+00:00",
"CreationDate": "2021-05-10T14:21:24.733000+00:00",
"RefreshTokenValidity": 30,
"TokenValidityUnits": {},
"AllowedOAuthFlows": [
"client_credentials"
],
"AllowedOAuthScopes": [
":write"
],
"AllowedOAuthFlowsUserPoolClient": true
}
}
This is only parameters it returns. But documentation says that there should be a lot more like "ExplicitAuthFlows" and others. Is there something with aws or maybe something with my access rights?
For anyone having trouble with the same issue: If you have any property set by default and never touched it (never edited) amazon won't return them in request. This works for many other aws cli commands also.
Maybe it is common knowledge, but i struggled with it)

CDK adds random parameters

So I have this function I'm trying to declare and it works and deploys just dandy unless you uncomment the logRetention setting. If logRetention is specified the cdk deploy operation
adds additional parameters to the stack. And, of course, this behavior is completely unexplained in the documentation.
https://docs.aws.amazon.com/cdk/api/latest/docs/aws-lambda-readme.html#log-group
SingletonFunction.Builder.create(this, "native-lambda-s3-fun")
.functionName(funcName)
.description("")
// .logRetention(RetentionDays.ONE_DAY)
.handler("app")
.timeout(Duration.seconds(300))
.runtime(Runtime.GO_1_X)
.uuid(UUID.randomUUID().toString())
.environment(new HashMap<String, String>(){{
put("FILE_KEY", "/file/key");
put("S3_BUCKET", junk.getBucketName());
}})
.code(Code.fromBucket(uploads, functionUploadKey(
"formation-examples",
"native-lambda-s3",
lambdaVersion.getValueAsString()
)))
.build();
"Parameters": {
"lambdaVersion": {
"Type": "String"
},
"AssetParametersceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40aS3BucketB030C8A8": {
"Type": "String",
"Description": "S3 bucket for asset \"ceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40a\""
},
"AssetParametersceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40aS3VersionKey6A2AABD7": {
"Type": "String",
"Description": "S3 key for asset version \"ceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40a\""
},
"AssetParametersceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40aArtifactHashEDC522F0": {
"Type": "String",
"Description": "Artifact hash for asset \"ceefd938ac7ea929077f2e2f4cf09b5034ebdd14799216b1281f4b28427da40a\""
}
},
It's a bug. They're Working On Itâ„¢. So, rejoice - we can probably expect a fix sometime within the next decade.
I haven't tried it yet, but I'm guessing the workaround is to manipulate the low-level CfnLogGroup construct, since it has the authoritative retentionInDays property. The relevant high-level Log Group construct can probably be obtained from the Function via its logGroup property. Failing that, the LogGroup can be created from scratch (which will probably be a headache all on its own).
I also encountered the problem described above. From what I can tell, we are unable to specify a log group name and thus the log group name is predictable.
My solution was to simply create a LogGroup with the same name as my Lambda function with the /aws/lambda/ prefix.
Example:
var function = new Function(
this,
"Thing",
new FunctionProps
{
FunctionName = $"{Stack.Of(this).StackName}-Thing",
// ...
});
_ = new LogGroup(
this,
"ThingLogGroup",
new LogGroupProps
{
LogGroupName = $"/aws/lambda/{function.FunctionName}",
Retention = RetentionDays.ONE_MONTH,
});
This does not create unnecessary "AssetParameters..." CF template parameters like the inline option does.
Note: I'm using CDK version 1.111.0 and 1.86.0 with C#

How to use StringMap parameters in SSM documents?

I have the following step in a SSM document. The result of the call is a Json, so I wanted to parse it as a stringMap (which seems to be the correct type for it) instead of creating an output for each variable I want to reference
I've tried referencing this as both:
{{ GetLoadBalancerProperties.Description.Scheme }}
and
{{ GetLoadBalancerProperties.Description[\"LoadBalancerName\"] }}
In both cases I get an error saying the variable was never defined
{
"name": "GetLoadBalancerProperties",
"action": "aws:executeAwsApi",
"isCritical": true,
"maxAttempts": 1,
"onFailure": "step:deleteParseCloudFormationTemplate",
"inputs": {
"Service": "elb",
"Api": "describe-load-balancers",
"LoadBalancerNames": [
"{{ ResourceId }}"
]
},
"outputs": [
{
"Name": "Description",
"Selector": "$.LoadBalancerDescriptions[0]",
"Type": "StringMap"
}
]
}
This is the actual message:
Step fails when it is validating and resolving the step inputs. Failed to resolve input: GetLoadBalancerProperties.Description["LoadBalancerName"] to type String. GetLoadBalancerProperties.Description["LoadBalancerName"] is not defined in the Automation Document.. Please refer to Automation Service Troubleshooting Guide for more diagnosis details.
I believe the answer you were searching is in here:
https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-plugins.html#top-level-properties-type
Just to name a few examples:
Map type is a Python dict, hence if your output is a dict you should use StringMap in the SSM Document.
While List type is same as Python list.
So if your output is a List of Dictionary the type you want to use is MapList.
In some cases it seems that you cannot. I was able to work around this issue, by using a Python script in the SSM document to output the right type, but otherwise I believe the SSM document is not flexible enough to cover all cases.
The script I used:
- name: myMainStep
action: aws:executeScript
inputs:
Runtime: python3.6
Handler: myMainStep
InputPayload:
param: "{{ myPreviousStep.myOutput }}"
Script: |-
def myMainStep(events,context):
myOutput = events['myOutput']
for tag in myOutput:
if tag["Key"] == "myKey":
return tag["Value"]
return "myDefaultValue"
outputs:
- Name: output
Selector: "$.Payload"
Type: String
You can find out what the myOutput should be in AWS web console > SSM > Automation > Your execution, if you have already executed your automation once > executeScript step > input parameters

Can we use $ref in x-mediation-script?

I'm going to hard code some data using x-mediation-script. Where as I want to use $ref which will be called in setPayloadjson. Is this possible can we do it? need suggestion with any of the sample
"x-mediation-script": "mc.setProperty('CONTENT_TYPE', 'application/json');mc.setPayloadJSON('$ref', '#/definitions/out');"
"definitions":{
"out":{
"type" : "object",
"required": ["NAME"],
"properties": {
"NAME2": {"type": "string"},
"NAME3": {"type": "string"},
"NAME3": {"type": "string"},
"NAME4": {"type": "string"},
}
}
}
It is not possible to access swagger content from the mediation script using $ref due to,
x-mediation-script is in JS and could not use swagger syntax in code.
API Manager does not process the script. Therefore, when publishing the API, only the x-mediation-script content is copied to the synapse file.
As the solution, create a JS variable in the x-mediation-script and use it.
mc.setProperty('CONTENT_TYPE', 'application/json'); // Set the content type of the payload to the message context
var town = mc.getProperty('uri.var.town'); // Get the path parameter 'town' and store in a variable
mc.setPayloadJSON('{ "Town" : "'+town+'"}'); // Set the new payload to the message context.

Fetching With Distinct/Unique Values

I have a Cloudant database with objects that use the following format:
{
"_id": "0ea1ac7d5ef28860abc7030444515c4c",
"_rev": "1-362058dda0b8680a818b38e9c68c5389",
"text": "text-data",
"time-data": "1452988105",
"time-text": "3:48 PM - 16 Jan 2016",
"link": "http://url/to/website"
}
I want to fetch objects where the text attribute is distinct. There will be objects with duplicate text and I want Cloudant to handle removing them from a query.
How do I go about creating a MapReduce view that will do this for me? I'm completely new to MapReduce and I'm having difficulty understanding the relationship between the map and reduce functions. I tried tinkering with the built-in COUNT function and writing my own view, but they've failed catastrophically, haha.
Anyways, would it be easier to just delete the duplicates? If so, how do I do that?
While I'm trying to study this and find ELI5s, would anyone help me out? Thanks in advance! I appreciate it.
I'm not sure a MapReduce view is what you are looking for. A MapReduce view will essentially allow you to get the text and the number of docs with that same text, but you really won't be able to get the rest of the fields in the doc (because MapReduce has no idea which doc to return when multiple docs match the text). Here is a sample MapReduce view:
{
"_id": "_design/textObjects",
"views": {
"by_text": {
"map": "function (doc) { if (doc.text) { emit(doc.text, 1); }}",
"reduce": "_count"
}
},
"language": "javascript"
}
What this is doing:
The Map part of the Map Reduce takes each doc and maps it into a doc that looks like this:
{"key":"text-data", "value":1}
So, if you had 7 docs, 2 where text="text-data" and 5 where text="other-text-data" the data would look like this:
{"key":"text-data", "value":1}
{"key":"text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
The reduce part of the MapReduce ("reduce": "_count") groups the docs above by the key and returns the count:
{"key":"text-data","value":2},
{"key":"other-text-data","value":5}
You can query this view on your Cloudant instance:
https://<yourcloudantinstance>/<databasename>
/_design/textObjects
/_view/by_text?group=true
This will result in something similar to the following:
{"rows":[
{"key":"text-data","value":2},
{"key":"other-text-data","value":5}
]}
If this is not what you are looking for, but rather you are just looking to keep the latest info for a specific text value then you can simply find an existing document that matches that text and update it with new values:
Add an index on text:
{
"index": {
"fields": [
"text"
]
},
"type": "json"
}
Whenever you add a new document find the document with that same exact text:
{
"selector": {
"text": "text-value"
},
"fields": [
"_id",
"text"
]
}
If it exists update it. If not then insert a new document.
Finally, if you want to keep multiple docs with the same text value, but just want to be able to query the latest you could do something like this:
Add a property called latest or similar to your docs.
Add an index on text and latest:
{
"index": {
"fields": [
"text",
"latest"
]
},
"type": "json"
}
Whenever you add a new document find the document with that same exact text where latest == true:
{
"selector": {
"text": "text-value",
"latest" : true
},
"fields": [
"_id",
"text",
"latest"
]
}
Set latest = false on the existing document (if one exists)
Insert the new document with latest = true
This query will find the latest doc for all text values:
{
"selector": {
"text": {"$gt":null}
"latest" : true
},
"fields": [
"_id",
"text",
"latest"
]
}