I have an Airflow DAG executing calling a GKEPodOperator where a python script is ran to process and load data to BigQuery. Can the GKEPodOperator return like a python list or dictionary from the executed script back to the DAG so I can make use of that to write custom emails using DAGOperator?
First GKEPodOperator is deprecated. You should use GKEStartPodOperator.
Like other operators you can pass values between tasks with Xcom. Note that the GKEStartPodOperator inherits from KubernetesPodOperator thus the xcom mechanism is different than other operators. It works with launching a sidecar container. You can read more about it with examples here.
Now that you have the desired value stored as XCOM (as string) you want to pull it so you can use it in your custom operator:
Airflow >= 2.1.0 (not yet released)
There is native support for to convert xcom directly into native python object. See PR. You will need to set render_template_as_native_obj=True. You can read more about it in the docs: Rendering Fields as Native Python Objects
Airflow < 2.1.0:
In the follow up task you need to pull the xcom value and convert it to whatever Python object you'd like. You can see example for it here.
Related
I am using an older version of Airflow (1.10). We are using Python operators to trigger Glue jobs because Glue operators aren't available in this version. We have multiple jobs that need to run in a particular order. When we run the DAG, our first job triggers and then it passes as succeeded since the job was successfully started.
We are trying to use boto3 to check the status of the job, but we need it to do so continually. Any thoughts on how to check the status continually then only move on to the next Python operator based upon success?
Well, you could try to replicate the .job_completion method from the GlueJobSensor. So basically:
def my_glue_job_that_waits():
# botocore call that starts the job
while True:
try:
# botocore call to retrieve job state
if job_state == "SUCCEEDED":
# log statement
return #what you want the operator to return
else:
# log statement
time.sleep(POKE_INTERVAL)
except:
# what you want to happen if the call above fails
But I highly encourage you to upgrade to Airflow 2 if you can. Long term it will save you a lot of time both being able to use new features and not running into conflicts with provider packages.
I have a requirement wherein i need to pass parameters to an Airflow dag (eg. tenantName), and the airflow dag(python operator) will use the tenantName to filter data based on the parameter passed.
i'm able to access the parameters stored in Airflow , using the following code
def greeting():
import logging
logging.info("Hello there, World! {} {}".format(models.Variable.get('tenantName'), datetime.datetime.now()))
However, for my requirement - the tenantName will differ depending on the run.
How do i achieve this ?
My pipeline is IoTCore -> Pub/Sub -> Dataflow -> BigQuery. Initially the data I was getting was Json format and the pipeline was working properly. Now I need to shift to csv and the issue is the Google defined dataflow template I was using uses Json input instead of csv. Is there an easy way of transfering csv data from pub/sub to bigquery through dataflow. The template can probably be changed but it is implemented in Java which I have never used so would take a long time to implement. I also considered implementing an entire custom template in python but that would take too long.
Here is a link to the template provided by google:
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubSubToBigQuery.java
Sample: Currently my pub/sub messages are JSON and these work correctly
"{"Id":"123","Temperature":"50","Charge":"90"}"
But I need to change this to comma seperated values
"123,50,90"
Very easy: Do nothing!! If you have a look to this line you can see that the type of the messages used is the PubSub message JSON, not your content in JSON.
So, to prevent any issues (to query and to insert), write in another table and it should work nicely!
Can you please share your existing python code where you are parsing JSON format data and new & old data sample, So that it can be customized accordingly.
Moreover you can refer Python code here, it has performed word count transformation logic over PCollection, hopefully it can give you some refence to customize your code accrdingly.
Is there a way to bulk tag bigquery tables with python google.cloud.datacatalog?
If you want to take a look at sample code which uses the python google.cloud.datacatalog client library, I've put together a utilities open source script, that creates bulk Tags using a CSV as source. If you want to use a different source, you may use this script as reference, hope it helps.
create bulk tags from csv
For this purpose you may consider using DataCatalogClient() method which is included in google.cloud.datacatalog_v1 class as a part of PyPI Python google-cloud-datacatalog package leveraging Google Cloud Data Catalog API service.
By the first, you have to enable Data Catalog and BigQuery APIs
in your project;
Install Python Cloud Client Libraries for the Data Catalog API:
pip install --upgrade google-cloud-datacatalog
Set up authentication, exporting
GOOGLE_APPLICATION_CREDENTIALS environment variable holding JSON
file that contains your service account key:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Refer to this example from official documentation that
intelligibly reflects a way creating Data catalog tag template,
attaching appropriate tag fields to the target Bigquery table using
create_tag_template() function.
Having any doubts feel free to extend you initial question or add a comment below this answer, thus we can address particular use case according to your needs.
I have some current instances that get some data by passing a json blob through the user data string. I would like to also pass a script to be run at boot time through the user data. Is there a way to do both of these things? I've looked at cloud-config, but setting an arbitrary value doesn't seem to be one of the options.
You're correct that on EC2, there is only one 'user-data' blob that can be specified. Cloud-init addresses this limitation by allowing the blob to be an "archive" format of sorts.
Mime Multipart
Cloud-config Archive
cloud-config archive is unfortunately not documented now, but there is an example in doc/examples/cloud-config-archive.txt. It is expected to be yaml and start with '#cloud-config-archive'. Note that yaml is a strict superset of json, so any thing that can dump json can be used to produce this yaml.
Both of these formats require changes to all consumers to "share" the shared resource of user-data. cloud-init will ignore mime types that it does not understand, and handle those that it does. You'd have to modify the other application producing and consuming user-data to do the same.
Well, cloud-init supports multi-part MIME. With that in mind you could have your boot script as one part, then a custom mime part. Note that you would need to write a python handler that tells cloud-init what to do with that part (most likely moving it to wherever your app expects it). This handler code ends up in the handlers directory as described here.