Currently, I am creating emr_default in the connections tab for my first DAG. For the other DAG I am making another emr_default2 and using it in my other DAG. What is the best way to create emr_default per each dag, can we have something like a parameter file and use it in our dag? Please advise.
Thanks.
Xi
Related
I need to schedule automatically a bq load process that gets AVRO files from a GCS bucket and load them in BigQuery, and wait for its completion in order to execute another task upon completion, specifically a task that will read from above mentioned table.
As showed here there is a nice API to run this [command][1] , example given:
bq load \
--source_format=AVRO \
mydataset.mytable \
gs://mybucket/mydata.avro
This will give me a job_id
Waiting on bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1 ... (0s) Current status
job_id that I can check with bq show --job=true bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1
And that is nice... I guess under the hood the bq load job is a DataTransfer. I found some operators related to this: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery_dts.html
Even if the documentation does not cover specifically avro load configuration, digging through the documentation, gave me what I was looking for.
My question is: Is there an easier way of getting the status of the job given a job_id similar to the bq show --job=true <job_id> command?
Is there something that might help me in not going through creating a DataTransferJob, starting it, monitoring it and then delete it (because, I don't need it to stay there since next time the parameters will change).
Maybe a custom sensor, using the python-sdk-api?
Thank you in advance.
[1]: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
I think you can use GCSToBigQueryOperator operators and tasks sequencing with Airflow to solve your issue :
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
with airflow.DAG(
'dag_name',
default_args=your_args,
schedule_interval=None) as dag:
load_avro_to_bq = GCSToBigQueryOperator(
task_id='load_avro_file_to_bq',
bucket={your_bucket},
source_objects=['folder/{SOURCE}/*.avro'],
destination_project_dataset_table='{your_project}:{your_dataset}.{your_table}',
source_format='AVRO',
compression=None,
create_disposition='CREATE_NEVER',
write_disposition='WRITE_TRUNCATE'
)
second_task = DummyOperator(task_id='Next operator', dag=dag)
load_avro_to_bq >> second_task
The first operator allows to load the Avro file from GCS to BigQuery
If this operator is in success, the second task is executed otherwise it's not executed
I was trying to use EventArc to monitor Firestore changes, which will trigger a Cloud Run service.
It works by listening to Any Resource or specific resource name.
Resource name
projects/PROJECT_Id/databases/(default)
It works, but it will listen to all changes in firestore. However, I want to filter the event to a specific collection. I have tried a few combinations to the pattern, and none of them works. E.g.
projects/PROJECT_Id/databases/(default)/users/*
projects/PROJECT_Id/databases/(default)/users/{user}
Any ideas? Thanks :)
I have seen in the image you shared, under the “Resource” section you selected “specific resource” instead can you try to change it to a “path pattern” and you will be able to write a specific path that the trigger object needs to respect.
You can refer to this documentation for Applying a path pattern when filtering in Eventarc.
You can also check out the AuditLog step of Trigger Cloud Run with events from Eventarc codelab for an example on how to use path patterns.
I have multiple DAG's created in Airflow .but i want to triggger all of them via some common Module or DAG .Can we create a workflow like Azkaban to have all the DAgs invoked listed in that flow.
You can create a DAG that triggers other DAGs using TriggerDagRunOperator by passing their dag_id and other appropriate args
Source of the operator can be found in dagrun_operator.py
Also I suggest you go through the Scheduling and Triggers section of the docs
You can use SubdagOperator.
The main DAG see and manage all the SubDAGs as normal tasks
the Airflow admin GUI list only the main DAG in the main DAGs list, then it will be possible to "zoom-in" to the SubDAG in the Graph View section of the GUI.
We can manage all dags using its main dag or each sub-dags separately by going through its zoom-in option
here is an example for the operator:
https://github.com/apache/airflow/blob/master/airflow/example_dags/example_subdag_operator.py
I recommended using a common dag factor to create all sub-dags if all
are in the same pattern and workflow.
I'm exploring apache airflow 1.8. I'm curious that either is there any way to pass arguments to DAGs or tasks while backfilling?
Here is something like that I'm searching for,
airflow backfill My_DAG -s some_date -e end_date argument_for_t1 argument_for_t2
or it could be array of args.
is there any way to pass arguments? I've searched a lot, but wasn't able to find anything.
Backfill is to re-run a failed/missed/mis configured DAG and hence there is no provision to give command line arguments.
However, you can go to DagBag folder and give params to the dag object (Change/fix existing dag) and then back fill the DAG
I came across the solution that I can set the variables using airflow variables -s my_var="some value"
Now I can access my_var anywhere across the application.
for more options check cli
I am new to Grafana. I am setting it up to view data from Cloudwatch for a Custom Metrics. Custom Metrics Namespace Name is JVMStats, Metric is JVMHeapUsed, Dimension is instance Id. If I configure these data, I am not able to get the graph. Can you please advice me on how to get the data?
Regards
Karthik
I want to do the same.
As far as I can tell, it's not possible out of the box with the latest Grafana (2.6 at time of writing). See this issue.
This pull request implements it. It's currently tagged as 3.0-beta1. So I expect we'll both be able to do what we want come version 3.0.
EDIT: inserting proof of 3.0-beta-1 working
I installed 3.0-beta-1, and was able to use Custom Metrics, as evidenced by this image:
I managed to add my custom metrics now, only issue I had was that I listed my custom metrics in the "Data Source" configurations with commmas and spaces:
Custom1, Custom2
but it must be only commas:
Custom1,Custom2
And it works for me. The preview in the text box shows this, but I missed it.
Another option is to configure AWS CloudWatch job to collect data into Axibase Time Series Database where CustomMetrics namespace is enabled out-of-the-box:
Disclosure: I work for Axibase.