Current State of airflow dag:
ml_processors = [a, b, c, d, e]
abc_task >> ml_processors (all ml models from a to e run in parallel after abc task is successfully completed)
ml_processors >> xyz_task (once a to e all are successful xyz task runs)
Problem statement: There are instances when one of the machine learning models (task in airflow) get on new version with better accuracy and we want to reprocess our data. Now lets say c_processor get on new version and reprocessing is required to just reprocess the data for this processor. In that case I would like to run c_processor >> xyz_task only.
What I know/tried
I know that I can go back in successful dag runs and clear the task for certain period of time to run only specific task. But this way might not be very efficient when I have lets say c_processor, d_classifier to be rerun. And I would end up doing 2 steps here:
c_processor >> xyz_task
d_processor >> xyz_task which I would like to avoid
I read about "backfill in airflow" but looks like its more for whole dag instead of specific/ selected tasks from a dag
Environment/setup
Using google composer environment.
Dag is triggered on file upload in GCP storage.
I am interested to know if there are any other ways to rerun only specific tasks from airflow dag.
"clear"1 would also allow you to clear some specific tasks in a DAG with the --task-regex flag. In this case, you can run airflow tasks clear --task-regex "[c|d]_processor" --downstream -s 2021-03-22 -e 2021-03-23 <dag_id>, which clear the states for c and d processors with their downstreams.
One caveat though, this will also clean up the states for the original task runs.
Related
I am having a issue with the celery , I will explain with the code
def samplefunction(request):
print("This is a samplefunction")
a=5,b=6
myceleryfunction.delay(a,b)
return Response({msg:" process execution started"}
#celery_app.task(name="sample celery", base=something)
def myceleryfunction(a,b):
c = a+b
my_obj = MyModel()
my_obj.value = c
my_obj.save()
In my case one person calling the celery it will work perfectly
If many peoples passing the request it will process one by one
So imagine that my celery function "myceleryfunction" take 3 Min to complete the background task .
So if 10 request are coming at the same time, last one take 30 Min delay to complete the output
How to solve this issue or any other alternative .
Thank you
I'm assuming you are running a single worker with default settings for the worker.
This will have the worker running with worker_pool=prefork and worker_concurrency=<nr of CPUs>
If the machine it runs on only has a single CPU, you won't get any parallel running tasks.
To get parallelisation you can:
set worker_concurrency to something > 1, this will use multiple processes in the same worker.
start additional workers
use celery multi to start multiple workers
when running the worker in a docker container, add replica's of the container
See Concurrency for more info.
I am trying to see how Airflow sets execution_date for any DAG. I have made the property catchup=false in the DAG. Here is my
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval=timedelta(minutes=5)
)
Now, since Catchup=false, it should skip the runs prior to current_time. It does the same, however a strange thing is it is not setting the execution_date right.
Here, the runs execution time:
Exectution time
We can see the runs are scheduled at freq of 5 min. But, why does it append seconds and milliseconds to time?
This is impacting my sensors later.
Note that the behaviour runs fine when catchup=True.
I did some permutations. Seems that the execution_time is correctly coming when I specify cron, instead of timedelta function.
So, my DAG now is
dag = DAG(
'child',
max_active_runs=1,
description='A sample pipeline run',
start_date=days_ago(0),
catchup=False,
schedule_interval='*/5 * * * *'
)
Hope it will help someone. I have also raised a bug for this:
Can be tracked at : https://github.com/apache/airflow/issues/11758
Regarding execution_date you should have a look on scheduler documentation. It is the begin of the period, but get's triggered at the end of the period (start_date).
The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as #daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late
Note
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
Also the article Scheduling Tasks in Airflow might be worth a read.
You also should avoid setting the start_date to a relative value - this can lead to unexpected behaviour as this value is newly interpreted everytime the DAG file is parsed.
There is a long description within the Airflow FAQ:
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
I want to create Cron in chef witch they verify size of the log if it's > 30mb it will delete it, here is my code:
cron_d 'ganglia_tomcat_thread_max' do
hour '0'
minute '1'
command "rm - f /srv/node/current/app/log/simplesamlphp.log"
only_if { ::File.size('/srv/node/current/app/log/simplesamlphp.log').to_f / 1024000 > 30 }
end
Can you help me in it please
Welcome to Stackoverflow!
I suggest you to go with existing tools like "logrotate". There is a chef cookbook available to manage logrotate.
Please note, that "cron" in chef manages the system cron service which runs independently of chef. You'll have to do the file size check within the "command". It's also better to use the cron_d resource as documented here.
In the way you create cron_d resource it will add cron task only when your log file has size greater than 30mb. In all other cases cron_d will be not created.
You can check that ruby code
File.size('file').to_f / 2**20
to get the file size in Megabytes - there is a slight difference in the result I believe that is more correct.
so you can go wirh 2 solutions for your specific case
create new cron_d resource when log file is less than 30 mb to remove existing cron and provision your node periodically
move the check of the file size in the command with bash and glue with && - in that case file will be dated only if size is greater than 30mb. something like that
du -k file.txt | cut -f1
will return size of the file in bytes
To me also correct way to to that is to use logrotate service and chef recipe for that.
When I'm inserting rows on BigQuery using writeTableRows, performance is really bad compared to InsertAllRequest. Clearly, something is not setup correctly.
Use case 1: I wrote a Java program to process 'sample' Twitter stream using Twitter4j. When a tweet comes in I write it to BigQuery using this:
insertAllRequestBuilder.addRow(rowContent);
When I run this program from my Mac, it inserts about 1000 rows per minute directly into BigQuery table. I thought I could do better by running a Dataflow job on the cluster.
Use case 2: When a tweet comes in, I write it to a topic of Google's PubSub. I run this from my Mac which sends about 1000 messages every minute.
I wrote a Dataflow job that reads this topic and writes to BigQuery using BigQueryIO.writeTableRows(). I have a 8 machine Dataproc cluster. I started this job on the master node of this cluster with DataflowRunner. It's unbelievably slow! Like 100 rows every 5 minutes or so. Here's a snippet of the relevant code:
statuses.apply("ToBQRow", ParDo.of(new DoFn<Status, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = new TableRow();
Status status = c.element();
row.set("Id", status.getId());
row.set("Text", status.getText());
row.set("RetweetCount", status.getRetweetCount());
row.set("FavoriteCount", status.getFavoriteCount());
row.set("Language", status.getLang());
row.set("ReceivedAt", null);
row.set("UserId", status.getUser().getId());
row.set("CountryCode", status.getPlace().getCountryCode());
row.set("Country", status.getPlace().getCountry());
c.output(row);
}
}))
.apply("WriteTableRows", BigQueryIO.writeTableRows().to(tweetsTable)//
.withSchema(schema)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
What am I doing wrong? Should I use a 'SparkRunner'? How do I confirm that it's running on all nodes of my cluster?
With BigQuery you can either:
Stream data in. Low latency, up to 100k rows per second, has a cost.
Batch data in. Way higher latency, incredible throughput, totally free.
That's the difference you are experiencing. If you only want to ingest 1000 rows, batching will be noticeably slower. The same with 10 billion rows will be way faster thru batching, and at no cost.
Dataflow/Bem's BigQueryIO.writeTableRows can either stream or batch data in.
With BigQueryIO.Write.Method.FILE_LOADS the pasted code is choosing batch.
I want to create a temporary sorted-set based on the origin one in a timer, maybe the interval is 4 hour, I'm using spring-data-redis api to do this.
ZUNIONSTORE tmp 2 A B AGGREGATE MAX
when the ZUNIONSTORE commmand is executing, will it block any other commands like
ZADD,ZREM,ZRANGE,ZINCRBY based on the SortedSet A or B ? I don't know if this will cause
concurrency problems, please give me some suggestions.