making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once? - airflow-scheduler

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.

This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

Related

Apache beam / PubSub time delay before processing files

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?
You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

Process flow gets stuck on table creations

I'm trying to understand the Enterprise Guide process flow. As I understand it, the process flow is supposed to make it easy to run related steps in the order they need to be run to make a dependent action able to run and be up to date somewhere later in the flow.
Given that understanding, I'm getting stuck trying to make the process flow work in cases where the temporary data is purged. I'm warned when closing Enterprise Guide that the project has references to temporary data which must be the tables I created. That should be fine, the data is on the SAS server and I wrote code to import that data into SAS.
I would expect that the data can be regenerated when I try run an analysis that depends on that data again later, but instead I'm getting an error indicating that the input data does not exist. If I then run the code to import the data and/or join tables in each necessary place, the process flow seems to work as expected.
See the flow that I'm working with below:
I'm sure I must be missing something. Imagine I want to rerun the rightmost linear regression. Is there a way to make the process flow import the data without doing so manually for each individual table creation the first time round?
The general answer to your question is probably that you can't really do what you're wanting directly, but you can do it indirectly.
A process flow (of which you can have many per project, don't forget) is a single set of programs/tasks/etc. that you intend to run as a group. Typically, you will run whole process flows at once, rather than just individual pieces. If you have a point that you want to pause, look at things, then continue, then you have a few choices.
One is to have a process flow that goes to that point, then a second process flow that starts from that point. You can even take your 'import data' steps out of the process flow entirely, make an 'import data' process flow, always run that first, then run the other process flows individually as you need them. In fact, if you use the AUTOEXEC process flow, you could have the import data steps run whenever you open the project, and imported data ready and waiting for you.
A second is to use the UI and control+click or drag a box to select on the process flow to select a group of programs to run; select the first five, say, then run them, then select 'run branch from program...' option to run from that point on. You could also make separate 'branches' and run just the one branch at a time, making each branch dependent on the input streams.
A third option would be to have different starting points for different analysis tasks, and have the import data bit be after that starting point. It could be common to the starting points, and use macro variables and conditional execution to go different directions. For example, you could have a macro variable set in the first program that says which analysis program you're running, then the conditional from the last import step (which are in sequence, not in parallel like you have them) send you off to whatever analysis task the macro variable says. You could also have macro variables that indicate whether an import has been run once already in the current session that then would tell you not to rerun it via conditional steps.
Unfortunately, though, there's no direct way to run something and say 'run this and all of its dependencies', though.

How should I parallelize a mix of cpu- and network-intensive tasks (in Celery)

I have a job that scans a network file system (can be remote), pulls many files, runs a computation on them and pushes the results (per file) into a DB. I am in process of moving this to Celery so that it can be scaled up. The number of files can get really huge (1M+).
I am not sure what design approach to take, specifically:
Uniform "end2end" tasks
A task gets a batch (list of N files), pulls them, computes and uploads results.
(Using batches rather than individual files is for optimizing the connection to the remote file system and the DB, although it is a pure heuristics at this point)
Clearly, a task would spend a large part of it waiting for I/O, so we'll need to play with number of worker processes (much more than # of CPUs), so that I have enough tasks running (computing) concurrently.
pro: simple design, easier coding and control.
con: probably will need to tune the process pool size individually per installation, as it would depend on environment (network, machines etc.)
Split into dedicated smaller tasks
download, compute, upload (again, batches).
This option is appealing intuitively, but I don't actually see the advantage.
I'd be glad to get some references to tutorials on concurrency design, as well as design suggestions.
How long does it take to scan the network file system compared to computation per file?
How does the hierarchy of the remote file system look like? Are the files evenly distributed? How can you use this in your advantage ?
I would follow a process like this:
1. In one process, list the first two levels of the root remote target folder.
2. For each of the discovered folders, spin up a separate celery process that further lists the content of those folders. You may also want to save the location of the discovered files just in case things go wrong.
3. After you have listed the content of the remote file system and all celery processes that list files terminate you can go in processing mode.
4. You may want to list files with 2 processes and use the rest of your cores to start doing per file work.
NB: Before doing everything in python I would also investigate how does bash tools like xargs and find work together in remote file discovery. Xargs allows you to spin up multiple C processes that do what you want. That might be the most efficient way to do the remote file discovery and then pipe everything to you python code.
Instead of celery, you can write a simple python script which runs on k*cpu_count threads just to connect to remote servers and fetch files without celery.
Personally, I found that k value in between 4 to 7 gives better results in terms of CPU utilization for IO bound tasks. Depending on the number of files produced or the rate at which you want to consume, you can use a suitable number of threads.
Alternatively, you can use celery + gevent or celery with threads if your tasks are IO bound.
For computation and updating DB you can use celery so that you can scale as per your requirements dynamically. If you are too many tasks at a time which need DB connection, you should use DB connection pooling for workers.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.

SAS Enterprise: Run multiple Process Flows at once

I can't seem to find a straightforward way to run multiple process flows at once. I can select multiple and right click, but the 'Run' function disappears.
Any ideas? Programmatic or otherwise?
Assuming you mean you want to run multiple flows sequentially, you would use an Ordered List to do that. You can include any number of programs in a process flow.
Process flows are intended to contain all of the items you want to run in one shot, so you would not normally run many entire process flows at once. You can of course run one, then run the next one, if it's a few. I don't believe you can link programs or objects from one process flow to another.
If you mean run simultaneously, then you can do that if you set your project up to allow parallel execution, and your server allows it. File -> Project Properties -> Code Submission, check "Allow parallel execution on the same server" allows you to run multiple things at once - but be aware that each submission is in its own distinct SAS session and doesn't have direct access to the other submissions' temporary libraries or macro variables.