How to control concurrency of a subdag or taskgroup in Airflow? - concurrency

I have a simple ETL workflow composed of three tasks for each table I want to extract :
1_extract_to_tmp >> 2_push_to_s3 >> 3_delete_tmp
As I want to reproduce the same steps for multiple tables, I was thinking of grouping these task in a TaskGroup or a subDAG and instanciate it dynamically for each table to extract. My final DAG will look like this :
From what I read, the TaskGroup is the prefered solution now to do that.
The problem is that I also need to control how much extractions are done in parallel because I don't have enough local disk space for all data and apparently the subDag operator does not comply to pool setting and the TaskGroup does not have one.
Do you know a way to achieve this ?
Am I doing it wrong ?
Maybe my DAG should not be designed like that.

TaskGroup is just a UI feature. It doesn't really contains logic so any parallelism limitation you want to enforce can be done regardless to the usage of TaskGroup.
You didn't explain what exactly you want to limit.
If you want to limit the overall tasks that can run in parallel with on your dag (overwrite the airflow.cfg default) then set concurrency in your DAG contractor:
dag = DAG(dag_id='my_dag', concurrency=5, ...)
If you are looking to limit concurrency usage to protect a specific resources (database, api, etc...) then use pools:
my_op = PythonOperator(python_callable=func,
task_id='my_task',
pool='my_pool',
dag=dag)
The task will be executed only if there are free slots in the pool. The number of available slots are defined when you create the pool.
As for SubDags I wouldn't recommand using it at all. While it's not officially deprecated it probably will be. See https://github.com/apache/airflow/issues/12292

Related

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.
Is there any way to obtain the same behavior with AirFlow?
Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.
This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.
That's why in Airflow you have a backfill command that basically means.
Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).
This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).
This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).
This covers more than the case you mention:
a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it
b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.
The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.
A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.
Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

Apache beam / PubSub time delay before processing files

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?
You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardHours(1)))
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

AWS boto3 -- Difference between `batch_writer` and `batch_write_item`

I'm currently applying boto3 with dynamodb, and I noticed that there are two types of batch write
batch_writer is used in tutorial, and it seems like you can just iterate through different JSON objects to do insert (this is just one example, of course)
batch_write_items seems to me is a dynamo-specific function. However, I'm not 100% sure about this, and I'm not sure what's the difference between these two functions (performance, methodology, what not)
Do they do the same thing? If they are, why having 2 different functions? If they're not, what's the difference? How's the performance comparison?
As far as I understand and use these APIs, with the batch_write_item(), you can even handle the data for more than one table in one query. But with batch_writer(), it means you are going to specify the actions are only applicable for a certain table. I think that should be the very basic difference I can tell you.
batch_writer creates a context manager for writing objects to Amazon
DynamoDB in batch.
The batch writer will automatically handle buffering and sending items
in batches.
In addition, the batch writer will also automatically handle any
unprocessed items and resend them as needed. All you need to do is
call put_item for any items you want to add, and delete_item for any
items you want to delete.
In addition, you can specify auto_dedup if the batch might contain
duplicated requests and you want this writer to handle de-dup for you.
source

Code changes needed for custom distributed ML Engine Experiment

I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.

When does an action not run on the driver in Apache Spark?

I have just started with Spark and was struggling with the concept of tasks.
Can any one please help me in understanding when does an action (say reduce) not run in the driver program.
From the spark tutorial,
"Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel. "
I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.
From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.
Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"
All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).
Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks
Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.