Dynamic TaskGroups vs Dynamic DAGs - airflow-scheduler

Suppose I have a process with tasks:
T1 >> T2 >> T3
The process needs to be run for a set of ids [1,2,3]:
process_run_with_id1
process_run_with_id2
process_run_with_id3
I can either create a single DAG with multiple TaskGroups, where each TaskGroup represents the set of tasks to be run for the id:
DAG = > TG_for_1, TG_for_2, TG_for_3
Or multiple DAGs
DAG_for_1 = t1 >> t2 >> t3
DAG_for_2 = t1 >> t2 >> t3
DAG_for_3 = t1 >> t2 >> t3
Other than visually being different, what are the differences between the two approaches (and whether I'm creating the DAGs dynamically by having a file creating DAGs or having multiple DAG files)?

You can get the same result by splitting tasks across DAG or joining them to a single DAG but this is more of how you define result.
You can also write a whole software in a single file but is it smart to place everything in the same file? probably not.
There are a few rules which can help you distinguish between adding a task group of tasks vs creating a separated DAG.
Points in favor of adding the tasks with Task Group to the current DAG:
The tasks are a subunit of the DAG.
The tasks in the group are never to be executed as a stand alone. Execution is always as part of the DAG itself.
The tasks share similar/close business logic with the DAG so it make sense to find these tasks within the specific DAG.
Points in favor of separating the tasks from current DAG to another DAG:
Tasks may need to run separately from the current DAG.
Tasks represent a separate business unit which may be independent or be executed after the current DAG.
Code change within the tasks will not require any additional change to the current DAG.
Overall, I think the most important one to understand is that while tasks groups are just a pretty UI representation. This means that if for any reason TG_for_1 will take longer to be executed, it may cause delay in scheduling new DAG runs which means it will effect not only the tasks in TG_for_1 but also the tasks of the following run of TG_for_2 & TG_for_3 which will be delayed. Thus you want to "bind" the tasks into the same DAG only if it make sense that they will run together and "suffer" together if there are issues.

Related

State machine in AWS (step function?)

I would like to get some advice to see whether step function is suitable for my use case.
I have a bunch of user records generated at random time. I need to do some pre-processing and validation before putting them into a pool. I have a stage which runs periodically (1-5min) to collect records from the pool and combine them, then publish them.
I need realtime traceability/monitor of each record and I need to notify the user once the record is published.
Here is a diagram to illustrate the flow.
Is a step function suitable for my use case? if not, is there any alternative which help me to simplify the solution? Thanks
Yes, Step Functions is an option. Step Function "State Machines" add the greatest value vs other AWS serverless workflow patterns such as event-driven or pub/sub when the scenario involves complex branching/retry logic and observability requirements. SM logic is explicit and visual, which makes it simple to reason about the workflow. For each State Machine (SM) execution, you can easily trace the exact path the execution took and where it failed. This added functionality is reflected in its higher cost.
In any case, you need to gather records until its time to collect them. This batching requirement means that your achitecture will need more elements than just a State Machine. Here are some ideas:
(1) A SM preprocesses Records one-by-one as they arrive
One option is to use State Machines to orchestrate the preprocessing and validation only. Each arriving event record kicks off a SM execution. Pre-processed records go into a queue, from which they are periodically polled and sent to be combined.
[Records EventBrige event] -> [preprocessing SM] -> [Record queue] -> [polling lambda] -> [Combining Service]
(2) Preprocess and process bached records in a end-to-end State Machine
Gather records in a queue as they arrive. A lambda periodically polls the queue and begins the SM execution on a batch of records. A SM Map Task pre-processes and validates the records in parallel then calls the combining service, all within a single execution. This setup gives you the greatest visibility, but is more complex because you have to handle cases where a single record causes the batched execution to fail.
[Records arrive] -> [Record source queue] -> [polling lambda gets batch] -> [SM for preprocessing, collecting and combining]
Other
There are plenty of other combinations, including chaining SM's together if necessary. Or avoiding SM's altogether. Which option is best for you will depend on which pain points matter most to you: observability, error handling, simplicity, cost.

when is it not performance practical to use persist() on a spark dataframe?

While working on improving code performance as I had many jobs fail (aborted), I thought about using persist() function on Spark Dataframe whenever I need to use that same dataframe on many other operations. When doing it and following the jobs, stages in the Spark application UI, I felt like it's not really always optimal to do so, it depends on the number of partitions and the data size. I wasn't sure until I got the job aborted because of a fail in the persist stage.
I'm questioning if the best practice of using persist() whenever many operations will be performed on the dataframe is always valid? If not, when it's not? how to judge?
To be more concrete I will present my code and the details of the aborted job:
#create a dataframe from another one df_transf_1 on which I made a lot of transformations but no actions
spark_df = df_transf_1.select('user_id', 'product_id').dropDuplicates()
#persist
spark_df.persist()
products_df = spark_df[['product_id']].distinct()
df_products_indexed = products_df.rdd.map(lambda r: r.product_id).zipWithIndex().toDF(['product_id', 'product_index'])
You may ask why I persisted spark_df?
It's because I'm going to use it multiple of times like with products_df and also in joins (e.g: spark_df = spark_df.join(df_products_indexed,"product_id")
Details of fail reason in Stage 3:
Job aborted due to stage failure: Task 40458 in stage 3.0 failed 4 times, most recent failure: Lost task 40458.3 in stage 3.0 (TID 60778, xx.xx.yyyy.com, executor 91): ExecutorLostFailure (executor 91 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
The size of the input data (4 TB) is huge, before doing persist is there a way to check the size of the data? Is it a parameter in choosing to persist or not? Also the number of partitions (tasks) for persist > 100,000
Here are two cases for using persist():
After using repartition in order to avoid shuffling your data again and again as the dataframe is being used by the next steps. This will be useful only for the case that you call more than one action for the persisted dataframe/RDD since persist is an transformation and hence lazily evaluated. In general if you have multiple actions on the same dataframe/RDD.
Iterative computations, for instance when you want to query a dataframe inside a for loop. With persist Spark will save the intermediate results and omit reevaluating the same operations on every action call. Another example would be appending new columns with a join as discussed here.
What my experience taught me is that you should persist the dataframe when you perform several operations on them, so you create temporal tables (also you ensure that if something fails you have a recovery point). By doing this you prevent huge DAG'S that often do not end, if you have, for example, joins. So my advice would be to do something like this:
# operations
df.write.saveAsTable('database.tablename_temp')
df = spark.table('database.tablename_temp')
# more operations

Dataproc Pyspark job only running on one node

My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
rddData
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:
gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
EDIT
I changed
llCalculated = rddCalculated.collect()
#... save to csv and export
to
rddCalculated.saveAsTextFile("gs://storage-bucket/results")
and only one node is still doing the work.
Depending on whether you loaded rddData from GCS or HDFS, the default split size is likely either 64MB or 128MB, meaning your 200MB dataset only has 2-4 partitions. Spark does this because typical basic data-parallel tasks churn through data fast enough that 64MB-128MB means maybe tens of seconds of processing, so there's no benefit in splitting into smaller chunks of parallelism since startup overhead would then dominate.
In your case, it sounds like the per-MB processing time is much higher due to your joining against the other dataset and perhaps performing fairly heavyweight computation on each record. So you'll want a larger number of partitions, otherwise no matter how many nodes you have, Spark won't know to split into more than 2-4 units of work (which would also likely get packed onto a single machine if each machine has multiple cores).
So you simply need to call repartition:
rddCalculated = (
rddData
.repartition(200)
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
Or add the repartition to an earlier line:
rddData = rddData.repartition(200)
Or you may have better efficiency if you repartition at read time:
rddData = sc.textFile("gs://storage-bucket/your-input-data", minPartitions=200)

What is the most efficient way to perform a large and slow batch job on GAE

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
object.put()
Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy
You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)

What factors decide the number of executors in a stand alone mode?

Given a Spark application
What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.
Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.
Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.
Answering your questions:
The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.
I hope this will help.
I agree with #jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
and if we check what is the value of defaultMinPartitions:
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.
Answering some points that were not addressed in previous answers:
in Standalone mode, you need to play with --executor-cores and --max-executor-cores to set the number of executors that will be launched (granted that you have enough memory to fit that number if you specify --executor-memory)
Spark does not allocate task in a round-robin manner, it uses a mechanism called "Delay Scheduling", which is a pull-based technique allowing each executor to offer it's availability to the master, which will decide whether or not to send a task on it.