Slow Performance with Apache Spark Gradient Boosted Tree training runs - amazon-web-services

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.

I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.

With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

Related

Factors that contribute to XGBoost slower training time on SageMaker distributed vs single-instance

I want to create a module on distributed training to fulfill the customer's request. I'm editing this notebook by adding a few lines of code (below) to run the training using FullyReplicated and SharedByS3Key with two instances for training to show how the customer can use distributed training with XGBoost.
I successfully run all three training types -- single, distributed:sharedbys3key, and distributed:fullyreplicated. However, the training time for the single took 3 minutes, whereas the distributed both took longer (6 & 7 minutes, respectively).
Why would the distributed training in this case be slower? The training data is only 500 records. Does the small sample size affect training size and the smaller size would actually make distributed training slower? Would it be faster if the training set was significantly larger?
I'm using the same hyperparameters for all training types, and everything else is the same. The only difference between the jobs are the instance count (1 versus 2), and the S3DataDistributionType for the two distributed training methods.
I am assuming with 500 records you are dealing with one file in the training data . The SharedByS3Key shines if you have multiple files in the training data because it will allocate each file to one training instance .With such small data set I think the overhead to spin multiple instances and move data across those instances is the only thing that comes to mind that will take up time.

SpannerIO Batch Read Speed Affected by Slow Partition Reader

We have been using SpannerIO.readAll to scan large amount of data in google dataflow setting. The ReadOperations passed to spanner are created withQuery(query) and withBatching(true). I noticed that though initially the throughput is OK, it dropped to very low throughput in the end probably due to outliers with larger amount work. Looking at BatchSpannerRead code, one DoFn is taking care of all the batch scan work for a partition. Although in a perfect world, we should assume the generated partitions should handle this outlier issues, but in practice, will it make sense to re-split the work of those slow workers?

Apache Spark: Regex with ReduceByKey is lot slower than GREP command

I have a file with strings (textData) and a set of regex filters (regx) that I want to apply and get count. Before we migrated to Spark, I used GREP as follows:
from subprocess import check_output
result={}
for reg in regx: # regx is a list of all the filters
result[reg] = system.exec('grep -e ' + reg + 'file.txt | wc -l')
Note: I am paraphrasing here with 'system.exec', I am actually using check_output.
I upgraded to SPARK for other things, so I want to also take the benefit of spark here. So I wrote up this code.
import re
sc = SparkContext('local[*]')
rdd = sc.textFile('file.txt') #containing the strings as before
result = rdd.flatMap(lambda line: [(reg, line) for reg in regx])
.map(lambda line: (line[0], len(re.findall(line[0], line[1]))))
.reduceByKey(lambda a,b: a+b)
.collect()
I thought I was being smart but the code is actually slower. Can anyone point out any obvious errors? I am running it as
spark-submit --master local[*] filename.py
I haven't run both versions on the same exact data to check exactly how much slower. I could easily do that, if required. When I checked localhost:4040 most of the time is being taken by the reduceByKey job.
To give a sense of time taken, the number of rows in the file are 100,000 with average #chars per line of ~1000 or so. The number of filters len(regx)=20. This code has been running for 44min on an 8core processor with 128GB RAM.
EDIT: just to add, the number of regex filters and textfiles will multiply 100 folds in the final system. Plus rather than writing/reading data from text files, I would be querying for the data in rdd with an SQL statement. Hence, I thought Spark was a good choice.
I'm a quite heavy user of sort as well, and whilst Spark doesn't feel as fast in a local setup, you should consider some other things:
How big is your dataset? sort swaps records to /tmp when requiring high ammounts of RAM.
How many RAM have you assigned to your Spark app? by default it has only 1GB, that's pretty unfair in sorting vs a sort command without RAM restrictions.
Are both tasks executed on the same machine? is the Spark machine a virtual appliance running in an "auto-expand" disk file? (bad performance).
Spark Clusters will spread your tasks across multiple servers automatically. If running on Hadoop, remember that files are sliced in 128MB blocks, each block can be an RDD partition.
I.e. in a Hadoop cluster, RDD partitions could be processed in parallel. This is where you'll nottice performance.
Spark will deal with Hadoop to do its best to achieve "data locality", meaning that your processes run directly against local hard drives, otherwise the data is going to be replicated across the network, as when executing reduce-alike processes. These are the stages. Understanding stages and how data is moved across the executors will lead you nice improvements, moreover considering that sort is of type "reduce" and it triggers a new execution stage on Spark, potentially moving data across the network. Having spare resources on the same nodes where maps are being executed can save a lot of network overhead.
Otherwise it will still work frankly well, and you can't destroy a file in HDFS by mistake :-)
This is where you really get performance and safety of data and execution, by spreading the task in parallel to work against a lot of hard drives in a self-recovering execution environment.
In a local setup you simply feel it irresponsive, mostly because it takes a bit to load, launch and track back the process, but it feels quick and safe when dealing with many GBs across several nodes.
I do also love shell scripting and I deal with reasonable ammounts of GBs quite often, but you can't regex-match 5 TB of data without distributing disk IO or paying for RAM as if there was no tomorrow.

When does shuffling occur in Apache Spark?

I am optimizing parameters in Spark, and would like to know exactly how Spark is shuffling data.
Precisely, I have a simple word count program, and would like to know how spark.shuffle.file.buffer.kb is affecting the run time. Right now, I only see slowdown when I make this parameter very high (I am guessing this prevents every task's buffer from fitting in memory simultaneously).
Could someone explain how Spark is performing reductions? For example, the data is read and partitioned in an RDD, and when an "action" function is called, Spark sends out tasks to the worker nodes. If the action is a reduction, how does Spark handle this, and how are shuffle files / buffers related to this process?
Question : As for your question concerning when shuffling is triggered on Spark?
Answer : Any join, cogroup, or ByKey operation involves holding objects in hashmaps or in-memory buffers to group or sort. join, cogroup, and groupByKey use these data structures in the tasks for the stages that are on the fetching side of the shuffles they trigger. reduceByKey and aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger.
Explanation : How does shuffle operation work in Spark?
The shuffle operation is implemented differently in Spark compared to Hadoop. I don't know if you are familiar with how it works with Hadoop but let's focus on Spark for now.
On the map side, each map task in Spark writes out a shuffle file (os disk buffer) for every reducer – which corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is lesser, the number of mappers (M) and reducers(R) is far higher than in Hadoop. Thus, shipping M*R files to the respective reducers could result in significant overheads.
Similar to Hadoop, Spark also provide a parameter spark.shuffle.compress to specify compression libraries to compress map outputs. In this case, it could be Snappy (by default) or LZF. Snappy uses only 33KB of buffer for each opened file and significantly reduces risk of encountering out-of-memory errors.
On the reduce side, Spark requires all shuffled data to fit into memory of the corresponding reducer task, on the contrary of Hadoop that had an option to spill this over to disk. This would of course happen only in cases where the reducer task demands all shuffled data for a GroupByKey or a ReduceByKey operation, for instance. Spark throws an out-of-memory exception in this case, which has proved quite a challenge for developers so far.
Also with Spark there is no overlapping copy phase, unlike Hadoop that has an overlapping copy phase where mappers push data to the reducers even before map is complete. This means that the shuffle is a pull operation in Spark, compared to a push operation in Hadoop. Each reducer should also maintain a network buffer to fetch map outputs. Size of this buffer is specified through the parameter spark.reducer.maxMbInFlight (by default, it is 48MB).
For more information about shuffling in Apache Spark, I suggest the following readings :
Optimizing Shuffle Performance in Spark by Aaron Davidson and Andrew Or.
SPARK-751 JIRA issue and Consolidating Shuffle files by Jason Dai.
It occurs whenever data needs to moved between executors (worker nodes)

What factors decide the number of executors in a stand alone mode?

Given a Spark application
What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.
Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.
Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.
Answering your questions:
The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.
I hope this will help.
I agree with #jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
and if we check what is the value of defaultMinPartitions:
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.
Answering some points that were not addressed in previous answers:
in Standalone mode, you need to play with --executor-cores and --max-executor-cores to set the number of executors that will be launched (granted that you have enough memory to fit that number if you specify --executor-memory)
Spark does not allocate task in a round-robin manner, it uses a mechanism called "Delay Scheduling", which is a pull-based technique allowing each executor to offer it's availability to the master, which will decide whether or not to send a task on it.