Dataproc Pyspark job only running on one node - python-2.7

My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
rddData
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:
gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
EDIT
I changed
llCalculated = rddCalculated.collect()
#... save to csv and export
to
rddCalculated.saveAsTextFile("gs://storage-bucket/results")
and only one node is still doing the work.

Depending on whether you loaded rddData from GCS or HDFS, the default split size is likely either 64MB or 128MB, meaning your 200MB dataset only has 2-4 partitions. Spark does this because typical basic data-parallel tasks churn through data fast enough that 64MB-128MB means maybe tens of seconds of processing, so there's no benefit in splitting into smaller chunks of parallelism since startup overhead would then dominate.
In your case, it sounds like the per-MB processing time is much higher due to your joining against the other dataset and perhaps performing fairly heavyweight computation on each record. So you'll want a larger number of partitions, otherwise no matter how many nodes you have, Spark won't know to split into more than 2-4 units of work (which would also likely get packed onto a single machine if each machine has multiple cores).
So you simply need to call repartition:
rddCalculated = (
rddData
.repartition(200)
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
Or add the repartition to an earlier line:
rddData = rddData.repartition(200)
Or you may have better efficiency if you repartition at read time:
rddData = sc.textFile("gs://storage-bucket/your-input-data", minPartitions=200)

Related

Apache Spark: Regex with ReduceByKey is lot slower than GREP command

I have a file with strings (textData) and a set of regex filters (regx) that I want to apply and get count. Before we migrated to Spark, I used GREP as follows:
from subprocess import check_output
result={}
for reg in regx: # regx is a list of all the filters
result[reg] = system.exec('grep -e ' + reg + 'file.txt | wc -l')
Note: I am paraphrasing here with 'system.exec', I am actually using check_output.
I upgraded to SPARK for other things, so I want to also take the benefit of spark here. So I wrote up this code.
import re
sc = SparkContext('local[*]')
rdd = sc.textFile('file.txt') #containing the strings as before
result = rdd.flatMap(lambda line: [(reg, line) for reg in regx])
.map(lambda line: (line[0], len(re.findall(line[0], line[1]))))
.reduceByKey(lambda a,b: a+b)
.collect()
I thought I was being smart but the code is actually slower. Can anyone point out any obvious errors? I am running it as
spark-submit --master local[*] filename.py
I haven't run both versions on the same exact data to check exactly how much slower. I could easily do that, if required. When I checked localhost:4040 most of the time is being taken by the reduceByKey job.
To give a sense of time taken, the number of rows in the file are 100,000 with average #chars per line of ~1000 or so. The number of filters len(regx)=20. This code has been running for 44min on an 8core processor with 128GB RAM.
EDIT: just to add, the number of regex filters and textfiles will multiply 100 folds in the final system. Plus rather than writing/reading data from text files, I would be querying for the data in rdd with an SQL statement. Hence, I thought Spark was a good choice.
I'm a quite heavy user of sort as well, and whilst Spark doesn't feel as fast in a local setup, you should consider some other things:
How big is your dataset? sort swaps records to /tmp when requiring high ammounts of RAM.
How many RAM have you assigned to your Spark app? by default it has only 1GB, that's pretty unfair in sorting vs a sort command without RAM restrictions.
Are both tasks executed on the same machine? is the Spark machine a virtual appliance running in an "auto-expand" disk file? (bad performance).
Spark Clusters will spread your tasks across multiple servers automatically. If running on Hadoop, remember that files are sliced in 128MB blocks, each block can be an RDD partition.
I.e. in a Hadoop cluster, RDD partitions could be processed in parallel. This is where you'll nottice performance.
Spark will deal with Hadoop to do its best to achieve "data locality", meaning that your processes run directly against local hard drives, otherwise the data is going to be replicated across the network, as when executing reduce-alike processes. These are the stages. Understanding stages and how data is moved across the executors will lead you nice improvements, moreover considering that sort is of type "reduce" and it triggers a new execution stage on Spark, potentially moving data across the network. Having spare resources on the same nodes where maps are being executed can save a lot of network overhead.
Otherwise it will still work frankly well, and you can't destroy a file in HDFS by mistake :-)
This is where you really get performance and safety of data and execution, by spreading the task in parallel to work against a lot of hard drives in a self-recovering execution environment.
In a local setup you simply feel it irresponsive, mostly because it takes a bit to load, launch and track back the process, but it feels quick and safe when dealing with many GBs across several nodes.
I do also love shell scripting and I deal with reasonable ammounts of GBs quite often, but you can't regex-match 5 TB of data without distributing disk IO or paying for RAM as if there was no tomorrow.

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

What is the most efficient way to perform a large and slow batch job on GAE

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
object.put()
Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy
You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases

What factors decide the number of executors in a stand alone mode?

Given a Spark application
What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.
Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.
Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.
Answering your questions:
The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.
I hope this will help.
I agree with #jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
and if we check what is the value of defaultMinPartitions:
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.
Answering some points that were not addressed in previous answers:
in Standalone mode, you need to play with --executor-cores and --max-executor-cores to set the number of executors that will be launched (granted that you have enough memory to fit that number if you specify --executor-memory)
Spark does not allocate task in a round-robin manner, it uses a mechanism called "Delay Scheduling", which is a pull-based technique allowing each executor to offer it's availability to the master, which will decide whether or not to send a task on it.