Apache Spark: Regex with ReduceByKey is lot slower than GREP command - regex

I have a file with strings (textData) and a set of regex filters (regx) that I want to apply and get count. Before we migrated to Spark, I used GREP as follows:
from subprocess import check_output
result={}
for reg in regx: # regx is a list of all the filters
result[reg] = system.exec('grep -e ' + reg + 'file.txt | wc -l')
Note: I am paraphrasing here with 'system.exec', I am actually using check_output.
I upgraded to SPARK for other things, so I want to also take the benefit of spark here. So I wrote up this code.
import re
sc = SparkContext('local[*]')
rdd = sc.textFile('file.txt') #containing the strings as before
result = rdd.flatMap(lambda line: [(reg, line) for reg in regx])
.map(lambda line: (line[0], len(re.findall(line[0], line[1]))))
.reduceByKey(lambda a,b: a+b)
.collect()
I thought I was being smart but the code is actually slower. Can anyone point out any obvious errors? I am running it as
spark-submit --master local[*] filename.py
I haven't run both versions on the same exact data to check exactly how much slower. I could easily do that, if required. When I checked localhost:4040 most of the time is being taken by the reduceByKey job.
To give a sense of time taken, the number of rows in the file are 100,000 with average #chars per line of ~1000 or so. The number of filters len(regx)=20. This code has been running for 44min on an 8core processor with 128GB RAM.
EDIT: just to add, the number of regex filters and textfiles will multiply 100 folds in the final system. Plus rather than writing/reading data from text files, I would be querying for the data in rdd with an SQL statement. Hence, I thought Spark was a good choice.

I'm a quite heavy user of sort as well, and whilst Spark doesn't feel as fast in a local setup, you should consider some other things:
How big is your dataset? sort swaps records to /tmp when requiring high ammounts of RAM.
How many RAM have you assigned to your Spark app? by default it has only 1GB, that's pretty unfair in sorting vs a sort command without RAM restrictions.
Are both tasks executed on the same machine? is the Spark machine a virtual appliance running in an "auto-expand" disk file? (bad performance).
Spark Clusters will spread your tasks across multiple servers automatically. If running on Hadoop, remember that files are sliced in 128MB blocks, each block can be an RDD partition.
I.e. in a Hadoop cluster, RDD partitions could be processed in parallel. This is where you'll nottice performance.
Spark will deal with Hadoop to do its best to achieve "data locality", meaning that your processes run directly against local hard drives, otherwise the data is going to be replicated across the network, as when executing reduce-alike processes. These are the stages. Understanding stages and how data is moved across the executors will lead you nice improvements, moreover considering that sort is of type "reduce" and it triggers a new execution stage on Spark, potentially moving data across the network. Having spare resources on the same nodes where maps are being executed can save a lot of network overhead.
Otherwise it will still work frankly well, and you can't destroy a file in HDFS by mistake :-)
This is where you really get performance and safety of data and execution, by spreading the task in parallel to work against a lot of hard drives in a self-recovering execution environment.
In a local setup you simply feel it irresponsive, mostly because it takes a bit to load, launch and track back the process, but it feels quick and safe when dealing with many GBs across several nodes.
I do also love shell scripting and I deal with reasonable ammounts of GBs quite often, but you can't regex-match 5 TB of data without distributing disk IO or paying for RAM as if there was no tomorrow.

Related

Dataproc Pyspark job only running on one node

My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
rddData
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:
gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
EDIT
I changed
llCalculated = rddCalculated.collect()
#... save to csv and export
to
rddCalculated.saveAsTextFile("gs://storage-bucket/results")
and only one node is still doing the work.
Depending on whether you loaded rddData from GCS or HDFS, the default split size is likely either 64MB or 128MB, meaning your 200MB dataset only has 2-4 partitions. Spark does this because typical basic data-parallel tasks churn through data fast enough that 64MB-128MB means maybe tens of seconds of processing, so there's no benefit in splitting into smaller chunks of parallelism since startup overhead would then dominate.
In your case, it sounds like the per-MB processing time is much higher due to your joining against the other dataset and perhaps performing fairly heavyweight computation on each record. So you'll want a larger number of partitions, otherwise no matter how many nodes you have, Spark won't know to split into more than 2-4 units of work (which would also likely get packed onto a single machine if each machine has multiple cores).
So you simply need to call repartition:
rddCalculated = (
rddData
.repartition(200)
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
Or add the repartition to an earlier line:
rddData = rddData.repartition(200)
Or you may have better efficiency if you repartition at read time:
rddData = sc.textFile("gs://storage-bucket/your-input-data", minPartitions=200)

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

When does shuffling occur in Apache Spark?

I am optimizing parameters in Spark, and would like to know exactly how Spark is shuffling data.
Precisely, I have a simple word count program, and would like to know how spark.shuffle.file.buffer.kb is affecting the run time. Right now, I only see slowdown when I make this parameter very high (I am guessing this prevents every task's buffer from fitting in memory simultaneously).
Could someone explain how Spark is performing reductions? For example, the data is read and partitioned in an RDD, and when an "action" function is called, Spark sends out tasks to the worker nodes. If the action is a reduction, how does Spark handle this, and how are shuffle files / buffers related to this process?
Question : As for your question concerning when shuffling is triggered on Spark?
Answer : Any join, cogroup, or ByKey operation involves holding objects in hashmaps or in-memory buffers to group or sort. join, cogroup, and groupByKey use these data structures in the tasks for the stages that are on the fetching side of the shuffles they trigger. reduceByKey and aggregateByKey use data structures in the tasks for the stages on both sides of the shuffles they trigger.
Explanation : How does shuffle operation work in Spark?
The shuffle operation is implemented differently in Spark compared to Hadoop. I don't know if you are familiar with how it works with Hadoop but let's focus on Spark for now.
On the map side, each map task in Spark writes out a shuffle file (os disk buffer) for every reducer – which corresponds to a logical block in Spark. These files are not intermediary in the sense that Spark does not merge them into larger partitioned ones. Since scheduling overhead in Spark is lesser, the number of mappers (M) and reducers(R) is far higher than in Hadoop. Thus, shipping M*R files to the respective reducers could result in significant overheads.
Similar to Hadoop, Spark also provide a parameter spark.shuffle.compress to specify compression libraries to compress map outputs. In this case, it could be Snappy (by default) or LZF. Snappy uses only 33KB of buffer for each opened file and significantly reduces risk of encountering out-of-memory errors.
On the reduce side, Spark requires all shuffled data to fit into memory of the corresponding reducer task, on the contrary of Hadoop that had an option to spill this over to disk. This would of course happen only in cases where the reducer task demands all shuffled data for a GroupByKey or a ReduceByKey operation, for instance. Spark throws an out-of-memory exception in this case, which has proved quite a challenge for developers so far.
Also with Spark there is no overlapping copy phase, unlike Hadoop that has an overlapping copy phase where mappers push data to the reducers even before map is complete. This means that the shuffle is a pull operation in Spark, compared to a push operation in Hadoop. Each reducer should also maintain a network buffer to fetch map outputs. Size of this buffer is specified through the parameter spark.reducer.maxMbInFlight (by default, it is 48MB).
For more information about shuffling in Apache Spark, I suggest the following readings :
Optimizing Shuffle Performance in Spark by Aaron Davidson and Andrew Or.
SPARK-751 JIRA issue and Consolidating Shuffle files by Jason Dai.
It occurs whenever data needs to moved between executors (worker nodes)

When to use use MapReduce in Hbase?

I want to understand MapReduce of Hbase from application point of view, Need some real use cases of it to better understand the efficient use case of writing these jobs.
If there is any link to document or examples that explains the real use cases, Please share.
I can give some example based on my use cases. If you already store your data in hbase, you can write a java program, which scans a table and do something, then write the output to hbase or somewhere else. OR you can use mapreduce to do the same. The difference is, mapreduce will run where the data is and network traffic is used only for result data. We have hourly jobs to calculate sum and average of kpis and input data is huge but output data is tiny for this task. If i did not use mapreduce, i need to move one hour of data over network which is 18gb. But mapreduce output is only 1mb and i can write it to hbase or file or somewhere else.
Also mapreduce gives you parallel task execution ability, which you can have in java but why :)
Keep in mind that YARN creates map tasks according to your hbase table's split count. So if you need more map task, split your table.
If you already store your data in hadoop hdfs, you are lucky, a mapreduce reading from hdfs is much faster than reading from hbase. Also you can still write mapreduce output to hbase, if you want.
Please look into the usecases given
1. here.
2. And a small reference here - 30.Joins
3. May be an end to end example here
In the end, it all depends on your understanding of each concept Map reduce, Hbase and use it as per your need in your project. The same task can be done with or without map reduce. Happy coding

How to parallelize a query in Apache Hive for a (small) dataset

I am testing the latest Hive on parts of my data set. It's only a couple GB of log files that I am reading through a custom SerDe.
When I run simple Group By queries (4 MR jobs), I am getting logs such as
map : 100%
reduce : 0%
map : 85%
reduce : 0%
map : 86%
reduce : 0%
all the while only using one core on the 8 core server. Kind of a waste...
I have activated the parallel option but it still won't parallelize. I have set the number of reduce jobs to be 8.
My expectations is that since my data set is partitionned (=> different files), at least some of the map-reduce phases could be run on parallel on those files.
Is my understanding wrong ? Is there a specific way to write the queries ?
Thanks
If your doing nothing but a simple GROUP BY, the only real processing is comparison, which isn't that hard. That said, how many mappers are you running? The tasktrackers will not run parallelized. Rather, hadoop banks on multiple tasktrackers running to parallelize. So if you're only running one map task per node, you won't see anything.
Another possibility is that because your doing a GROUP BY, your bound in IO and not on processor, so there's no need to bring multiple cores into it.