How Hadoop calculate physical memory and virtual memory during a job execution - mapreduce

I have few queries related to the counters used in Hadoop to display memory usage.
A map reduce job executed on a cluster gives me below menitoned counter values. Input file used is just in KBs, but these counter shows 35GB and 420 GB usage.
PHYSICAL_MEMORY_BYTES=35110662144
VIRTUAL_MEMORY_BYTES=420121841664
For another different job on same input file it shows 309 MB (physical) and 3G(vitual) usage
PHYSICAL_MEMORY_BYTES=309526528
VIRTUAL_MEMORY_BYTES=3435827200
First job is more CPU intensive than other and creates more objects than the other one but still its usage shown seems very high.
So I just wanted to know how this memory usage is calculated. I tried going through some posts and gave an over view on this below link which seems to be
requirement task for describing these variables (https://issues.apache.org/jira/i#browse/MAPREDUCE-1218 ) but couldnt find how these are calculated. It does gives me an idea on how these values are passed to Job Tracker,but no information on how these are determined. So if some one could give some insight on this than it would be really helpfull.

You can find few references here and here. The second link in particular to map and reducer job and how slots are decided based on memory allocations. Happy Learning

Related

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

Is there a way to specify the number of mappers in Scalding?

I am new to scalding world. My scalding job will have multiple stages, and I need to tune each stage individually.
I have found that we might be able to change the number of reducers by using withReducers. Also, I am able to set the split size for the input data by the job config. However, I didn't see there is any way to change the number of mappers for my sub-tasks on the fly.
Did I miss something? Does anyone know how to specify the number of mappers for my sub-tasks? Thanks.
Got some answers/ideas might be helpful for someone else who shared the same question.
It is much easier to control reducers compared to mappers.
Mappers are controlled by hadoop without a similar simple knob. You can set some config parameters to give hadoop an idea of how many map tasks to launch.
This stack overflow may be helpful:
Setting the number of map tasks and reduce tasks
One workaround I could think of is changing your major task to small ones, which you could individually tweak the size (# of mappers) of your input data.

Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’

ETL developer reports they have been trying to run our weekly and daily processes on ADW consistently. While for the most part they are executing without exception, I am now getting this error:
“Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”
Is there a limit on TEMPDB space associated with the DWU setting?
The database is limited to 100TB (per the portal) and not full.
Azure SQL Data Warehouse does allocate space for a tempdb, at around 399 GB per 100 DWU. Reference here.
What DWU are you using at the moment? Consider temporarily raising your DWU aka service objective or refactoring your job to be less dependent on tempdb. Lower it when your batch process is finished.
It might also be worth checking your workload for anything like cartesian products, excessive sorting, over-dependency on temp tables etc to see if any optimisation can be done.
Have a look at the Explain Plans for your code, and see whether you have a lot more data movement going on than you expect. If you find that one query does moved a lot more into Q tables, you can probably tune it to avoid the data movement (which may mean redesigning tables to distribute in a different key).

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting GBT ensemble in human-readable format. My experience so far is that for my problem size adding more resources to the cluster seems to not have an effect on the length of the run. A 10-iteration training run seem to roughly take 13hrs. This isn't acceptable since I'm looking to do 100-300 iteration runs, and the execution time seems to explode with the number of iterations.
My Spark application
This isn't the exact code, but it can be reduced to:
SparkConf sc = new SparkConf().setAppName("GBT Trainer")
// unlimited max result size for intermediate Map-Reduce ops.
// Having no limit is probably bad, but I've not had time to find
// a tighter upper bound and the default value wasn't sufficient.
.set("spark.driver.maxResultSize", "0");
JavaSparkContext jsc = new JavaSparkContext(sc)
// The input file is encoded in plain-text LIBSVM format ~59GB in size
<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD();
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(10);
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(1);
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
// Somewhat-convoluted code below reads in Parquete-formatted output
// of the GBT model and writes it back out as json.
// There might be cleaner ways of achieving the same, but since output
// size is only a few KB I feel little guilt leaving it as is.
// serialize and output the GBT classifier model the only way that the library allows
String outputPath = "s3://somebucket/somekeyprefex";
model.save(jsc.sc(), outputPath + "/parquet");
// read in the parquet-formatted classifier output as a generic DataFrame object
SQLContext sqlContext = new SQLContext(jsc);
DataFrame outputDataFrame = sqlContext.read().parquet(outputPath + "/parquet"));
// output DataFrame-formatted classifier model as json
outputDataFrame.write().format("json").save(outputPath + "/json");
Question
What is the performance bottleneck with my Spark application (or with GBT learning algorithm itself) on input of that size and how can I achieve greater execution parallelism?
I'm still a novice Spark dev, and I'd appreciate any tips on cluster configuration and execution profiling.
More details on the cluster setup
I'm running this app on a AWS EMR cluster (emr-4.0.0, YARN cluster mode) of r3.8xlarge instances (32 cores, 244GB RAM each). I'm using such large instances in order to maximize flexibility of resource allocation. So far I've tried using 1-3 r3.8xlarge instances with a variety of resource allocation schemes between the driver and workers. For example, for a cluster of 1 r3.8xlarge instances I submit the app as follows:
aws emr add-steps --cluster-id $1 --steps Name=$2,\
Jar=s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,\
Args=[/usr/lib/spark/bin/spark-submit,--verbose,\
--deploy-mode,cluster,--master,yarn,\
--driver-memory,60G,\
--executor-memory,30G,\
--executor-cores,5,\
--num-executors,6,\
--class,GbtTrainer,\
"s3://somebucket/somekey/spark.jar"],\
ActionOnFailure=CONTINUE
For a cluster of 3 r3.8xlarge instances I tweak resource allocation:
--driver-memory,80G,\
--executor-memory,35G,\
--executor-cores,5,\
--num-executors,18,\
I don't have a clear idea of how much memory is useful to give to every executor, but I feel that I'm being generous in either case. Looking through Spark UI, I'm not seeing task with input size of more than a few GB. I'm steering on the side of caution when giving the driver process so much memory in order to ensure that it isn't memory starved for any intermediate result-aggregation operations.
I'm trying to keep the number of cores per executor down to 5 as per suggestions in Cloudera's How To Tune Your Spark Jobs series (according to them, more that 5 cores tends to introduce a HDFS IO bottleneck). I'm also making sure that there is enough of spare RAM and CPUs left over for the host OS and Hadoop services.
My findings thus far
My only clue is Spark UI showing very long Scheduling Delay for a number of tasks at the tail-end of execution. I also get the feeling that the stages/tasks timeline shown by Spark UI does not account for all of the time that the job takes to finish. I suspect that the driver application is stuck performing some kind of a lengthy operation either at the end of every training iteration, or at the end of the entire training run.
I've already done a fair bit of research on tuning Spark applications. Most articles will give great suggestions on using RDD operations which reduce intermediate input size or avoid shuffling of data between stages. In my case I'm basically using an "out-of-the-box" algorithm, which is written by ML experts and should already be well tuned in this regard. My own code that outputs GBT model to S3 should take a trivial amount of time to run.
I haven't used MLLibs GBT implemention, but I have used both
LightGBM and XGBoost successfully. I'd highly suggest taking a look at these other libraries.
In general, GBM implementations need to train models iteratively as they consider the loss of the entire ensemble when building the next tree. This makes GBM training inherently bottlenecked and not easily parallelizable (unlike random forests which are trivially parallelizable). I'd expect it to perform better with fewer tasks, but that might not be your whole issue. Since you have so many features 500K, you're going to have very high overhead when calculating the histograms and split points during training. You should reduce the number of features you have, especially since they're much larger than the number of samples which will cause it to overfit.
As for tuning your cluster:
You want to minimize data movement, so fewer executors with more memory. 1 executor per ec2 instance, with the number of cores set to whatever the instance provides.
Your data is small enough to fit into ~2 EC2s of that size. Assuming you are using doubles (8 bytes), it comes to 8 * 500000 * 50000 = 200 GB Try loading it all into ram by using .cache() on your dataframe. If you perform an operation over all the rows (like sum) you should force it to load and you can measure how long the IO takes. Once its in ram and cached any other operations over it will be faster.
With a dataset of that size, you may well be better off loading the full dataset into memory and using XGBoost directly rather than the Spark implementation.
If you want to stick with Spark to give greater scalability, I'd recommend taking a closer look at your partitioning strategy. If your data isn't effectively partitioned, adding machines won't improve your runtime, as you describe above, and the subset of overloaded workers will remain your bottleneck. Ensure you have an effective partition key, and repartition your RDD before you begin your training stage.

How to increase Mappers and Reducer in Apache TEZ

I know this simple question, I need some help on this query from this community, When I create PartitionTable with ORC format, When I try to dump data from non partition table which is pointing to 2 GB File with 210 columns, I see Number of Mapper are 2 and reducer are 2 . is there a way to increase Mapper and reducer. My assumption is we cant set number of Mapper and reducer like MR 1.0, It is based on Settings like Yarn container size, Mapper minimum memory and maximum memory . can any one suggest me TEz Calculates mappers and reducers. What is best value to keep memory size setting, so that i dont come across : Java heap space, Java Out of Memory problem. My file size may grow upto 100GB. Please help me on this.
You can still set the number of mappers and reducers in Yarn. Have you tried that? If so, please get back here.
Yarn changes the underlying execution mechanism, but #mappers and #reducers is describing the Job requirements - not the way the job resources are allocated (which is how yarn and mrv1 differ).
Traditional Map/Reduce has a hard coded number of map and reduce "slot". As you say - Yarn uses containers - which are per-application. Yarn is thus more flexible. But the #mappers and #reducers are inputs of the job in both cases. And also in both cases the actual number of mappers and reducers may differ from the requested number. Typically the #reducers would either be
(a) precisely the number that was requested
(b) exactly ONE reducer - that is if the job required it such as in total ordering
For the memory settings, if you are using hive with tez, the following 2 settings will be of use to you:
1) hive.tez.container.size - this is the size of the Yarn Container that will be used ( value in MB ).
2) hive.tez.java.opts - this is for the java opts that will be used for each task. If container size is set to 1024 MB, set java opts to say something like "-Xmx800m" and not "-Xmx1024m". YARN kills processes that use more memory than specified container size and given that a java process's memory footprint usually can exceed the specified Xmx value, setting Xmx to be the same value as the container size usually leads to problems.