I am running a spark job on AWS (cr1.8xlarge instance, 32 cores with 240 GB memory each node) with the following configuration:
(The cluster has one master and 25 slaves, and I want each slave node to have 2 executors)
However, in the job tracker, it has only 25 executors:
Why does it have only 25 executors while I explicitly ask it to make 50? Thanks!
Related
I have a 21 node Hive LLAP EMR cluster.
Hive LLAP Daemons not consuming available cluster VCPU allocation.
160 cores available for YARN but only 1 vCore is used per LLAP daemon.
Each Node has 64 GB memory and 8 vCores. Each node runs 1 LLAP deamon and its allocated 70% of the memory BUT ONLY 1 vCore.
Some of the properties:
yarn.nodemanager.resource.cpu-vcores=8;
yarn.scheduler.minimum-allocation-vcores=1;
yarn.scheduler.maximum-allocation-vcores=128;
hive.llap.daemon.vcpus.per.instance=4;
hive.llap.daemon.num.executors=4;
Why isn't the daemon allocated more than 1 vcore ?
Will the executors be able to use the available vcores OR can ONLY use the 1 vcore allocated to the daemon.
If you are seeing this in YARN ui probably you have to add this
yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
I had the same confusion. Actually while using DefaultResourceCalculator in Yarn UI its only calculates memory usage, behind the scene it may have been using more than 1 core but you will see only 1 core used. On the other hand DominantResourceCalculator calucates both core and memory for resource allocation and shows actual number of core and memory.
You can enable ganglia or see EMR metrics for more details.
I am executing a AWS-Gluejob with python shell. It fails inconsistently with the error "Command failed with exit code 137" and executes perfectly fine with no changes sometimes.
What does this error signify? Are there any changes we can do in the job configuration to handle the same?
Error Screenshot
Adding the worker type to the Job Properties will resolve the issue.Based on the file size please select the worker type as below:
Standard – When you choose this type, you also provide a value for Maximum capacity. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors.
G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs.
G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs and jobs that run ML transforms.
Please see the attached screen shot of the CPU Load: Driver and Executors. It looks fine in the first 6 minutes, multiple executors are active. But after 6 minutes the chart only shows the Executor Average and Driver lines. When I put the mouse on the line, there are no usage data for all 17 executors. Does that mean all the executors are inactive after 6 minutes? How the Executor Average is calculated?
Thank you.
After talked to AWS support, I finally got the answer for why after 04:07 there are no lines for individual executors but only the Executor Average and the Driver.
I was told there are 62 executors for each job, however, at each moment at most 17 executors are used. So the Executor Average is the average of different sets of 17 executors at different moment. The default CPU Load chart only shows Executor 1 to 17, not 18 to 62. In order to show other executors, you need to manually add the metrics.
I'm running a Spark slave inside a Docker container on AWS c4.8xlarge machines (one or more) and struggling to get the expected performance when compared to just using multiprocessing on my laptop (with quad-core Intel i7-6820HQ). (see edit below, there is a huge overhead on same hardware as well)
I'm looking for solutions to horizontally scale analytics model training with a "Multiprocessor" which can work in a single thread, multi-process or in a distributed Spark scenario:
class Multiprocessor:
# ...
def map(self, func, args):
if has_pyspark:
n_partitions = min(len(args), 1000)
return _spark_context.parallelize(args, n_partitions).map(func).collect()
elif self.max_n_parallel > 1:
with multiprocessing.Pool(self.max_n_parallel) as pool:
return list(pool.map(func, args))
else:
return list(map(func, args))
As you can see Spark's role is to distribute calculations and simply retrieve results, parallelize().map() is the only API used. args is just a list of integer id tuples, nothing too heavy.
I'm using Docker 1.12.1 (--net host), Spark 2.0.0 (stand-alone cluster), Hadoop 2.7, Python 3.5 and openjdk-7. Results for the same training dataset, every run is CPU-bound:
5.4 minutes with local multiprocessing (4 processes)
5.9 minutes with four c4.8xlarge slaves (10 cores in use / each)
6.9 minutes with local Spark (master local[4])
7.7 minutes with three c4.8xlarge slaves (10 cores in use / each)
25 minutes with a single c4.8xlarge slave (10 cores) (!)
27 minutes with local VM Spark slave (4 cores) (!)
All 36 virtual CPUs seem to be in use, load averages are 250 - 350. There were about 360 args values to be mapped, their processing took 15 - 45 seconds (25th and 75th percentiles). CG times were insignificant. Even tried returning "empty" results to avoid network overhead but it did not affect the total time. Ping to AWS via VPN is 50 - 60 ms.
Any tips on which other metrics I should look into, feel I'm wasting lots CPU cycles somewhere. I'd really like to build architecture around Spark but based on these PoCs at least machines on AWS are way too expensive. Gotta do tests with other local hardware I've access to.
EDIT 1: Tested on a Linux VM on laptop, took 27 minutes when using the stand-alone cluster which is 20 minutes more than with local[4].
EDIT 2: There seems to be 7 pyspark daemons for each slave "core", all of taking significant amount of CPU resources. Is this expected behavior? (picture from laptop's VM)
EDIT 3: Actually this happens even when starting the slave just a single core, I get 100% CPU utilization. According to this answer red color indicates kernel level threads, could Docker play a role here? Anyway, I don't remember seeing this issue when I was prototyping it with Python 2.7, I got very minimal performance overhead. Now updated to Java OpenJDK 8, it made no difference. Also got same results with Spark 1.5.0 and Hadoop 2.6.
EDIT 4: I could track down that by default scipy.linalg.cho_factor uses all available cores, that is why I'm seeing high CPU usage even with one core for the Spark slave. Must investigate further...
Final edit: The issue seems to have nothing to do with AWS or Spark, I've got poor performance on stand-alone Python within the Docker container. See my answer below.
Had the same problem - for me the root cause was memory allocation.
Make sure you allocate enough memory to your spark instances.
In start-slave.sh - run --help to get the memory option (the default is 1GB per node regardless the actual memory in the machine).
You can view in the UI (port 8080 on the master) the allocated memory per node.
You also need to set the memory per executor when you submit your application, i.e. spark-submit (again the default is 1GB), like before - run with --help to get the memory option.
Hope this helps.
Sorry for the confusion (I'm the OP), it took me a while to dig down to what is really happening. I did lots of benchmarking and finally I realized that on Docker image I was using OpenBLAS which by default multithreads linalg functions. My code is running cho_solve hundreds of times on matrices of size ranging from 80 x 80 to 140 x 140. There was simply tons of overhead from launching all these threads, which I don't need in the first place as I'm doing parallel computation via multiprocessing or Spark.
# N_CORES=4 python linalg_test.py
72.983 seconds
# OPENBLAS_NUM_THREADS=1 N_CORES=4 python linalg_test.py
9.075 seconds
I am trying to get acquainted with Amazon Big Data tools and I want to preprocess data from S3 for eventually using it for Machine Learning.
I am struggling to understand how to effectively read data into an AWS EMR Spark cluster.
I have a Scala script which takes a lot of time to run, most of that time is taken up by Spark's explode+pivot on my data and then using Spark-CSV to write to file.
But even reading the raw data files takes up too much time in my view.
Then I created a script only to read in data with sqlContext.read.json() from 4 different folders (data sizes of 0.18MB, 0.14MB, 0.0003MB and 399.9MB respectively). I used System.currentTimeMillis() before and after each read function to see how much time it takes and with 4 different instances' settings the results were the following:
m1.medium (1) | m1.medium (4) | c4.xlarge (1) | c4.xlarge (4)
1. folder 00:00:34.042 | 00:00:29.136 | 00:00:07.483 | 00:00:06.957
2. folder 00:00:04.980 | 00:00:04.935 | 00:00:01.928 | 00:00:01.542
3. folder 00:00:00.909 | 00:00:00.673 | 00:00:00.455 | 00:00:00.409
4. folder 00:04:13.003 | 00:04:02.575 | 00:03:05.675 | 00:02:46.169
The number after the instance type indicates how many nodes were used. 1 is only master and 4 is one master, 3 slaves of the same type.
Firstly, it is weird that reading in first two similarly sized folders take up different amount of time.
But still how does it take so much time (in seconds) to read in less than 1MB of data?
I had 1800MB of data a few days ago and my job with data processing script on c4.xlarge (4 nodes) took 1,5h before it failed with error:
controller log:
INFO waitProcessCompletion ended with exit code 137 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 4870 seconds
2016-07-01T11:50:38.920Z INFO Step created jobs:
2016-07-01T11:50:38.920Z WARN Step failed with exitCode 137 and took 4870 seconds
stderr log:
16/07/01 11:50:35 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[21] at json at DataPreProcessor.scala:435)
16/07/01 11:50:35 INFO TaskSchedulerImpl: Adding task set 4.0 with 24 tasks
16/07/01 11:50:36 WARN TaskSetManager: Stage 4 contains a task of very large size (64722 KB). The maximum recommended task size is 100 KB.
16/07/01 11:50:36 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5330, localhost, partition 0,PROCESS_LOCAL, 66276000 bytes)
16/07/01 11:50:36 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 5331, localhost, partition 1,PROCESS_LOCAL, 66441997 bytes)
16/07/01 11:50:36 INFO Executor: Running task 0.0 in stage 4.0 (TID 5330)
16/07/01 11:50:36 INFO Executor: Running task 1.0 in stage 4.0 (TID 5331)
Command exiting with ret '137'
This data was doubled in size over the weekend. So if I get ~1GB of new data each day now (and it will grow soon and fast) then I hit big data sizes very soon and I really need an effective way to read and process the data quickly.
How can I do that? Is there anything that I am missing? I can upgrade my instances, but for me it does not seem normal that reading in 0.2MB of data with 4x c4.xlarge (4 vCPU, 16ECU, 7.5GiB mem) instances take 7 seconds (even with inferring data schema automatically for ~200 JSON attributes).