I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message:
Reason: Container killed by YARN for exceeding memory limits.
5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
So I google'd how to do this, and found that I should pass along the spark.yarn.executor.memoryOverhead parameter with the --conf flag. I'm doing it this way:
aws emr add-steps\
--cluster-id %s\
--profile EMR\
--region us-west-2\
--steps Name=Spark,Jar=command-runner.jar,\
Args=[\
/usr/lib/spark/bin/spark-submit,\
--deploy-mode,client,\
/home/hadoop/%s,\
--executor-memory,100g,\
--num-executors,3,\
--total-executor-cores,1,\
--conf,'spark.python.worker.memory=1200m',\
--conf,'spark.yarn.executor.memoryOverhead=15300',\
],ActionOnFailure=CONTINUE" % (cluster_id,script_name)\
But when I rerun the job it keeps giving me the same error message, with the 5.5 GB of 5.5 GB physical memory used, which implies that my memory did not increase.. any hints on what I am doing wrong?
EDIT
Here are details on how I initially create the cluster:
aws emr create-cluster\
--name "Spark"\
--release-label emr-4.7.0\
--applications Name=Spark\
--bootstrap-action Path=s3://emr-code-matgreen/bootstraps/install_python_modules.sh\
--ec2-attributes KeyName=EMR2,InstanceProfile=EMR_EC2_DefaultRole\
--log-uri s3://emr-logs-zerex\
--instance-type r3.xlarge\
--instance-count 4\
--profile EMR\
--service-role EMR_DefaultRole\
--region us-west-2'
Thanks.
After a couple of hours I found the solution to this problem. When creating the cluster, I needed to pass on the following flag as a parameter:
--configurations file://./sparkConfig.json\
With the JSON file containing:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "10G"
}
}
]
This allows me to increase the memoryOverhead in the next step by using the parameter I initially posted.
If you are logged into an EMR node and want to further alter Spark's default settings without dealing with the AWSCLI tools you can add a line to the spark-defaults.conf file. Spark is located in EMR's /etc directory. Users can access the file directly by navigating to or editing /etc/spark/conf/spark-defaults.conf
So in this case we'd append spark.yarn.executor.memoryOverhead to the end of the spark-defaults file. The end of the file looks very similar to this example:
spark.driver.memory 1024M
spark.executor.memory 4305M
spark.default.parallelism 8
spark.logConf true
spark.executorEnv.PYTHONPATH /usr/lib/spark/python
spark.driver.maxResultSize 0
spark.worker.timeout 600
spark.storage.blockManagerSlaveTimeoutMs 600000
spark.executorEnv.PYTHONHASHSEED 0
spark.akka.timeout 600
spark.sql.shuffle.partitions 300
spark.yarn.executor.memoryOverhead 1000M
Similarly, the heap size can be controlled with the --executor-memory=xg flag or the spark.executor.memory property.
Hope this helps...
Related
I have an AWS CLI cluster creation command that I am trying to modify so that it
enables my driver and executor to work with a customized log4j.properties file. With
Spark stand-alone clusters I have successfully used the approach of using the
--files <log4j.file> switch together with setting -Dlog4j.configuration=<log4j.file> specified via
spark.driver.extraJavaOptions, and spark.executor.extraJavaOptions.
I tried many different permutations and variations, but have yet to get this working with the
Spark job that I am running on an AWS EMR clusters.
I use the AWS CLI's 'create cluster' command with an intermediate step that downloads my spark jar, unzips
it to get at the log4j.properties packaged with that .jar. I then copy the log4j.properties
to my hdfs /tmp folder and attempt to distribute that log4j.properties file via '--files'.
Note, I have also tried this without hdfs (specifying
--files log4j.properties instead of --files hdfs:///tmp/log4j.properties) and that didn't work either.
My latest non-working version of this command (using hdfs) is given below. I'm wondering if anyone can share
a recipe that actually does work. The output of the command from the driver when I run this version is:
log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader#1e67b872.
log4j: Using URL [file:/etc/spark/conf.dist/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/etc/spark/conf.dist/log4j.properties
log4j: Parsing for [root] with value=[WARN,stdout].
From the above I can see that my log4j.properties file is not being picked up (the default is).
In addition to -Dlog4j.configuration=log4j.properties, I also tried configuring via
-Dlog4j.configuration=classpath:log4j.properties (and again that failed).
Any guidance much appreciated !
AWS COMMAND
jarPath=s3://com-acme/deployments/spark.jar
class=com.acme.SparkFoo
log4jConfigExtractCmd="aws s3 cp $jarPath /tmp/spark.jar ; cd /home/hadoop ; unzip /tmp/spark.jar log4j.properties ; hdfs dfs -put log4j.properties /tmp/log4j.properties "
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark \
--tags 'Project=mouse' \
'Owner=SwarmAnalytics'\
'DatadogMonitoring=True'\
'StreamMonitorRedshift=False'\
'DeployRedshiftLoader=False'\
'Environment=dev'\
'DeploySpark=False'\
'StreamMonitorS3=False'\
'Name=CCPASixCore' \
--ec2-attributes '{"KeyName":"mouse-spark-2021","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-07039960","EmrManagedSlaveSecurityGroup":"sg-09c806ca38fd32353","EmrManagedMasterSecurityGroup":"sg-092288bbc8812371a"}' \
--release-label emr-5.27.0 \
--log-uri 's3n://log-foo' \
--steps '[{"Args":["bash","-c", "$log4jConfigExtractCmd"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"downloadSparkJar"},{"Args":["spark-submit","--files", "hdfs:///tmp/log4j.properties","--deploy-mode","client","--class","$class","--driver-memory","24G","--conf","spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256 -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256 -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.yarn.executor.memoryOverhead=10g","--conf","spark.yarn.driver.memoryOverhead=10g","$jarPath"],"Type":"CUSTOM_JAR","ActionOnFailure":"CANCEL_AND_WAIT","Jar":"command-runner.jar","Properties":"","Name":"SparkFoo"}]'\
--instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"r5d.4xlarge","Name":"Core - 6"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"}]' \
--configurations '[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR","log4j.logger.org.apache.hadoop":"ERROR","log4j.appender.stdout":"org.apache.log4j.ConsoleAppender","log4j.logger.io.netty":"ERROR","log4j.logger.org.apache.spark.scheduler.cluster":"ERROR","log4j.rootLogger":"WARN,stdout","log4j.appender.stdout.layout.ConversionPattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p/%c{1}:%L - %m%n","log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler":"INFO"}},{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'\
--auto-terminate --ebs-root-volume-size 10 --service-role EMR_DefaultRole \
--security-configuration 'CCPA_dev_security_configuration_2' --enable-debugging --name 'SparkFoo' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1 --profile sandbox
Here is how to change the logging. The best way on AWS/EMR (that I have found) is to NOT fiddle with
spark.driver.extraJavaOptions or
spark.executor.extraJavaOptions
Instead, take advantage of configuration block that looks like this >
[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",
And then, say you want to change all logging done by classes under com.foo and its decendants to TRACE. Then you'd change the above block to look like this ->
[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"TRACE","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",
I am new to dataproc and PySpark. I created a cluster with the below configuration:
gcloud beta dataproc clusters create $CLUSTER_NAME \
--zone $ZONE \
--region $REGION \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--num-workers 3 \
--bucket $GCS_BUCKET \
--image-version 1.4-ubuntu18 \
--optional-components=ANACONDA,JUPYTER \
--subnet=default \
--enable-component-gateway \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--properties ${PROPERTIES}
Here, are the property settings i am using currently based on what i got on the internet.
PROPERTIES="\
spark:spark.executor.cores=2,\
spark:spark.executor.memory=8g,\
spark:spark.executor.memoryOverhead=2g,\
spark:spark.driver.memory=6g,\
spark:spark.driver.maxResultSize=6g,\
spark:spark.kryoserializer.buffer=128m,\
spark:spark.kryoserializer.buffer.max=1024m,\
spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark:spark.default.parallelism=512,\
spark:spark.rdd.compress=true,\
spark:spark.network.timeout=10000000,\
spark:spark.executor.heartbeatInterval=10000000,\
spark:spark.rpc.message.maxSize=256,\
spark:spark.io.compression.codec=snappy,\
spark:spark.shuffle.service.enabled=true,\
spark:spark.sql.shuffle.partitions=256,\
spark:spark.sql.files.ignoreCorruptFiles=true,\
yarn:yarn.nodemanager.resource.cpu-vcores=8,\
yarn:yarn.scheduler.minimum-allocation-vcores=2,\
yarn:yarn.scheduler.maximum-allocation-vcores=4,\
yarn:yarn.nodemanager.vmem-check-enabled=false,\
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
"
I want to understand if this is the right property setting for my cluster and if not how do i assign the most ideal values to these properties, specially the core, memory and memoryOverhead to run my pyspark jobs in the most efficient way possible and also because i am facing
this error : Container exited with a non-zero exit code 143. Killed by external signal?
It is important here to understand the configuration and limitations of the machines you are using, and how memory is allocated to spark components.
n1-standard-4 is a 4 core machine with 15GB RAM. By default, 80% of a machine's memory is allocated to YARN Node Manager. Since you are not setting it explicitly, in this case it will be 12GB.
Spark Executor and Driver run in the containers allocated by YARN.
Total memory allocated to spark executor is a sum of spark.executor.memory and spark.executor.memoryOverhead, which in this case is 10GB. I would advise you to allocate more memory to executor than to the memoryOverhead, as the former is used for running tasks and latter is used for special purposes. By default, spark.executor.memoryOverhead is max(384MB, 0.10 * executor.memory).
In this case, you can have only one executor per machine (10GB per executor and 15GB machine capacity). Because of this configuration you are underutilizing the cores because you are using only 2 cores for each executor. It is advised to leave 1 core per machine for other OS processes, so it might help to change executor.cores to 3 here.
In general it is recommended to use default memory configurations, unless you have a very good understanding of all the properties you are modifying. Based on the performance of your application under default settings, you may tweak other properties. Also consider changing to a different machine type based on the memory requirements of your application.
References -
1. https://mapr.com/blog/resource-allocation-configuration-spark-yarn/
2. https://sujithjay.com/spark/with-yarn
I'm running an AWS EC2 m5.large (a none burstable instance). I have setup one of AWS CloudWatch's default metrics (CPU %) + some custom metrics (memory + disk usage) in my dashboard.
But when I compare the numbers CloudWatch report to me they are pretty far from then actually usage of the Ubuntu 20.04 server when I log in to it...
Actual usage:
CPU: ~ 35 %
Memory: ~ 33 %
CloudWatch report:
CPU ~ 10 %
Memory: ~ 50-55
https://www.screencast.com/t/o1nAnOFjVZW
I have followed AWS own instructions to add the metrics for memory and disk usage (Because CloudWatch doesn't out of the box have access to O/S level stuff): https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html
When numbers are so far from each other - then it would be impossible to setup useful alarms and notifications. I can't believe that is what AWS wants to provide to the people who chose to followed their original instructions?
The only thing with match exactly is the disk usage %.
HOW TO INSTALL AWS AGENT AT UBUNTU 20.04 (NEWER WAY INSTEAD OF THE OLD SCRIPT: "CloudWatchMonitoringScripts")
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/download-cloudwatch-agent-commandline.html
1. sudo wget https://s3.amazonaws.com/amazoncloudwatch-agent/debian/amd64/latest/amazon-cloudwatch-agent.deb
2. sudo dpkg -i -E ./amazon-cloudwatch-agent.deb
3. sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
4. Go through all the steps in the wizard (The result is saved here: /opt/aws/amazon-cloudwatch-agent/bin/config.json)
Hint: I answered:
- Default to most questions and otherwise:
- NO --> Do you want to store the config in the SSM parameter store? (Because when I answered YES it failed later on because of some permission-issue and I didn't know how to make it happy and I don't think I need SSM in regards to this)
- YES --> Do you want to turn on StatsD daemon?
- YES --> Do you want to monitor metrics from CollectD?
- NO --> Do you have any existing CloudWatch Log Agent?
Now to prevent this error: Error parsing /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml, open /usr/share/collectd/types.db: no such file or directory
https://github.com/awsdocs/amazon-cloudwatch-user-guide/issues/1
5. sudo mkdir -p /usr/share/collectd/
6. sudo touch /usr/share/collectd/types.db
7. sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
8. /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status
{
"status": "running",
"starttime": "2020-06-07T10:04:41+00:00",
"version": "1.245315.0"
}
https://www.screencast.com/t/42VWgoS88Z (Create IAM role, add policies and make it the server default role).
https://www.screencast.com/t/fAUUHCPe (CloudWatch new custom metrics)
https://www.screencast.com/t/8J0Saw0co (data match OK now)
https://www.screencast.com/t/x0PxOa799 (data match OK now)
I realized - that the second I login to the machine the CPU % usage goes up from 10 % to 30% and stays there (of course some increase was to be expected - but not that much in my opinion) which in my case explains the big difference earlier...I honestly don't now if this way in more precise than the older script - but this should be the right way to do it in year 2020 :-) And you get access to 179 custom metrics when selecting "Advanced" during the wizard (even though only few would be valuable to most people)
I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.
I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]
I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.
I want to monitor memory used by particular process under cloudwatch in AWS. Do I have to use script to do so? If yes, let me know the steps or some guideline or Can I use cloudwatch logs to report memory utilized by particular process in real time? Tell me the other alternatives as well.
Yes, you will need a script that runs on the instance you want to monitor. Cloudwatch by default can only report on things it can 'see' at the hypervisor level, not things that re going on 'inside', so you'll need to create and report 'custom metrics'.
Here are some Linux script pointers:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/mon-scripts.html
and some for windows:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/mon-scripts-powershell.html
Put this in a file called 001initial.config in your .ebextensions folder of your s3 bucket you're using for your app ver. This will install the monitoring and set it up as a cron job.
Note the perl modules that get installed.
You'll want to ssh into your box and test the script is running.
Go into security and update your iam role for you ec2 instance with CloudWatch rights. Make sure to select the checkbox for the role and then click it to get to the rights page.
Once you know monitoring is running, go to the cloud watch page, and from the very first page type in System/Linux and search for that and it will show you disk and memory stats.
---
files:
"/etc/cron.d/my_cron":
mode: "000644"
owner: root
group: root
content: |
# run a cloudwatch command every five minutes (as ec2-user)
*/5 * * * * ec2-user ~/aws-scripts-mon/mon-put-instance-data.pl --mem-util --mem-used --mem-avail --disk-space-util --disk-path=/ --from-cron
encoding: plain
commands:
# delete backup file created by Elastic Beanstalk
clear_cron_backup:
command: rm -f /etc/cron.d/watson.bak
container_commands:
02download:
command: "curl http://aws-cloudwatch.s3.amazonaws.com/downloads/CloudWatchMonitoringScripts-1.2.1.zip -O"
ignoreErrors: true
03extract:
command: "unzip CloudWatchMonitoringScripts-1.2.1.zip"
ignoreErrors: true
04rmzip:
command: "rm rm CloudWatchMonitoringScripts-1.2.1.zip"
ignoreErrors: true
05cdinto:
command: "mv aws-scripts-mon/ /home/ec2-user"
ignoreErrors: true
packages:
yum:
perl-Switch : []
perl-URI: []
perl-Bundle-LWP: []
perl-DateTime: []
perl-Sys-Syslog: []
perl-LWP-Protocol-https: []
While the reason provided by #EJBrennan in his answer is correct, a more recent update to this question is to simply install the scripts as provided in this excellent documentation from AWS
AWS Documentation for Memory & Disk Metrics
So you need to
Install the scripts in your EC2 server
PUT the logs to Cloudwatch using ./mon-put-instance-data.pl --mem-util --mem-used-incl-cache-buff --mem-used --mem-avail
Setup a dashboard in your cloudwatch to see the metrics.
Alternatively, you can also setup a cron job to get the metrics on a periodic basis.
Hope that helps
You can try AWS CloudWatch procstat plugin. Apart from memory using memory_data param of procstat, you can monitor many other data of process. I have answered here.
A sample JSON configuration file using procstat -
{
"agent":{
"metrics_collection_interval":60,
"region":"us-south-1",
"logfile":"/opt/aws/amazon-cloudwatch-agent/logs/process-monitoring.log"
},
"metrics":{
"namespace":"CWAgent",
"append_dimensions":{
"AutoScalingGroupName":"${aws:AutoScalingGroupName}"
},
"aggregation_dimensions":[
[
"AutoScalingGroupName"
]
],
"force_flush_interval":60,
"metrics_collected":{
"procstat":[
{
"pid_file":"/var/opt/data/myapp/tmp/sampleApp.pid",
"measurement":[
"memory_data",
"memory_locked",
"memory_rss"
],
"metrics_collection_interval":30
}
]
}
}
}