Spark History Server very slow when driver running on master node

Spark History Server very slow when driver running on master node - amazon-web-services

I'm using Spark 2.4.5 running on AWS EMR 5.30.0 with r5.4xlarge instances (16 vCore, 128 GiB memory, EBS only storage, EBS Storage:256 GiB) : 1 master, 1 core and 30 task.
I launched Spark Thrift Server on the master node and it's the only job that is running on the cluster
sudo /usr/lib/spark/sbin/start-thriftserver.sh --conf spark.blacklist.enabled=true --conf spark.blacklist.stage.maxFailedExecutorsPerNode=4 --conf spark.blacklist.task.maxTaskAttemptsPerNode=3 --conf spark.driver.cores=12 --conf spark.driver.maxResultSize=10g --conf spark.driver.memory=86000M --conf spark.driver.memoryOverhead=10240 --conf spark.kryoserializer.buffer.max=768m --conf spark.rpc.askTimeout=700 --conf spark.sql.broadcastTimeout=800 --conf spark.sql.sources.partitionOverwriteMode=dynamic --conf spark.task.maxFailures=20
Then I launch SQL queries on it with JDBC but when heavy queries are running, the UI gets really slow. I thought it would be fine if I put spark.driver.cores=12 (there are 16 in the master node) and spark.driver.memory=86000M (there are 128GB of memory) to leave some margin for the master node to be able to run other processes like the history server but it is still slow.
So I guess there are other settings that I can edit to make the UI works fine but I'm not sure what.
Those are the settings from spark-defaults.conf in the cluster FYI:
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address <xxxxx>:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.yarn.dist.files /etc/spark/conf/hive-site.xml
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.dynamicAllocation.enabled true
spark.blacklist.decommissioning.enabled true
spark.blacklist.decommissioning.timeout 1h
spark.resourceManager.cleanupExpiredHost true
spark.stage.attempt.ignoreOnDecommissionFetchFailure true
spark.decommissioning.timeout.threshold 20
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS $(hostname -f)
spark.files.fetchFailure.unRegisterOutputOnHost true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem true
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000
spark.sql.parquet.output.committer.class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.sql.sources.partitionOverwriteMode dynamic
spark.executor.instances 1
spark.executor.cores 16
spark.driver.memory 2048M
spark.executor.memory 109498M
spark.default.parallelism 32
spark.emr.maximizeResourceAllocation true```

The problem was having only 1 core instance as the logs were saved in HDFS so this instance became a bottleneck.
I added another core instance and it's going much better now.
Another solution could be to save the logs to S3/S3A instead of HDFS, changing those parameters in spark-defaults.conf (make sure they are changed in the UI config too) but it might require adding some JAR files to work.
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps

Related

failed to start 'instance-controller' service on EMR master node

I started observing below validation error on EMR console,
Upon checking the status of the instance controller service, observed that
sudo systemctl status instnace-controller.service output is not consistent, it varies between running and auto-restart.
Master node system logs shows;
(console) 2023-02-03 21:55:23 About to start instance controller.
(console) 2023-02-03 21:55:23 Listing currently running instance controllers:
hadoop 8439 1 0 21:55 ? 00:00:00 /bin/bash -l /usr/bin/instance-controller
hadoop 8510 8439 0 21:55 ? 00:00:00 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.instancecontroller.Main
hadoop 8541 8439 0 21:55 ? 00:00:00 grep -i instance
root 8542 8439 0 21:55 ? 00:00:00 sudo tee -a /emr/instance-state/console.log-2023-02-03-21-55 /dev/console
oozie 26477 1 26 21:53 ? 00:00:22 /etc/alternatives/jre/bin/java -Xmx1024m -Xmx1024m -Doozie.home.dir=/usr/lib/oozie -Doozie.config.dir=/etc/oozie/conf -Doozie.log.dir=/var/log/oozie -Doozie.data.dir=/var/lib/oozie -Doozie.instance.id=ip-10-111-24-159.pvt.lp192.cazena.com -Doozie.config.file=oozie-site.xml -Doozie.log4j.file=oozie-log4j.properties -Doozie.log4j.reload=10 -Djava.library.path= -cp /usr/lib/oozie/embedded-oozie-server/*:/usr/lib/oozie/embedded-oozie-server/dependency/*:/usr/lib/oozie/lib/*:/usr/lib/oozie/libtools/*:/usr/lib/oozie/libext/*:/usr/lib/oozie/embedded-oozie-server:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/conf/*:/usr/share/aws/emr/emrfs/auxlib/* org.apache.oozie.server.EmbeddedOozieServer
root 27455 1 42 21:54 ? 00:00:35 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.logpusher.Main /etc/logpusher/logpusher.properties
(console) 2023-02-03 21:55:23 Displaying last 10 lines of instance controller logfile:
2023-02-03 21:55:17,719 INFO main: isV2FrameworkEnabled: false, extraInstanceData.numCandidates: 1
2023-02-03 21:55:17,735 WARN main: Invalid metrics information null fetched from checkpoint, will start continuing from current moment instead.
2023-02-03 21:55:17,735 INFO main: Initialized YARN checkpointing state with ckpFileAvl: true, ckpInfo: [ lastCkpTs(0), totalHdfsBytesReadCompletedApps(0), totalHdfsBytesWrittenCompletedApps(0), totalS3BytesReadCompletedApps(0), totalS3BytesWrittenCompletedApps(0)]
2023-02-03 21:55:17,745 ERROR main: Thread + 'main' failed with error
java.lang.RuntimeException: LocalStartupState is FAILED, so not allowing instance controller to start
at aws157.instancecontroller.common.InstanceConfigurator.hasAlreadyBeenConfigured(InstanceConfigurator.java:124)
at aws157.instancecontroller.common.InstanceConfigurator.<init>(InstanceConfigurator.java:100)
at aws157.instancecontroller.InstanceController.<init>(InstanceController.java:223)
at aws157.instancecontroller.Main.runV1Framework(Main.java:239)
at aws157.instancecontroller.Main.main(Main.java:222)
I tried restarting service multiple time with
sudo systemctl start instance-controller.service, rebooted the node hoping that service will start back after reboot. But it is not working. (Btw, this worked on lower environment)
Jobs on the cluster are running fine though without any issues, but I am not able to see application logs pushed to S3 or on console.
Need inputs on how to restart instance controller service.

AWS EC2 terminal session terminated with "Plugin with name Standard_Stream not found"

I was streaming Kafka on AWS EC2 CentOS 7. My Session Manager Idle Timeout is set to 60min. And yet, after running for much less than that, the terminal got frozen, saying My session has been terminated. Of course, the Kafka streaming for disrupted as well.
When I tried to restart a new session with a new terminal, I got this error popup
Your session has been terminated for the following reasons: Plugin with name Standard_Stream not found. Step name: Standard_Stream
and I am still unable to restart a terminal.
What does this error mean and how to resolve it? Thanks.

So far you need to access the EC2 using SSH with key-pem to debug
(ask your admin)
Running tail -f got issue
tail: inotify resources exhausted
tail: inotify cannot be used, reverting to polling
Restart ssm-agent service also got issue No space left on device
but it's not about disk space
[root#env-test ec2-user]# systemctl restart amazon-ssm-agent.service
Error: No space left on device
[root#env-test ec2-user]# df -h |grep dev
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
/dev/nvme0n1p1 100G 82G 18G 83% /
So the error itself means that system is getting low on inotify
watches, that enable programs to monitor file/dirs changes. To see
the currently set limit (including output on my machine)
$ cat /proc/sys/fs/inotify/max_user_watches
8192
Check which processes using inotify to improve your apps or increase max_user_watches
for foo in /proc/*/fd/*; do readlink -f $foo; done | grep inotify | sort | uniq -c | sort -nr
5 /proc/1/fd/anon_inode:inotify
2 /proc/7126/fd/anon_inode:inotify
2 /proc/5130/fd/anon_inode:inotify
1 /proc/4497/fd/anon_inode:inotify
1 /proc/4437/fd/anon_inode:inotify
1 /proc/4151/fd/anon_inode:inotify
1 /proc/4147/fd/anon_inode:inotify
1 /proc/4028/fd/anon_inode:inotify
1 /proc/3913/fd/anon_inode:inotify
1 /proc/3841/fd/anon_inode:inotify
1 /proc/31146/fd/anon_inode:inotify
1 /proc/2829/fd/anon_inode:inotify
1 /proc/21259/fd/anon_inode:inotify
1 /proc/1934/fd/anon_inode:notify
Notice that the above inotify list include PID of ssm-agent
processes, it explains why we got issue with SSM when
max_user_watches reached limit
ps -ef | grep ssm-ag
root 3841 1 0 00:02 ? 00:00:05 /usr/bin/amazon-ssm-agent
root 4497 3841 0 00:02 ? 00:00:33 /usr/bin/ssm-agent-worker
Final Solution: Permanent solution (preserved across restarts)
echo "fs.inotify.max_user_watches=1048576" >> /etc/sysctl.conf sysctl -p
Verify:
$ aws ssm start-session --target i-123abc456efd789xx --region ap-northeast-2
Starting session with SessionId: userdev-03ccb1a04a6345bf5
sh-4.2$
This issue comes from EC2 instance not about SSM agent Go to link to
undestanding SSM agent.
optional link

In my case, extend the disk space works!
(syslog full of my case)

In my case too extending the disk space worked as my /var/logs was huge.

AWS DOCKER dm.basesize in /etc/sysconfig/docker doesn't work

I want to change dm.basesize in my containers .
These are the size of containers to 20GB
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 8G 0 disk
`-xvda1 202:1 0 8G 0 part /
xvdf 202:80 0 8G 0 disk
xvdg 202:96 0 8G 0 disk
I have a sh
#cloud-boothook
#!/bin/bash
cloud-init-per once docker_options echo 'OPTIONS="${OPTIONS} --storage-opt dm.basesize=20G"' >> /etc/sysconfig/docker
~
I executed this script
I stopped the docker service
[ec2-user#ip-172-31-41-55 ~]$ sudo service docker stop
Redirecting to /bin/systemctl stop docker.service
[ec2-user#ip-172-31-41-55 ~]$
I started docker service
[ec2-user#ip-172-31-41-55 ~]$ sudo service docker start
Redirecting to /bin/systemctl start docker.service
[ec2-user#ip-172-31-41-55 ~]$
But the container size doesn't change.
This is /etc/sysconfig/docker file
#The max number of open files for the daemon itself, and all
# running containers. The default value of 1048576 mirrors the value
# used by the systemd service unit.
DAEMON_MAXFILES=1048576
# Additional startup options for the Docker daemon, for example:
# OPTIONS="--ip-forward=true --iptables=true"
# By default we limit the number of open files per container
OPTIONS="--default-ulimit nofile=1024:4096"
# How many seconds the sysvinit script waits for the pidfile to appear
# when starting the daemon.
DAEMON_PIDFILE_TIMEOUT=10
I read in the aws documentation that I can to execute scripts in the aws instance when I start it . I don't want to restart my aws instance because I lost my data.
Is there a way to update my container size without restart the aws instance?
In the aws documentation I don't find how to set a script when I launch the aws instance.
I follow the tutorial
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html
I don't find a example how to set a script when I launch the aws instance.
UPDATED
I configured the file
/etc/docker/daemon.json
{
"storage-driver": "devicemapper",
"storage-opts": [
"dm.directlvm_device=/dev/xdf",
"dm.thinp_percent=95",
"dm.thinp_metapercent=1",
"dm.thinp_autoextend_threshold=80",
"dm.thinp_autoextend_percent=20",
"dm.directlvm_device_force=false"
]
}
When I start docker, I get
Error starting daemon: error initializing graphdriver: /dev/xdf is not available for use with devicemapper
How can I configure the parameter
dm.directlvm_device=/dev/xdf

How to optimally set spark driver properties in YARN

I am trying out various options for setting spark driver memory in yarn.
Use Case:
I have a spark cluster with 1 master and 2 slaves
master : r5d xlarge - 8 vcore, 32GB
slave : r5d xlarge - 8 vcore, 32GB
I am using Apache Zeppelin to run the queries on spark cluster. Spark interpreter is configured with properties provided by Zeppelin. I am using spark 2.3.1 running on YARN. I want to create 4 interpreters so that 4 users can parallelly use this cluster.
Config 1:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 24G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Config 2:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 12G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Questions:
Why is the container size of driver 0?
Is the spark.memory.fraction calculated as (spark.driver.memory-300)*0.6 ? If so, why is it not exact ? (14.22, 7.02 resp)
Why is the container size of executor 3.8 GB ? According to my configuration, it should be 3G + 512M = 3.5 GB. This issue was not there with spark 2.1
No of VCores available to YARN is 8 per node. How is this possible since AWS gives vCPU with their instances? Hence I should only be getting 4 VCores according to AWS.
https://aws.amazon.com/ec2/instance-types/r5/
If I want to use 4 interpreters, should I distribute 32 GB of master equally to all the interpreters?
Driver:
spark.driver.cores 2
spark.driver.memory 7G
spark.driver.memoryOverhead 1024M

Spark cluster fails after running and no exception thrown

I'm trying to run a stand-alone Spark application on EC2 Yarn command line. I'm submitting the following spark-submit script:
./bin/spark-submit --class PageRankGraphX --master yarn-cluster --properties-file spark-defaults.conf.2 --executor-memory 2G --total-executor-cores 5 ./SparkPageRank-assembly-1.0.jar s3://linkfilefull/full/links_small.txt s3://conansoutputbucket/smalloutput.txt 10 0.15 2
This is the output - there is no exception or error thrown, the job simply fails after running:
15/04/15 21:27:03 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/
appUser: hadoop
15/04/15 21:27:04 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/A
appUser: hadoop
Does anyone know what could be causing this or how I could investigate? When I try to access the yarn logs, it says logs are disabled or not ready.

Check out Amazon's documentation on enabling access to the web UI of Hadoop. Once in the UI, you can check the stderr output for the application, where the exception will most likely be. As others mentioned, this log will also be available on S3.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Spark History Server very slow when driver running on master node - amazon-web-services

Related

failed to start 'instance-controller' service on EMR master node

AWS EC2 terminal session terminated with "Plugin with name Standard_Stream not found"

AWS DOCKER dm.basesize in /etc/sysconfig/docker doesn't work

How to optimally set spark driver properties in YARN

Spark cluster fails after running and no exception thrown

Categories

Resources