Dataproc Initialization Script not running on master node - google-cloud-platform

I'm setting up a new dataproc server and using initilization-action to run a custom script. The script runs fine on 2 datanodes but not executing on master node.
Tried looking for logs under /var/log/dataprog-initilization-*.log but unable to find the file in the master node.
Has anyone else faced this issue before?
Thanks in advance!!
gcloud command:
gcloud dataproc clusters create test-cluster \
--region=us-central1 --zone=us-central1-a \
--master-machine-type=n1-standard-4 --master-boot-disk-size=200 \
--initialization-actions=gs://dp_init_data/init2.sh --initialization-action-timeout="2m" \
--num-workers=2 --worker-machine-type=n1-standard-8 --worker-boot-disk-size=200
DataNode error log:
2019-07-11 03:29:22,123 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-268987178-10.32.1.248-1562675355441 (Datanode Uuid 71664f82-1d23-4184-b19b-28f86b01a251) service to exp-gcp-kerberos-m.c.exp-cdh-prod.internal/10.32.1.248:8051 Datanode denied communication with namenode because the host is not in the include-list: DatanodeRegistration(10.32.1.60:9866, datanodeUuid=71664f82-1d23-4184-b19b-28f86b01a251, infoPort=0, infoSecurePort=9865, ipcPort=9867, storageInfo=lv=-57;cid=CID-aee57974-1706-4b8c-9654-97da47ad0464;nsid=128710770;c=1562675355441)

According to your DataNode error log, seems you are expecting the init action to be run first on master, then workers. But init actions are run in parallel, you have to add logic to sync between master and workers. I think you can simply add some wait in workers, or if you want something more reliable, write a flag file in GCS when master init is done, check that file in workers.

Related

How can i connect to documnet db TLS enabled cluster from databricks spark?

I want to connect to a document db which has TLS enabled .I could do that from a lambda function with the rds-combined-ca-bundle.pem copied with lambda code .I could not do the same with databricks as all the node of cluster should have this file when spark try to connect it always time out.I tried to create the init scripts by following below link
https://learn.microsoft.com/en-us/azure/databricks/kb/python/import-custom-ca-cert
However it does not help either .Let me know if any one has any clue on this kind of use case .
Note:I can connect to TLS disabled document-db from same databricks instance .
If you are experiencing connection time out errors when using an init script to import the rds-combined-ca-bundle.pem file on your Spark cluster, try the following steps:
Make sure that the rds-combined-ca-bundle.pem file is available on the driver node of your Spark cluster. The init script will only be executed on the driver node. You will encounter connection time out errors otherwise.
Use the --conf option when starting the spark-shell or spark-submit command to specify the location of the rds-combined-ca-bundle.pem file on the driver node. To specify the location of the rds-combined-ca-bundle.pem file, run:
spark-shell --conf spark.mongodb.ssl.caFile=path/to/rds-combined-ca-bundle.pem
Check the Spark cluster logs whether the init script is being executed correctly or if its encountering any errors.

Dataflow Job failing

I have a pipeline which requires a Dataflow Job to run. I was using the gcloud CLI command to start a dataflow job which was working fine for over a month. But since last three days the dataflow job is failing within 10-20 sec with the following error log.
Failed to start the VM, launcher-2022012621245117717885921401920990, used for launching because of status code: UNAVAILABLE, reason: One or more operations had an error: 'operation-1643261093401-5d68989bed339-a33de830-9f90d92a': [UNAVAILABLE] 'HTTP_503'..
The command I'm using is:
gcloud dataflow sql query "SELECT tr.* FROM pubsub.topic.`my_project`.pubsub_topic as tr"
--job-name test_job
--region asia-south1
--bigquery-write-disposition write-empty
--bigquery-project my_project
--bigquery-dataset test_dataset --bigquery-table table_name
--max-workers 1 --worker-machine-type n1-standard-1
I tried starting the job from cloud console with same parameters as well which failed with the same error log. I have tested the job run from console before and it worked fine. The issue started a couple days ago.
What could be going wrong?
Thanks.
The Google Cloud error model indicates that a 503 means the service is unavailable [1].
You may try to change the region, for example, from europe-north1 to europe-west4, that should work. Additionally, you shouldn't include your job ID on Stack Overflow.
[1] https://cloud.google.com/apis/design/errors#handling_errors

Amazon EMR and Yarn deployment mode

I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark application from the driver node using spark-submit. I have the default /etc/spark/conf/spark-default.conf file created by AWS which is using Yarn Resource Manager. Everything runs fine and I can monitor the Tracking URL as well.
But I am not able to differentiate between whether the spark job is running in 'client' mode or 'cluster' mode. How do I determine that?
Excerpts from /etc/spark/conf/spark-default.conf
spark.master yarn
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.executor.memory 4743M
spark.executor.cores 2
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory 2048M
Excerpts from my pypspark job:
import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from boto3.session import Session
conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....
Is there a deployment mode called 'yarn-client' or is it just 'client' and 'cluster'?
Also, why is "num-executors" not specified in the config file by AWS? Is that something I need to add?
Thanks
It is determined by how you send the option when you submit the job, see the Documentation.
Once you access to the spark history server from the EMR console or by web server, you can find the spark.submit.deployMode option in the Environment tab. In my case, it is client mode.
By default spark application runs in client mode, i.e. driver runs on the node where you're submitting the application from. Details about these deployment configurations can be found here. One easy to verify it would be to kill the running process by pressing ctrl + c on terminal after the job goes to RUNNING state. If it's running on client mode, the app would die. If it's running in cluster mode it would continue to run, because the driver is running in one of the worker nodes in EMR cluster. A sample spark-submit command to run the job in cluster mode would be
spark-submit --master yarn \
--py-files my-dependencies.zip \
--num-executors 10 \
--executor-cores 2 \
--executor-memory 5g \
--name sample-pyspark \
--deploy-mode cluster \
package.pyspark.main
By default number of executors is set to 1. You can check the default values for all spark configs here.

aws: EMR cluster fails "ERROR UserData: Error encountered while try to get user data" on submitting spark job

Successfully started aws EMR cluster, but any submission fails with:
19/07/30 08:37:42 ERROR UserData: Error encountered while try to get user data
java.io.IOException: File '/var/aws/emr/userData.json' cannot be read
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:296)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1711)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1748)
at com.amazon.ws.emr.hadoop.fs.util.UserData.getUserData(UserData.java:62)
at com.amazon.ws.emr.hadoop.fs.util.UserData.<init>(UserData.java:39)
at com.amazon.ws.emr.hadoop.fs.util.UserData.ofDefaultResourceLocations(UserData.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.buildSTSClient(AWSSessionCredentialsProviderFactory.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:17)
at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:22)
at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:130)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:86)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.doesBucketExist(AmazonS3LiteClient.java:90)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:139)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:116)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:508)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:111)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:190)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:146)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:144)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
userData.json isn't part of my application, looks like it is emr internals.
Any ideas what is wrong? I submit jobs via livy requests.
Cluster setup:
2 core nodes m4.large
7 task nodes m5.4xlarge
1 master node m5.xlarge
The correct way to fix this is by running the following command as part of your bootstrap script when launching EMR (or, if running on a Glue Endpoint, run the following at any point on your endpoint):
chmod 444 /var/aws/emr/userData.json
I've face the similar issue in AWS EMR emr-5.24.1(spark 2.4.1), but jobs are never failed.

Restart hiveserver2 on emr

I killed the hiveserver2 process (after finding the PID with ps aux|grep -i hiveserver2) on my EMR cluster with one master and two workers. Before killing hiveserver2 I was able to browse and query Hive on my browser via HUE. I tried restarting with hive --service hiveserver2 but then I can't connect from HUE anymore and it either hangs or says that it can't connect to the <publicDNS>:10000
My use case is that I want to modify the hive configuration of my EMR cluster without shutting down the cluster. Is that possible at all?
initctl list
status hive-server2
sudo restart hive-server2
sudo stop hive-server2
sudo start hive-server2
How do I restart a service in Amazon EMR?
Hive configurations can be added before you launch your cluster, not after you have the cluster ready. You can add them as configuration settings in bootstrap step.
E.G. You can add your configurations in hive-site.xml using following syntax (in java):
Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.vectorized.execution.enabled","true");
hiveProperties.put("hive.vectorized.execution.reduce.enabled","true");
hiveProperties.put("hive.execution.engine","Tez");
hiveProperties.put("hive.auto.convert.join","true");
hiveProperties.put("hive.exec.parallel","true");
Configuration myHiveConfig = new Configuration()
.withClassification("hive-site")
.withProperties(hiveProperties);
List <Application> apps = new ArrayList<Application>();
apps.add(new Application().withName("Hadoop"));
apps.add(new Application().withName("Hive"));
apps.add(new Application().withName("Spark"));
//apps.add(new Application().withName("Pig"));
//apps.add(new Application().withName("Zeppelin-Sandbox"));
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("abc")
.withReleaseLabel(emrVersion) //"emr-4.3.0"
.withServiceRole("EMR_DefaultRole")
.withConfigurations(myHiveConfig)
.withInstances(
new JobFlowInstancesConfig()
.withInstanceCount(numberofInstances)
.withKeepJobFlowAliveWhenNoSteps(true)
.withTerminationProtected(false)
.withMasterInstanceType(mserverType)
.withSlaveInstanceType(sserverType)
)
.withApplications(apps)
.withJobFlowRole("EMR_EC2_DefaultRole")
.withSteps(generalSteps);
More details in link below:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html