EMR: Issue in running pyspark application - amazon-web-services

I am currently working with EMR 6.4.0 and facing an issue while running a pyspark application. The code was working fine but suddenly it started failing. I am currently stuck with two errors to which i have no clue how to resolve it.
The objective of the code is to get data read data from snowflake, save temporary data on S3 and write data back on different snowflake table at the end.
1) No Class found exception:
I am getting the below error on my EMR spark steps. I tried looking into many post but i am still not clear how to fix this:
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
**Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException**
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
**Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException**
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
Command exiting with ret '1'
I am submit my pyspark code using the below command in the EMR steps on a M4.large instance for test (In PROD, i have a bigger instance type - M5.8XLarge).
spark-submit --deploy-mode cluster --master yarn --driver-memory 4g --executor-memory 1g --executor-cores 1 --num-executors 1 --conf spark.rpc.message.maxSize=100 --jars /home/hadoop/configure_cluster/snowflake-jdbc-3.13.8.jar,/home/hadoop/configure_cluster/spark-snowflake_2.12-2.9.1-spark_3.1.jar --py-files /home/hadoop/spark_utils.zip /home/hadoop/weibull_2.py dev dafehv-dse-weibull-processing-dev
As shown in command above, i am trying to limit memory limits by specifying it in spark submit command. But i can see in logs that i get below error -
diagnostics: Uncaught exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request! Cannot allocate containers as requested resource is greater than maximum allowed allocation. Requested resource type=[memory-mb], Requested resource=<memory:35789, max memory:2147483647, vCores:2, max vCores:2147483647>, maximum allowed allocation=<memory:6144, vCores:4>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:6144, vCores:128>
Why do spark try to dynamically allocate the containers with resources not mentioned in spark submit? I am lost here, I have been trying to find how to fix the above two issues from last week, but to no avail.
I havent worked a lot with Spark, but can anyone please guide me how can i proceed to fix the issues?

Finally i found what the issue was.
Through many other blogs on stack overflow, i got to know that somehow the underlying spark and Hadoop version was different due to which i had to change my hadoop-aws and aws-sdk jars.
This also made mechange the snowflake jdbc and snowflake-spark driver jars.
This is solved my problem and the code now seems to be running.
for point 2: I found that the driver and executor memory was configured in spark config, which took precedence over the spark-submit options.
Update: 14 Nov:
I faced few more errors, all because of hadoop-aws jar issues. Checkout the below link (answer by LaserJesus) if that helps as it solved my problems:
https://stackoverflow.com/a/72931205/3186923
Update 16th Nov:
Check this link as well.
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

Dataflow job failing due to ZONE_RESOURCE_POOL_EXHAUSTED in europe-west3 region

My dataflow job has been failing since 7AM this morning with error:
Startup of the worker pool in zone europe-west3-c failed to bring up any of the desired 1 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance '' creation failed: The zone 'projects//zones/europe-west3-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
I tried to launch the job in europe-west3-a and europe-west3-b and I get same error. It's been well over 12 hours but this problem persists. I know this is not a general resource availability problem as I can create a new VM in that region without any problems.
I even have case open with Google Support but unfortunately they don't even read my ticket and simply reply with standard reply asking me to do things I've tried already.
Any idea what I can do here?
Update 1:
I tried to create a new job with --worker-machine-type=e2-standard-2 and that works. The problem seems to be related to their server-specified machine.
Update 2:
We are now going into day 2 of the problem in europe-west3. Our dev environment is in europe-west1 and this problem doesn't occur there.
This error occurs due to current unavailability of Compute Engine resources like GPUs in that zone.
This is not related to your Compute Engine quota.
You can resolve the issue by creating the resource in another zone in the region or a different region.
You can read more information and different resolution regarding this error in this document

Yarn UI shows no active node while it appeared in HDFS UI

I've setup Hadoop in my laptop,
and when I submit a job on Hadoop (though MapReduce and Tez),
the status always ACCEPTED, but the progress always stuck at 0% and description wrote something like "waiting for AM container to be allocated".
When I check the node through YARN UI(localhost:8088),
it shows that the active node is 0
But from HDFS UI(localhost:50070), it shows that there are one live node.
Is that the main reason that cause the job stuck since there are no available node? If that's the case, what should I do?
In your YARN UI, it shows you have zero vcores and zero memory so there is no way for any job to ever run since you lack computing resources. The datanode is only for storage (HDFS in this case) and does not matter as far as why your application is stuck.
To fix your problem, you need to update your yarn-site.xml and provide settings for the memory and vcore properties described in the following:
http://blog.cloudera.com/blog/2015/10/untangling-apache-hadoop-yarn-part-2/
You might consider using a Cloudera QuickStart VM or Hortonworks Sandbox (at least as a reference for configuration values for the yarn-site.xml).
https://www.cloudera.com/downloads/quickstart_vms/5-10.html
https://hortonworks.com/products/sandbox/

Spark step on EMR just hangs as "Running" after done writing to S3

Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).