EMR not generating step logs - amazon-web-services

Due to some reason, I do not see steps logs for my jobs in EMR. It used to work fine a while back, but it just stopped logging.
I checked from HDFS in path /mnt/var/log/hadoop/steps/, but there is no log there. The steps do complete successfully, just that there are no logs.
Is there anything I can do to find the issue and get the logger back to work?
Thanks in advance for taking time to read and respond.
All the best.

Related

How to debug an aws glue pyspark job

I have a aws glue pyspark job which is long running after a certain command . In the log it is not writing anything after that command even a simple “print hello “ statement.
How can I debug aws glue pyspark job which is long running and not even writing logs. Job is not throwing any error it shows running status in the console
AWS Glue is based on Apache Spark which means until an action called there will not be any actual execution. So if you put print statements in between and see them in the logs that does't mean that your job is executed up to that point. As your job is long running check this article by AWS which explains about Debugging Demanding Stages and Straggler Tasks. Also this is a good blog to take a look at.

simple AWS Batch array job stuck in pending

I'm following the print-color AWS batch tutorial for an array batch job from the official amazon batch user guide on page 23 https://docs.aws.amazon.com/batch/latest/userguide/batch_user.pdf . It is supposed to be a very simple tutorial, but I find that my submitted array job is stuck in pending for an indefinite period of time.
Does anybody have an idea? I can't find more info to know if there is a bug. And there is nothing in cloud watch.image link Pending job
Thanks in advance.

Why do Dataflow steps not start?

I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening?
This was happening because of an error in the worker start up. Certain Dataflow steps do not seem to require workers (e.g. writing to GCS), which is why that step was able to start - i.e. that step starting does not imply that workers are being created correctly. Worker start up is not displayed in the job logs by default - you need to click the link to Stackdriver in the job logs and then add worker-startup in the logs drop down in order to see any of those errors.

Spark step on EMR just hangs as "Running" after done writing to S3

Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).

Spark application stops when reading file from s3

I have an application which runs on EMR and reads a csv file from s3.
However, the whole thing seems to stop (I've let it run for about an hour) when I try to read in that file from s3. Nothing happens and nothing is written to the logs any more except that the application is still running. The step in which this application is running does not fail!
I've tried copying the file to the cluster via the flag --files of spark-submit and reading it directly within the application with sc.textFile(filename).
Is there anything I am missing?
After a while I finally got back to that problem again and could "solve" it myself (I don't really know what the problem was, though...)
It seems like spark was failing to allocate worker nodes. After setting spark.dynamicAllocation.enabled to true everything is working as expected now.