I have a AWS EMR cluster job which runs every 2 hours. I have set up a schedule using cloudWatch job to run every two hours.
But sometimes the next job (which runs after 2 hour from previous one) starts when previous one is not finished as it sometimes take more than 2 hours to completed depending on data to be processed.
I need some configuration by which I could prevent next job to be started if previous job is running.
I tried but couldn't found any set up. Can anyone knows how to do that please?
Add them as EMR steps. EMR steps run sequentially by default(Unless you change the concurrency setting)
Related
I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?
I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name
Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.
I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!
I'm relatively new to Spark. I have a Spark job that runs on an Amazon EMR cluster of 1 master and 8 cores. In a nutshell, the Spark job reads some .csv files from S3, transforms them to RDDs, performs some relatively complex joins on the RDDs and finally produces other .csv files on S3.
This job, executed on the EMR cluster, used to take about 5 hours. Suddenly, one of these days, it started to take over 30 hours and it does so ever since. There is no apparent difference in the inputs (the S3 files).
I've checked the logs and in the lengthy run (30 hours) I can see something about OutOfMemory errors:
java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:472)
at java.util.IdentityHashMap.put(IdentityHashMap.java:441)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
....
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
In spite of the apparent OutOfMemory exception(s), the outputs (the S3 files) look good, so apparently the Spark job finishes properly.
What could suddenly produce the jump from 5 hours execution to 30 hours ?
How would you go about investigating such an issue ?
Spark retries on failure. Your processes are failing. When that happens, all active tasks are probably considered failed, so requeued elsewhere in the cluster.
Is it possible to keep the master machine running in Dataproc? Every time that I run the job after a while (~1 hour), I see the master node is stopped. It is not a real issue since I would easily start it again but I would like to know if there is a way to keep it awake.
A possible way that occurs to me is to do a schedule job in the master machine, but want to know if there is more official way to achieve this.
Dataproc does not stop any cluster nodes (including master) when they are idle.
You need to check if you have some kind of automation or user that can do this on your end.
I have a schedule which runs my flow twice a day - at 0910 and 1520 BST.
This is spawning a massive number of DataFlow jobs - so far today just the second schedule (1520) has spawned 80 jobs:
$ gcloud dataflow jobs list
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2018-07-29_12_17_06-14876588186269022154 project-name-513008-by-username Batch 2018-07-29 19:17:07 Running us-central1
2018-07-29_12_14_54-6436458673562317581 project-name-512986-by-username Batch 2018-07-29 19:14:55 Cancelled us-central1
2018-07-29_12_13_55-6167618802124600084 project-name-512985-by-username Batch 2018-07-29 19:13:57 Cancelled us-central1
...
(see PasteBin for the full list)
In the days after the DataPrep update last week, I had trouble accessing the run settings url for the flow. I suspect that there's a process as part of the run settings which walks back through the flow (I have 12 flows chained by reference datasets) and sanity checks it - it seems that my flow was just on the cusp of being complex enough to cause the page load to time out, and I had to cut out a couple of steps just to get to the run settings.
I wonder if each time this timed out, it somehow duplicated the schedule or something else in the process - but then again, the number of duplicated jobs is inconsistent.
I recently rebuilt this project after seeing some issues with sampling errors (in that the sample was corrupt, so I couldn't load the transformation UI, but also couldn't build a new sample). After a hefty attempt at resolving the issue, I took the chance to rebuild as a dedicated GCP project with structure improvements, etc. I didn't see this scheduling error before the rebuild.
I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening?
This was happening because of an error in the worker start up. Certain Dataflow steps do not seem to require workers (e.g. writing to GCS), which is why that step was able to start - i.e. that step starting does not imply that workers are being created correctly. Worker start up is not displayed in the job logs by default - you need to click the link to Stackdriver in the job logs and then add worker-startup in the logs drop down in order to see any of those errors.