I have a aws glue pyspark job which is long running after a certain command . In the log it is not writing anything after that command even a simple “print hello “ statement.
How can I debug aws glue pyspark job which is long running and not even writing logs. Job is not throwing any error it shows running status in the console
AWS Glue is based on Apache Spark which means until an action called there will not be any actual execution. So if you put print statements in between and see them in the logs that does't mean that your job is executed up to that point. As your job is long running check this article by AWS which explains about Debugging Demanding Stages and Straggler Tasks. Also this is a good blog to take a look at.
Related
actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples
AWS Glue looks promising but I'm having a challenge with the development cycle time. If I edit PySpark scripts through the AWS console, it takes several minutes to run even on a minimal test dataset. This makes it a challenge to iterate quickly if I have to wait 3-5 minutes just to see whether I called the right method on glueContext or understood a particular DynamicFrame behavior.
What techniques would allow me to iterate faster?
I suppose I could develop Spark code locally, and deploy it to Glue as an execution framework. But if I need to test code with Glue-specific extensions, I am stuck.
For development and testing scripts Glue has Development Endpoints which you can use with notebooks like Zeppelin installed either on a local machine or on Amazon EC2 instance (other options are 'REPL Shell' and 'PyCharm Professional').
Please don't forget to remove the endpoint when you are done with testing since you pay for it even if it's idling.
I keep pyspark code in separate class file and glue code in another file. We use glue for reading and writing data only. We do test driven development using pytest in local machine. No need of dev endpoint or zeppelin. Once all syntactical or business logic specific bugs are fixed in pyspark, end-to-end testing is done using glue. We also wrote shell script, which uploads latest code to S3 bucket from which glue job is run.
https://github.com/fatangare/aws-glue-deploy-utility
https://github.com/fatangare/aws-python-shell-deploy
Is there way to trigger dataprep flow on GCS (Google Cloud Storage) file upload? Or, at least, is it possible to make dataprep run each day and take the newest file from certain directory in GCS?
It should be possible, because otherwise what is the point in scheduling? Running the same job over the same data source with the same output?
It seems this product is very immature at the moment, so no API endpoint exists to run a job in this service. It is only possible to run a job in the UI.
In general, this is a pattern that is typically used for running jobs on a schedule. Maybe at some point the service will allow you to publish into the "queue" that Run Job already uses.
Running PySpark 2 job on EMR 5.1.0 as a step. Even after the script is done with a _SUCCESS file written to S3 and Spark UI showing the job as completed, EMR still shows the step as "Running". I've waited for over an hour to see if Spark was just trying to clean itself up but the step never shows as "Completed". The last thing written in the logs is:
INFO MultipartUploadOutputStream: close closed:false s3://mybucket/some/path/_SUCCESS
INFO DefaultWriterContainer: Job job_201611181653_0000 committed.
INFO ContextCleaner: Cleaned accumulator 0
I didn't have this problem with Spark 1.6. I've tried a bunch of different hadoop-aws and aws-java-sdk jars to no avail.
I'm using the default Spark 2.0 configurations so I don't think anything else like metadata is being written. Also the size of the data doesn't seem to have an impact on this problem.
If you aren't already, you should close your spark context.
sc.stop()
Also, if you are watching the Spark Web UI via a browser, you should close that as it sometimes keeps the spark context alive. I recall seeing this on the spark dev mailing list, but can't find the jira for it.
We experienced this problem and resolved it by running the job in cluster deploy mode using the following spark-submit option:
spark-submit --deploy-mode cluster
It was something to do with when running in client mode the driver runs in the master instance and the spark-submit process is getting stuck despite the spark spark context closing. This was causing the instance controller to continuously polling for process as it never receives the completion signal. Running the driver on one of the instance nodes using the above option doesn't seem to have this problem. Hope this helps
I experienced the same issue with Spark on AWS EMR and I solved the issue by calling sys.exit(O) at the end of my Python script. The same worked with Scala program with System.exit(0).
I am using AWS EMR clusters to run Hive. I want to be able to enforce that certain tables should never be empty After initial creation, such as refrence tables, and if they are found to be empty to throw an error (or log a message) and stop processing.
Does anyone know of any ways to achieve this?
Thanks
You could install a cron job on the master server that periodically runs a check against your Hive table. Once this table is empty, you can terminate the cluster or stop the job flow or take some other action. These actions can be executed using EMR CLI tools http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html
These commands can also be run using AWS SDK inside a Java Program - in case you want all of this as a Java program instead of a script.
You have not specified if the cluster is persistent or transient. If it is persistent, this script can run outside the master.