I have an AWS glue job with Spark UI enabled by following this instruction: Enabling the Spark UI for Jobs
The glue job has s3:* access to arn:aws:s3:::my-spark-event-bucket/* resource. But for some reason, when I run the glue job (and it successfully finished within 40-50 seconds and successfully generated the output parquet files), it doesn't generate any spark event logs to the destination s3 path. I wonder what could have gone wrong and if there is any systematic way for me to pinpoint the root cause.
How long is your Glue job running for?
I found that jobs with short execution times, less then or around 1 min do not reliably produce Spark UI logs in S3.
The AWS documentation states "Every 30 seconds, AWS Glue flushes the Spark event logs to the Amazon S3 path that you specify." the reason short jobs do not produce Spark UI logs probably has something to do with this.
If you have a job with a short execution time try adding additional steps to the job or even a pause/wait to length the execution time. This should help ensure that the Spark UI logs are sent to S3.
Related
Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
When running an AWS Glue crawler that points to S3, the second log entry in CloudWatch is always:
Crawl is not running in S3 event mode
What is S3 event mode?
The name sounds like some way of getting S3 to invoke Glue for partial crawls after every object upload to the prefix. But as far as I can tell, such functionality does not exist. So what is this log entry referring to?
The closest thing I found in the Glue documentation was event based triggers for Glue jobs, but Glue Jobs are different to Glue Crawlers.
Steps to reproduce
Create a Glue Crawler. Choose any configuration. Point it to anywhere in any S3 bucket with any dataset (even an empty one)
Run the crawler. It doesn't matter if the crawl fails or succeeds
Open the logs for that crawl
Look at the second log entry
2021-07-01T20:04:39.882+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] BENCHMARK : Running Start Crawl for Crawler my-crawler
2021-07-01T20:04:40.200+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] INFO : Crawl is not running in S3 event mode
AWS Support gave me an answer.
S3 Event mode is functionality available internally inside AWS. As I suspected it means S3 triggers crawler crawls for every file upload. But this functionality is not public at the moment.
I had the same problem and I found a solution in this article https://www.linkedin.com/pulse/my-top-5-gotchas-working-aws-glue-tanveer-uddin/
In short though the solution was to have aws-glue- before the name of my bucket. So, for example trying to get a crawler to go through a bucket called test-bucket would not work but if I change the name to aws-glue-test-bucket then works.
I am trying to populate maximum possible Glue job metrics for some testing, below is the setup I have created:
A crawler reads data (dummy customer data of 500 rows) from a CSV file placed in an S3 bucket.
Used another crawler to crawl tables created in Redshift cluster.
An ETL job finally reads data from csv file in s3 and dumps it into a Redshift table.
The job is running without any issue and i am able to see final data getting dumped into Redshift table, however, in the end, only below 5 Cloudwatch metrics are being populated:
glue.jvm.heap.usage
glue.jvm.heap.used
glue.s3.filesystem.read_bytes
glue.s3.filesystem.write_bytes
glue.system.cpuSystemLoad
There are approximately 20 more metrics which are not getting populated.
Any suggestions on how to populate those remaining metrics as well?
Double check if the CW metrics for your job is enabled
Make sure your job runs longer say > 3mins such that it allows CW to push the metrics
For this you can add a sleep time in your code
Assuming that you are using Glue version 2.0+ for the above job, please be advised that AWS Glue version 2.0+ does not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available. Trackback on using Glue 1.0 and you should confirm that all the documented metrics are now available.
Met the same issue. Does your glue.s3.filesystem.read_bytes and glue.s3.filesystem.write_bytes have any data?
One possible reason is that the AWS Glue job metrics not emitted if job completes in less then 30 sec
While running the job enable the metrics option under monitoring tab.
I am building a data lake pipeline on aws which includes many AWS services like s3, cloudwatch, lambda, glue crawler, glue job etc. The pipeline flow works like:
- cloudwatch schedule a cron job to trigger a lambda to fetch external data and save them in s3 bucket.
- a lambda will be triggered whenever a file is uploaded to the s3 bucket who trigger a glue crawler
- cloudwatch listen on glue crawler state change and trigger a lambda which calls a glue job to do data ETL
It works fine but I feel it is hard to monitor the the whole process. The only thing I can get is the log saved in cloudwatch and some notification / alert. Is there a better way to monitor this pipeline? Like viewing it as in a workflow diagram to see each time of execution.
You can try AWS X-Ray. AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. It traces user requests as they travel through your entire application. It aggregates the data generated by the individual services and resources that make up your application, providing you an end-to-end view of how your application is performing. Check here for more details here .
I have a S3 bucket where everyday files are getting dumped. AWS crawler crawls the data from this location.On the very first day when my glue job runs it takes all the data present in the table that is created by AWS crawler.For example on very first day three files are there.(i.e. file1.txt,file2.txt,file3.txt) and glue job processes these files on the first day of glue job execution.On the second day another two files reaches to S3 location.Now in S3 location these are the files present.(i.e. file1.txt,file2.txt,file3.txt,file4.txt,file5.txt).Can i somehow design my AWS crawler in such a way that on the next day of job execution it just reads two files (file4.txt,file5.txt)?Or else how can I write AWS glue job just to identify these incremental files?
You need to enable AWS job bookmark for glue and it will be able to persist the state of already processed data. You can refer to the link below about how to do it.
aws glue job bookmark
You could implement an intermediate service like SQS. With that said, you can setup your SQS to wait events or messages from S3 (Such Put event in your case) and then, you can configure your crawler in order to poll from SQS when a new message comes and this it would apply for the new files.
The previous answer marked as correct does not answer your question and/or scenario.