I have a S3 bucket where everyday files are getting dumped. AWS crawler crawls the data from this location.On the very first day when my glue job runs it takes all the data present in the table that is created by AWS crawler.For example on very first day three files are there.(i.e. file1.txt,file2.txt,file3.txt) and glue job processes these files on the first day of glue job execution.On the second day another two files reaches to S3 location.Now in S3 location these are the files present.(i.e. file1.txt,file2.txt,file3.txt,file4.txt,file5.txt).Can i somehow design my AWS crawler in such a way that on the next day of job execution it just reads two files (file4.txt,file5.txt)?Or else how can I write AWS glue job just to identify these incremental files?
You need to enable AWS job bookmark for glue and it will be able to persist the state of already processed data. You can refer to the link below about how to do it.
aws glue job bookmark
You could implement an intermediate service like SQS. With that said, you can setup your SQS to wait events or messages from S3 (Such Put event in your case) and then, you can configure your crawler in order to poll from SQS when a new message comes and this it would apply for the new files.
The previous answer marked as correct does not answer your question and/or scenario.
Related
We have a simple ETL setup below
Vendor upload crawled parquet data to our S3 bucket.
S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:
Vendor upload file to S3.
S3 send event to SQS.
SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution. The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (e.g. every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
When running an AWS Glue crawler that points to S3, the second log entry in CloudWatch is always:
Crawl is not running in S3 event mode
What is S3 event mode?
The name sounds like some way of getting S3 to invoke Glue for partial crawls after every object upload to the prefix. But as far as I can tell, such functionality does not exist. So what is this log entry referring to?
The closest thing I found in the Glue documentation was event based triggers for Glue jobs, but Glue Jobs are different to Glue Crawlers.
Steps to reproduce
Create a Glue Crawler. Choose any configuration. Point it to anywhere in any S3 bucket with any dataset (even an empty one)
Run the crawler. It doesn't matter if the crawl fails or succeeds
Open the logs for that crawl
Look at the second log entry
2021-07-01T20:04:39.882+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] BENCHMARK : Running Start Crawl for Crawler my-crawler
2021-07-01T20:04:40.200+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] INFO : Crawl is not running in S3 event mode
AWS Support gave me an answer.
S3 Event mode is functionality available internally inside AWS. As I suspected it means S3 triggers crawler crawls for every file upload. But this functionality is not public at the moment.
I had the same problem and I found a solution in this article https://www.linkedin.com/pulse/my-top-5-gotchas-working-aws-glue-tanveer-uddin/
In short though the solution was to have aws-glue- before the name of my bucket. So, for example trying to get a crawler to go through a bucket called test-bucket would not work but if I change the name to aws-glue-test-bucket then works.
I have setup a Glue Job which runs concurrently to process input files and writes it down to S3. The Glue job runs periodically (not a one time job).
The output in S3 is in a form of csv file. The requirement is to copy all those records into Aws SQS. Assuming there might be 100s of files, each containing upto million records.
Initially i was planning to have a lambda event to send the records row by row. however, from the doc i see a time limit for lambda as 15 mins- https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/#:~:text=You%20can%20now%20configure%20your,Lambda%20function%20was%205%20minutes.
Will it be better to use AWS Batch for copying the records from S3 to SQS ? I believe, AWS Batch has the capability to scale the process when needed and also perform the task in parallel.
I want to know if AWS Batch is a right pick or am i trying to more complicate the design ?
There is a data being stored on a s3 bucket in a daily basis, we are trying to automate parsing and processing that daily data being sent to s3 bucket, we already have the script that will parse the data, we just need to have the approach on the AWS how to automate this,the approach/use-case we thought was AWS batch that is scheduled to do the script on a daily basis or will get the latest data on that day before EOD, but seems like batch is incapable of doing it.
any ideas and approach? we've seen some approach like using Lambda and SQS/SNS
just to summarize:
data (Daily) > stored in S3 > data will be process by our team > stored to elastic search.
Thanks your ideas.
AWS Lambda is exactly what you want in this case. You can trigger lambda executing on S3 file showing up, that will process the file, and can then send it to ElasticSearch or wherever you want it to end up.
Here's an official explanation from AWS: https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
You can use Lambda + cloud watch events to execute your code on a regular schedule. You can specify a fixed rate ( or you can specify a Cron expression ), for example, in your case, you can execute your lambda every 24 hours, this way, your logic for data processing will run once daily.
Take a look at this article from AWS : Schedule AWS Lambda Functions Using CloudWatch Events