On first day i kept my data as Folder 1 in s3 and run the job from glue,
i got the expected output.
On second day i kept my data as Folder 2 in same parent folder and run the job from glue,
folder1 data got replicated and output for data in folder 2 also came.
How can i avoid replication of data from folder1?
Have you enabled the bookmark in your AWS Glue Job? Enabling the bookmark will cause Glue to keep track of what it has already loaded. If you ever have to reload all your data, there's a "reset bookmark" option on the Jobs menu.
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
I have multiple ERPs ingesting data in S3, I have AWS glue for spark processing.
I found out, I need to have delta type files for spark processing and best way to run this ETL on EMR or Databricks.
Should I go for Databricks for incremental load and full load refresh of dashboard?
or EMR can also manage full data refresh along with update matched and insert new data features. if yes please share some info.
What I am confused about is that, if I have only new/ updated/ deleted data to process then how dashboard will show me all previous data.
When running an AWS Glue crawler that points to S3, the second log entry in CloudWatch is always:
Crawl is not running in S3 event mode
What is S3 event mode?
The name sounds like some way of getting S3 to invoke Glue for partial crawls after every object upload to the prefix. But as far as I can tell, such functionality does not exist. So what is this log entry referring to?
The closest thing I found in the Glue documentation was event based triggers for Glue jobs, but Glue Jobs are different to Glue Crawlers.
Steps to reproduce
Create a Glue Crawler. Choose any configuration. Point it to anywhere in any S3 bucket with any dataset (even an empty one)
Run the crawler. It doesn't matter if the crawl fails or succeeds
Open the logs for that crawl
Look at the second log entry
2021-07-01T20:04:39.882+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] BENCHMARK : Running Start Crawl for Crawler my-crawler
2021-07-01T20:04:40.200+10:00
[6588c8ba-57e2-46e3-94b4-1bc4dfc5957d] INFO : Crawl is not running in S3 event mode
AWS Support gave me an answer.
S3 Event mode is functionality available internally inside AWS. As I suspected it means S3 triggers crawler crawls for every file upload. But this functionality is not public at the moment.
I had the same problem and I found a solution in this article https://www.linkedin.com/pulse/my-top-5-gotchas-working-aws-glue-tanveer-uddin/
In short though the solution was to have aws-glue- before the name of my bucket. So, for example trying to get a crawler to go through a bucket called test-bucket would not work but if I change the name to aws-glue-test-bucket then works.
I am new to Spark but I am reading up as much as I can. I have a small project where multiple data files (in gzip) are going to continuously land in an S3 bucket every hour. I need to be able to open/read these gzip files and consolidate/aggregate data across them. So, I need to look at them in a holistic fashion. How, what techniques and tools from Amazon AWS can be used? Do I create interim files in a S3 folder or hold Dataframes in memory or use some database and blow away the data after each hour? So, I am looking for ideas more than a piece of code.
So far, in AWS, I have written a pyspark script that reads 1 file at a time and create an output file back in output S3 folder. But that leaves me with multiple output files for each hour. Would be nice if there was 1 file for a given hour.
From technology perspective, I am using an EMR cluster with just 1 master and 1 core node, Pyspark and S3.
Thanks
You could use an AWS Glue ETL job written in PySpark. Glue jobs can be scheduled to run every hour.
I suggest reading the entire dataset, performing your operations, and then moving the data to another long-term storage location.
If you are working on a few GB of data, a PySpark job should complete within minutes. There's no need to keep an EMR cluster running for an hour if you'll only need it for 10 minutes. Consider using short-lived EMR clusters or a Glue ETL job.
Athena supports querying GZipped data. If you're performing some sort of analysis, maybe executing an Athena query with a time range will work?
You could also use a CTAS (Create Table As Select) statement in Athena to copy data to a new location, and performing basic ETL on it at the same time.
What exactly does your PySpark code do?
I have a S3 bucket where everyday files are getting dumped. AWS crawler crawls the data from this location.On the very first day when my glue job runs it takes all the data present in the table that is created by AWS crawler.For example on very first day three files are there.(i.e. file1.txt,file2.txt,file3.txt) and glue job processes these files on the first day of glue job execution.On the second day another two files reaches to S3 location.Now in S3 location these are the files present.(i.e. file1.txt,file2.txt,file3.txt,file4.txt,file5.txt).Can i somehow design my AWS crawler in such a way that on the next day of job execution it just reads two files (file4.txt,file5.txt)?Or else how can I write AWS glue job just to identify these incremental files?
You need to enable AWS job bookmark for glue and it will be able to persist the state of already processed data. You can refer to the link below about how to do it.
aws glue job bookmark
You could implement an intermediate service like SQS. With that said, you can setup your SQS to wait events or messages from S3 (Such Put event in your case) and then, you can configure your crawler in order to poll from SQS when a new message comes and this it would apply for the new files.
The previous answer marked as correct does not answer your question and/or scenario.