I'm looking for a way to set-up an incremental Glue crawler for S3 data, where data arrives continuously and is partitioned by the date it was captured (so the S3 paths within the include path contain date=yyyy-mm-dd). My concern is, that if I run the crawler in the course of a day, the partition for it will be created, and will not be re-visited in subsequent crawls. Is there a way to force a given partition, that I know might still be receiving updates, to be crawled while running the crawler incrementally and not wasting resources on historic data?
The crawler will visit only new folders with an incremental crawl (assuming you have set crawl new folders only option). The only circumstance where adding more data to an existing folder would cause a problem is if you were changing schema by adding a differently formatted file into a folder that was already crawled. Otherwise the crawler has created the partition and knows the schema, and is ready to pull the data, even if new files are added to the existing folder.
Related
I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.
I am fairly new to AWS and am a bit overwhelmed with the options for a task I have.
Have: I have a dimensional model in an S3 bucket (read only access), that has a folder structure and contains partitioned parquet files. This bucket will be updated daily (+40GB a day), with changes to both dim and fact tables. I need to get this data out of S3, but it's extremely inefficient to set up a boto3 connection and repeatedly pull the entire raw data and continuously check if the data has even been updated.
What I was thinking for a solution: To maintain updated tables in another S3 bucket that I create (likely Athena query outputs), where I can just pull in the updated changes, so that boto can just check if there is data in the new bucket and pull, reducing load.
Considerations:
I need some kind of event notification that triggers the Athena query. I was looking into Lambda or Cloudwatch, but unsure which is better or restraints.
For the fact tables, I need an Athena query that gets the most recent "Last Updated" timestamp from the updated data. And then updates the updated bucket tables to include all the raw data that is greater than the found timestamp.
FYI: I am working with partitioned data, and I am not sure if I can just work with the tables as partitions (part-0000dim-table-3.parquet) or if additional steps are required to work with partitions.
For the dim tables, I need to somehow scan the entire table for changes (dim tables are a combination of SCD 0,1,2)... unsure how best to do this. In the worst case, I could just point the boto3 connection to the raw dim tables whenever the fact tables update.
What AWS APIs, workflows, should I think about using?
I am unclear on the constraints that I could run into with either Lambda, Cloudwatch, Athena step functions, etc. and trying to learn as I go. I am also struggling on how to compare Athena query results across the two buckets.
Thank you very much & if there is any more information that would help, just let me know!!
I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.
I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.
I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3