inserting and reading data to/from athena tables - amazon-web-services

I apologise if the title is a bit misleading for the question I am going to ask. I am trying to understand how athena works a bit more clearly.
I have a daily job, which uploads files to a s3 location. I have created a athena table, which reads table from that s3 location. Every day the data gets updated and new files (i.e. new data) is uploaded to the location. (New necessarily doesn't mean overwriting but also adding more files).
My issue is, when I try to read the latest data from athena gui, it doesn't return anything but an empty table.
How do I read the latest data? Do I have to run another command like ALTER TABLE or INSERT INTO after uploading files to s3. My understanding was uploading files to that s3 location is akin to inserting data into table and vice versa i.e. running ALTER TABLE/INSERT INTO is akin to uploading files to s3?

Related

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

AWS Glue Incremental crawl of continually arriving data on S3

I'm looking for a way to set-up an incremental Glue crawler for S3 data, where data arrives continuously and is partitioned by the date it was captured (so the S3 paths within the include path contain date=yyyy-mm-dd). My concern is, that if I run the crawler in the course of a day, the partition for it will be created, and will not be re-visited in subsequent crawls. Is there a way to force a given partition, that I know might still be receiving updates, to be crawled while running the crawler incrementally and not wasting resources on historic data?
The crawler will visit only new folders with an incremental crawl (assuming you have set crawl new folders only option). The only circumstance where adding more data to an existing folder would cause a problem is if you were changing schema by adding a differently formatted file into a folder that was already crawled. Otherwise the crawler has created the partition and knows the schema, and is ready to pull the data, even if new files are added to the existing folder.

AWS Athena - What happens when you add new files to S3 folder

I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.

AWS Glue Crawler Overwrite Data vs. Append

I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3

How to split data when archiving from AWS database to S3

For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects