Hi I need to copy tables and views from one Athena instance to another (I'm not using Glue). How to do this via AWS Console or via boto3/pyathena api without loosing any data? Cant find anything in documentation :(
Athena tables are built on top of files stored on S3. If you need the data, you will have to export the data.
Related
I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.
Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.
Have you tried using get_query_results in boto3? get_query_results
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3
For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects
I created a database and some tables with Data on AWS Athena and would like to rename the database without deleting and re-creating the tables and database. Is there a way to do this? I tried the standard SQL alter database but it doesn't seem to work.
thanks!
I'm afraid there is no way to do this according to this official forum thread. You would need to remove the database and re-create it. However, since Athena does not store any data by itself, deleting a table or a database won't impact your data stored on S3. Therefore, if you kept all the scripts that create external tables, re-creating a database should be fairly quick thing to do.
Athena doesn't support renaming database. You need to recreate database with a new name.
You can use Presto which is an open source version of Athena and Presto supports more DDL queries.