Is it possible to re-partition the data using AWS glue crawler? - amazon-web-services

I have inherited a S3 bucket from a former colleague, where the files inside are partitioned with id and time, such as:
s3://bucket/partition_id=0/year=2017/month=6/day=1/file
The data in all these files is one table, can be queried through Athena. From the Glue catalogue it also showed that the partition(0) is id, partition(1) is year and so on.
Recently I want to reconstruct the work, and figured the partition using id is not very straightforward. I tried to use the Glue crawler and direct it to the S3 bucket. But there is no where I could choose if I only want it to partition with time, not id, like this:
s3://bucket/year=2017/month=6/day=1/file
I am quite new with AWS and not sure if it is possible or even makes sense to you. Please give me some feedback. Thank you.

I dont think you can do it with help of crawler, however you can create new table manually in Athena like this (also see https://docs.aws.amazon.com/en_us/athena/latest/ug/ctas-examples.html)
CREATE TABLE new_table
WITH (
format = 'ORC',
external_location = 's3://...',
partitioned_by = ARRAY['year', 'month', 'day'])
AS select *
FROM old_table;

Write python shell job using s3 boto apis to reorganize folder structure and then run crawler

Related

How can I save SQL script from AWS Athena view with boto3/python

I have been working with AWS Athena for a while and need to do create a backup and version control of the views. I'm trying to build an automation for the backup to run daily and get all the views.
I tried to find a way to copy all the views created in Athena using boto3, but I couldn't find a way to do that. With Dbeaver I can see and export the views SQL script but from what I've seen only one at a time which not serve the goal.
I'm open for any way.
I try to find answer to my question in boto3 documentation and Dbeaver documentation. read thread on stack over flow and some google search did not took me so far.
Views and Tables are stored in the AWS Glue Data Catalog.
You can Query the AWS Glue Data Catalog - Amazon Athena to obtain information about tables, partitions, columns, etc.
However, if you want to obtain the DDL that was used to create the views, you will probably need to use SHOW CREATE TABLE [db_name.]table_name:
Analyzes an existing table named table_name to generate the query that created it.
Have you tried using get_query_results in boto3? get_query_results

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Update AWS Athena data & table to rename columns

Today, I saw myself with a simple problem, renaming column of an Athena glue table from old to new name.
First thing, I search here and tried some solutions like this, this, and many others. Unfortunately, none works, so I decided to use my knowledge and imagination.
I'm posting this question with the intention of share, but also, with the intention to get how others did and maybe find out I reinvented the wheel. So please also share your way if you know how to do it.
My setup is, a Athena JSON table partitioned by day with valuable and enormous amount of data, the infrastructure is defined and updated through Cloudformation.
How to rename an Athena column and still keep the data?
Explaining without all the cloudformation infrastructure.
Imagine a table containing:
userId
score
otherColumns
eventDateUtc
dt_utc
Partitioned by dt_utc and stored using JSON format. Wee need to change the column score to deltaScore.
Keep in mind, although I haven't tested with others format/configurations, this should apply to any configuration supported by athena as we are going to use athena algorithm to do the job for us.
How to do
if you run the cloudformation migration first, you gonna "lose" access to the dropped column.
but you can simply rename the column back and the data appears.
Those are the steps required for rename a AWS Athena table:
Create a temporary table mapping the old column name to the new one:
This can be done by use of CREATE TABLE AS, read more in the aws docs
With this command, we use Athena engine to apply the transformation on the files of the original table for us and save at s3://bucket_name/A_folder/temp_table_rename/.
CREATE TABLE "temp_table_rename"
WITH(
format = 'JSON',
external_location = 's3://bucket_name/A_folder/temp_table_rename/',
partitioned_by = ARRAY['dt_utc']
)
AS
SELECT DISTINCT
userid,
score as deltascore,
otherColumns,
eventDateUtc,
"dt_utc"
FROM "my_database"."original_table"
Apply the database rename by running the cloudformation with the changes or on the way you have.
At this point, you can even drop the original_table, and create again using the right column name.
After rename, you will notice that the renamed column have no data.
Remove the data of the original table by deleting it's s3 source.
Copy the data from the temp table source to the original table source
I prefer to use a aws command as, there can be thousands of files to copy
aws s3 cp s3://bucket_name/A_folder/temp_table_rename/ s3://bucket_name/A_folder/original_table/ --recursive
Restore the index of the original table
MSCK REPAIR TABLE "my_database"."original_table"
done.
Final notes:
Using CREATE TABLE AS to do the transformation job, allow you to do much more than only renaming the column, for example split the data of a column into 2 new columns, or merge it to a single one.

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.

AWS Glue: Do I really need a Crawler for new content?

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.