Duplicate Table in AWS Glue using AWS Athena - amazon-web-services

I have a table in AWS Glue which uses an S3 bucket for it's data location. I want to execute an Athena query on that existing table and use the query results to create a new Glue table.
I have tried creating a new Glue table, pointing it to a new location in S3, and piping the Athena query results to that S3 location. This almost accomplishes what I want, but
a .csv.metadata file is put in this location along with the actual .csv output (which is read by the Glue table as it reads all files in the specified s3 location).
The csv file places double quotes around each field, which ruins any fieldSchema defined in the Glue Table that uses numbers
These services are all designed to work together, so there must be a proper way to accomplish this. Any advice would be much appreciated :)

The way to do that is by using CTAS query statements.
A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.
For example:
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/'
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
There are some limitations though. However, for your case the most important are:
The destination location for storing CTAS query results in Amazon S3 must be empty.
The same applies to the name of new table, i.e. it shouldn't exist in AWS Glue Data Catalog.
In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system.
However, can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/',
bucketed_by=ARRAY['some_column_from_select'],
bucket_count=1
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
Apart from creating a new files and defining a table associated with you can also convert your data to a different file formats, e.g. Parquet, JSON etc.

I guess you have to change ur ser-de. If you are querying csv data either opencsvserde or lazysimple serde should work for you.

Related

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

AWS Glue job to convert table to Parquet w/o needing another crawler

Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?
Current set up:
JSON files in partitioned S3 bucket are crawled once a day
Glue Job creates Parquet files in specified folder
Run ANOTHER crawler to RECREATE the same table that was made in step 1
I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!
For convenience considerations - 2 crawlers is the way to go.
For cost considerations - a hacky solution whould be:
Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;
In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.
Execute the new CREATE TABLE DDL in Athena.
For example:
SHOW CREATE TABLE json_table;
Original DDL:
CREATE EXTERNAL TABLE `json_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
...
LOCATION
's3://bucket_name/table_data'
...
New DDL:
CREATE EXTERNAL TABLE `parquet_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_name/table_data'
You can also do it in the same way with Glue api methods: get_table() > replace > create_table().
Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.

Converting data in AWS S3 to another schema structure (also in S3)

quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.

Can I delete data (rows in tables) from Athena?

Is it possible to delete data stored in S3 through an Athena query? I have some rows I have to delete from a couple of tables (they point to separate buckets in S3).
I couldn't find a way to do it in the Athena User Guide: https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf and DELETE FROM isn't supported, but I'm wondering if there is an easier way than trying to find the files in S3 and deleting them.
You can leverage Athena to find out all the files that you want to delete and then delete them separately. There is a special variable "$path".
Select "$path" from <table> where <condition to get row of files to delete>
To automate this, you can have iterator on Athena results and then get filename and delete them from S3.
I also would like to add that after you find the files to be updated you can filter the rows you want to delete, and create new files using CTAS:
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
Later you can replace the old files with the new ones created by CTAS. I think it is the most simple way to go
The answer is Yes, now you can delete the data from Athena, recently AWS has introduced ICEBERG table which supports the ACID property.
You need to create an iceberg table that will have the same data as your Athena table(where you want to delete/update records) using the below steps.
Create ICEBERG TABLE
Create table new_iceberg_table
(id double, name string)
LOCATION 'S3://path/where/you/want/to_save/'
TBLPROPERTIES (table_type='iceberg')
Load data from your Data Catalogue into this new iceberg table.
Insert into datasource.new_iceberg_table
Select * from datasource.main_athena_table.
main_athena_table = Table where you want to perform Delete/Update or ACID.
new_iceberg_table = Newly created table
Now you can insert, update, and delete the data from iceberg table.
You can also time travel using SYSTEM_TIME.
Relevant SQL's
#Update SQL =
UPDATE from datasource.new_iceberg_table set id = 04 where name='ABC' ;
#Delete SQL =
DELETE from datasource.new_iceberg_table where name='ABC' ;
#Time travel SQL (In case you want to time travel and want to see the older data)
SELECT * from datasource.new_iceberg_table for SYSTEM_TIME as of (current_timestamp - interval '10' minutes) where name='ABC'
Thank you.
I would just like to add to Dhaval's answer.
You can find out the path of the file with the rows that you want to delete and instead of deleting the entire file, you can just delete the rows from the S3 file which I am assuming would be in the Json format.
The process is to download the particular file which has those rows, remove the rows from that file and upload the same file to S3.
This just replaces the original file with the one with modified data (in your case, without the rows that got deleted). After the upload, Athena would tranform the data again and the deleted rows won't show up.
Use AWS Glue for that.
Load your data, delete what you need to delete, save the data back.
Now you can also delete files from s3 and merge data: https://aws.amazon.com/about-aws/whats-new/2020/01/aws-glue-adds-new-transforms-apache-spark-applications-datasets-amazon-s3/
AWS has announced general availability of Iceberg integration with Athena and Athena now support DMLs at raw level for Iceberg tables.
UPDATE and DELETE rows can be done using SQLs:
DELETE FROM [db_name.]table_name [WHERE predicate]
UPDATE [db_name.]table_name SET xx=yy[,...] [WHERE predicate]
For more details - AWS DOCUMENT
Note that Athena Iceberg integration now is very restrictive - nested SQLs for Deletes and Updates are NOT supported.
Below query won't work:
Delete from table1 where uniqueid in (select b.uniqueid from delete_staging b)
Shivendra Singh's answer about ICEBERG should be accepted, as ICEBERG seems to answer all needs now. But if you need to stay on the Hive table, or if your files format is JSON and you need to keep it this way, you have the following option:
Use CTAS to create new table with values you want to keep. If it's hard to phrase query this way, you can always do something like where id not in (select id ...) or select * from ... except select * from .... If your table is partitioned, and after deletion there should be more than 100 partitions, you'll need to use "insert into" technique to create up to 100 partitions per query (https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html).
Move (just in case) original data from S3 for partitions that were relevant for the deletion
Move data that was created by (1)

can athena table be created for s3 bucket sub-directories?

Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables