I need to insert data on daily basis to AWS Redshift.
The requirement is to analyze only the daily batch inserted to Redshift. Redshift cluster is used by BI tools for analytics.
Question:
What are the best practices to "renew" the data set on a daily basis?
My concern is it is a quite heavy operation and performance will be poor but at the same time it is a quite common situation and I believe it was done before by multiple organization.
If the data is on S3, why not create an EXTERNAL TABLE over it. Then if the query speed over external table it not enough you can load it using CREATE TABLE AS SELECT statement into a temporary table, and once loaded, rename to a name your usual table name.
Sketched SQL:
CREATE EXTERNAL TABLE external_daily_batch_20190422 (
<schema ...>
)
PARTITIONED BY (
<if anything to partition on>
)
ROW FORMAT SERDE <data format>
LOCATION 's3://my-s3-location/2019-04-22';
CREATE TABLE internal_daily_batch_temp
DISTKEY ...
SORTKEY ...
AS
SELECT * from external_daily_batch_20190422;
DROP TABLE IF EXISTS internal_daily_batch__backup CASCADE;
ALTER TABLE internal_daily_batch rename to internal_daily_batch__backup;
ALTER TABLE internal_daily_batch_temp rename to internal_daily_batch;
Incremental load not possible?
By the way, is all of your 10TB of data mutable? Isn't incremental update possible?
Related
I have a partitioned table in the rds(postgres), this table is read by dms and put into dms along with other tables. So for the partitioned table dms creates a new folder each day on s3 in naming like this:
revenue/
revenue_y2022d001/
revenuey_2022d002
sales/
sales_y2022d001/
sales_y2022d002/
...
I want to read only all the sales tables in the glue dynamic dataframe.
There is a way to use exclusion in the crawler but then one crawler will need to be created for each partitioned table.
So is there a way to read only the folders starting with sales in the glue dynamic dataframe?
everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.
Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?
Current set up:
JSON files in partitioned S3 bucket are crawled once a day
Glue Job creates Parquet files in specified folder
Run ANOTHER crawler to RECREATE the same table that was made in step 1
I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!
For convenience considerations - 2 crawlers is the way to go.
For cost considerations - a hacky solution whould be:
Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;
In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.
Execute the new CREATE TABLE DDL in Athena.
For example:
SHOW CREATE TABLE json_table;
Original DDL:
CREATE EXTERNAL TABLE `json_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
...
LOCATION
's3://bucket_name/table_data'
...
New DDL:
CREATE EXTERNAL TABLE `parquet_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_name/table_data'
You can also do it in the same way with Glue api methods: get_table() > replace > create_table().
Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.
I have a table in AWS Glue which uses an S3 bucket for it's data location. I want to execute an Athena query on that existing table and use the query results to create a new Glue table.
I have tried creating a new Glue table, pointing it to a new location in S3, and piping the Athena query results to that S3 location. This almost accomplishes what I want, but
a .csv.metadata file is put in this location along with the actual .csv output (which is read by the Glue table as it reads all files in the specified s3 location).
The csv file places double quotes around each field, which ruins any fieldSchema defined in the Glue Table that uses numbers
These services are all designed to work together, so there must be a proper way to accomplish this. Any advice would be much appreciated :)
The way to do that is by using CTAS query statements.
A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.
For example:
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/'
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
There are some limitations though. However, for your case the most important are:
The destination location for storing CTAS query results in Amazon S3 must be empty.
The same applies to the name of new table, i.e. it shouldn't exist in AWS Glue Data Catalog.
In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system.
However, can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/',
bucketed_by=ARRAY['some_column_from_select'],
bucket_count=1
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
Apart from creating a new files and defining a table associated with you can also convert your data to a different file formats, e.g. Parquet, JSON etc.
I guess you have to change ur ser-de. If you are querying csv data either opencsvserde or lazysimple serde should work for you.
I am trying to drop all the partitions on an external table in a redshift cluster. I am unable to find an easy way to do it. I am currently doing this by running a dynamic query to select the dates from the table and concatenating it with the drop logic and taking the result set and running it separately like this
select 'ALTER TABLE procore_iad_ext.active_histories DROP PARTITION (values='''||rtrim(ltrim(values, '["'),'"]') ||''');' from svv_external_partitions
where tablename = 'xyz';
values looks like this ->["2009-03-10"]
Looking for a simpler direct solution. Thanks.
The easiest way to do this would be to drop the table itself. As long as you have the DDL to recreate the table and don't mind dropping all partitions, just DROP TABLE <schemaname>.<tablename>; then recreate the table. The new table will not have any partitions.
Please check out the Glue catalog. It provides a UI to easily delete the tables/partitions etc.