Can BigLake tables detect schema changes on GCS? - google-cloud-platform

I am thinking of a BigQuery setup, where
the source files are CSVs stored on GCS, and
tables are created from those files with BigLake
tables are partitioned by created_at column
I would like to know how this setup can handle schema changes. The source files are actually exports from our application DB, so their schema can change when application schema is updated.
In order to set up a partition upon a specific column in the table, I noticed that I need to specify the schema beforehand and "Auto-detect schema" option cannot be used.
In this situation, will I have to update the schema manually every time the source schema is changed? Or will BigLake be able to handle the change on its own?
I have read the following page on BigLake, but I could not find them mentioning schema changes.
https://cloud.google.com/bigquery/docs/biglake-quickstart

BigLake will not handle the table’s schema if the external table’s(i.e CSV in your case) schema gets changed. Once you upload the CSV into GCS and if the source file’s schema is different from the previous version then you will also need to update the table’s schema.
You can follow any of the documented methods to alter the schema of a table.

Related

How to add columns to an existing Athena table using Avro storage

I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.

Athena tables having history of records of every csv

I am uploading CSV files in the s3 bucket and creating tables through glue crawler and seeing the tables in Athena, making connection between Athena and Quicksight, and showing the result graphically there in quicksight.
But what I need to do now is keep the history of the files uploaded, instead of a new CSV file being uploaded and crawler updating the table, can I have crawler save each record separately? or is it even a reasonable thing to do? since I wonder it would then create so many tables and it'll be a mess?
I'm just trying to figure out a way to keep a history of previous records. how can i achieve this?
When you run an Amazon Athena query, Athena will look at the location parameter defined in the table's DDL. This specifies where the data is stored in an Amazon S3 bucket.
Athena will include all files in that location when it runs the query on that table. Thus, if you wish to add more data to the table, simply add another file in that S3 location. To replace data in that table, you can overwrite the file(s) in that location. To delete data, you can delete files from that location.
There is no need to run a crawler on a regular basis. The crawler can be used to create the table definition and it can be run again to update the table definition if anything has changed. But you typically only need to use the crawler once to create the table definition.
If you wish to preserve historical data in the table while adding more data to the table, simply upload the data to new files and keep the existing data files in place. That way, any queries will include both the historical data and the new data because Athena simply looks at all the files in that location.

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Updating manually created aws glue data catalog table with crawler

I'm working with AWS glue and many files on s3, with new files appended every day. I try to create and run a crawler to deduce a schema of those csv files. Instead of just one data catalog table with schema, crawler creates many tables (even with Create a single schema for each S3 path option selected), which means that crawler recognize different schemas and can't combine them into one. But I need just one table in data catalog for all those files!
So I created separate data catalog table manually, and when I use this table with glue job, none of the s3 csv files are processed. I guess that is because every time crawler runs, it checks for new files and partitions (and in good case of single schema table we can see those files and partitions by clicking on View partitions button in Tables).
So in here there is way to update manually created table with a crawler, I followed it with a hope that crawler will not change data types for columns that I selected, but update list of files and partitions for glue job to process later:
You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.
To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.
It doesn't happen for some reason, in crawler log I see this:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket1/customer/dt=2020-02-26/delta_20200226_080101.csv
INFO : Multiple tables are found under location bucket1/customer/. Table customer is skipped.
But there is no "Exclude patterns" option to exclude that file when crawler uses existing data catalog table, documentation says that in this case "The crawler then crawls the data stores specified by the catalog tables".
And crawler doesn't add any partitions or files to my table.
Is there a way to update my manually created table with new files from s3?
Considering your crawler is detecting different schemas, it will continue to do the same no matter what option I choose. You can get it to use the table definition from the table for all the partitions and then only log changes to avoid updating the table schema. But if there is a difference in schema for the files , I’m not sure if your queries will work.
Another option would be to add partitions using boto3 for your s3 path. I can get the table schema using the get table function and then create a partition in glue with that table schema
I don't know why, but the crawler I created can't update list of files and partitions for glue job to process later, it skips my manually created data catalog table, I see it in the cloudwatch log. To solve this problem, I needed to add repair table query into my glue script, so it does what crawler is supposed to do (and I disabled the crawler itself, so it doesn't changes my manually created table and doesn't create many tables for individual csv files and partitions), before actual ETL process:
import boto3
...
# Athena query part
client = boto3.client('athena', region_name='us-east-2')
data_catalog_table = "customer"
db = "inner_customer" # glue data_catalog db, not Postgres DB
# this supposed to update all partitions for data_catalog_table, so glue job can upload new file data into DB
q = "MSCK REPAIR TABLE "+data_catalog_table
# output of the query goes to s3 file normally
output = "s3://bucket_to_store_query_results/results/"
response = client.start_query_execution(
QueryString=q,
QueryExecutionContext={
'Database': db
},
ResultConfiguration={
'OutputLocation': output,
}
)`
After that query "MSCK REPAIR TABLE customer" executes, it writes to s3://bucket_to_store_query_results/results/ a xxx-xxx-xxx.txt file with content like this:
Partitions not in metastore: customer:dt=2020-03-28 customer:dt=2020-03-29 customer:dt=2020-03-30
Repair: Added partition to metastore customer:dt=2020-03-28
Repair: Added partition to metastore customer:dt=2020-03-29
Repair: Added partition to metastore customer:dt=2020-03-30
And if I open Glue->Tables-> select customer table, then click on "View partitions" button on the right top of the page, I see all my partitions from the s3 bucket. After that part the glue job continues as before. I understand that "repair table" query hack is not really optimal, and may be will change it to something more sophisticated, like described in here.

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.