We have an ETL job that uses the below code snippet to update the catalog table:
sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True, updateBehavior='UPDATE_IN_DATABASE')
sink.setFormat('glueparquet')
sink.setCatalogInfo(catalogDatabase=config['glue_db'], catalogTableName=config['glue_table_bc'], catalogId=args['catalog_id'])
sink.writeFrame(dyF)
The table is non-partitioned & needs to be overwritten with new data daily. Since glueContext does not support overwrite, we are using purge_s3_path & purge_table methods to empty the S3 Location a step before using the above write. We do similar thing for partitioned tables as well & it has been working fine for us so far.
Recently, the schema of the data was updated (added a few new columns). Upon the ETL job completion, it successfully updated the partitioned Table with the new schema but the non-partitioned schema is still the same. We did verify by physically accessing the S3 files & the new fields are present in the datafiles. Why is the schema not updated similar to the partitioned Table? Is there a different method that we can use?
Related
I am adding a new file in parquet format which is created by a Glue Databrew in my S3 folder. The new file has the same schema as the previous file. But when I am running the Crawler for the 2nd time it is neither updating the table nor creating a new one in the data catalog. Also when I am crawling both the files together, both of them are getting added.
Log File is giving the following information:
INFO : Created partitions with values [[New file name]] for table
BENCHMARK : Finished writing to Catalog
I have tried with and without "Create a single schema for each S3 path". But the crawler is not updating the table with the new file. Sooner I will add new files on a daily basis to do my analysis. Any solution?
The best way to approach this issue in my opinion is to use AWS DataBrew output to Data Catalog directly. Data Catalog can be updated either by the crawler or by DataBrew directly but the recommended practice is that you employ any one of those mechanisms not both.
Can you try running the job with output as your data catalog and let Databrew manage your catalog? It should update your catalog table with right data/files.
I'm working with AWS glue and many files on s3, with new files appended every day. I try to create and run a crawler to deduce a schema of those csv files. Instead of just one data catalog table with schema, crawler creates many tables (even with Create a single schema for each S3 path option selected), which means that crawler recognize different schemas and can't combine them into one. But I need just one table in data catalog for all those files!
So I created separate data catalog table manually, and when I use this table with glue job, none of the s3 csv files are processed. I guess that is because every time crawler runs, it checks for new files and partitions (and in good case of single schema table we can see those files and partitions by clicking on View partitions button in Tables).
So in here there is way to update manually created table with a crawler, I followed it with a hope that crawler will not change data types for columns that I selected, but update list of files and partitions for glue job to process later:
You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.
To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.
It doesn't happen for some reason, in crawler log I see this:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket1/customer/dt=2020-02-26/delta_20200226_080101.csv
INFO : Multiple tables are found under location bucket1/customer/. Table customer is skipped.
But there is no "Exclude patterns" option to exclude that file when crawler uses existing data catalog table, documentation says that in this case "The crawler then crawls the data stores specified by the catalog tables".
And crawler doesn't add any partitions or files to my table.
Is there a way to update my manually created table with new files from s3?
Considering your crawler is detecting different schemas, it will continue to do the same no matter what option I choose. You can get it to use the table definition from the table for all the partitions and then only log changes to avoid updating the table schema. But if there is a difference in schema for the files , I’m not sure if your queries will work.
Another option would be to add partitions using boto3 for your s3 path. I can get the table schema using the get table function and then create a partition in glue with that table schema
I don't know why, but the crawler I created can't update list of files and partitions for glue job to process later, it skips my manually created data catalog table, I see it in the cloudwatch log. To solve this problem, I needed to add repair table query into my glue script, so it does what crawler is supposed to do (and I disabled the crawler itself, so it doesn't changes my manually created table and doesn't create many tables for individual csv files and partitions), before actual ETL process:
import boto3
...
# Athena query part
client = boto3.client('athena', region_name='us-east-2')
data_catalog_table = "customer"
db = "inner_customer" # glue data_catalog db, not Postgres DB
# this supposed to update all partitions for data_catalog_table, so glue job can upload new file data into DB
q = "MSCK REPAIR TABLE "+data_catalog_table
# output of the query goes to s3 file normally
output = "s3://bucket_to_store_query_results/results/"
response = client.start_query_execution(
QueryString=q,
QueryExecutionContext={
'Database': db
},
ResultConfiguration={
'OutputLocation': output,
}
)`
After that query "MSCK REPAIR TABLE customer" executes, it writes to s3://bucket_to_store_query_results/results/ a xxx-xxx-xxx.txt file with content like this:
Partitions not in metastore: customer:dt=2020-03-28 customer:dt=2020-03-29 customer:dt=2020-03-30
Repair: Added partition to metastore customer:dt=2020-03-28
Repair: Added partition to metastore customer:dt=2020-03-29
Repair: Added partition to metastore customer:dt=2020-03-30
And if I open Glue->Tables-> select customer table, then click on "View partitions" button on the right top of the page, I see all my partitions from the s3 bucket. After that part the glue job continues as before. I understand that "repair table" query hack is not really optimal, and may be will change it to something more sophisticated, like described in here.
What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.
I am trying to leverage Athena to run SQL on data that is pre-ETL'd by a third-party vendor and pushed to an internal S3 bucket.
CSV files are pushed to the bucket daily by the ETL vendor. Each file includes yesterday's data in addition to data going back to 2016 (i.e. new data arrives daily but historical data can also change).
I have an AWS Glue Crawler set up to monitor the specific S3 folder where the CSV files are uploaded.
Because each file contains updated historical data, I am hoping to figure out a way to make the crawler overwrite the existing table based on the latest file uploaded instead of appending. Is this possible?
Thanks very much in advance!
It is not possible the way you are asking. The Crawler does not alter data.
The Crawler is populating the AWS Glue Data Catalog with tables only.
Please see here for details: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If you want to do data cleaning using Athena/Glue before using data you need to follow the steps:
Map the data using Crawler into a temporary Athena database/table
Profile your data using Athena. SQL or QuickSight etc. to get the idea what you need to alter
Use Glue job to
make data transformation/cleaning/renaming/deduping using PySpark or Scala
export data into S3 new location (.csv / .paruqet etc.) potentially partitioning
Run one more Crawler to map cleaned data from the new S3 location into Athena database
The dedupe you are askinging about happens in step 3
I have a dynamo table that has depreciated and I need to merge it into another table. The schema for the two tables are slightly different, and so I need to do some minor work on each item before I can put items into the surviving table.
Now, I know that I could always create a lambda that writes a batch of these records into a kinesis stream that's watched by another lambda that could put the records in the surviving table, but this seems kludgy to me. DataPipeline seems like a better solution but I'm not sure if I can alter items before they are moved to the new table. Same with EMR.
Any suggestions would be appreciated.
Data pipeline , in its import/export template uses DynamoDB connector's Import/Export Tool to copy contents from source to destination. See https://github.com/awslabs/emr-dynamodb-connector
The tool simply launches the Hadoop implementation to run mappers/Reducers to do your job. However, the tool doesn't have enough control to alter items nor have a DynamoDB -> DynamoDB ETL.
However, as all EMR clusters come with emr-dynamodb-connector libraries you can use either HIVE / SPARK to write your own DDL's and DML's to copy DynamoDB -> DynamoDB(same AWS region). If you write the DML's correct, you might even copy data between two DynamoDB tables with different schema's using your own clauses. You can later automate those scripts to run on EMR resource created by Data-pipeline on a schedule.
The presudo code for hql to copy from one table to another can be as simple as:
CREATE EXTERNAL TABLE dynamodb_table (`ID` STRING,`DateTime` STRING)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "ddb-table-1", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime");
CREATE EXTERNAL TABLE dynamodb_table2 (`ID` STRING,`DateTime` STRING)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "ddb-table-2", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime");
INSERT OVERWRITE TABLE dynamodb_table SELECT * FROM dynamodb_table2;
Hive DDL's and DML syntax's can be found here :
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Some examples on using DynamoDB storage handler on EMR :
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html