I'm getting a schema mismatch error when querying parquet data from Athena.
The error is:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields
In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error.
I did this because of the similar question:
How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
The table schema for the totals column is:
struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>
The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".
You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :
1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.
2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.
Related
everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.
I am having a parquet file with the below structure
column_name_old
First
Second
I am crawling this file to a table in AWS glue, however in the schema table I want the table structure as below without changing anything in parquet files
column_name_new
First
Second
I tried updating table structure using boto3
col_list = js['Table']['StorageDescriptor']['Columns']
for x in col_list:
if isinstance(x, dict):
x.update({'Name': x['Name'].replace('column_name_old', 'column_name_new')})
And it works as I can see the table structure updated in Glue catalog, but when I query the table using the new column name I don't get any data as it seems the mapping between the table structure and partition files is lost.
Is this approach even possible or I must change the parquet files itself? If it's possible what I am doing wrong?
You can create a view of the column name mapped to other value.
I believe a change in the column name will break the meta catalogue.
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://exp-mahesh-sandbox/Demo/Year=2017/Month=1/Day=3/part-00015-d0e1263a-616e-435f-b4f4-9154afb3f07d.c000.snappy.parquet (offset=0, length=12795): Schema mismatch, metastore schema for row column statistical has 17 fields but parquet schema has 9 fields
I have used AWS Glue crawler to get the schema of the Parquet files. Initially I am having few files in the partition Day=1 and Day=2, run crawler and able to query it using Athena. After adding few more files in the partition Day=3, where the schema of file with "statistical"(type:struct) column has some missing fields, Athena throws the above mentioned error.
Is there any way to solve this issue. I am expecting null value in the missing fields.
I have tried UPDATE THE TABLE DEFINITION IN THE DATA CATALOG option in the crawler, but it gives the same result.
Crawler Settings
You're getting that error because at least one of your Parquet files has a schema that is either different from the other files that compose the table or from the table's definition itself; it appears to be your "Day=3" partition.
This is a limitation in Athena, that requires that the files that are the data source for a table have the same schema, i.e. all the files' columns need to match Athena's table definition, even struct members.
This error happens despite the Glue crawler running successfully; the table definition is indeed updated by the crawler, but when you execute a query that touches a file with a different schema (e.g. missing a column) you get a HIVE_CANNOT_OPEN_SPLIT error.
I have a table defined in AWS Glue. I use AWS Kinesis streams to stream logs into S3 using this table definition, using parquet file format. It's partitioned by date.
One of the fields in the table is a struct with several fields, event_payload, one of them an array of structs. Recently I added a new field to the inner struct in the log data. I want to add it in the table definition so that it will be written to the S3, and so that I can query it using AWS Athena.
I tried editing the table schema directly in the console. It does write the data to S3, but I get an exception in Athena when querying:
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'event_payload' in table 'c2s.logs' is declared as type 'struct<...>', but partition 'year=2019/month=201910/day=20191026/hour=2019102623' declared column 'event_payload' as type 'struct<...>'.
I tried deleting all the partitions and repairing the table, as specified here, but I got another error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://.../year=2019/month=201910/day=20191022/hour=2019102216/beaconFirehose-4-2019-10-22-16-34-21-71f183d2-207e-4ae9-98fe-07dda0bab70c.parquet (offset=0, length=801679): Schema mismatch, metastore schema for row column event_payload.markings.element has 8 fields but parquet schema has 7 fields
So the schema has a field which is not present in the data.
Is there a way to specify an optional field? If it's not present, just make it null.
As per link schema updates on nested structure is not supported in Athena. One way to make this work can be to flatten the struct type with the help of relalaionalize operator in Glue. for ex:
val frames: Seq[DynamicFrame] = lHistory.relationalize(rootTableName = "hist_root", stagingPath = redshiftTmpDir, JsonOptions.empty)
I have few files in s3. Used glue data catalog to get the table definition. I have field called log_time and I manually set the datatype to timestamp in glue catalog. Now when I query that table from Athena I can see the timestamp values correctly.
Now I go to Redshift spectrum and create an external schema pointing to the schema created by the glue data catalog. I can see the table that are defined there and also when I check the data type of the column I see that it is defined as timestamp. However I run the same query I can in Athena, log_time field displays the date part correctly. But for the time part it is all 00:00:00 for all rows.
Any idea?
**date value it the file :**2018-12-16 00:47:20.28
When i change the field date-type to timestamp manually in glue-data-catalog then then query in Athena i see the value: 2018-12-16 00:47:20.280
When I create a Redshift spectrum schema pointing to the data-catalog's schema and then query it, I see the value 2018-12-16 00:00:00