AWS Glue - Adding fileld to a struct field - amazon-web-services

I have a table defined in AWS Glue. I use AWS Kinesis streams to stream logs into S3 using this table definition, using parquet file format. It's partitioned by date.
One of the fields in the table is a struct with several fields, event_payload, one of them an array of structs. Recently I added a new field to the inner struct in the log data. I want to add it in the table definition so that it will be written to the S3, and so that I can query it using AWS Athena.
I tried editing the table schema directly in the console. It does write the data to S3, but I get an exception in Athena when querying:
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'event_payload' in table 'c2s.logs' is declared as type 'struct<...>', but partition 'year=2019/month=201910/day=20191026/hour=2019102623' declared column 'event_payload' as type 'struct<...>'.
I tried deleting all the partitions and repairing the table, as specified here, but I got another error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://.../year=2019/month=201910/day=20191022/hour=2019102216/beaconFirehose-4-2019-10-22-16-34-21-71f183d2-207e-4ae9-98fe-07dda0bab70c.parquet (offset=0, length=801679): Schema mismatch, metastore schema for row column event_payload.markings.element has 8 fields but parquet schema has 7 fields
So the schema has a field which is not present in the data.
Is there a way to specify an optional field? If it's not present, just make it null.

As per link schema updates on nested structure is not supported in Athena. One way to make this work can be to flatten the struct type with the help of relalaionalize operator in Glue. for ex:
val frames: Seq[DynamicFrame] = lHistory.relationalize(rootTableName = "hist_root", stagingPath = redshiftTmpDir, JsonOptions.empty)

Related

AWS Athena & AWS Glue column issues

I am having an issue with Athena having used AWS Glue to crawl an S3 bucket and create a table. The table created had a column type which was defined as a struct. I manually changed the type to String in the definition as I wish to use the contents by casting it to JSON.
However when I attempt to quert the data in Athena I receive
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced.
This is because the partitions created automatically from the crawler refer to the changed column type as its old type (not the updated string type). I can't see how I can edit the partitions but also there are thousands of them as one is generated per folder so doing so manually could be difficult.
Does anyone have a way of editing a table scheme that also updates the partitions generated for it?
Thanks.

Renaming AWS glue table column name without changing underlying parquet files

I am having a parquet file with the below structure
column_name_old
First
Second
I am crawling this file to a table in AWS glue, however in the schema table I want the table structure as below without changing anything in parquet files
column_name_new
First
Second
I tried updating table structure using boto3
col_list = js['Table']['StorageDescriptor']['Columns']
for x in col_list:
if isinstance(x, dict):
x.update({'Name': x['Name'].replace('column_name_old', 'column_name_new')})
And it works as I can see the table structure updated in Glue catalog, but when I query the table using the new column name I don't get any data as it seems the mapping between the table structure and partition files is lost.
Is this approach even possible or I must change the parquet files itself? If it's possible what I am doing wrong?
You can create a view of the column name mapped to other value.
I believe a change in the column name will break the meta catalogue.

AWS ATHENA : HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split, Schema mismatch when querying Parquet files

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://exp-mahesh-sandbox/Demo/Year=2017/Month=1/Day=3/part-00015-d0e1263a-616e-435f-b4f4-9154afb3f07d.c000.snappy.parquet (offset=0, length=12795): Schema mismatch, metastore schema for row column statistical has 17 fields but parquet schema has 9 fields
I have used AWS Glue crawler to get the schema of the Parquet files. Initially I am having few files in the partition Day=1 and Day=2, run crawler and able to query it using Athena. After adding few more files in the partition Day=3, where the schema of file with "statistical"(type:struct) column has some missing fields, Athena throws the above mentioned error.
Is there any way to solve this issue. I am expecting null value in the missing fields.
I have tried UPDATE THE TABLE DEFINITION IN THE DATA CATALOG option in the crawler, but it gives the same result.
Crawler Settings
You're getting that error because at least one of your Parquet files has a schema that is either different from the other files that compose the table or from the table's definition itself; it appears to be your "Day=3" partition.
This is a limitation in Athena, that requires that the files that are the data source for a table have the same schema, i.e. all the files' columns need to match Athena's table definition, even struct members.
This error happens despite the Glue crawler running successfully; the table definition is indeed updated by the crawler, but when you execute a query that touches a file with a different schema (e.g. missing a column) you get a HIVE_CANNOT_OPEN_SPLIT error.

HIVE_BAD_DATA: Wrong type using parquets on AWS Athena

I've created a Glue Crawler to read files from S3 and create a table for each S3 path. The table health_users were created using a wrong type for a specific column: the column two_factor_auth_enabled were created as int instead of string.
Manually, I went to Glue Catalog and updated the schema of table health_users.
After that, I tried to run the query again on Athena and it still throwing the same error:
Your query has the following error(s):
HIVE_BAD_DATA: Field two_factor_auth_enabled's type BOOLEAN in parquet is incompatible with type int defined in table schema
This query ran against the "test_parquets" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c3a86b98-70a2-4c70-97d8-8bc377c455b8.
I've checked the table structure on Athena and the column two_factor_auth_enabled is a string (the file attached shows table definition):
What's wrong with my solution? How can I fix this error?

HIVE_CANNOT_OPEN_SPLIT: Schema mismatch when querying parquet files from Athena

I'm getting a schema mismatch error when querying parquet data from Athena.
The error is:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields
In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error.
I did this because of the similar question:
How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
The table schema for the totals column is:
struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>
The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".
You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :
1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.
2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.