Google Cloud DataStream failing with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE - google-cloud-platform

I am getting error when I was trying to partition the destination table in BigQuery while working with DataStream.
step by step to reproduce this:
start DataStream from CloudSQL(MYSQL) to BigQuery
once the Stream Completed all tables in BigQuery, pause the job
Partition one of the table
Resume the job
Getting error log as below
====================================================
Discarded 97 unsupported events for BigQuery destination: 833537404433.Test_Membership_1.internal_Membership, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported..
{
insertId: "65ad79ec-0000-24c7-a66e-14223bbf970a#a1"
jsonPayload: {
context: "CDC"
event_code: "UNSUPPORTED_EVENTS_DISCARDED"
message: "Discarded 97 unsupported events for BigQuery destination:
833537404433.Test_Membership_1.internal_Membership, with reason code:
BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to
BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."
read_method: ""
}
logName: "projects/gcp-everwash-wh-dw/logs/datastream.googleapis.com%2Fstream_activity"
receiveTimestamp: "2022-11-22T22:08:38.620495835Z"
resource: {2}
severity: "WARNING"
timestamp: "2022-11-22T22:08:37.726075Z"
}
What you expected to happen: ?
I am expecting to create Partition for the certain tables that are getting inserted in BigQuery via DataStream.

Partitioning to the existing BigQuery table is not supported.You have to add partitioning to a net-new table. You can create a newly partitioned table from the result of a query as mentioned in this document, however this approach won't work for existing Datastream sourced tables since there wouldn't be a _CHANGE_SEQUENCE_NUMBER field which is required to correctly apply UPSERT operations in the correct order. So the only option would be to pre-create the table with partitioning/clustering/primary keys before starting the Datastream stream like the below DDL SQL sample query.
CREATE TABLE `project.dataset.new_table`
(
`Primary_key_field` INTEGER PRIMARY KEY NOT ENFORCED,
`time_field` TIMESTAMP,
`field1` STRING,
#Just an example above. Add needed fields within the base table...
)
PARTITION BY
DATE(time_field)
CLUSTER BY
Primary_key_field #This must be an exact match of the specified primary key fields
OPTIONS(max_staleness = INTERVAL 15 MINUTE) #or whatever the desired max_staleness value is
For more information, you can check this issue tracker.

Related

Accessing datacatalog table in Glue properly

I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.
Sidenotes:
The ETL job concerned has AWSGlueServiceRole attached.
I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.

AWS Data Pipeline Dynamo to Redshift

I have an issue:
I need to migrate data from DynamoDB to Redshift. The problem is that I receive such exception:
ERROR: Unsupported Data Type: Current Version only supports Strings and Numbers Detail: ----------------------------------------------- error: Unsupported Data Type: Current Version only supports Strings and Numbers code: 9005 context: Table Name = user_session query: 446027 location: copy_dynamodb_scanner.cpp:199 process: query0_124_446027 [pid=25424] -----------------------------------------------
In my Dynamo item I have boolean field. How can I modify field from Boolean to INT(for example)?
I tried to use as a VARCHAR(5), but didn't help(so it one ticket in Github without response)
Will be appreciate for any suggestions.
As a solution, I migrated data from DynamoDB to S3 first and then to Redshift.
I used Exports to S3 build in feature in DynamoDB. It saves all data as *.json files into S3 realy fast(but not sorted).
After that I used ETL script, using Glue Job and custom script with pyspark to process and save data into Redshift.
Also can be done with Glue crawler to define schema, but still need to validate its result, as sometimes it was not correct.
Using crawlers to parse DynamoDB directly is overkill of your tables if you are not using ONDEMAND read/write. So the better way is to do that with data from S3.

Athena - reserved words and table that cannot be queried

I'm putting JSON data files into S3, and use AWS-Glue to build the table definition. I have about 120 fields per each json "row". One of the fields is called "timestamp" in lower case. I have 1000s of large files, and would hate to change them all.
Here (https://docs.aws.amazon.com/athena/latest/ug/reserved-words.html), I see TIMESTAMP in DDL is a reserved word. Does that mean I won't be able to read those JSON file from Athena.
I'm getting this error, which lead me to the above being a potential reason.
I clicked the 3 dots to the right of the tablename, and clicked "Preview Table", which built and ran this select statement:
SELECT * FROM "relatixcurrdayjson"."table_currday" limit 10;
That lead to an error which seems wrong or misleading:
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c448f0ea-5086-4436-9107-2b60dab0f04f.
If I click the option that says "Generate Create Table DDL", it builds this line to execute:
SHOW CREATE TABLE table_currday;
and results in this error:
Your query has the following error(s):
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.NullPointerException
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 6ac5d90f-8d52-4e3e-8f16-cd42e1edcfa3.
This is the AWS Glue Log:
UPDATE #1:
I used Athena a couple of weeks ago with CSV and it worked great.
This time I'm using JSON.
I created a new folder with one file containing the following, ran the Glue Crawler:
[
{"firstName": "Neal",
"lastName": "Walters",
"city": "Irving",
"state", "TX"
}
{"firstName": "Fred",
"lastName": "Flintstone",
"city": "Bedrock",
"state", "TX"
}
{"firstName": "Barney",
"lastName": "Rubble",
"city": "Stillwater",
"state", "OK"
}
]
and this SQL gives the same error as above:
SELECT * FROM "relatixcurrdayjson"."tbeasyeasytest" limit 10;
It's very easy to get Glue crawlers to create tables that don't work in Athena, which is surprising given that it's the primary goal it was designed for.
If the JSON you posted is exactly what you ran your crawler against the problem is that Athena does not support multi-line JSON documents. Your files must have exactly one JSON document per line. See Dealing with multi-line JSON? (And, bonus points, CRLF), Multi-line JSON file querying in hive, and Create Table in Athena From Nested JSON

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
Partitions represent days
Files represent events
Each event is a json blob in a single s3 file
An event contains a subset of columns (dependent on the type of event)
The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
This inconsistency causes the error in Athena I think
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
It also fixed my issue!
If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
This helped me. Posting the image for others in case the link is lost
Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run

What's wrong with this HIVE script to export from DynamoDB to S3 in an AWS Data Pipeline?

Is there a problem with the HIVE script below or is this another issue, possibly related to the version of HIVE installed by AWS Data Pipeline?
The first part of my AWS Data Pipeline must export large tables from DynamoDB to S3 to later process using EMR. The DynamoDB table that I'm using for testing is only a few rows long, so I know that the data is formatted correctly.
The script associated with the AWS Data Pipeline "Export DynamoDB to S3" building block works correctly for tables that contain only primitive_types but don't export array_types. (reference - http://archive.cloudera.com/cdh/3/hive/language_manual/data-manipulation-statements.html)
I pulled out all Data Pipeline-specific stuff and am now trying to get the following minimal example based on the DynamoDB docs to work - (reference - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html)
-- Drop table
DROP table dynamodb_table;
--http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html
CREATE EXTERNAL TABLE dynamodb_table (song string, artist string, id string, genres array<string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "song:song,artist:artist,id:id,genres:genres");
INSERT OVERWRITE DIRECTORY 's3://umami-dev/output/colmap/' SELECT *
FROM dynamodb_table;
Here is the stack-trace / EMR errors that I'm see when running the above script -
Diagnostic Messages for this Task:
java.io.IOException: IO error in map input file hdfs://172.31.40.150:9000/mnt/hive_0110/warehouse/dynamodb_table
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:244)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:218)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:238)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.scan(AbstractDynamoDBRecordReader.java:176)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.fetchItems(HiveDynamoDBRecordReader.java:87)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:44)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:25)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Command exiting with ret '255'
I tried a few things already to debug, but none of them have been successful - creating an external table w/formatting, using a few different JSON SerDes. I'm not sure what to try next.
Many thanks.
I answered my own question by creating an EMR cluster and using Hue to quickly run HIVE queries in the Amazon environment.
The solution was to change the format of Items in the DynamoDB - what was originally a List of Strings is now a StringSet. Then my Hive tables could successfully operate on the array.
Logically speaking, I may lose the order of the Strings because I assume that a List is ordered but a Set is not. This doesn't matter in my specific problem.
Here's the relevant chunk of the final functioning Hive script -
-- depends on genres2 to be a StringSet (or not exist)
CREATE EXTERNAL TABLE sauce (id string, artist string, song string, genres2 array<string>)
STORED BY "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "id:id,artist:artist,song:song,genres2:genres2");
-- s3 location for export to
CREATE EXTERNAL TABLE pasta (id int, artist string, song string, genres array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
LOCATION "s3n://umami-dev/tmp2";
-- do the export
INSERT OVERWRITE TABLE pasta
SELECT * FROM sauce;