I'm putting JSON data files into S3, and use AWS-Glue to build the table definition. I have about 120 fields per each json "row". One of the fields is called "timestamp" in lower case. I have 1000s of large files, and would hate to change them all.
Here (https://docs.aws.amazon.com/athena/latest/ug/reserved-words.html), I see TIMESTAMP in DDL is a reserved word. Does that mean I won't be able to read those JSON file from Athena.
I'm getting this error, which lead me to the above being a potential reason.
I clicked the 3 dots to the right of the tablename, and clicked "Preview Table", which built and ran this select statement:
SELECT * FROM "relatixcurrdayjson"."table_currday" limit 10;
That lead to an error which seems wrong or misleading:
Your query has the following error(s):
SYNTAX_ERROR: line 1:8: SELECT * not allowed in queries without FROM clause
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c448f0ea-5086-4436-9107-2b60dab0f04f.
If I click the option that says "Generate Create Table DDL", it builds this line to execute:
SHOW CREATE TABLE table_currday;
and results in this error:
Your query has the following error(s):
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.NullPointerException
This query ran against the "relatixcurrdayjson" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 6ac5d90f-8d52-4e3e-8f16-cd42e1edcfa3.
This is the AWS Glue Log:
UPDATE #1:
I used Athena a couple of weeks ago with CSV and it worked great.
This time I'm using JSON.
I created a new folder with one file containing the following, ran the Glue Crawler:
[
{"firstName": "Neal",
"lastName": "Walters",
"city": "Irving",
"state", "TX"
}
{"firstName": "Fred",
"lastName": "Flintstone",
"city": "Bedrock",
"state", "TX"
}
{"firstName": "Barney",
"lastName": "Rubble",
"city": "Stillwater",
"state", "OK"
}
]
and this SQL gives the same error as above:
SELECT * FROM "relatixcurrdayjson"."tbeasyeasytest" limit 10;
It's very easy to get Glue crawlers to create tables that don't work in Athena, which is surprising given that it's the primary goal it was designed for.
If the JSON you posted is exactly what you ran your crawler against the problem is that Athena does not support multi-line JSON documents. Your files must have exactly one JSON document per line. See Dealing with multi-line JSON? (And, bonus points, CRLF), Multi-line JSON file querying in hive, and Create Table in Athena From Nested JSON
Related
I am getting error when I was trying to partition the destination table in BigQuery while working with DataStream.
step by step to reproduce this:
start DataStream from CloudSQL(MYSQL) to BigQuery
once the Stream Completed all tables in BigQuery, pause the job
Partition one of the table
Resume the job
Getting error log as below
====================================================
Discarded 97 unsupported events for BigQuery destination: 833537404433.Test_Membership_1.internal_Membership, with reason code: BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported..
{
insertId: "65ad79ec-0000-24c7-a66e-14223bbf970a#a1"
jsonPayload: {
context: "CDC"
event_code: "UNSUPPORTED_EVENTS_DISCARDED"
message: "Discarded 97 unsupported events for BigQuery destination:
833537404433.Test_Membership_1.internal_Membership, with reason code:
BIGQUERY_UNSUPPORTED_PRIMARY_KEY_CHANGE, details: Failed to write to
BigQuery due to an unsupported primary key change: adding primary keys to existing tables is not supported.."
read_method: ""
}
logName: "projects/gcp-everwash-wh-dw/logs/datastream.googleapis.com%2Fstream_activity"
receiveTimestamp: "2022-11-22T22:08:38.620495835Z"
resource: {2}
severity: "WARNING"
timestamp: "2022-11-22T22:08:37.726075Z"
}
What you expected to happen: ?
I am expecting to create Partition for the certain tables that are getting inserted in BigQuery via DataStream.
Partitioning to the existing BigQuery table is not supported.You have to add partitioning to a net-new table. You can create a newly partitioned table from the result of a query as mentioned in this document, however this approach won't work for existing Datastream sourced tables since there wouldn't be a _CHANGE_SEQUENCE_NUMBER field which is required to correctly apply UPSERT operations in the correct order. So the only option would be to pre-create the table with partitioning/clustering/primary keys before starting the Datastream stream like the below DDL SQL sample query.
CREATE TABLE `project.dataset.new_table`
(
`Primary_key_field` INTEGER PRIMARY KEY NOT ENFORCED,
`time_field` TIMESTAMP,
`field1` STRING,
#Just an example above. Add needed fields within the base table...
)
PARTITION BY
DATE(time_field)
CLUSTER BY
Primary_key_field #This must be an exact match of the specified primary key fields
OPTIONS(max_staleness = INTERVAL 15 MINUTE) #or whatever the desired max_staleness value is
For more information, you can check this issue tracker.
I am trying to create a GCP Pub Sub BigQuery subscriptions using the console: https://cloud.google.com/pubsub/docs/bigquery
However, I get the following error message:
API returned error: ‘Request contains an invalid argument.’
Any help would be appreciated.
NOTE
When the big query table does not exist I get the following error:
2.Pub Sub schema was not deleted
Update:
Actually, despite the generic error message, turns out to be a straightforward issue.
To use the "write metadata" option the BigQuery table must previously have the following field structure:
subscription_name (string),
message_id (string)
publish_time (timestamp)
data (bytes, string or json)
attributes (string or json)
they are better described in the docs here
Once they are created, this option works fine.
I believe that problem with the "Use topic schema" is also related to the schema used in the topic, since the table must have already the same structure (but there is a need to check it in your configuration). If your topic follow an avro schema, this might help: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
---------- previous answer
Not a definitive answer, but am having the same problem and figured out that it is somewhat related to the following options:
Use topic schema
Write metadata
Unchecking then make it work.
The same thing happens algo using terraform to try to set up the infrastructure.
I am still investigating if it is a bug or perhaps an error in my schema definition, but maybe it can be a starting point to you as well
A 404 error would indicate that either the BigQuery table does not exist or the Pub/Sub schema associated with the topic was deleted. For the former, ensure that the project, data set, and table names all match the names of the existing table to which you want to write data. For the latter, you could look at the topic details page and make sure that the schema name is not _deleted-schema_.
I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.
Sidenotes:
The ETL job concerned has AWSGlueServiceRole attached.
I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.
I am new to quicksight and was just test driving (on the quicksight web console. I'm not using the command line in this entire thing) with some data (can't share, confidential business info). I have a strange issue. when I create a dataset by uploading the file, which is only 50 mb, it works fine and I can see a preview of the table and I am able to proceed to the visualization. But when I upload the same file to the s3 and make a manifest and submit it using the 'use s3' option in the creat dataset window, I get the INCORRECT_FIELD_COUNT error.
here's the manifest file:
{
"fileLocations": [
{
"URIs": [
"s3://testbucket/analytics/mydata.csv"
]
},
{
"URIPrefixes": [
"s3://testbucket/analytics/"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"containsHeader": "true"
}
}
I know the data is not fully structured with some rows where a few columns are missing but how is it possible for quicksight to automatically infer and put NULLs into shorter rows when uploaded from local machine but not as an s3 file with the manifest? are there some different setttings that i'm missing?
I'm getting the same thing - looks like this is fairly new code. It'd be useful to know what the expected field count is, especially as it doesn't say if it's too few or too many (both are wrong). One of those technologies that looks promising, but I'd say there's a little maturing required.
As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
Partitions represent days
Files represent events
Each event is a json blob in a single s3 file
An event contains a subset of columns (dependent on the type of event)
The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
This inconsistency causes the error in Athena I think
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
It also fixed my issue!
If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
This helped me. Posting the image for others in case the link is lost
Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run