I run a Glue Crawler over a nested JSON data-source on S3 and I tried to query nested fields as per documentation via Redshift Spectrum:
select c.id , c.my_nested_column.MyField
from my_external_schema.my_table c;
But as per title I was getting the error message
[42703] ERROR: column "my_nested_column" does not exist
which doesn't really make sense as from metadata I can see the field exists. But because of this I'm unable to unnest fields from "my_nested_column".
How to fix this?
After some investigations, I noticed that one of the fields of the JSONs being parsed contained a colon within the field name and it was something like my:field.
This clashes with the JSON logic and doesn't work really well. I removed this field from the Glue Catalog and afterwards I was able to query the field correctly.
The initial error message I was getting really didn't help but the problem was caused by malformed JSON field name.
Related
I'm interested in using a streaming pipeline from google pub/sub to big query, I wanted to know how it would handle a case where an updated json object is sent with missing fields/branches that are presently already in a big query table / schema. For example will it set the value in the table to empty/null, retain what's in the table and update fields/branches that are present, or simply fail because the sent object does not match the schema one to one.
I have a JSON file that has spaces in the field names. The Glue crawler is able to infer it and create the fields appropriately in a struct. But when I query this table in Athena I get the HIVE_METASTORE_ERROR complaining about the space in the field name. So I updated the table manually in the glue data catalog to get rid of spaces in the column names. When I go to Athena after the update and do a generate DDL I get the right columns without spaces. However the select still fails with the HIVE_METASTORE_ERROR and it keeps telling me that I have spaces in the column names.
Anyone knows how to get around this problem? Changing the source data is not allowed
I'm looking into OpenSearch and trying out the dashboards & data querying but am having some difficulty with some specific queries I'd like to run.
The data I have streaming into OpenSearch is a set of custom error logs which the #message field contains json such as: ...{"code":400,"message":"Bad request","detail":"xxxx"}... (the entire message field is not valid json as it contains other data too)
I can specifically query for the code field with "code:400" but would like a more generic query to match all 4XX codes but adding any type of wildcard or range breaks the query and the quotes surrounding it are required otherwise the results include code OR 400
Is there any way to achieve this with the kind of data I have or is this a limitation in the querying syntax?
I've created a Glue Crawler to read files from S3 and create a table for each S3 path. The table health_users were created using a wrong type for a specific column: the column two_factor_auth_enabled were created as int instead of string.
Manually, I went to Glue Catalog and updated the schema of table health_users.
After that, I tried to run the query again on Athena and it still throwing the same error:
Your query has the following error(s):
HIVE_BAD_DATA: Field two_factor_auth_enabled's type BOOLEAN in parquet is incompatible with type int defined in table schema
This query ran against the "test_parquets" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: c3a86b98-70a2-4c70-97d8-8bc377c455b8.
I've checked the table structure on Athena and the column two_factor_auth_enabled is a string (the file attached shows table definition):
What's wrong with my solution? How can I fix this error?
Per the BigQuery documentation I am attempting to modify a table's schema by adding a field. The table in question is a partition slice (partitioned by day). I am planning on performing the action on every slice.
Per the documentation (https://cloud.google.com/bigquery/docs/managing-partitioned-tables), I should be able to add field to a partitioned table like any other table. However whenever I attempt to add a field to a partitioned table, I am met with this error:
Could not edit table schema.: Cannot change partitioned/clustered table to non partitioned/clustered table.
I am not able to find any good information on what this error means, or what I'm doing wrong. I have successfully added a field to a non-partitioned table. Does the community have any good ideas to help me troubleshoot?
I understand that you are using the update_table method to update the schema in python, correct me if I'm wrong. You have to do it with the patch API you can try this API to have a better view on how to do it.