BigQuery Cannot Modify Partitioned Table Schema - google-cloud-platform

Per the BigQuery documentation I am attempting to modify a table's schema by adding a field. The table in question is a partition slice (partitioned by day). I am planning on performing the action on every slice.
Per the documentation (https://cloud.google.com/bigquery/docs/managing-partitioned-tables), I should be able to add field to a partitioned table like any other table. However whenever I attempt to add a field to a partitioned table, I am met with this error:
Could not edit table schema.: Cannot change partitioned/clustered table to non partitioned/clustered table.
I am not able to find any good information on what this error means, or what I'm doing wrong. I have successfully added a field to a non-partitioned table. Does the community have any good ideas to help me troubleshoot?

I understand that you are using the update_table method to update the schema in python, correct me if I'm wrong. You have to do it with the patch API you can try this API to have a better view on how to do it.

Related

How does google Big Query handle table updates with missing fields?

I'm interested in using a streaming pipeline from google pub/sub to big query, I wanted to know how it would handle a case where an updated json object is sent with missing fields/branches that are presently already in a big query table / schema. For example will it set the value in the table to empty/null, retain what's in the table and update fields/branches that are present, or simply fail because the sent object does not match the schema one to one.

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}
Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

Automatically generate data documentation in the Redshift cluster

I am trying to automatically generate a data documentation in the Redshift cluster for all the maintained data products, but I am having trouble to do so.
Is there a way to fetch/store metadata about tables/columns in redshift directly?
Is there also some automatic way to determine what are the unique keys in a Redshift table?
For example an ideal solution would be to have:
Table location (cluster, schema, etc.)
Table description (what is the table for)
Each column's description (what is each column for, data type, is it a key column, if so what type, etc.)
Column's distribution (min, max, median, mode, etc.)
Columns which together form a unique entry in the table
I fully understand that getting the descriptions automatically is pretty much impossible, but I couldn't find a way to store the descriptions in redshift directly, instead I'd have to use 3rd party solutions or generally a documentation outside of the SQL scripts, which I'm not a big fan of, due to the way the data products are built right now. Thus having a way to store each table's/column's description in redshift would be greatly appreciated.
Amazon Redshift has the ability to store a COMMENT on:
TABLE
COLUMN
CONSTRAINT
DATABASE
VIEW
You can use these comments to store descriptions. It might need a bit of table joining to access.
See: COMMENT - Amazon Redshift

How to query a Dynamo DB table without knowing the table name before runtime?

I want to query a Dynamo DB table based on an attribute UpdateTime such that I get the records which are updated in the last 24 hours. But this attribute is not an index in the table. I understand that I need to make this column as an index. But I do not know how do I write a query expression for this.
I saw this question but the problem is I do not know the table name on which I want to query before runtime.
To find out the table names in your DynamoDB instance, you can use the "ListTables" API: http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ListTables.html.
Another way to view tables and their data is via the DynamoDB Console: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ConsoleDynamoDB.html.
Once you know the table name, you can either create an index with the UpdateTime attribute as a key or scan the whole table to get the results you want. Keep in mind that scanning a table is a costly operation.
Alternatively you can create a DynamoDB Stream that captures all of the changes to your tables: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html.