How does google Big Query handle table updates with missing fields? - google-cloud-platform

I'm interested in using a streaming pipeline from google pub/sub to big query, I wanted to know how it would handle a case where an updated json object is sent with missing fields/branches that are presently already in a big query table / schema. For example will it set the value in the table to empty/null, retain what's in the table and update fields/branches that are present, or simply fail because the sent object does not match the schema one to one.

Related

How should I create this DynamoDB table for almost 100000 records?

I've got a CSV with nearly 100000 records in it and I'm trying to query this data in AWS DynamoDB, but I don't think I did it right.
When creating the table I added a key column airport_id, and then wrote a small console app to import all the records setting a new GUID for the key column for each record as I've only ever used relational databases before so I don't know if this is supposed to work differently.
The problem came when querying the data, as picking a record somewhere in the middle and querying it using the AWS SDK in .NET produced no results at all. I can only put this down to bad DB design on my part as it did get some results depending on the query.
What I'd like, is to be able to query the data, for example
Get all the records that have iso_country of JP (Japan) - doesn't find many
Get all the records that have municipality of Inverness - doesn't find any, the Inverness records are halfway through the set
There are records there, but I'm reckoning it does not have the right design in order to get it in a timely fashion.
Show should I create the DynamoDB table based on the below screenshot?

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

AWS Glue table Map data type for arbitratry number of fields and challenges faced

We are working on a Data-Lake project and we are getting the data from the cloudwatch logs, which in turn is going to be sent to S3 through the help of Kinesis service. Once this is in S3, we need to create a table to see the data through Athena, we have JSON files. We have 3 fields one is timestamp and the other one is properties, which in turn is an object which may hold arbitrary number of fields and differs on case to case, hence while creating the table, I defined it to be map<string,string>, based on some research and advises. Now the challenge is while querying through Athena, it always says zero records returned when there is data for sure. To confirm this, I have create a table this time through crawler and I am able to see the data through Athena, but only difference is the properties column is defined as struct with specific fields inside it, but where manual table has map<string,string> to handle arbitrary fileds coming in. Appreciate for any help to identify the root cause against this. Thank you.
Below is sample JSON line which is sitting in S3.
{"streamedAt":1599012411567,"properties":{"timestamp":1597998038615,"message":"Example Event 1","processFlag":"true"},"event":"aws:kinesis:record"}
Zero records returned usually means the table's location is wrong, or that you have a partitioned table but haven't added any partitions. I assume you would have figured it out if it was the former, so I suspect the latter. When you run a crawler it will add partitions it finds in addition to creating the table.
If you need help adding partitions please edit your question and provide examples of how your data is structured on S3.
This is a case where using a Glue crawler will probably not work very well, it will try too hard to figure out the schema of the properties column and it's never going to get it right. Glue crawlers are in general pretty bad at things that aren't very basic (see this question for a similar use case to yours when Glue didn't work out).
I think you'll be fine with a manually created table that uses map<string,string> as the type for the properties column. When you know the type of a property and want to use it as that type you just cast the value at query time. An alternative is to use string as the type and use the JSON functions to extract values at query time.

How to get rid of __key__ columns in BigQuery table for every 'Record' Type field?

For every 'Record' Type of my Firestore table, BigQuery is automatically adding the 'key' columns. I do not want to have these added for each of the 'Record' Type fields. How can I get rid of these extra columns automatically being added by BigQuery? (I want to get rid of the below columns in my BigQuery table schema highlighted in yellow)
This is intended behavior, citing Bigquery GCP documentation:
Each document in Firestore has a unique key that contains
information such as the document ID and the document path. BigQuery
creates a RECORD data type (also known as a STRUCT) for the key,
with nested fields for each piece of information, as described in the
following table.
Due to the fact that Firestore export method is fully integrated with GCP managed import and export service, you can't change this behavior, excluding __key__.* properties being sent for each RECORD field in the target Bigquery table.
I guess in your use case, Bigquery table modification action will require some hand-on intervention, since it requires manually changing schema data.
In order to set up this feasibility I would encourage you to raise a service request to the vendor via Google public issue tracker.

BigQuery Cannot Modify Partitioned Table Schema

Per the BigQuery documentation I am attempting to modify a table's schema by adding a field. The table in question is a partition slice (partitioned by day). I am planning on performing the action on every slice.
Per the documentation (https://cloud.google.com/bigquery/docs/managing-partitioned-tables), I should be able to add field to a partitioned table like any other table. However whenever I attempt to add a field to a partitioned table, I am met with this error:
Could not edit table schema.: Cannot change partitioned/clustered table to non partitioned/clustered table.
I am not able to find any good information on what this error means, or what I'm doing wrong. I have successfully added a field to a non-partitioned table. Does the community have any good ideas to help me troubleshoot?
I understand that you are using the update_table method to update the schema in python, correct me if I'm wrong. You have to do it with the patch API you can try this API to have a better view on how to do it.