Create a report from GCP Cloud SQL logs - google-cloud-platform

I have enabled logging on my GCP PostgreSQL 11 Cloud SQL database. The logs are being redirected to a bucket in the same project and they are in a JSON format.
The logs contain queries which were executed on the database. Is there a way to create a decent report from these JSON logs with a few fields from the log entries? Currently the log files are in JSON and not very reader friendly.
Additionally, if a multi-line query is run, then those many log entries are created for that query. If there is also a way to recognize logs which are belong to the same query, that will be helpful, too!

I guess the easiest way is using BigQuery.
BigQuery will import properly those jsonl files and will assign proper field names for the json data
When you have multiline-queries, you'll see that they appear as multiple log entries in the json files.
Looks like all entries from a multiline query have the same receiveTimestamp (which makes sense, since they were produced at the same time).
Also, the insertId field has a 's=xxxx' subfield that does not change for lines on the same statement. For example:
insertId: "s=6657e04f732a4f45a107bc2b56ae428c;i=1d4598;b=c09b782e120c4f1f983cec0993fdb866;m=c4ae690400;t=5b1b334351733;x=ccf0744974395562-0#a1"
The strategy to extract that statements in the right line order is:
Sort by the 's' field in insertId
Then sort by receiveTimestamp ascending (to get all the lines sent at once to the syslog agent in the cloudsql service)
And finally sort by timestamp ascending (to get the line ordering right)

Related

Is it possible to retrieve the Timestream schema via the API?

The AWS console for Timestream can show you the schema for tables, but I cannot find any way to retrieve this information programatically. While the schema is dynamically created from the write requests, it would be useful to check the schema to validate that writes are going to generate the same measure types.
As shown in this document, you can run the command beneath, it'll return the table's schema.
DESCRIBE "database"."tablename"
Having the right profile setup at your environment, to run this query from the CLI, you can run the command:
aws timestream-query query --query-string 'DESCRIBE "db-name"."tb-name"' --no-paginate
This should return the table schema.
You can also retrieve the same data running this query from source codes in Typescript/Javascript, Python, Java or C# just by using the Timestream-Query library from AWS's SDK.
NOTE:
If the table is empty, this query will return empty so, make sure you have data in your db/table before querying for it's schema, ok? Good luck!

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

Extract script with which data was inserted into an old table and the owner in BigQuery - GCP

I am trying to find out what was the query run to be able to insert data into an empty table created earlier.
It would also be useful for me to know the user who was the creator of that table, if not, just to ask him about that script.
I tried with "INFORMATION_SCHEMA.TABLES;" but I only get the create script of the table.
I would look into BigQuery audit logs
most probably you can get information you want :
here is reference: https://cloud.google.com/bigquery/docs/reference/auditlogs/

Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.