Is there any way to track a job across services in stackdrive? - google-cloud-platform

We use lots of components in Google Cloud, for example a job may start on App Engine, then do some work in Apache Airflow, then do some Dataflow work which will run a BigQuery insert.
Is there any way we can track the status of a job across all components using stack driver. For example tell stackdriver somehow a custom job id and query for it.

You can use advanced logs filters [1] to include log entries from various products. In the logging page search for your BigQuery Job ID. Click to the Job ID and select show matching entries. This will open advanced filter text box with the proper syntax. Then you can add more queries with an OR in between.

Related

How to update data in google cloud storage/bigquery for google data studio?

For context, we would like to visualize our data in google data studio - this dataset receives more entries each week. I have tried hosting our data sets in google drive, but it seems that they're too large and this slows down google data studio (the file is only 50 mb, am I doing something wrong?).
I have loaded our data into google cloud storage --> google bigquery, and connected my google data studio to my bigquery table. This has allowed me to use the google data studio dashboard much quicker!
I'm not sure what is the best way to update our data weekly in google cloud/bigquery. I have found a slow way to do this by uploading the new weekly data to google cloud, then appending the data to my table manually in bigquery, but I'm wondering if there's a better way to do this (or at least a more automated way)?
I'm open to any suggestions, and if you think that bigquery/google cloud storage is not the answer for me, please let me know!
If I understand your question correctly, you want to automate the query that populate your table, which is connected to Data Studio.
If this is the case, then you can use Scheduled Query from BigQuery. Scheduled query allow you to define a query which results can be inserted in a new table. Particularly you can specify different rules for repetition (minimum each 15 minutes) and execution, as well as destination writing options (destination table, writing mode: append, truncate).
In order to use Scheduled Queries your account must have the right permissions. You can have a look at the following documentation to better understand how to use Scheduled Query [1].
Also, please note that at the front end the updated data in the BigQuery table will be seen updated in Datastudio at each refresh (click on refresh button in Datastudio). To automatically refresh the front-end visualization you can use the following plugin [2] or automate the click on the refresh button through Browser console commands.
[1] https://cloud.google.com/bigquery/docs/scheduling-queries
[2] https://chrome.google.com/webstore/detail/data-studio-auto-refresh/inkgahcdacjcejipadnndepfllmbgoag?hl=en

Google Data Studio Billing Report Demo for GCP multiple projects

Basically I am trying to setup Google Cloud Billing Report Demo for multiple projects.
Example mentioned in this link
In it there are 3 steps to configure datasource for data studio
Create the Billing Export Data Source
Create the Spending Trends Data Source
Create the BigQuery Audit Data Source
Now 1st point is quite clear.
For 2nd point the query example which is provided in demo is based on a single project. In my case I wanted to have spending datasource from multiple projects.
Does doing UNION of query based on each project works in this case?
For 3rd point, I need Bigquery Audit log from all my projects. I thought setting the external single dataset sink as shown below for bigquery in all my project should be able to do the needful.
bigquery.googleapis.com/projects/myorg-project/datasets/myorg_cloud_costs
But I see that in my dataset tables are creating with a suffix _(1) as shown below
cloudaudit_googleapis_com_activity_ (1)
cloudaudit_googleapis_com_data_access_ (1)
and these tables doesn't contain any data despite running bigquery queries in all projects multiple times.In fact it shows below error on previewing.
Unable to find table: myorg-project:cloud_costs.cloudaudit_googleapis_com_activity_20190113
I think auto generated name with suffix _ (1) is causing some issue and because of that data is also not getting populated.
I believe there should be a very simple solution for it, but it just I am not able to think in correct way.
Can somebody please provide some information on how to solve 2nd and 3rd requirement for multiple projects in gcp datastudio billing report demo?
For 2nd point the query example which is provided in demo is based on
a single project. In my case I wanted to have spending datasource from
multiple projects. Does doing UNION of query based on each project
works in this case?
That project is the project you specify for the bulling audit logs in BigQuery. The logs are attached to the billing account, which can contain multiple projects underneath it. All projects in the billing account will be captured in the logs - more specifically, the column project.id.
For 3rd point, I need Bigquery Audit log from all my projects. I
thought setting the external single dataset sink as shown below for
bigquery in all my project should be able to do the needful.
You use the includeChildren property. See here. If you don't have an organisation or use folders, then you will need to create a sink per project and point it at the dataset in BigQuery where you want all the logs to go. You can script this up using the gcloud tool. It's easy.
I think auto generated name with suffix _ (1) is causing some issue and because of that data is also not getting populated.
The suffix normal. Also, it can take a few hours for your logs/sinks to start flowing.

Count number of GCP log entries during a specified time

Is it possible to count number of occurrences of a specific log message over a specific period of time from GCP Stackdriver logging? To answer the question "How many times did this event occur during this time period." Basically I would like the integral of the curve in the chart below.
It doesn't have to be a moving window, this time it's more of a one-time-task. A count-aggregator or similar on the advanced log query would also work if that would be available.
The query looks like this:
(resource.type="container"
logName="projects/xyz-142842/logs/drs"
"Publish Message for updated entity"
) AND (timestamp>="2018-04-25T06:20:53Z" timestamp<="2018-04-26T06:20:53Z")
My log based metric for the graph above looks like this:
My Dashboard is setup like this:
I ended up building stacked bars.
With correct zoom level I can sum up the number of occurrences easy enough. It would have been a nice feature to get the count directly from a graph (the integral), but this works for now.
There are multiple ways to do so, the two that I saw actually working and that can apply to your situation are the following:
Making use of Logs-based Metrics. They can, for example, record the number of log entries containing particular error messages, or they can extract latency information reported in log entries.
Stackdriver Logging logs-based metrics can be one of two metric types: counter or distribution. [...] Counter metrics count the number of log entries matching an advanced logs filter. [...] Distribution metrics accumulate numeric data from log entries matching a filter.
I would advise you to go through the Documentation to check this feature completely cover your use case.
You can export your logs to Big query, once you have them there you can make use of the classical tools like groupby, select and all the tool that BigQuery offers you.
Here you can find a very minimal step to step guide regarding how to export the logs and how to Analyzing Audit Logs Using BigQuery, but I am sure you can find online many resources.
The product and the approaches are really different, I would say that BigQuery is more flexible, but also more complex to be configure and to properly use it. If you find a third better way please update your question with those information.
At first you have to create a metric :
Go to Log explorer.
Type your query
Go to Actions >> Create Metric.
In the monitoring dashboard
Create a chart.
Select the resource and metric.
Go to "Advanced" and provide the details as given below :
Preprocessing step : Rate
Alignment function : count
Alignment period : 1
Alignment unit : minutes
Group by : log
Group by function : count
This will give you the visualisation in a bar chart with count of the desired events.
There is one more option.
You can read your custom metric using Stackdriver Monitoring API ( https://cloud.google.com/monitoring/api/v3/ ) and process it in script with whatever aggregation you need.
If you are working with python - you may look into gcloud python library https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/monitoring
It will be very simple script and you can stream results of calculation into bigquery table and use it in your dashboard
With PacketAI, you can send logs of arbitrary formats, including from GCP. then the logs dashboard will automatically parse and group into patterns as shown in this video. https://streamable.com/n50kr8
Counts and trends of different log patterns are also displayed
Disclaimer: I work for PacketAI

Matillion for Amazon Redshift support for job monitoring

I am working on Amazon Matillion for Redshift and we have multiple jobs running daily triggered by SQS messages. Now I am checking the possibility of creating a UI dashboard for stakeholders which will monitor live progress of jobs and will show report of previous jobs, like Job name, tables impacted, job status/reason for failure etc. Does Matillion maintain this kind of information implicitly? Or I will have to maintain this information for each job.
Matillion has an API which you can use to obtain details of all task history. Information on the tasks API is here:
https://redshiftsupport.matillion.com/customer/en/portal/articles/2720083-loading-task-information?b_id=8915
You can use this to pull data on either currently running jobs or completed jobs down to component level including name of job, name of component, how long it took to run, whether it ran successfully or not and any applicable error message.
This information can be pulled into a Redshift table using the Matillion API profile which comes built into the product and the API Query component. You could then build your dashboard on top of this table. For further information I suggest you reach out to Matillion via their Support Center.
The API is helpful, but you can only pass a date as a parameter (this is for Matillion for Snowflake, assume it's the same for Redshift). I've requested the ability to pass a datetime so we can run the jobs throughout the day and not pull back the same set of records every time our API call runs.

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars