Matillion for Amazon Redshift support for job monitoring

Matillion for Amazon Redshift support for job monitoring - amazon-web-services

I am working on Amazon Matillion for Redshift and we have multiple jobs running daily triggered by SQS messages. Now I am checking the possibility of creating a UI dashboard for stakeholders which will monitor live progress of jobs and will show report of previous jobs, like Job name, tables impacted, job status/reason for failure etc. Does Matillion maintain this kind of information implicitly? Or I will have to maintain this information for each job.

Matillion has an API which you can use to obtain details of all task history. Information on the tasks API is here:
https://redshiftsupport.matillion.com/customer/en/portal/articles/2720083-loading-task-information?b_id=8915
You can use this to pull data on either currently running jobs or completed jobs down to component level including name of job, name of component, how long it took to run, whether it ran successfully or not and any applicable error message.
This information can be pulled into a Redshift table using the Matillion API profile which comes built into the product and the API Query component. You could then build your dashboard on top of this table. For further information I suggest you reach out to Matillion via their Support Center.

The API is helpful, but you can only pass a date as a parameter (this is for Matillion for Snowflake, assume it's the same for Redshift). I've requested the ability to pass a datetime so we can run the jobs throughout the day and not pull back the same set of records every time our API call runs.

Related

Bigquery BQ Alerts for Slots and Query concurrency

I'm trying to establish email alerts at a project level to send email alerts for when a certain number of query/job concurrency is reached e.g. 5 concurrent queries. We have a flat-rate pricing model.
I want to do a similar email notification when total slot Usage exceeds a certain threshold as well e.g. slot usage reaching 1000 slots
As a next step I would like to throttle new incoming queries based on the above mentioned thresholds. Meaning if there are already for example 5 queries actively running the 6th one will be put on hold until one of the 5 running earlier have completed.

You may create an Alert Policy in which you can set your desired metric type (eg. slots) and then configure your desired threshold similar to below.
In creating an Alert Policy you may also configure the notification channel to email notification which is also included on the same documentation.
For the available metric types for SLOTS in BigQuery, you may refer to this Google Cloud Metrics for BigQuery documentation.
For your next step, you may code (python, node.js, etc) using BigQuery API to count queries actively running (through JOB ID) and when the count hits 5, you may print "query queue is full" and then wait for the total JOBS to hit below 5 before running the next query. You may refer to this BigQuery Managing Jobs API Documentation.

Integration of BigQuery and Dialogflow

I'm getting started on Dialogflow and I would like to integrate it with BigQuery.
I have some tables in BigQuery with difference data, for instance a record of alarms that a wind turbine showed during time.
In one of my test cases, let's say I want my chatbot to tell me what alarms were raised in the wind turbine number 5 of my farm, on the 25th of October.
I have already created a chatbot in Dialogflow that asks for all the necessary parameters of the enquiry, such as the wind farm name, the wind turbine number, the date, and the name of the alarm.
My doubt now is how I can send those parameters to BigQuery in order to dig into my tables, extract the required information, and print it in Dialogflow.
I have been looking for documentation or tutorials but nothing came out that could fit my case...
Thanks in advance!

You need to implement a fulfillment. It triggers a webhook, for example a Cloud Functions or a Cloud Run service.
This webhook call contains the value gather by your intent and parameters. You have to extract them and perform your process, for example a call to BigQuery. Then format the response and display it on Dialogflow.

How to monitor if a BigQuery table contains current data and send an alert if not?

I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?

Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.

Can we schedule StackDriver Logging to export log?

I am new to Google Cloud StackDriver Logging and as per this documentation StackDriver stores the Data Access audit logs for 30 days. Also mentioned on the same page, that Size of a log entry is limited to 100KB.
I am aware of the fact that the logs can be exported Google Cloud Storage using Cloud SDK as well as using Logging Libraries in many languages (we prefer Python).
I have two questions related to the exporting the logs, which are:
Is there any way in StackDriver to schedule something similar to a task or cronjob that keeps exporting the Logs in the Google Cloud storage automatically after a fixed interval of time?
What happens to the log entries which are larger than 100KB. I assume they get truncated. Is my assumption correct? If yes, is there any way in which we can export/view the full(which is not at all truncated) Log entry?

Is there any way in StackDriver to schedule something similar to a
task or cronjob that keeps exporting the Logs in the Google Cloud
storage automatically after a fixed interval of time?
Stackdriver supports exporting log data via sinks. There is no schedule that you can set as everything is automatic. Basically, the data is exported as soon as possible and you have no control over the amount exported at each sink or the delay between exports. I have never found this to be an issue. Logging, by design, is not to be used as a real-time system. The closest is to sink to PubSub which has a couple of second delay (based upon my experience).
There are two methods to export data from Stackdriver:
Create an export sink. Supported destinations are BigQuery, Cloud Storage and PubSub. The log entries will be written to the destination automatically. You can then use tools to process the exported entries. This is the recommended method.
Write your own code in Python, Java, etc. to read the log entries and do what you want with them. Scheduling is up to you. This method is manual and requires your management of schedule and destination.
What happens to the log entries which are larger than 100KB. I assume
they get truncated. Is my assumption correct? If yes, is there any way
in which we can export/view the full(which is not at all truncated)
Log entry?
Entries that exceed the max size of an entry cannot be written to Stackdriver. The API call that attempts to create the entry will fail with an error message similar to (Python error message):
400 Log entry with size 113.7K exceeds maximum size of 110.0K
This means that entries that are too large will be discarded unless the writer has logic to handle this case.

As per the documentation of stack driver logging the whole process is automatic. Export sink to google cloud storage is slower than the Bigquery and Cloud sub/pub. link for the documentation
I recently used the export sink to the big query, which is better than cloud pub/sub in case if you don't want to use other third-party application for log analysis. For Bigquery sink needs dataset where do you want to store the log entries. I noticed that sink create bigquery table on a timestamp basis in the bigquery dataset.
One more thing if you want to query timestamp partitioned tables check this link
Legacy SQL Functions and Operators

Google Dataflow job monitor

I am writing an app to monitor and view Google dataflow jobs.
To get the metadata about google dataflow jobs, I am exploring the REST APIs listed here :
https://developers.google.com/apis-explorer/#search/dataflow/dataflow/v1b3/
I was wondering if there are any APIs that could do the following :
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
3)Get log messages associated with a dataflow job
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Any help would be greatly appreciated. Thank You

There is additional documentation about the Dataflow REST API at: https://cloud.google.com/dataflow/docs/reference/rest/
Addressing each of your questions separately:
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
No, there is no batch method for a list of jobs. You'll need to query them individually with projects.jobs.get.
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
The only other filter currently available is location.
3)Get log messages associated with a dataflow job
In Dataflow there are two types of log messages:
"Job Logs" are generated by the Dataflow service and provide high-level information about the overall job execution. These are available via the projects.jobs.messages.list API.
There are also "Worker Logs" written by the SDK and user code running in the pipeline. These are generated on the distributed VMs associated with a pipeline and ingested into Stackdriver. They can be queried via the Stackdriver Logging entries.list API by including in your filter:
resource.type="dataflow_step"
resource.labels.job_id="<YOUR JOB ID>"
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Dataflow jobs are only retained by the service for 30 days. Older jobs are deleted and thus not available in the UI or APIs.

In our case we implemented such functionality by tracking the job stages and by using schedulers/cron jobs to report the details of running job in one file. This file withing 1 bucket is watched by our job which just gives all status to our application

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js