Bigquery BQ Alerts for Slots and Query concurrency - google-cloud-platform

I'm trying to establish email alerts at a project level to send email alerts for when a certain number of query/job concurrency is reached e.g. 5 concurrent queries. We have a flat-rate pricing model.
I want to do a similar email notification when total slot Usage exceeds a certain threshold as well e.g. slot usage reaching 1000 slots
As a next step I would like to throttle new incoming queries based on the above mentioned thresholds. Meaning if there are already for example 5 queries actively running the 6th one will be put on hold until one of the 5 running earlier have completed.

You may create an Alert Policy in which you can set your desired metric type (eg. slots) and then configure your desired threshold similar to below.
In creating an Alert Policy you may also configure the notification channel to email notification which is also included on the same documentation.
For the available metric types for SLOTS in BigQuery, you may refer to this Google Cloud Metrics for BigQuery documentation.
For your next step, you may code (python, node.js, etc) using BigQuery API to count queries actively running (through JOB ID) and when the count hits 5, you may print "query queue is full" and then wait for the total JOBS to hit below 5 before running the next query. You may refer to this BigQuery Managing Jobs API Documentation.

Related

What is the best way to handle changes to the filter on a subscription in Google Cloud Pub/Sub?

Problem:
I know that Google Pub/Sub subscription cannot be patched to update the filter. I am just figuring out other ways to handle updates to filter in production.
Approach I could come up with:
Modify the push config to pull in existing subscription (old filter) so that it accumulates messages
Create a new subscription with latest filter
Transfer the messages from old subscription to a topic using dataflow
Detach the old subscription from the topic
Problems I see with the approach:
As both subscriptions exist at a point of time, I could end up processing duplicate messages
Any suggestions on the best way to handle this?
The timing is important to minimize the duplicates or the message loss.
Firstly, I will deploy a service (Cloud Run for exemple) that save the PubSub message as is somewhere (in a Cloud Storage file, in bigQuery, in Firestore,...)
Then, and in the same time, I will change the push of the old subscription to push to my Cloud Run service; and create the new push subscription with the new filter
Finally detach the subscription.
If you have the capacity, in your REAL app to detect the message already processed, you can remove them from your save location (it's easier with BigQuery for example) and then reprocess only the missing messages. With dataflow, or manually
However, it's recommended to have idempotent processing of your message. Keep in mind that PubSub is at least 1 delivery message and even with the same subscription you could have duplicates.
As you note, a filter expression cannot be changed once a subscription has been created. To effectively change a filter, please do the following steps:
Create a snapshot on the existing subscription (subscription A).
Create a new subscription (subscription B) with the new filter expression.
Seek subscription B to the snapshot created in step 1.
Change your subscribers to use subscription B instead of A.
Please note that during the time subscribers are switching from subscription A to B, there will be a high rate of duplicates delivered as messages are acked independently on the two subscriptions.
If you need to minimize the duplicates processed and can afford downtime, you can stop the all subscribing jobs/servers before step 1 and restart them configured to pull from the new subscription, B, after step 4. All steps must be completed well within the configured message_retention_duration to prevent message loss.

Can I aggregate data from a stream on AWS?

I have data coming from multiple machines, I would like to aggregate it by user. I'm thinking of producing batches of 1000 "rows", or 10 seconds of data (whichever comes first), by user.
I do have some experience with AWS kinesis and lambdas, but in my experience we don't have so much control on how the aggregation is done. All machines would send the data by kinesis, with the user id as the partition key. Then AWS will call our lambda with small batches. This has been great for some other use cases but here if I receive 100 records I don't know what to do (I would like to "wait" to receive more or wait that 10 seconds elapse since the date of the first record).
Also I'm not sure how the aggregation "by user id" would work. So far, on a lambda, I would have split the records in the batch by user id, but then if I get called with a batch of 100 records, even though there is a partition key on the user id, there is no guarantee that those 100 records would be for 1 user. Maybe I will get 100 records from 100 different users, and there is no "aggregation" help at all.
Any idea if kinesis + lambda is suited for this? I did look at the documentation of AWS but I don't see my scenario. It looks like they also have a tool "Data Streams" but it's hard for me to understand if this would work for my case.
Thanks!
Your understanding is correct. AWS Lambda + Kinesis alone will not be sufficient alone for aggregation. AWS Lambda programming model is stateless, so you can only aggregate based on the batch of records received in that particular invocatio(GetRecords API) call. Furthermore, the batch size provided in the function does not gurantee that you will get that number of records. This is merely the maximum number of records which you can get(MaxRecords) per invocation.
What you need is some kind of windowing mechanism, either row-based or time-based. Kinesis Analytics would be the easiest and fastest to get on-boarded with this. You can either use SQL or Flink with Kinesis analytics. You can even have your output to AWS Lambda for post processing.
Other ways would be use a Spark streaming job (you can use AWS EMR) and use windowing in your application.

One or more points were written more frequently than the maximum sampling period configured for the metric

Background
I have a website deployed in multiple machines. I want to create a Google Custom Metric that specifies the throughput of it - how many calls were served.
The idea was to create a custom metric that collects information about served requests and 1 time per minute to update the information into a custom metric. So, for each machine, this code can happen a maximum of 1-time per minute. But this process is happening on each machine on my cluster.
Running the code locally is working perfectly.
The problem
I'm getting this error: Grpc.Core.RpcException:
Status(StatusCode=InvalidArgument, Detail="One or more TimeSeries
could not be written: One or more points were written more frequently
than the maximum sampling period configured for the metric. {Metric:
custom.googleapis.com/web/2xx, Timestamps: {Youngest Existing:
'2019/09/28-23:58:59.000', New: '2019/09/28-23:59:02.000'}}:
timeSeries[0]; One or more points were written more frequently than
the maximum sampling period configured for the metric. {Metric:
custom.googleapis.com/web/4xx, Timestamps: {Youngest Existing:
'2019/09/28-23:58:59.000', New: '2019/09/28-23:59:02.000'}}:
timeSeries1")
Then, I was reading in the custom metric limitations that:
Rate at which data can be written to a single time series = one point per minute
I was thinking that Google Cloud Custom Metric will handle the concurrencies issues for me.
According to their limitations, the only option for me to implement realtime monitoring is to put another application that will collect information from all machines and will update it into a custom metric. It sounds to me like too much work for a real use case.
What I'm missing?
Now that you add the machine name on the metric and you get the machines metrics.
To SUM these metrics go to Stackdriver > Metric Explorer, and group your metrics by project-id or label for example, and then SUM the metrics.
https://cloud.google.com/monitoring/charts/metrics-selector#alignment
You can save the chart in a custom dashboard.

Google Dataflow job monitor

I am writing an app to monitor and view Google dataflow jobs.
To get the metadata about google dataflow jobs, I am exploring the REST APIs listed here :
https://developers.google.com/apis-explorer/#search/dataflow/dataflow/v1b3/
I was wondering if there are any APIs that could do the following :
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
3)Get log messages associated with a dataflow job
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Any help would be greatly appreciated. Thank You
There is additional documentation about the Dataflow REST API at: https://cloud.google.com/dataflow/docs/reference/rest/
Addressing each of your questions separately:
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
No, there is no batch method for a list of jobs. You'll need to query them individually with projects.jobs.get.
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
The only other filter currently available is location.
3)Get log messages associated with a dataflow job
In Dataflow there are two types of log messages:
"Job Logs" are generated by the Dataflow service and provide high-level information about the overall job execution. These are available via the projects.jobs.messages.list API.
There are also "Worker Logs" written by the SDK and user code running in the pipeline. These are generated on the distributed VMs associated with a pipeline and ingested into Stackdriver. They can be queried via the Stackdriver Logging entries.list API by including in your filter:
resource.type="dataflow_step"
resource.labels.job_id="<YOUR JOB ID>"
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Dataflow jobs are only retained by the service for 30 days. Older jobs are deleted and thus not available in the UI or APIs.
In our case we implemented such functionality by tracking the job stages and by using schedulers/cron jobs to report the details of running job in one file. This file withing 1 bucket is watched by our job which just gives all status to our application

Matillion for Amazon Redshift support for job monitoring

I am working on Amazon Matillion for Redshift and we have multiple jobs running daily triggered by SQS messages. Now I am checking the possibility of creating a UI dashboard for stakeholders which will monitor live progress of jobs and will show report of previous jobs, like Job name, tables impacted, job status/reason for failure etc. Does Matillion maintain this kind of information implicitly? Or I will have to maintain this information for each job.
Matillion has an API which you can use to obtain details of all task history. Information on the tasks API is here:
https://redshiftsupport.matillion.com/customer/en/portal/articles/2720083-loading-task-information?b_id=8915
You can use this to pull data on either currently running jobs or completed jobs down to component level including name of job, name of component, how long it took to run, whether it ran successfully or not and any applicable error message.
This information can be pulled into a Redshift table using the Matillion API profile which comes built into the product and the API Query component. You could then build your dashboard on top of this table. For further information I suggest you reach out to Matillion via their Support Center.
The API is helpful, but you can only pass a date as a parameter (this is for Matillion for Snowflake, assume it's the same for Redshift). I've requested the ability to pass a datetime so we can run the jobs throughout the day and not pull back the same set of records every time our API call runs.