WARNING: Failed to add policy job since the add condition is not satisfied - google-cloud-platform

I'm trying to schedule automatic recommendation and population by following this doc.
I'm trying to run this query
SELECT google_columnar_engine_add_policy( 'RECOMMEND_AND_POPULATE_COLUMNS', 'EVERY', 10, 'HOURS');
But this query fails. I've tried many other combinations of policy_interval, duration, time_unit, and it fails with the same error every time.
Only one case works, that is when policy_interval is "IMMEDIATE" but this is not what I'm after.

The basic steps to follow for the configuation and usage are as below:
Enable the columnar engine.
Let the engine's recommendation feature observe your workload and
gather query statistics
Size the engine's column store based on the recommendation feature's
analysis.
Enable automatic population of the column store by the
recommendation feature.
Let the recommendation feature observe your workload and
automatically add columns to the column store.
The query that you are trying to run is for Schedule automatic recommendation and population
(
'RECOMMEND_AND_POPULATE_COLUMNS',
policy_interval, duration, time_unit
);
policy_interval: The time interval determining when the policy runs. You can specify these values:
'IMMEDIATE': The RECOMMEND_AND_POPULATE_COLUMNS operation runs immediately one time. When you use this value, specify 0 and 'HOURS' for the duration and time_unit parameters.
'AFTER': The RECOMMEND_AND_POPULATE_COLUMNS operation runs once when the duration time_unit amount of time passes.
'EVERY': The RECOMMEND_AND_POPULATE_COLUMNS operation runs repeatedly every duration time_unit amount of time.
duration: The number of time_units. For example, 24.
time_unit: The unit of time for duration. You can specify 'DAYS'or 'HOURS'.
Please check if this was followed from setup to configuration and try again.Also as you mentioned, the specific errors are not available with you for clearly understanding the breakpoint here.I would recommend you to check the below link for reference.
https://cloud.google.com/alloydb/docs
https://cloud.google.com/alloydb/docs/faq
Hope that helps.

Related

Calculate next_run at every run for a Schedule Object

I have got a question about django-q, where I could not find any answers in its documentation.
Question: Is it possible to calculate the next_run at every run at the end?
The reason behind it: The q cluster does not cover local times with dst (daylight saving time).
As example:
A schedule that should run 6am. german time.
For summer time: The schedule should be executed at 4am (UTC).
For winter time: The schedule should be executed at 5am (UTC).
To fix that I wrote custom logic for the next run. This logic is taking place in the custom function.
I tried to retrieve the schedule object in the custom function and set the next_run there.
The probleme here is: If I put the next_run logic before the second section "other calculations" it does work, But if I place it after the second section "other calculations" it does not work. Other calculations are not related to the schedule object.
def custom_function(**kwargs)
# 1. some calculations not related to schedule object
# this place does work
related_schedule= Schedule.objects.get(id=kwargs["schedule_id"])
related_schedule.next_run = ...
related_schedule.save()
# 2. some other calculations not related to schedule object
# this place doesn't
That is very random behaviour which I could not explain to me.

How to Decrease Query Compile Time in Redshift

I have seen that the first time query execution taking longer time to execute but second execution takes less time, seems like query compile time is taking longer time at first, can we do anything here which will increase the performance of compile time ?
Scenario:
enable_result_cache_for_session is off
We have SLA defined to execute specific query is 15 seconds but when run for the first time it is taking 33 seconds to compile and run the query that time SLA is miss but subsequent run took 10 seconds which is SLA hit.
Q: How do I tune this part ? How do I make sure this does not happens ?
Do we have any database configuration parameter for the same?
The title of the question says compile time but I understand that you are interested in improving the execution time, right?
For sure the John Rotenstein comment makes total sense, to improve the Redshift execution query time you need to understand the RS architecture and how to distribute your data in the best way you can to improve the queries time.
You will need to understand the DISTKEY and SORTKEY
Useful links
Redshift Architecture
https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
https://medium.com/#dpazetojr/redshift-architecture-basics-4aae5068b8e3
Redshift Distribuition Styles
https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
https://medium.com/#dpazetojr/redshift-distkey-and-sortkey-d247b01b01f6
UPDATE 1:
In order you tune query and know how/when use DISTKEY and SORTKEY, we can start using the EXPLAIN command in the query you run and based on that act more precisely.
https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
https://dev.to/ronsoak/the-r-a-g-redshift-analyst-guide-understanding-the-query-plan-explain-360d

Amazon Sagemaker Groundtruth: Cannot get active learning to work

I am trying to test Sagemaker Groundtruth's active learning capability, but cannot figure out how to get the auto-labeling part to work. I started a previous labeling job with an initial model that I had to create manually. This allowed me to retrieve the model's ARN as a starting point for the next job. I uploaded 1,758 dataset objects and labeled 40 of them. I assumed the auto-labeling would take it from here, but the job in Sagemaker just says "complete" and is only displaying the labels that I created. How do I make the auto-labeler work?
Do I have to manually label 1,000 dataset objects before it can start working? I saw this post: Information regarding Amazon Sagemaker groundtruth, where the representative said that some of the 1,000 objects can be auto-labeled, but how is that possible if it needs 1,000 objects to start auto-labeling?
Thanks in advance.
I'm an engineer at AWS. In order to understand the "active learning"/"automated data labeling" feature, it will be helpful to start with a broader recap of how SageMaker Ground Truth works.
First, let's consider the workflow without the active learning feature. Recall that Ground Truth annotates data in batches [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-batching.html]. This means that your dataset is submitted for annotation in "chunks." The size of these batches is controlled by the API parameter MaxConcurrentTaskCount [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount]. This parameter has a default value of 1,000. You cannot control this value when you use the AWS console, so the default value will be used unless you alter it by submitting your job via the API instead of the console.
Now, let's consider how active learning fits into this workflow. Active learning runs in between your batches of manual annotation. Another important detail is that Ground Truth will partition your dataset into a validation set and an unlabeled set. For datasets smaller than 5,000 objects, the validation set will be 20% of your total dataset; for datasets largert than 5,000 objects, the validation set will be 10% of your total dataset. Once the validation set is collected, any data that is subsequently annotated manually consistutes the training set. The collection of the validation set and training set proceeds according to the batch-wise process described in the previous paragraph. A longer discussion of active learning is available in [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html].
That last paragraph was a bit of a mouthful, so I'll provide an example using the numbers you gave.
Example #1
Default MaxConcurrentTaskCount ("batch size") of 1,000
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 351 objects to populate the validation set (1407 remaining).
Annotate 1,000 objects to populate the first iteration of the training set (407 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 407 objects.
(Assume no objects were automatically labeled in step #3) Annotate 407 objects. End labeling job.
Example #2
Non-default MaxConcurrentTaskCount ("batch size") of 250
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 250 objects to begin populating the validation set (1508 remaining).
Annotate 101 objects to finish populating the validation set (1407 remaining).
Annotate 250 objects to populate the first iteration of the training set (1157 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 1157 objects. All else being equal, we would expect the model to be less accurate than the model in example #1 at this stage, because our training set is only 250 objects here.
Repeat alternating steps of annotating batches of 250 objects and running active learning.
Hopefully these examples illustrate the workflow and help you understand the process a little better. Since your dataset consists of 1,758 objects, the upper bound on the number of automated labels that can be supplied is 407 objects (assuming you use the default MaxConcurrentTaskCount).
Ultimately, 1,758 objects is still a relatively small dataset. We typically recommend at least 5,000 objects to see meaningful results [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html]. Without knowing any other details of your labeling job, it's difficult to gauge why your job didn't result in more automated annotations. A useful starting point might be to inspect the annotations you received, and to determine the quality of the model that was trained during the Ground Truth labeling job.
Best regards from AWS!

Airflow: how to get response from Big query output for data availability and based on result kick off task/subdags

Requirement is kick off dag based on data availability from upstream/dependent tables
While condition check data availability (in the tables at Big query for n number of iteration) to check data available or not. If data available then kick off subdag/task else continue in loop.
It would be great to see an clear example how to use BigQueryOperator or `BigQueryValueCheckOperator' and then execute big query something like this
{Code}
SELECT
1
FROM
WHERE
datetime BETWEEN TIMESTAMP(CURRENT_DATE())
AND TIMESTAMP(DATE_ADD(CURRENT_DATE(),1,'day'))
LIMIT
1
{Code}
If query output is 1 (that means data available for today's load) then kick off dag else continue in loop as shown in attached diagram link.
Does anyone had setup such design in Airflow dag.
You may check the BaseSensorOperator and BigQueryTableSensor to implement your own Sensor for it. https://airflow.incubator.apache.org/_modules/airflow/operators/sensors.html
Sensor operators keep executing at a time interval and succeed when a
criteria is met and fail if and when they time out.
BigQueryTableSensor just checks whether table exists or not but did check the data in the table. It might be something like this:
task1>>YourSensor>>task2

How to find resource intensive and time consuming queries in WX2?

Is there a way to find the resource intensive and time consuming queries in WX2?
I tried to check SYS.IPE_COMMAND and SYS.IPE_TRANSACTION tables but of no help.
The best way to identify such queries when they are still running is to connect as SYS with Kognitio Console and use Tools | Identify Problem Queries. This runs a number of queries against Kognitio virtual tables to understand how long current queries have been running, how much RAM they are using, etc. The most intensive queries are at the top of the list, ranked by the final column, "Relative Severity".
For queries which ran in the past, you can look in IPE_COMMAND to see duration but only for non-SELECT queries - this is because SELECT queries default to only logging the DECLARE CURSOR statement, which basically just measures compile time rather than run time. To see details for SELECT queries you should join to IPE_TRANSACTION to find the start and end time for the transaction.
For non-SELECT queries, IPE_COMMAND contains a breakdown of the time taken in a number of columns (all times in ms):
SM_TIME shows the compile time
TM_TIME shows the interpreter time
QUEUE_TIME shows the time the query was queued
TOTAL_TIME aggregates the above information
If it is for historic view image commands as mentioned in the comments, you can query
... SYS.IPE_COMMAND WHERE COMMAND IMATCHING 'create view image' AND TOTAL_TIME > 300000"
If it is for currently running commands you can look in SYS.IPE_CURTRANS and join to IPE_TRANSACTION to find the start time of the transaction (assuming your CVI runs in its own transaction - if not, you will need to look in IPE_COMMAND to find when the last statement in this TNO completed and use that as the start time)