I am trying to schedule a query in Google Cloud Platform query scheduler. But whenever I try to schedule, it is hardly executing. What I did is in steps below
1) Created a dataset with location US
2) Created a table in same location
3) Wrote a query
4) Scheduled a query . To check I gave the time to run in the next 3 minutes (not a cron type). Just added a scheduled time
But in the end, 1/10 times, it is executing as per the schedule. Rest, it is not even starting so I could log the error as well. Please advice
More clarification could be needed, but I see two possible situations:
1) You expected it to run on the exact time of the "Scheduled start time", but it doesn't work like this, it runs according to the schedule you set on the "repeats" dropdown. You can verify the exact scheduled time by going to "Scheduled queries" and then check "Next Scheduled"
2) If you set a custom schedule, did you consider that the time is UTC? You can also check the "Next Scheduled" time for what time would it be in your local time.
Related
When a query job is executed from bq command line tool with --batch option, if it is a single statement, it gets a BATCH priority. But if it is a set of statements, the parent SCRIPT job is assigned BATCH but individual statements are assigned INTERACTIVE priority. Same thing with a CALL to a stored procedure.
The priorities were observed from the information_schema.jobs view. The same behavior happens from Python API as well.
When a parent script job runs with BATCH priority, shouldn't the child jobs get BATCH priority as well? I did not find anything in the documentation that explains this. Maybe there is a reason for this.
Steps to reproduce:
bq query --batch --use_legacy_sql=False "select current_timestamp();"
-- This produces one entry in INFORMATION_SCHEMA.JOBS: QUERY/SELECT/BATCH
bq query --batch --use_legacy_sql=False "select current_timestamp();select current_timestamp();"
-- This produces 3 entries, the parent SCRIPT jobs is assigned batch, but the two child select jobs get INTERACTIVE. (see image)
Note: the behavior without the --batch flag, all three entries in JOBS is INTERACTIVE:
It is possible to get INTERACTIVE job priority even if your query is scheduled as BATCH priority. If the query has not started or queued within 24 hours, it will change to interactive priority which makes your query to be executed as soon as possible. BATCH and INTERACTIVE queries use the same resources.
You can go to this link for reference.
I am working on a use case,where I need to trigger DAG when a bigquery table is inserted with some records.
I am using Eventarc , and listening for insertJob event provided by Eventarc for bigquery.
It working almost fine, but I am getting 2 events whenever I insert the records. Event is also getting generated,when I query the table, and DAG is getting triggered 2 times.
This is my eventrc setting
Your eventarc configuration works well. When you perform a manual query, on the UI, you have, at least 2 insertjob entries.
Let's have a deeper look:
You have that first
Then that
Focus your attention on the latest lines. You could see a "dryrun" attribute.
Indeed, on the UI, you have a first dry run query performed to validate it and to get the bytebilled value (the volume of data processed by the query, displayed in the upper right corner).
Therefore 2 insert jobs: one with dry run, one without (the real query execution)
That being said, you have to check, in your Cloud Functions, if the dry run parameter is set or not in the event body.
Is there a way to add expiry date to a Huey Dynamic periodic task ?
Just like there is an option in celery task - "some_celery_task.apply_async(args=('foo',), expires=expiry_date)"
to add expiry date while creating the task.
I want to add the expiry date while creating the Huey Dynamic periodic task. I used "revoke" , it worked as it supposed to , but I want to stop the task completely after the expiry date not revoking it . When the Huey dynamic periodic task is revoked - message is displayed on the Huey terminal that the huey function is revoked (whenever crontab condition becomes true).
(I am using Huey in django)
(Extra)
What I did to meet the need of this expiry date -
I created the function which return Days - Months pairs for crontab :
For eg.
start date = 2021-1-20 , end date = 2021-6-14
then the function will return - Days_Month :[['20-31',1], ['*','2-5'], ['1-14','6']]
Then I call the Huey Dynamic periodic task (three times in this case).
(the Days_Month function will return Day-Months as per requirement - Daily, Weekly, Monthly or repeating after n days)
Is there a better way to do this?
Thank you for the help.
The best solution will depend on how often you need this functionality of having periodic tasks with a specific end date but the ideal approach is probably involving your database.
I would create a database model (let's call it Job) with fields for your end_date, a next_execution_date and a field that indicates the interval between repetitions (like x days).
You would then create a periodic task with huey that runs every day (or even every hour/minute if you need finer grain of control). Every time this periodic task runs you would then go over all your Job instances and check whether their next_execution_date is in the past. If so, launch a new huey task that actually executes the functionality you need to have periodically executed per Job instance. On success, you calculate the new next_execution_date using the interval.
So whenever you want a new Job with a new end_date, you can just create this in the django admin (or make an interface for it) and you would set the next_execution_date as the first date where you want it to execute.
Your final solution would thus have the Job model and two huey decorated functions. One for the periodic task that merely checks whether Job instances need to be executed and updates their next_execution_date and another one that actually executes the periodic functionality per Job instance. This way you don't have to do any manual cancelling and you only need 1 periodic task that just runs indefinitely but doesn't actually execute anything if there are no Job instances that need to be run.
Note: this will only be a reasonable approach if you have multiple of these tasks and you potentially want to control the end_dates in your interface.
ETL_JOB_ID:- There will be one ID for one workflow; it means WF_X_STG will have 1 and WF_Y_STG will have 2 always
ETL_JOB_NAME:- This will be the workflow name.
ETL_LOAD_PROCESS_NUMBER:- This will be batch number; means number of execution on any respective day. Suppose if workflow is running every 1 hour, this will be having 24 entries per day.
JOB_RUNSTART_TS:- This will be sessionstarttime
JOB_RUN_STATUS:- This will be STARTED or IN_PROGRESS and later will be updated to COMPLETED and FAILED based on outcome.
JOB_RUNEND_TS:- This will be NULL initially and later will be updated to CURRENT_TIME which is the end time of the object.
Check this framework out: http://powercenternotes.blogspot.com/2014/01/an-etl-framework-for-operational.html
Or just use Metadata to fetch what you need directly from Informatica DB.
We have sysdig running on our WSO2 API gateway machine and we notice that it fires a large number of SQL queries to the database for a minute, than waits a minute and repeats.
The query looks like this:
Every minute it goes wild, waits for a minute and goes wild again with a request of the following format:
SELECT REG_PATH, REG_USER_ID, REG_LOGGED_TIME, REG_ACTION, REG_ACTION_DATA
FROM REG_LOG
WHERE REG_LOGGED_TIME>'2016-02-29 09:57:54'
AND REG_LOGGED_TIME<'2016-03-02 11:43:59.959' AND REG_TENANT_ID=-1234
There is no load on the server. What is causing this? What can we do to avoid this?
screen shot sysdig api gateway process
This particular query is the result of the registry indexing task that runs in the background. The REG_LOG table is being queried periodically to retrieve the latest registry actions. The indexing task cannot be stopped. However, one can configure the frequency of the indexing task through the following parameter that is in the registry.xml. See [1] for more information.
indexingFrequencyInSeconds
If this table is filled up, one can clean the data using a simple SQL query. However, when deleting the records, one must be careful not to delete all the data. The latest records of each resource path should be left in the REG_LOG table since reindexing of data requires at least one reference of each resource path.
Also, if required, before clearing up the REG_LOG table, you can take a dump of the data in case you do not want to loose old records. Hope this answer provides information you require.
[1] - https://docs.wso2.com/display/Governance510/Configuration+for+Indexing