(Django) RQ scheduler - Jobs disappearing from queue - django

Since my project has so many moving parts.. probably best to explain the symptom
I have 1 scheduler running on 1 queue. I add scheduled jobs ( to be executed within seconds of the scheduling).
I keep repeating scheduling of jobs with NO rq worker doing anything (in fact, the process is completely off). In another words, the queue should just be piling up.
But ALL of a sudden.. the queue gets chopped off (randomly) and first 70-80% of jobs just disappear.
Does this have anything to do with:
the "max length" of queue? (but i dont recall seeing any limits)
does the scheduler automatically "discard" jobs where the start time
is BEFORE the current time?

ran my own experiment. RQ scheduler does indeed remove jobs whose start date < now.

Related

Reusing a database record created by means of Celery task

There is a task which creates database record {R) when it runs for the first time. When task is started second time it should read database record, perform some calculations and call external API. First and second start happens in a loop
In case of single start of the task there are no problems, but in the case of loops (at each loop's iteration the new task is created and starts at certain time) there is a problem. In the task queue (for it we use a flower) we have crashed task on every second iteration.
If we add, at the and of the loop time.sleep(1) sometimes the tasks work properly, but sometimes - not. How to avoid this problem? We afraid that task for different combination of two users started at the same time also will be crashed.
Is there some problem with running tasks in Celery simultaneously? Or something we should consider, tasks are for scheduled payments so they have to work rock solid

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

Camunda External tasks messages are de prioritising

We use node camunda-external-task-client-js to handle camunda external tasks.
Following is the client configuration
"topic_name": "app-ext-task",
"maxTasks": 5,
"maxParallelExecutions": 5,
"interval": 500,
"usePriority": true,
"lockDuration":2100000,
"workerId": "app-ext-task-worker"
We are getting external task details and able to processing them,But some times we see some tasks are getting deprioritised.
We are not setting any priority to any external task, by default all tasks are assigned priority 0.
We expect all tasks will execute in sequential manner, we agree some tasks may take more time than the subsequent task so that the taks-1 may take more time than task-2.
Ex: If a queue contains 10 taks [task1,taks-2,task-3,task-4,task-5,...task-10]
All the tasks executed sequentially as all the tasks have same priority.
1st:task-1,
2nd:task-2
3rd: task-3
Problem:
We see some tasks are getting deprioritised it means early messages are taking priority over existing messages.
1st:task-1,
2nd:task-2
3rd: task-4
4th: task-5
5th: task-6
6th: task-7
7th: task-8
8th: task-3
I am seeing problem at 2 places
While producing the message, camunda could have not posted the message in QUEUE.
While reading the Queue camunda external tasks are not processed properly.
I didn't find much docs on this, I don't know how do I debug this.
For me this is an intermitent issue, as I am not able to find the root cause of the problem.
I am not sure how to debug this as well.
Is my expectation wrong in camunda queues?
The external tasks do not form a "queue". They are instances in a pool of possible tasks, your worker fetches "some" tasks, which might be in order or not. You could prioritise the tasks, but still, if you have 10 "highest" prio tasks in the pool and the worker fetches 5, you won't be able to determine which are chosen.
But you have a process engine at hand, if keeping the sequence is essential for your process, why do you start all tasks at once and rely on the external worker to keep the order? Why not just creating one task at a time and continue when it is finished?

django RQ / rq scheduler - Possible to disable job timeout?

Once in a while, I have a job that requires much longer processing than anticipated so i'd like to disable timeout if possible.
You can pass job_timeout when enqueuing a job and that'll be preserved, default timeout is 3 minutes (180), I believe your function is taking more than three minutes.
By default, jobs should execute within 180
seconds. After that, the worker kills the work horse and puts the job
onto the failed queue, indicating the job timed out.
If a job requires more (or less) time to complete, the default timeout
period can be loosened (or tightened), by specifying it as a keyword
argument to the enqueue() call, like so:
q = Queue() q.enqueue(mytask, args=(foo,), kwargs={'bar': qux},
job_timeout=600) # 10 mins
https://github.com/rq/rq/blob/6bfd47f735de3f297ba3c8f59d5e2dcfa1987107/docs/docs/results.md#L88

Dataflow job stuck and not reading messages from PubSub

I have a dataflow job which reads JSON from 3 PubSub topics, flattening them in one, apply some transformations and save to BigQuery.
I'm using a GlobalWindow with following configuration.
.apply(Window.<PubsubMessage>into(new GlobalWindows()).triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterFirst.of(AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations))))
.discardingFiredPanes());
The job is running with following configuration
Max Workers : 20
Disk Size: 10GB
Machine Type : n1-standard-4
Autoscaling Algo: Throughput Based
The problem I'm facing is that after processing few messages (approx ~80k) the job stops reading messages from PubSub. There is a backlog of close to 10 Million messages in one of those topics and yet the Dataflow Job is not reading the messages or autoscaling.
I also checked the CPU usage of each worker and that is also hovering in single digit after initial burst.
I've tried changing machine type and max worker configuration but nothing seems to work.
How should I approach this problem ?
I suspect the windowing function is the culprit. GlobalWindow isn't suited to streaming jobs (which I assume this job is, due to the use of PubSub), because it won't fire the window until all elements are present, which never happens in a streaming context.
In your situation, it looks like the window will fire early once, when it hits either that element count or duration, but after that the window will get stuck waiting for all the elements to finally arrive. A quick fix to check if this is the case is to wrap the early firings in a Repeatedly.forever trigger, like so:
withEarlyFirings(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations)))))
This should allow the early firing to fire repeatedly, preventing the window from getting stuck.
However for a more permanent solution I recommend moving away from using GlobalWindow in streaming pipelines. Using fixed-time windows with early firings based on element count would give you the same behavior, but without risk of getting stuck.