camunda 7.5 asynchronous job slow execution - camunda

After add some asynchrone job on our workflow, the excecution of some instance become slow.I use embedded Process engine Camunda (https://docs.camunda.org/get-started/spring/embedded-process-engine/)
Any idea?

It looks like your job executions result in adding timers, there was a bug where the process engine does not realize that new jobs have been added or that there might be other jobs to execute in that case.
The issue is described in Issue CAM-6453
The scenario for us was that we had several thousand processes accumulated due to a network problem. The process would execute one service task and then wait for a intermediate timer catch event. Because adding a timer did not hint the job executor, it would execute a few processes and then sleep for 60 seconds before acquiring the next batch of jobs, even though there were still a few thousand jobs available for execution.
It should be fixed since 7.4.10, 7.5.4 and 7.6.

Related

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

Celery all generated tasks status

Django produces multiple Celery tasks through chains in one script run (f.e. if / is opened in browser, 1000 tasks are called by delay method).
I need something that will restrict new task generation, if tasks, queued in previous script run, are still running.
You need a distributed lock for this, which celery doesn't offer natively.
For these kinds of locks I've found redis.Lock useful to most cases. If you need a semaphore, you can use redis' atomic incr/decr functions along with some kind of watchdog mechanism to ensure your processes are still running.
You can restrict the number of tasks of one type running at the same time by setting:
rate_limit = “1000/m”
=> only 1000 tasks of this type can run per minute.
(see http://docs.celeryproject.org/en/latest/userguide/tasks.html#list-of-options)

Heroku Scheduler - why enqueue long-running jobs

The Heroku Scheduler documentation says:
Scheduled jobs are meant to execute short running tasks or enqueue longer running tasks into a background job queue. Anything that takes longer than a couple of minutes to complete should use a worker process to run
If the Scheduler starts a new dyno for these jobs and the cost is the same for a dyno vs. a worker, what is the advantage to adding a task to the queue and having a worker process run it?
It is an architectural best practice to only schedule, and not execute, interval tasks on the scheduler task (or your own custom clock process). The motivation for this is explained in the scheduled jobs article but, to summarize, you want your scheduler process/task to be as light-weight as possible since there should only be one of them. When you start overloading scheduling with execution you often run into schedule conflicts and erratic behavior.
Imagine that one interval job hangs, or takes much longer than expected. If your intervals are tight enough this will start causing a backlog and future intervals could be pushed back or skipped all together.
Also, it is just wise to keep component responsibilities as separated as possible - not having a single component be responsible for orthogonal tasks. This is a common design practice which is reflected in the scheduled job use-case by keeping scheduling and execution independent.
Best practices aside, if you're in development or bootstrap mode and understand the consequences stated above you can certainly choose to ignore such advice and run everything within the scheduler task. Just be careful for hard to debug job conflicts or apparent duplication.
Well, I think this is just a recommendation. If you have a task which is ran by Scheduler and you'll run this task manually (in the Heroku administration), you'll get an error - this error is caused by timeout (because each task has limit 30s). But in fact, this task will not be interrupted - the task is gonna be finished correctly.
If you have 1 dyno, so this one dyno use Heroku for your application. If you run some scheduled job, so this dyno gonna be taken be the Scheduler -> if you have long-time running task, your page will be "idle" (not correctly working till the time, when the scheduled job will be finished).