Camunda : How to locate the step in my workflow that provoke OptimisticLockingException - camunda

Under heavy load we are experiencing a lot of OptimisticLockingException exceptions and job reties for some of our processes (which causes a lot of trouble).
When not under load, the orchestrator don't throw any OptimisticLockingException exception
Could you please suggest a way to locate which steps provoke these concurrent operations ?
170556:2021/01/21 21:35:04.022 DEBUG ENGINE-16002 Exception while closing command context: ENGINE-03005 Execution of 'UPDATE ExecutionEntity[223d44fe-5c28-11eb-aa7e-eeeccf665d52]' failed. Entity was updated by another transaction concurrently. {"org.camunda.bpm.engine.OptimisticLockingException: ENGINE-03005 Execution of 'UPDATE ExecutionEntity[223d44fe-5c28-11eb-aa7e-eeeccf665d52]' failed. Entity was updated by another transaction concurrently.":null}
170986:2021/01/21 21:35:04.107 WARN ENGINE-14006 Exception while executing job 23e3a29c-5c28-11eb-80a2-eeeccf665d52: {"org.camunda.bpm.engine.OptimisticLockingException: ENGINE-03005 Execution of 'UPDATE ExecutionEntity[223d44fe-5c28-11eb-aa7e-eeeccf665d52]' failed. Entity was updated by another transaction concurrently.":null}
107264:2021/01/21 21:35:36.407 DEBUG ENGINE-16002 Exception while closing command context: ENGINE-03005 Execution of 'DELETE TimerEntity[f723f288-5c27-11eb-aa7e-eeeccf665d52]' failed. Entity was updated by another transaction concurrently. {"org.camunda.bpm.engine.OptimisticLockingException: ENGINE-03005 Execution of 'DELETE TimerEntity[f723f288-5c27-11eb-aa7e-eeeccf665d52]' failed. Entity was updated by another transaction concurrently.":null}
If you can suggest a way to avoir retry of async task that would be great, as asked in this question
https://forum.camunda.org/t/how-to-avoid-retry-of-async-service-tasks-when-an-optimisticlockingexception-occurs/21301
Env :
2 instances of spring boot Camunda orchestrator
<camunda-bpm.version>3.4.0</camunda-bpm.version>
<camunda-engine.version>7.12.0</camunda-engine.version>
Postgres 9.12 with read_commited

OptimisticLockingExceptions are a mechanism to protect you from lost updates, which could otherwise result form concurrent access to the same execution data. One transaction updated the parent execution first (V1>V2). The process engine then makes the second transaction redo its operations (on V1, meanwhile stale), but this time based on the latest version of the execution (V2). The second transaction then creates new version of the execution (V2>V3)
So the OLEs can occur in places where concurrency occurs. Are you using parallel or inclusive gateways? Are events trigger concurrent token flow?
Understand when concurrency occurs in the process model / engine and evaluate if the concurrent execution is really needed. In many cases people model e.g. two service call in parallel, which only take milliseconds each. Then there is no gain in total processing time (creating and merging concurrent job also costs time), but the concurrency can become a burden. So prefer sequential execution where possible.
Check the duration of your transactions. If you have longer transaction combining multiple service calls, it can be helpful to split them into multiple jobs (it depends on the use case. more jobs also mean more transactions).
The most important best practice when dealing with OLE is checking async before on merging parallel gateways. This will not fully prevent the OLE, but the built-in retry mechanism of the job executor will take care of them for you.
Last but not least OLEs occur increasingly when the system is high load and the DB is not performing well. Tune the overall system performance to reduce DB load and OLEs.

Related

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

How to limit concurrency of a step in step functions

I have a state machine in AWS. I want to limit concurrency of a task (created via lambda) to reduce traffic to one of my downstream API.
I can restrict the lambda concurrency, but the task fails with "Lambda.TooManyExecutions" failure. Can someone please share a simple approach to limit concurrency of a lambda task?
Thanks,
Vinod.
Within the same state machine execution
You can use a Map state to run these tasks in parallel, and use the maximum concurrency setting to reduce excessive lambda executions.
The Map state ("Type": "Map") can be used to run a set of steps for each element of an input array. While the Parallel state executes multiple branches of steps using the same input, a Map state will execute the same steps for multiple entries of an array in the state input.
MaxConcurrency (Optional)
The MaxConcurrency field’s value is an integer that provides an upper bound on how many invocations of the Iterator may run in parallel. For instance, a MaxConcurrency value of 10 will limit your Map state to 10 concurrent iterations running at one time.
This should reduce the likelihood of issues. That said, you would still benefit from adding a retry statement for these cases. Here's an example:
{
"Retry": [ {
"ErrorEquals": ["Lambda.TooManyRequestsException", "Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
} ]
}
Across different executions
If you want to control this concurrency across different executions, you'll have to implement some kind of separate control yourself. One way to prepare your state machine for that is to request the data you need and then using an Activity to wait for a response.
You can use the lambda concurrency you mentioned but then add a retry clause to your step function so that when you hit the concurrency limit, step functions manages the retry of that task that failed.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-examples
There’s a limit to the number of retries, but you get to define it.
Alternatively , if you want to retry without limit, you could use catch to move to a Wait state when that concurrency is thrown. You can read about catch in the link above too. Here’s a wait state doc.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
You just have wait state transition back to the task state after it completes its wait.

SplittableDoFn when using BigQueryIOI

When reading large tables from BigQuery, I find that only one worker is sometime active and Dataflow then actively kills other workers (then starts ramping up once the large PCollection requires processing - losing time)
So I wonder:
1. Will SplittableDoFn (SDF) alleviate this problem when applied to BigQueryIO
2. Will SDF's increase the use of the num_workers (and stop them from being shut down)?
3. Are SDF's available in Python (yet) and even in Java, available beyond just FileIO?
The real objective here is to reduce total processing time (quicker creation of the PCollection using more workers, faster execution of the DAG as Dataflow then scales up from --num_workers to --max_workers)

Oracle 12. Maximum duration for "select for update" for occi c++

We are using occi in order to access Oracle 12 via a C++ process. One of the operations has to ensure that the client has to pick the latest data in the database and operate according to the latest value. The statement is
std::string sqlStmt = "SELECT REF(a) FROM O_RECORD a WHERE G_ID= :1 AND P_STATUS IN (:2, :3) FOR UPDATE OF PL_STATUS"
(we are using TYPES). For some reason this command did not go though and the database table is LOCKED. All other operations are waiting the first thread to finish, however the thread is killed and we have reached a deadend.
What is the optimal solution to avoid this catastrophic senario? Can I set a timeout in the statement in order to by 100% that a thread can operate on the "select for update", let's say for a maximum of 10 seconds? In other words the thread of execution can lock the database table/row but no more than a predifined time.
Is this possible?
There is a session parameter ddl_lock_timeout but no dml_lock_timeout. So you can not go this way. So Either you have to use
SELECT REF(a)
FROM O_RECORD a
WHERE G_ID= :1 AND P_STATUS IN (:2, :3)
FOR UPDATE OF PL_STATUS SKIP LOCKED
And modify the application logic. Or you can implement your own interruption mechanism. Simply fire a parallel thread and after some time execute OCIBreak. It is documented and supported solution. Calling OCIBreak is thread safe. The blocked SELECT .. FOR UPDATE statement will be released and you will get an error ORA-01013: user requested cancel of current operation
So on OCCI level you will have to handle this error.
Edit: added the Resource Manager, which can impose an even more precise limitation, just focused on those sessions that are blocking others...
by means of the Resource Manager:
The Resource Manager allows the definition of more complex policies than those available to the profiles and in your case is more suitable than the latter.
You have to define a plan and the groups of users associated to the plan, have to specify the policies associated to plan/groups and finally have to attach the users to the groups. To have an idea of how to do this, you can reuse this example #support.oracle.com (it appears a bit too long to be posted here) but replacing the MAX_IDLE_TIME with MAX_IDLE_BLOCKER_TIME.
The core line would be
dbms_resource_manager.create_plan_directive(
plan => 'TEST_PLAN',
group_or_subplan => 'my_limited_throttled_group',
comment => 'Limit blocking idle time to 300 seconds',
MAX_IDLE_BLOCKER_TIME => 300)
;
by means of profiles:
You can limit the inactivity period of those session specifying an IDLE_TIME.
CREATE PROFILE:
If a user exceeds the CONNECT_TIME or IDLE_TIME session resource limit, then the database rolls back the current transaction and ends the session. When the user process next issues a call, the database returns an error
To do so, specify a profile with a maximux idle time, and apply it to just the relevant users (so you wont affect all users or applications)
CREATE PROFILE o_record_consumer
LIMIT IDLE_TIME 2; --2 minutes timeout
alter user the_record_consumer profile o_record_consumer;
The drawback is that this setting is session-wide, so if the same session should be able to stay idle in the course of other operations, this policy will be enforced anyway.
of interest...
Maybe you already know that the other sessions may cohordinate their access to the same record in several ways:
FOR UPDATE WAIT x; If you append the WAIT x clause to your select for update statement, the waiting session will give up the wait after "x" seconds have elapsed. (the integer "x" must be hardcoded there, for instance the value "3"; a variable won't do, at least in Oracle 11gR2).
SKIP LOCKED; If you append the SKIP LOCKED clause to your select for update statement, the select won't return the records that are locked (as ibre5041 already pointed up).
You may signal an additional session (a sort of watchdog) that your session is up to start the query and, upon successful execution, alert it about the completion. The watchdog session may implement its "kill-the-session-after-timeout" logic. You have to pay the added complexity but get the benefit of having the timeout applied to that specific statement, not to the session. To do so see ORACLE-BASE - DBMS_PIPE or 3.2 DBMS_ALERT: Broadcasting Alerts to Users, By Steven Feuerstein, 1998.
Finally, it may be that you are attempting to implement a homemade queue infrastructure. In this case, bear in mind that Oracle already has its own queue mechanics called Advanced Queue and you may get a lot with very little by simply using them; see ORACLE-BASE - Oracle Advanced Queuing.

Dynamo DB - Concurrent Updates

I need to manage concurrent updates in a Dynamo DB Table. After doing a bit of research, I came across two ways to do the same:
Optimistic Locking through Version IDs - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html
Conditional Save - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.SpecifyingConditions.html
Are there any more ways to handle concurrent updates in Dynamo DB?
User Scenario
Table A - maintains some kind of state [STARTED, IN_PROGRESS, COMPLETE]
The change of state can only be like this.
STARTED -> IN_PROGRESS
IN_PROGRESS -> COMPLETE
IN_PROGRESS -> STARTED (back step)
I think I can use both optimistic locking and conditional write here.
Optimistic Locking -
Have a version number. When Thread 1 updates state to IN_PROGRESS, the version number is increased to 1.
Next if some Thread 2 changes the state back to STARTED, the version number will be increased to 2.
Later if Thread 1 tries to change the state back to COMPLETE it is going to fail since the Versions would differ.
Conditional Save -
Have a condition to check that when we try to change State to COMPLETE the current state should be IN_PROGRESS and not STARTED.
I think I can use either of the mechanism in this kind of situation. Is that correct? Or are there any specific situations where we should prefer one over the other?