Since we have updated from camunda engine 7.0.0-alpha4 to 7.0.0-Final, we are facing a problem when rolling back transactions that contain either deployment or delete deployment commands. The engine defines listeners DeploymentFailListener and DeleteDeploymentFailListener that are called upon transaction rollback, but at the time of rollback, we are outside camunda's context (i.e. Context has been emptied, and Context.getProcessEngineConfiguration().getRegisteredDeployments() throws a Nullpointer exception).
Is this a bug in the camunda engine? Can we do anything to avoid it?
Related
We are working on an event sourced application with akka-persistance using Oracle database as event store. The application have been running in production for sometime now. Lately we are seeing the following error in the application for some of the persistent actors.
Persistence failure when replaying events for persistenceId [some-persistence-id]. Last known sequence number [0]
Can someone who faced a similar issue in their application share their experience of why this happens?
Also, going through Akka documentation at: https://doc.akka.io/docs/akka/current/persistence.html, onRecoveryFailure is responsible for handling such failures. Is there a way we can override this method to ignore the persisted events in case we see failures while replaying events? In our scenario replaying the events is not very critical and we can serve the users even by ignoring the m.
That log is typically a manifestation of something else. Since the failure is from sequence number zero, that points an actual query to the DB failing (e.g. timeout). There should be other logs around the time of that log which will provide further information.
Akka Persistence has a fairly strong assumption that the persisted state is important (otherwise why would you be persisting?). Off the top of my head, I would consider separating the parts of the actor which are affected by persistence from the parts which aren't: the non-persistent actor can spawn a persistent child and interact with it (it can do tricks with stashing, for instance, to present an illusion that it and its child are the same actor).
We have a micro service based architecture. Let's say we have front and backend completely isolated. The backend microserviceA exposes a rest endpoint which basically calls a thirdParty service and updates a record in cosmosDB. Now, this micro service is deployed over kubernetes cluster and hence can have multiple replication factor for load balancing. As mentioned before, the frontEnd is isolated and it consumes the exposed endpoint.
Problem :
FrontEnd has been written in such a manner that if the response is not obtained within a certain time frame or if a network failure occurs, it retries the endpoint. It has been observed that in some rare scenarios(doesn't matter what) UI makes multiple calls (mostly 2) one after another with time difference in milliseconds. Now here comes the race condition at the backend logic.
If the first call goes to ThirdParty first and obtained a success response, the second call will get a failure(bcz the first one was already a success). We can not change the behaviour of ThirdParty.
Taking above scenario as base, Now if the second call(failure one) updates the DB first and reaches the UI. UI takes this as a failure(whereas the first call was already a success) and take failure actions.
If the success calls makes it to the UI first, everything works fine.
Possible solution I can think of:
1)
Put a cache as source of truth.
apiCall : Status
If (entry not present in cache) {
Put Entry in cache With Status NULL or Something with specific TTL
(acquire lock on specific entry) {
If (status is success) return successResponse.
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
} else {
(acquire lock on specific entry) {
MAKE ThirdParty Call
Update DB
Update cache
Release LOCK
}
}
Else block will never be executed. seems like.
Only in case of failure, instead of updating the DB, put a thread.sleep(10000) for couple of times in hope that another thread will update the DB with success response.
If still not success, return a failure update and update DB.
Put a poller on UI side. If it is a failure. Try to poll couple of times more in hope that the status changes. If not, take the failure actions.
Optimistic locking for cosmos record.
https://cosmosdb.github.io/labs/dotnet/labs/10-concurrency-control.html
Not sure how this can help.
Let's say, both api calls read the record when the version was 0.
Now the second api call update the the DB record, as the version was not changed,
it will be a successful update.
Now the DB holds Failure as value.
The first api call tries to update it and it found a version mismatch,
the update will not go through and another attempt will be made to update the DB as it was a success.
In case of failure, no attempts to update DB will be made.
Now, the second API call will appear to UI first and UI will again take the failure action.
UI require a poller in such cases.
But if the UI requires a poller, why do we need the optimistic locking in first place. :)
I don't know cosmosDB functionality much. If there is some functionality cosmos provides to handle, Please be kind enough to share.
What will be the best way to handle such kind of scenarios.
It seems in your application design you have made it necessary to wait for each execution to finish before you fire the next one, I am not debating if this is good or bad that's a different discussion, but it seems the only option you have to fire all your DB Updates in a synchronous manner in this case.
Optimistic locking is very good to ensure that the document you are updating have not been updated while your code did other things but it will not help your UI issue here.
I think you need to abstract the UI in order to make this work properly otherwise you are stuck running things in synchronous mode
We have a flow where if some actions are not done by a certain time period, we want to fail the workflow, to use alarming mechanisms.
For failing the workflow, I was initially thinking of just returning an exception from the code. But after reading sources online, it seems like the exception in decider flow will not let the host return the result and some other host will just pick the pending decision task after some time.
Wanted to know if there's a programmatic way to terminate the workflow in between via workflow code and marking the SWF workflow execution as failed.
To fail a workflow using the Flow Framework throw an Exception or its subclass from the workflow code.
Don't throw an Error from the workflow code. This indeed is going to fail a decision task which will lead to the workflow getting blocked in a retry loop.
The application has an implementation of IEventProcessor. When an unhandled exception is thrown from the ProcessEventsAsync method the EventProcessorHost never re-sends those messages to the running instance of IEventProcessor. (It will re-send if the hosting application is stopped and restarted or if the lease is lost and re-obtained.)
when an exception occurs in processEventAsync the checkpoint will not be set only if it's successful the checkpoint is set using this context.CheckpointAsync()
Checkout the ProcessorErrorAsync method. According the doc, it will be called in the event of an error. You'll have access to the context where can log the id and error.
I have a website with four continuous webjobs listening on different topics of a service bus.
If during the execution of one these webjobs, an error occurs and the process exits, how do I prevent the webjob to start up again (which in most cases would simply incur in the error again)?
I tried keeping a disable.job file in the root of each webjob folder, thinking that if I then ran the webjob manually it would override it, but instead it shuts down almost immediately after detecting that that file is present (I thought it would only check on an automatic restart).
There is no mechanism today to achieve that. If a continuous WebJob is not disabled, the WebJob engine will always try to restart if it crashes for any reason. That is what most users expect.
If you don't want that, one thing you could do is catch the exception in your WebJob, and simply do nothing (i.e. get in a Sleep loop). However, I would suggest getting to the bottom of the error, and seeing whether it can be avoided.