How to skip parallel branch in camunda - camunda

I have the following camunda flow
Sometimes an error might happen in any of the 8 service tasks. In that case I would like to skip
the service tasks on that branch that come next to the one that throws the error, log the error and let the other branch complete successfully.
Currently if an error happens in one of the branches the flow will hang.
The flow will also hang if there is a failure in both branches.
What is the best way to address this?

You can catch the error and log that and continue the flow of execution.
Something like this:

Related

can the last state in a aws step function flow contain a catch statement?

I have a step functions orchestration flow and I want to do error handling in some of the states using the catch field. However, the catch field requires a Next assignment and therefore I am unable to include a catch field in my last state if i want my step function to run.
I would like to have a catch field in the last state of the flow but I am wondering if it is good practise to have a catch statement in the last state. When i introduce an ending state e.g. a Type:Succeed state the stepfunction is able to run. But this solution feels a bit hacky.
I have tried to set the value of Next in the catch field to End. But was thrown this error in cloudformation when it tried to update the stack.
Resource handler returned message: "Invalid State Machine Definition: 'MISSING_TRANSITION_TARGET: Missing 'Next' target: EndState at /States/last_jobs/Branches[0]/States/last_state/Catch[0]/Next' (Service: AWSStepFunctions; Status Code: 400; Error Code: InvalidDefinition; Proxy: null)" (HandlerErrorCode: InvalidRequest)
The purpose of Catch is so you can tell Step Functions to take a different action in response to a failure (after retries) as the default behavior will be to fail the execution. That action needs to be captured in the workflow, hence the need for this to point to another state where that action is described.
I'm not 100% sure what you are looking to accomplish with this catch block, but I suspect it's one of the following cases.
If you are looking to take further action to compensate, then you will need to add that to your workflow (e.g. another task or a wait state that re-enters into the existing flow).
If you are looking to provide a specific failure cause and / or error as opposed to the default you would get from the task failing, then you will need a Fail state with those specifics. And set that as Next for your Catch.
If you are looking to ignore this task failure and complete the workflow successfully, then you need a Succeed state that you can specify as Next for your Catch.

How to debug a hanging job resulting from reading from lustre?

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

Azure Event Hub ServiceBusException causing skipped messages

We are using the Azure Java event hub library to read messages out of an event hub. Most of the time it works perfectly, but periodically we see exceptions of type "com.microsoft.azure.servicebus.ServiceBusException" occur that correspond to times when messages seem to be skipped that are in the event hub.
Here are some examples of exception details:
"The message container is being closed (some number here)."
This generally hits multiple partitions at the same time, but not all.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
"The link 'xxx' is force detached by the broker due to errors occurred in consumer(link#). Detach origin: InnerMessageReceiver was closed."
This is generally tied to com.microsoft.azure.servicebus.amqp.AmqpException exceptions.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
Example callstack:
at com.microsoft.azure.servicebus.ExceptionUtil.toException(ExceptionUtil.java:93)
at com.microsoft.azure.servicebus.MessageReceiver.onError(MessageReceiver.java:393)
at com.microsoft.azure.servicebus.MessageReceiver.onClose(MessageReceiver.java:646)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:83)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:52)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:309)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:276)
at com.microsoft.azure.servicebus.MessagingFactory$RunReactor.run(MessagingFactory.java:340)
at java.lang.Thread.run(Thread.java:745)
There doesn't seem to be a way for clients of the library to recognize a problem occurs and avoid moving ahead in the event hub past our skipped messages. Has anyone else run into this? Is there some other way to recognize and avoid skipping or retrying missed messages?
This error DOESN'T SKIP any messages - it will throw an Exception, when it shouldn't have. This will result in EPH to RESTART the affected Partitions' Receiver. If the application using EventHubs javaclient doesn't handle the errors - they may experience loss of messages.
This is a bug in our retry logic - in the current version of EventHubs JavaClient - until 0.11.0.
Here's the corresponding issue to track progress.
In EventHubs service - these errors happen if - for any reason - the Container hosting your EventHubs' code has to close (for the sake of the explanation, imagine we run a set of Container's - like DockerContainers for every EventHub namespace) - this is a transient error - this Container will eventually be opened in another Node.
Our javaclient-retry logic should have handled this error and should have retried - Will keep this thread posted with the fix.
EDIT
We just released 0.12.0 - which fixes this issue.
Thanks!
Sreeram

Gatling: polling a webservice, and failing the scenario on incorrect response-messages

Hard to write a good title for this question. I am developing a performance test in Gatling for a SOAP Webservice. I'm not very experienced with Gatling so I'm learning things as I go, but this conundrum has me entirely stumped.
One of the scenarios I am implementing a test for is an order-process consisting of several unique consecutive calls to the webservice, one of which is a polling call that returns the current status of the ordering process. Simplified, this call gets a SOAP Response with a status that can be of three types:
PROCESSING - Signifying the order is still processing.
ORDER_OK - Order completed without errors.
EVERYTHING_ELSE - A group of varying error-statuses and other results.
What I want to do, is have Gatling continuously poll the webservice until the processing-status changes - and then check that the status says it completed successfully. Polling continuously is easily implemented, but performing the check after it completes is turning out to be a far greater challenge than it has any business being.
So far, this is what I've done to solve the polling:
exec { session => session.set("status", "PROCESSING") }
.asLongAs(session => session("status").as[String].equals("PROCESSING")) {
exec(http("Poll order")
.post("/MyWebService")
.body(ELFileBody("bodies/ws/pollOrder.xml"))
.check(
status.is(200),
regex("soapFault").notExists,
regex("pollResponse").exists,
xpath("//*[local-name(.)='result']").exists.saveAs("status")
)
).exitHereIfFailed.pause(5 seconds)
}
This snip appears to be performing the polling correctly, it continues to poll until the orderStatus changes from processing to something else. I need to check the status to see if it changed to the response I am interested in however, because I don't know what it is, and only one of the many results it can be should cause the scenario to continue for that user.
A potential fix would be to add more checks in that call that go something like this:
.check(regex("EVERYTHING_ELSE_XYZ")).notExists
The service can return a LOT of different "not a happy day" messages however and I'm only really interested in the two other ones, so it would be preferable for me to be able to do a check only for the two valid happy-day responses. Checking if one exact thing exists seems far more sensible than checking that dozens of things don't.
What I thought I would be able to do was performing a check on the status variable in the users session when the step exits the asLongAs-loop, and continue/exit the scenario for that user. As it's a session-variable I could probably do this in the next step of the total scenario and break the run for that user there, but that would also mean the error is reported in the wrong place, and the next calls fault-% would be polluted by errors from the previous call.
Using pseudocode, being able to do something like this immediately after it exits the asLongAs loop would have been perfect:
if (session("status").as[String].equals("ORDER_OK")) ? continueTheScenario : failTheScenario
but I've not been able to do anything similar to that inside a gatling-chain. It's almost starting to appear impossible to do something like that, but can anyone see a solution to this that I'm not seeing?
Instead of "exists", use "in" to check that the result is one of the 2 valid values.

How should I handle an error in libpq for postgresql

I'm creating a few simple helper classes and methods for working with libpq, and am wondering if I receive an error from the database - (e.g. SQL error), how should I handle it?
At the moment, each method returns a bool depending on whether the operation was a success, and so is up to the user to check before continuing with new operations.
However, after reading the libpq docs, if an error occurs the best I can come up with is that I should log the error message / status and otherwise ignore. For example, if the application is in the middle of a transaction, then I believe it can still continue (Postgresql won't cancel the transaction as far as I know).
Is there something I can do with PostgreSQL / libpq to make the consequences of such errors safe regarding the database server, or is ignorance the better policy?
You should examine the SQLSTATE in the error and make handling decisions based on that and that alone. Never try to make decisions in code based on the error message text.
An application should simply retry transactions for certain kinds of errors:
Serialization failures
Deadlock detection transaction aborts
For connection errors, you should reconnect then re-try the transaction.
Of course you want to set a limit on the number of retries, so you don't loop forever if the issue doesn't clear up.
Other kinds of errors aren't going to be resolved by trying again, so the app should report an error to the client. Syntax error? Unique violation? Check constraint violation? Running the statement again won't help.
There is a list of error codes in the documentation but the docs don't explain much about each error, but the preamble is quite informative.
On a side note: One trap to avoid falling into is "testing" connections with a trivial query before using them, and assuming that means the real query can't fail. That's a race condition. Don't bother testing connections; simply run the real query and handle any error.
The details of what exactly to do depend on the error and on the application. If there was a single always-right answer, libpq would already do it for you.
My suggestions:
Always keep a record of the transaction until you've got a confirmed commit from the DB, in case you have to re-run. Don't just fire-and-forget SQL statements.
Retry the transaction without a disconnect and reconnect for SQLSTATEs 40001 (serialization_failure) and 40P01 (deadlock_detected), as these are transient conditions generally resolved by re-trying. You should log them, as they're opportunities to improve how the app interacts with the DB and if they happen a lot they're a performance problem.
Disconnect, reconnect, and retry the transaction at least once for error class 08 (connection exceptions).
Handle 53300 (too_many_connections) and 53400 (connection limit exceeded) with specific and informative errors to the user. Same with the other 53 class entries.
Handle class 57's entries with specific and informative errors to the user. Do not retry if you get a query_cancelled (57014), it'll make sysadmins very angry.
Handle 25006 (read_only_sql_transaction) by reporting a different error, telling the user you tried to write to a read-only database or using a read-only transaction.
Report a different error for 23505 (UNIQUE violation), indicating that there's a conflict in a unique constraint or primary key constraint. There's no point retrying.
Error class 01 should never produce an exception.
Treat other cases as errors and report them to the caller, with details from the problem - most importantly SQLSTATE. Log all the details if you return a simplified error.
Hope that's useful.