Camunda: How to read incident stack trace with Java API - camunda

Let's say I have incident (org.camunda.bpm.engine.runtime.Incident) found using RuntimeService.createIncidentQuery()....
Is there a way to read actual incident stack trace using Java API? Same stack trace accessible in Cockpit.

If it is failed job then the configuration / payload of the incident will be the job id. If the incident is caused by a failed external task then it will be the external task id.
See https://docs.camunda.org/javadoc/camunda-bpm-platform/7.15/org/camunda/bpm/engine/runtime/Incident.html
and
https://docs.camunda.org/javadoc/camunda-bpm-platform/7.15/org/camunda/bpm/engine/ManagementService.html#getJobExceptionStacktrace-java.lang.String-
and
https://docs.camunda.org/javadoc/camunda-bpm-platform/7.15/org/camunda/bpm/client/task/ExternalTask.html#getErrorDetails--
Hence:
Incident incident = runtimeService.createIncidentQuery().singleResult();
String configuration = incident.getConfiguration();
log.info("Incident type: {}", incident.getIncidentType());
if (incident.getIncidentType().equals(Incident.FAILED_JOB_HANDLER_TYPE)) {
log.info("Here comes the stacktrace: {}", managementService.getJobExceptionStacktrace(configuration));
} else {
log.info("Here come the error details: {}", externalTaskService.getExternalTaskErrorDetails(configuration));
}

Related

Chainlink async bridge task returns HTTP Error 422

I have to integrate my Chainlink Bridge task with an asynchronous call. Looking at the official documentation, there's the parameter named async and I can set it to true to invoke it asynchronously.
I also found this link which explains how to create Asynchronous callbacks.
I created the following job step:
fetch [type="bridge"
name="generate-hash"
requestData="{\\"id\\": $(jobSpec.externalJobID), \\"data\\": {\\"fileList\\": $(decode_cbor.fileList)}}"
async=true
]
I also set the environment variable BRIDGE_RESPONSE_URL equals to my Chainlink Oracle address. Looking at the documentation, I must set when using async external adapters.
If I start the job execution, I can see the async call to my bridge and it contains the responseURL equals to {BRIDGE_RESPONSE_URL}/v2/resume/{id}.
Once the process finishes, I call the responseURL using the PATCH http method with the following body:
{
"data": {
"fileList": ["aaa...fff", "bbbb...vvvv"],
"hashList": ["0x...", "0x..."]
}
}
The error I get is 422 - Unprocessable Entity.
What am I doing wrong?
I finally fixed the problem looking at the DEBUG logs on the Chainlink Node. My logs were these ones:
2023-01-17T21:49:01.538Z [DEBUG] PATCH /v2/resume/b5c2254d-6f85-4953-9f63-7911951a3fb4 web/router.go:535
body={"data":{"fileList":["aaa...fff","bbb...vvv"], hashList":["0x...","0x..."]}} clientIP={client_ip} ùerrors=Error #01: must provide only one of either 'value' or 'error' key
Even if the official Chainlink documentation says you have to send back the result or the error setting data or error, it's wrong.
You have to set the parameter value and then you can set the parameter data inside. This is how I updated my PATCH HTTP request body:
{
"value": {
"data": {
"fileList": ["aaa...fff", "bbbb...vvvv"],
"hashList": ["0x...", "0x..."]
}
}
}

Can a MassTransit Consumer Saga be InitiatedBy the same message(s) that it Orchestrates?

The new support for Event Hub Riders in 7.0 plus the existing InMemoryRepository backing for Sagas looks like it could provide a straightforward means of creating aggregate states based on a stream of correlated messages, e.g. across all sensors in a Building). In this scenario, the Building's Identifier would be used as the CorrelationId of the Messages, the Saga, and as the PartitionKey of the EventData messages sent to the Event Hub, ensuring the same consuming service instance receives all messages for that device at a given time. Given the way Event Hub's rebalancing works, it can be assumed that at some point while this service is running, the service instance managing messages for a Partition will shift to a new host, which will start reading messages sent by the sensors in the building. At that moment:
The new host does not know anything about the old host's processing. It just knows that it is now receiving messages for the Event Hub partition that includes that Building's messages.
The devices sending the messages do not know anything about the transition in state aggregation responsibility "downstream of them" - they are still happily reporting new measurements as always.
The challenge this creates is: on the new service instance, we need a new Saga to be created to take over for the previous Saga, but the only thing that knows no Saga lives for a given entity is MassTransit: nothing on the new instance knows a sensor reading from Building A is the first one from Building A since this service instance took over tracking the aggregate Building A state. We thought this could be handled by marking the same Message (DataCollected) with both InitiatedBy and Orchestrates:
public class BuildingAggregator:
ISaga,
InitiatedBy<DataCollected>, //init saga on first DataCollected with a given CorrelationId seen
Orchestrates<DataCollected> //then keep handling those in that saga
{
//saga Consume methods
}
However, this throws the following exception when the BuildingAggregator receives its second DataCollected message with a given Guid:
Saga exception on receipt of MassTransitFW_POC.Program+DataCollected: The message cannot be accepted by an existing saga
at MassTransit.Saga.Policies.NewSagaPolicy`2.MassTransit.Saga.ISagaPolicy<TSaga,TMessage>.Existing(SagaConsumeContext`2 context, IPipe`1 next)
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MassTransit.Saga.InMemoryRepository.InMemorySagaRepositoryContextFactory`1.<Send>d__4`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
Is there another way of achieving this logic? Is this the "wrong way" to apply Sagas?
As per Chris Patterson's comments on the question above, this is achievable with the state machine syntax:
Initially(
When(DataCollected)
.Then(f => _logger.LogInformation("Initiating Network Manager for Network: {NetworkId}", f.Data.NetworkId))
.TransitionTo(Running));
During(Running,
When(DataCollected)
.Then(f => { // activities and state transitions }),
When(SimulationComplete)
.Then(f => _logger.LogInformation("Network {NetworkId} shutting down.", f.Instance.CorrelationId))
.TransitionTo(Final));
Note how the DataCollected event is handled both in the Initially state transition and in a state transition set by the Initially condition.

Xray multi threading - Failed to end subsegment: subsegment cannot be found

I have an issue using Xray in a multithreaded environment with my REST API.
I'm using the Xray Auto instrumentation agent for Java, and Spring boot.
I've tried to follow the example found here
https://docs.aws.amazon.com/xray/latest/devguide/scorekeep-workerthreads.html
However when ending my subsegment, I get a log output of
"Suppressing AWS X-Ray context missing exception (SubsegmentNotFoundException): Failed to end subsegment: subsegment cannot be found."
I suspect what happens is that the request has been fully handled and response returned to the client, before the CompletableFuture is done. I assume this means that the Xray segment has been closed and that might explain why I'm seeing this issue. What I'm wondering is if there´s anything I can do to fix this?
Entity segment = AWSXRay.getGlobalRecorder().getTraceEntity();
CompletableFuture.runAsync(() -> {
AWSXRay.getGlobalRecorder().setTraceEntity(segment);
AWSXRay.beginSubsegment("PostTest");
try {
defaultService.createStatus(accessToken, statusRequest);
} finally {
AWSXRay.endSubsegment();
}
})
Thanks for your deep diving, this error log confuses users, the Xray SDK is better to remind user the segment is missing when beginning subsegment but not throw an missing subsegment exception at the end.
Your guess is right, it is logical that subsegment can end later than parent segment, but subsegment have to begin earlier than its parent segment end. Else subsegment would not be created successfully and certainly cannot be ended at the end. Please try to adjust your code to start this async thread before segment close.

ValidationException error when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms

I'm trying to run a Lambda function to create a SageMaker training job using the same parameters as another previous training job. Here's my lambda function:
def lambda_handler(event, context):
training_job_name = os.environ['training_job_name']
sm = boto3.client('sagemaker')
job = sm.describe_training_job(TrainingJobName=training_job_name)
training_job_prefix = 'new-randomcutforest-'
training_job_name = training_job_prefix+str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]
print("Starting training job %s" % training_job_name)
resp = sm.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification=job['AlgorithmSpecification'],
RoleArn=job['RoleArn'],
InputDataConfig=job['InputDataConfig'],
OutputDataConfig=job['OutputDataConfig'],
ResourceConfig=job['ResourceConfig'],
StoppingCondition=job['StoppingCondition'],
VpcConfig=job['VpcConfig'],
HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {},
Tags=job['Tags'] if 'Tags' in job else [])
[...]
And I keep getting the following error message:
An error occurred (ValidationException) when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms. Please retry the request without specifying metric definitions.: ClientError
Traceback (most recent call last):
File “/var/task/lambda_function.py”, line 96, in lambda_handler
StoppingCondition=job[‘StoppingCondition’]
, and I get the same error for Hyperparameters and Tags.
I tried to remove these parameters, but they are required, so that's not a solution:
Parameter validation failed:
Missing required parameter in input: "StoppingCondition": ParamValidationError
I tried to hard-code these variables, but it led to the same error.
The exact same function used to work, but only for a few training jobs (around 5), and then it gave this error message. Now it stopped working completely, and the same error message comes up. Any idea why?
Before calling "sm.create_training_job", remove the MetricDefinitions key. To do this, pop that key from the 'AlgorithmSpecification' dictionary.
job['AlgorithmSpecification'].pop('MetricDefinitions',None)
It's hard to tell exactly what's going wrong here and why your previous job's hyperparemeters didn't work. Perhaps instead of just passing them along to the new job you could print them out to be able to inspect them?
Going the by this line...
training_job_prefix = 'new-randomcutforest-'
... I am going to hazard a guess and assume you are trying to run RCF. The hyperparameters that that algo requires are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html

Unable to resolve akka.pattern.AskTimeoutException: Ask timed out

I am trying to replace the default LevelDB in OpenDaylight with Apache Ignite which i am unable to do after making changes to the akka.conf file and deploying the akka-persistence-ignite jar that i found here. https://github.com/Romeh/akka-persistance-ignite
I am facing an issue in the following line of the source code (AbstractDataStoreClientActor class) where it throws a Runtime Exception.
private static final Function1<ActorRef, ?> GET_CLIENT_FACTORY = ExplicitAsk.toScala(GetClientRequest::new);
#SuppressWarnings("checkstyle:IllegalCatch")
public static DataStoreClient getDistributedDataStoreClient(#Nonnull final ActorRef actor,
final long timeout, final TimeUnit unit) {
return (DataStoreClient) Await.result(ExplicitAsk.ask(actor, GET_CLIENT_FACTORY,
Timeout.apply(timeout, unit)), Duration.Inf());
which gives the following error
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://opendaylight-cluster-data/user/$a#-809157907]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.databroker.actors.dds.GetClientRequest".
My question is how can i know the behavior of the actor to which the above message is sent? Is there any way to check if the actor has been created properly? What could be the reason for which the Ask method is going to timeout?
EDIT:::: error stack trace from karaf.log
2018-07-12T11:27:01,755 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | DistributedDataStoreClientActor | 90 - com.typesafe.akka.slf4j - 2.5.11 | Persistence failure when replaying events for persistenceId [member-1-frontend-datastore-config]. Last known sequence number [0]
java.lang.NullPointerException: null
at akka.japi.Util$.option(JavaAPI.scala:271) ~[84:com.typesafe.akka.actor:2.5.11]
at akka.persistence.snapshot.japi.SnapshotStore.$anonfun$loadAsync$1(SnapshotStore.scala:20) ~[87:com.typesafe.akka.persistence:2.5.11]
at scala.util.Success.$anonfun$map$1(Try.scala:251) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.util.Success.map(Try.scala:209) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.concurrent.Future.$anonfun$map$1(Future.scala:288) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[84:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[84:com.typesafe.akka.actor:2.5.11]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) [323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) [323:org.scala-lang.scala-library:2.12.5.v20180316-130912-VFINAL-30a1428]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) [84:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [84:com.typesafe.akka.actor:2.5.11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
at java.lang.Thread.run(Thread.java:748) [?:?]
The issue is not with DistributedDatastoreClientActor - it is a side-effect of an issue with the persistence backend - see my previous comment. Notice that the error stack trace contains an NPE emanating from akka.persistence.snapshot.japi.SnapshotStore which indicates the backing SnapshotStore unexpectedly returned null from loadAsync. This points to the ignite plugin.