Xray multi threading - Failed to end subsegment: subsegment cannot be found - amazon-web-services

I have an issue using Xray in a multithreaded environment with my REST API.
I'm using the Xray Auto instrumentation agent for Java, and Spring boot.
I've tried to follow the example found here
https://docs.aws.amazon.com/xray/latest/devguide/scorekeep-workerthreads.html
However when ending my subsegment, I get a log output of
"Suppressing AWS X-Ray context missing exception (SubsegmentNotFoundException): Failed to end subsegment: subsegment cannot be found."
I suspect what happens is that the request has been fully handled and response returned to the client, before the CompletableFuture is done. I assume this means that the Xray segment has been closed and that might explain why I'm seeing this issue. What I'm wondering is if there´s anything I can do to fix this?
Entity segment = AWSXRay.getGlobalRecorder().getTraceEntity();
CompletableFuture.runAsync(() -> {
AWSXRay.getGlobalRecorder().setTraceEntity(segment);
AWSXRay.beginSubsegment("PostTest");
try {
defaultService.createStatus(accessToken, statusRequest);
} finally {
AWSXRay.endSubsegment();
}
})

Thanks for your deep diving, this error log confuses users, the Xray SDK is better to remind user the segment is missing when beginning subsegment but not throw an missing subsegment exception at the end.
Your guess is right, it is logical that subsegment can end later than parent segment, but subsegment have to begin earlier than its parent segment end. Else subsegment would not be created successfully and certainly cannot be ended at the end. Please try to adjust your code to start this async thread before segment close.

Related

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Processing AWS Lambda messages in Batches

I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures

Sitecore: Session Started Pipeline

I need to detect when a new session is started and I'm unable to locate a pipeline or processor for it. Ironically, I've found the sessionEnd pipeline!
The only other way I can seem to make it work is using Global.aspx to hook the session start event.
You can add your own processor to the httpRequestBegin pipeline with the following code:
if (Session["exist"] == null)
{
// your code that should be executed on session start.
Session["exist"] = true;
}
You might find what you're looking for here
Depending on what you're looking for maybe the startTracking pipeline is what you want.

Can Amazon Simple Workflow (SWF) be made to work with jRuby?

For uninteresting reasons, I have to use jRuby on a particular project where we also want to use Amazon Simple Workflow (SWF). I don't have a choice in the jRuby department, so please don't say "use MRI".
The first problem I ran into is that jRuby doesn't support forking and SWF activity workers love to fork. After hacking through the SWF ruby libraries, I was able to figure out how to attach a logger and also figure out how to prevent forking, which was tremendously helpful:
AWS::Flow::ActivityWorker.new(
swf.client, domain,"my_tasklist", MyActivities
) do |options|
options.logger= Logger.new("logs/swf_logger.log")
options.use_forking = false
end
This prevented forking, but now I'm hitting more exceptions deep in the SWF source code having to do with Fibers and the context not existing:
Error in the poller, exception:
AWS::Flow::Core::NoContextException: AWS::Flow::Core::NoContextException stacktrace:
"aws-flow-2.4.0/lib/aws/flow/implementation.rb:38:in 'task'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:292:in 'respond_activity_task_failed'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:204:in 'respond_activity_task_failed_with_retry'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:335:in 'process_single_task'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:388:in 'poll_and_process_single_task'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:447:in 'run_once'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:419:in 'start'",
"org/jruby/RubyKernel.java:1501:in `loop'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:417:in 'start'",
"/Users/trcull/dev/etl/flow/etl_runner.rb:28:in 'start_workers'"
This is the SWF code at that line:
# #param [Future] future
# Unused; defaults to **nil**.
#
# #param block
# The block of code to be executed when the task is run.
#
# #raise [NoContextException]
# If the current fiber does not respond to `Fiber.__context__`.
#
# #return [Future]
# The tasks result, which is a {Future}.
#
def task(future = nil, &block)
fiber = ::Fiber.current
raise NoContextException unless fiber.respond_to? :__context__
context = fiber.__context__
t = Task.new(nil, &block)
task_context = TaskContext.new(:parent => context.get_closest_containing_scope, :task => t)
context << t
t.result
end
I fear this is another flavor of the same forking problem and also fear that I'm facing a long road of slogging through SWF source code and working around problems until I finally hit a wall I can't work around.
So, my question is, has anyone actually gotten jRuby and SWF to work together? If so, is there a list of steps and workarounds somewhere I can be pointed to? Googling for "SWF and jRuby" hasn't turned up anything so far and I'm already 1 1/2 days into this task.
I think the issue might be that aws-flow-ruby doesn't support Ruby 2.0. I found this PDF dated Jan 22, 2015.
1.2.1
Tested Ruby Runtimes The AWS Flow Framework for Ruby has been tested
with the official Ruby 1.9 runtime, also known as YARV. Other versions
of the Ruby runtime may work, but are unsupported.
I have a partial answer to my own question. The answer to "Can SWF be made to work on jRuby" is "Yes...ish."
I was, indeed, able to get a workflow working end-to-end (and even make calls to a database via JDBC, the original reason I had to do this). So, that's the "yes" part of the answer. Yes, SWF can be made to work on jRuby.
Here's the "ish" part of the answer.
The stack trace I posted above is the result of SWF trying to raise an ActivityTaskFailedException due to a problem in some of my activity code. That part is my fault. What's not my fault is that the superclass of ActivityTaskFailedException has this code in it:
def initialize(reason = "Something went wrong in Flow",
details = "But this indicates that it got corrupted getting out")
super(reason)
#reason = reason
#details = details
details = details.message if details.is_a? Exception
self.set_backtrace(details)
end
When your activity throws an exception, the "details" variable you see above is filled with a String. MRI is perfectly happy to take a String as an argument to set_backtrace(), but jRuby is not, and jRuby throws an exception saying that "details" must be an Array of Strings. This exception blows through all the nice error catching logic of the SWF library and into this code that's trying to do incompatible things with the Fiber library. That code then throws a follow-on exception and kills the activity worker thread entirely.
So, you can run SWF on jRuby as long as your activity and workflow code never, ever throws exceptions because otherwise those exceptions will kill your worker threads (which is not the intended behavior of SWF workers). What they are designed to do instead is communicate the exception back to SWF in a nice, trackable, recoverable fashion. But, the SWF code that does the communicating back to SWF has, itself, code that's incompatible with jRuby.
To get past this problem, I monkey-patched AWS::Flow::FlowException like so:
def initialize(reason = "Something went wrong in Flow",
details = "But this indicates that it got corrupted getting out")
super(reason)
#reason = reason
#details = details
details = details.message if details.is_a? Exception
details = [details] if details.is_a? String
self.set_backtrace(details)
end
Hope that helps someone in the same situation as me.
I'm using JFlow, it lets you start SWF flow activity workers with JRuby.

Rails - run pusher in background

I use Pusher in my Rails-4 application.
The problem is that sometimes the connection is slow, so the execution of the code becomes slower.
I also get from time to time the following error:
Pusher::HTTPError: execution expired (HTTPClient::ConnectTimeoutError)
I send signals via Pusher with this code:
Pusher[channel].trigger!(event, msg)
I would like to execute it in background, so if an exception is thrown it will not break the flow of my app, and neither slow it down.
I tried to wrap the call with begin ... rescue but it didn't solve the exception problem. Of course even if it would, it wouldn't solve the slow-down problem i want to avoid.
Information on performing asynchronous triggers can be found here:
https://github.com/pusher/pusher-gem#asynchronous-requests
This also provides you within information on catching/handling errors.
Finally I implemented this solution:
Thread.new do
begin
Pusher[channel].trigger!(ch, ev, msg)
ActiveRecord::Base.connection.close
rescue Pusher::Error => e
Rails.logger.error "Pusher error: #{e.message}"
end
end