WSO2BPS timeouts and wait nodes are note processed after restart - wso2

Using WSO2 BPS 3.6.0 we encountered a serious issue
We have a few processes waiting for external events (with timeout) and several processes polling for updates (using wait node).
The problem arises as soon we restart a server:
* timeouts which are passed during the downtime are not processed
* wait nodes are not processed at all
Reading related articels:
https://issues.jboss.org/browse/RIFTSAW-466
wso2 wait loop doesn't work after restart
I found that the timeout timestamps are stored in the ode_job table. So I tried to update the timeout timestamps (before starting up the BPS server)
update ode_job set ts=(near_future_timestamp) where ts>(before_restart) and ts<(near_future_timestamp)
which resolved the scope timeouts, however the wait nodes are not processed anymore even they were stated in future. That effectively blocks all the polling instances without any means to move them further.
Is there a way to "revive" or timeout the wait nodes after restarting the server?

Related

WSO2 Micro Integrator - Message Processors are Stuck After Do Server Stop and Start

I am running WSO2 MI 4.1 in a cluster with two nodes. After I re-enable all message forwarding processors in the Dashboard that are forwarding messages from RabbitMQs message store to an endpoint, each queue says it is running. When I stop the server on one node wait for a short period of time and then start the same node back up and then repeat this on the second node, the message processors look enabled and all have an emabled state. If I go to RabbitMQ, some of the queues are idle. If I try to send a message to these queues the message just sits there in the queue. If I stop and start the message processor for the queue then the queue starts processing messages. This behavior happens with empty queues and queues that have messages in them. Is this a bug or is there a better way to do a system restart?
Removing the _meta_MSMP* files in the _system/governance/repository/components/org.wso2.carbon.tasks/definitions/-1234/ESB_TASK/folder resolved this issue.

On Demand Scheduler

I have a daemon which constantly pools an AWS SQS queue for messages, once it does receive a message, I need to keep increasing the visibility timeout until the message is processed.
I would like to set up an "on demand scheduler" which increases the visibility timeout of the message every X minutes or so and then stops the scheduler once the message is processed.
I have tried using the Spring Scheduler (https://spring.io/guides/gs/scheduling-tasks/) but that doesn't meet my needs since it's not on demand and runs no matter what.
This is done on a distributed system with a large fleet.
A message can take up to 10 hours to completely process.
We cannot set the default visibility timeout for the queue to be a high number (due to other reasons).
I would just like to know if there is a good library out there that I can leverage for doing this? Thanks for the help!
The maximum visibility timeout for an SQS message is 12 hours. You are nearing that limit. Perhaps you should consider removing the message from the queue while it is being processed and if an error occurs or the need arises you can re-queue the message.
You can set a trigger for Spring Scheduler allowing you to manually set the next execution time. Refer to this answer. This gives you more control over when the scheduled task runs.
Given the scenario, pulling a message (thus having the visibility timeout timer start) and then trying to acquire a lock was not the most feasible way to go about doing this (especially since messages can take so long to process).
Since the messages could potentially take a very long time to process and thus delete, its not feasible to keep having to increase the timeout for messages that you've pulled. Thus, we went a different way.
We first acquire a lock and then pull the message and then increase the visibility timeout to 11 hours, after we've gotten a lock.

NetShareEnum Timeout

The process for NetShareEnum sometimes takes upwards of 30 seconds, successful connections generally take less than a second, is there any way to set a manual timeout time?
The documentation is quite silent on the subject. The protocol includes a timeout that seems to be the actual connection timeout instead of a failure timeout. I found SMB timeouts, which seem to be configurable to a degree (via registry settings) but I'd rather not mess up the default timeouts for a user.
If we can't set a manual timeout- is it acceptable to spawn a worker thread to run the process and kill that thread after a custom timeout (using WaitForSingleObject and TerminateThread)? Is there any possibility of crashing due to killing a thread running only that process?

Occasional high latency in qpid application

I'm hoping someone can help me with an issue I'm seeing with a Qpid C++ application I'm using. Essentially, we have one application publishing a status to a last_value_queue at about a 10Hz rate and a couple other applications continuously processing this status. The receivers also use the status as a kind of heartbeat and will complain if the status message isn't updated for a certain amount of time (500ms, to be exact.)
This works fine for about a day, after which we start seeing issues. Every couple hours, a single fetch call by a receiver will block for over 500ms (sometimes for up to 900ms.) This behavior will continue until we restart the broker.
I'm no expert, but I don't think I'm doing anything particularly dumb. I've been able to repeat this behavior with a pair of small applications that connect to the broker. Every 100ms the sender sends a std::chrono::time_point object set to the current time. The receiver fetches the message and calculates the delay to the millisecond. The delay is always 0ms or 1ms, except for the single spikes every hour or so after the initial day of everything being happy. The connection is created like so:
qpid::messaging::Connection c("host1:5672","{ reconnect: true}");
and the sender and receiver are both created with the string
"testQueue; { mode: browse, create: always, node: { type: queue, x-declare:{ arguments:{'qpid.last_value_queue_key':'key','qpid.replicate':'none'}}}}"
High availability replication is enabled on the broker, but I have it explicitly disabled for everything for the purpose of my testing. I see no difference in behavior when the broker and apps are running on the same host or different hosts on the LAN. Using qpid-stat, I can see that the broker replication queue is still transmitting quite a bit of data, but its message count is always at 0 so I don't think it's sending more than it can handle. Can anyone think of anything I might be missing that could cause this behavior? We're using the Qpid 0.26 and the C++ broker.

How to detect stale workers (or auto-restart)

We recently experienced a nasty situation with the celery framework. There were a lot of messages in the queue, however those messages weren't processed. We restarted celery and the messages started being processed again. However we do not want a situation like this happening again and are looking for a permanent solution.
It appears that celery's workers have gone stale. The documentation of celery notes the following on stale workers:
This shows that there’s 2891 messages waiting to be processed in the task queue, and there are two consumers processing them.
One reason that the queue is never emptied could be that you have a stale worker process taking the messages hostage. This could happen if the worker wasn’t properly shut down.
When a message is received by a worker the broker waits for it to be acknowledged before marking the message as processed. The broker will not re-send that message to another consumer until the consumer is shut down properly.
If you hit this problem you have to kill all workers manually and restart them
See documentation
However this relies on manual checking for stale workers, leaving lots of room for error and costing manual labor. What would be a good solution to keep celery working?
You could use supervisor or supervisor-like tools to deploy the workers, refer to Running the worker as daemon .
Moreover, you could monitor the queue status with rabbitmq-management, to check if the queue become too large, assume that you are using RabbitMQ; celery monitoring also provide some mechanisms for monitoring