We're using WSO2 DAS 3.1.0 to receive events from WSO2 API-Manager and send off to a database.
If we send maybe 70-100 events / second for 4-5 hours to DAS the performance slowly deteriorates and it starts "laging" behind. At first we suspected a problem pushing the resulting events to our database (we have an event-receiver, an execution-plan (that summarizes events / second) and a publisher to our database), but we've now concluded that this isn't an issue, the database has no problem keeping up with the load at all.
To isolate the problem we've for e.g. added an event publisher to file from the incoming event receiver (before we do any handling in our execution-plan) and we can see that when DAS performance deteriorates, for several seconds, there's no output for this publisher either; hence the problem is in handling incoming events (we've also added a queue between pushing events to our database to make sure there were no back-pressure propagating to the handling of incoming events).
The really strange part however is that when this behavior occurs (the performance handling incoming events in DAS deteriorates), there's no way to get out of it apart from restarting the entire server (then it starts working again without problem for several hours). Even if we stop sending events to the server for several days, when we start sending even 1-2 events to the server, it takes several seconds between handling all events (and thus straight away "lags" behind incoming events). It's as if the performance gets exponentially slower at handling incoming events until we restart DAS.
Would be very happy for any potential clues as to where to make changes for this behavior to not occur (purging internal events has no effect either).
After a lot of debugging we finally found the cause for this.
In our Siddhi-statements we use 'group by' with dynamically changing timestamps, which it turns out is handled extremely inefficient as described by this bug: https://github.com/wso2/siddhi/issues/431.
After patching the specified classes the problem disappeared (but currently there's still a bug where the product gets OOM since it doesn't release the dynamic 'group by' information).
Related
I use Wowza GoCoder to publish video to a custom Wowza live application. In my application I attach an IRTSPActionNotify event listener within the onRTPSessionCreate callback. In the onRecord callback of my IRTSPActionNotify I perform various tasks - start recording the live stream, among other things. In my onTeardown callback I then stop the recording and do some additional processing of the recorded video, like moving the video file to a different storage location.
What I just noticed was that if the encoder timeout, due to a lost connection, power failure or some other sudden event, I wont receive an onTeardown event - not even when the RTSP session timeout. This is a big issue for me, since I need to do this additional processing before I make the published stream available for on demand view through another application.
I've been browsing through the documentation looking for an event or a utility-class that could help me out, but so far to no avail.
Is there some reliable event, or other reliable way to know that a connection has timed out, so that I can trigger this processing also for streams that doesn't fire a teardown-event?
I first discovered this issue when I lost connection on my mobile device while encoding video using the Wowza GoCoder app for iOS, but I guess the issue would be the same for any encoder.
In my Wowza modules, I have the following pattern, which proved to be quite reliable so far:
I have a custom worker thread and that iterates over all client types. Now this allows me to keep track of clients, and I have found that eventually all kind of disasters lead to clients being removed from those lists after unclear timeouts.
I think try tracking (add / remove) clients in your own Set and see if that is more accurate.
You could also try and see if anything gets called in that case in IMediaStreamActionNotify2.
I have seen onStreamCreate(IMediaStream stream) and onStreamDestroy(IMediaStream stream) being called in ModuleBase in case of GoCoder on iOS, and I am attaching an instance of IMediaStreamActionNotify2 to the stream by calling stream.addClientListener(actionNotify)
On the GoCoder platform: I am not sure that it's the same on Android, the GoCoder Android and iOS versions have a fundamental difference, that is the streaming protocol itself, which leads to different API calls and behaviour on backend side. Don't go live without testing on both platforms.. :-)
in our project, we have a stateful server. The server runs a rule engine (Drools) and exposes functionality using a rest service. It is monitoring system and it is very critical to have an uptime or more less 100%. Therefore we also need strategies to shut down a server for maintainance and to have strategies to be able to continue monitoring of an agent when one server is offline.
The first might be to put a message queue or service bus in front of the drools servers to keep messages that have not been processed and to have mechanisms to backup the state of the server to a database or another storage. This makes it possible to shut down the server for a few minutes to deploy a new version. But the question is, what to do when one server goes offline unexpectedly. Are there any failover strategies for stateful servers, what is your experience? And advice is welcome.
There's no 'correct' way that I can think of. It rather depends on things like:
sensitivity to changes over time windows.
how quickly your application needs to be brought back up.
impact if events are missed.
impact if the events it is monitoring are not up to the second.
how the application raises events back to the outside world.
Some ideas for enabling fail-over:
Start from a clean slate. Examine the most serious impact of this before spending time developing anything else.
Load a list of facts (today's transactions perhaps) from a database. Potentially replay in order. Possibly whilst using a pseudo clock. I'm aware of this being used for some pricing applications in the financial sector, although at the same time, I'm also aware that some of those systems can take a very long time to catch up due to the number of events that need to be replayed.
Persist the stateful session periodically. The interval to be determined based on how far behind the DR application is permitted to be, and how long it takes to persist a session. This way, the DR application can retrieve the same session from the database. However, there will be a gap in events received based on the interval between persists. Of course, if the reason for failure is corruption of the session, then this doesn't work so well.
Configure middleware to forward events to 2 queues, and subscribe primary and DR applications to those queues. This way, both monitors should be in sync and able to make decisions based on the last 1 minute of activity. Note that if one leg is taken out for a period then it will need to catch up, and your middleware needs capacity to store multiple hours (however long an outage might be) worth of events on a queue. Also, your rules need to work off the timestamp on the event itself when queued, rather than the current time. Otherwise, when bringing a leg back after an outage, it could well raise alerts based on events in a time window.
An additional point to consider when replaying events is that you probably don't want any alerts to be raised to the outside world until you have completed the replay. For instance you probably don't want 50 alert emails sent to say that ApplicationX is down, up, down, up, down, up, ...
I'll assume that a monitoring application might be pushing alerts to the outside world in some form. If you have a hot-hot configuration as in 4, you also need to control your alerts. I would be tempted to deal with this by configuring each to push alerts to its own queue. Then middleware could forward alerts from the secondary monitor to a dead letter queue. Failover would be to reconfigure middleware so that primary alerts go to the dead letter queue and secondary alerts go to the alert channel. This mechanism could also be used to discard events raised during a replay recovery.
Given the complexity and potential mess that can arise from replaying events, for a monitoring application I would probably prefer starting from a clean slate, or going with persisted sessions. However this may well depend on what you are monitoring.
I have a problem with client-server application. As I've almost run out of sane ideas for its solving I am asking for help. I've stumbled into described situation about three or four times now. Provided data is from last failure, when I've turned all the possible logging, messages dumping and so on.
System description
1) Client. Works under Windows. I take as an assumption that there is no problem with its work (judging from logs)
2) Server. Works under Linux (RHEL 5). It is server where I has a problem.
3) Two connections are maintained between client and server: one command and one for data sending. Both work asynchronously. Both connections live in one thread and on one boost::asio::io_service.
4) Data to be sent from client to server is messages delimeted by '\0'.
5) Data load is about 50 Mb/hour, 24 hours a day.
6) Data is read on server side using boost::asio::async_read_until with corresponding delimeter
Problem
- For two days system worked as expected
- On third day at 18:55 server read one last message from client and then stopped reading them. No info in logs about new data.
- From 18:55 to 09:00 (14 hours) client reported no errors. So it sent data (about 700 Mb) successfully and no errors arose.
- At 08:30 I started investigation of a problem. Server process was alive, both connections between server and client were alive too.
- At 09:00 I attached to server process with gdb. Server was in sleeping state, waiting for some signal from system. I believe I accidentally hit Ctrl + C and may be there was some message.
- Later in logs I found message with smth like 'system call interrupted'. After that both connections to client were dropped. Client reconnected and server started to worked normally.
- The first message processed by server was timestamped at 18:57 on client side. So after restarting normal work, server didn't drop all the messages up to 09:00, they were stored somewhere and it processed them accordingly after that.
Things I've tried
- Simulated scenario above. As server dumped all incoming messages I've wrote a small script which presented itself as client and sent all the messages back to server again. Server dropped with out of memory error, but, unfortunately, it was because of high data load (about 3 Gb/hour this time), not because of the same error. As it was Friday evening I had no time to correctly repeat the experiment.
- Nevertheless, I've run server through Valgrind to detect possible memory leaks. Nothing serious was found (except the fact that server was dropped because of high load), no huge memory leaks.
Questions
- Where were these 700 Mb of data which client sent and server didn't get? Why they were persistent and weren't lost when server restarted the connection?
- It seems to me that problem is someway connected with server not getting message from boost::asio::io_service. Buffer is get filled with data, but no calls to read handler are made. Could this be problem on OS side? Something wrong with asynchronous calls may be? If it is so, how could this be checked?
- What can I do to detect the source of problem? As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state), so I need to run as much possible checks for experiment as I could.
Would be grateful for any ideas I can use to get to the error.
Update: Ok, it seems that error was in synchronous write left in the middle of asynchronous client-server interaction. As both connections lived in one thread, this synchronous write was blocking thread for some reason and all interaction both on command and data connection stopped. So, I changed it to async version and now it seems to work.
As i said I've run out of sane ideas and each experiment costs very
much in terms of time (it takes about two or three days to get the
system to described state)
One way to simplify investigation of this problem is to run server inside some Virtual Machine until it reaches this broken state. Then you can make snapshot of whole system and revert to it every time when things go wrong during investigation. At least you will not have to wait 3 days to get this state again.
I am developing a Windows Phone app where users can update a list. Each update, delete, add etc need to be stored in a database that sits behind a web service. As well as ensuring all the operations made on the phone end up in the cloud, I need to make sure the app is really responsive and the user doesn’t feel any lag time whatsoever.
What’s the best design to use here? Each check box change, each text box edit fires a new thread to contact the web service? Locally store a list of things that need to be updated then send to the server in batch every so often (what about the back button)? Am I missing another even easier implementation?
Thanks in advance,
Data updates to your web service are going to take some time to execute, so in terms of providing the very best responsiveness to the user your best approach would be to fire these off on a background thread.
If updates not taking place (until your app resumes) due to a back press is a concern for your app then you can increase the frequency of sending these updates off.
Storing data locally would be a good idea following each change to make sure nothing is lost since you don't know if your app will get interrupted such as by a phone call.
You are able to intercept the back button which would allow you to handle notifying the user of pending updates being processed or requesting confirmation to defer transmission (say in the case of poor performing network location). Perhaps a visual queue in your UI would be helpful to indicate pending requests in your storage queue.
You may want to give some thought to the overall frequency of data updates in a typical usage scenario for your application and how intensely this would utilise the network connection. Depending on this you may want to balance frequency of updates with potential power consumption.
This may guide you on whether to fire updates off of field level changes, a timer when the queue isn't empty, and/or manipulating a different row of data among other possibilities.
General efficiency guidance with mobile network communications is to have larger and less frequent transmissions rather than a "chatty" or frequent transmissions pattern, however this is up to you to decide what is most applicable for your application.
You might want to look into something similar to REST or SOAP.
Each update, delete, add would send a request to the web service. After the request is fulfilled, the web service sends a message back to the Phone application.
Since you want to keep this simple on the Phone application, you would send a URL to the web service, and the web service would respond with a simple message you can easily parse.
Something like this:
http://webservice?action=update&id=10345&data=...
With a reply of:
Update 10345 successful
The id number is just an incrementing sequence to identify the request / response pair.
There is the Microsoft Sync Framework recently released and discussed some weeks back on DotNetRocks. I must admit I didnt consider this till I read your comment.
I've not looked into the sync framework's dependencies and thus capability for running on the wp7 platform as yet, but it's probably worth checking out.
Here's a link to the framework.
And a link to Carl and Richard's show with Lev Novik, an architect on the project if you're interested in some background info. It was quite an interesting show.
(Edited to try to explain better)
We have an agent, written in C++ for Win32. It needs to periodically post information to a server. It must support disconnected operation. That is: the client doesn't always have a connection to the server.
Note: This is for communication between an agent running on desktop PCs, to communicate with a server running somewhere in the enterprise.
This means that the messages to be sent to the server must be queued (so that they can be sent once the connection is available).
We currently use an in-house system that queues messages as individual files on disk, and uses HTTP POST to send them to the server when it's available.
It's starting to show its age, and I'd like to investigate alternatives before I consider updating it.
It must be available by default on Windows XP SP2, Windows Vista and Windows 7, or must be simple to include in our installer.
This product will be installed (by administrators) on a couple of hundred thousand PCs. They'll probably use something like Microsoft SMS or ConfigMgr. In this scenario, "frivolous" prerequisites are frowned upon. This means that, unless the client-side code (or a redistributable) can be included in our installer, the administrator won't be happy. This makes MSMQ a particularly hard sell, because it's not installed by default with XP.
It must be relatively simple to use from C++ on Win32.
Our client is an unmanaged C++ Win32 application. No .NET or Java on the client.
The transport should be HTTP or HTTPS. That is: it must go through firewalls easily; no RPC or DCOM.
It should be relatively reliable, with retries, etc. Protection against replays is a must-have.
It must be scalable -- there's a lot of traffic. Per-message impact on the server should be minimal.
The server end is C#, currently using ASP.NET to implement a simple HTTP POST mechanism.
(The slightly odd one). It must support client-side in-memory queues, so that we can avoid spinning up the hard disk. It must allow flushing to disk periodically.
It must be suitable for use in a proprietary product (i.e. no GPL, etc.).
How is your current solution showing its age?
I would push the logic on to the back end, and make the clients extremely simple.
Messages are simply stored in the file system. Have the client write to c:/queue/{uuid}.tmp. When the file is written, rename it to c:/queue/{uuid}.msg. This makes writing messages to the queue on the client "atomic".
A C++ thread wakes up, scans c:\queue for "*.msg" files, and if it finds one it then checks for the server, and HTTP POSTs the message to it. When it receives the 200 status back from the server (i.e. it has got the message), then it can delete the file. It only scans for *.msg files. The *.tmp files are still being written too, and you'd have a race condition trying to send a msg file that was still being written. That's what the rename from .tmp is for. I'd also suggest scanning by creation date so early messages go first.
Your server receives the message, and here it can to any necessary dupe checking. Push this burden on the server to centralize it. You could simply record every uuid for every message to do duplication elimination. If that list gets too long (I don't know your traffic volume), perhaps you can cull it of items greater than 30 days (I also don't know how long your clients can remain off line).
This system is simple, but pretty robust. If the file sending thread gets an error, it will simply try to send the file next time. The only time you should be getting a duplicate message is in the window between when the client gets the 200 ack from the server and when it deletes the file. If the client shuts down or crashes at that point, you will have a file that has been sent but not removed from the queue.
If your clients are stable, this is a pretty low risk. With the dupe checking based on the message ID, you can mitigate that at the cost of some bookkeeping, but maintaining a list of uuids isn't spectacularly daunting, but again it does depend on your message volume and other performance requirements.
The fact that you are allowed to work "offline" suggests you have some "slack" in your absolute messaging performance.
To be honest, the requirements listed don't make a lot of sense and show you have a long way to go in your MQ learning. Given that, if you don't want to use MSMQ (probably the easiest overall on Windows -- but with [IMO severe] limitations), then you should look into:
qpid - Decent use of AMQP standard
zeromq - (the best, IMO, technically but also requires the most familiarity with MQ technologies)
I'd recommend rabbitmq too, but that's an Erlang server and last I looked it didn't have usuable C or C++ libraries. Still, if you are shopping MQ, take a look at it...
[EDIT]
I've gone back and reread your reqs as well as some of your comments and think, for you, that perhaps client MQ -> server is not your best option. I would maybe consider letting your client -> server operations be HTTP POST or SOAP and allow the HTTP endpoint in turn queue messages on your MQ backend. IOW, abstract away the MQ client into an architecture you have more control over. Then your C++ client would simply be HTTP (easy), and your HTTP service (likely C# / .Net from reading your comments) can interact with any MQ backend of your choice. If all your HTTP endpoint does is spawn MQ messages, it'll be pretty darned lightweight and can scale through all the traditional load balancing techniques.
Last time I wanted to do any messaging I used C# and MSMQ. There are MSMQ libraries available that make using MSMQ very easy. It's free to install on both your servers and never lost a message to this day. It handles reboots etc all by itself. It's a thing of beauty and 100,000's of message are processed daily.
I'm not sure why you ruled out MSMQ and I didn't get point 2.
Quite often for queues we just dump record data into a database table and another process lifts rows out of the table periodically.
How about using Asynchronous Agents library from .NET Framework 4.0. It is still beta though.
http://msdn.microsoft.com/en-us/library/dd492627(VS.100).aspx