Retrieving node address from RemotingShutdownEvent - akka

We're currently updating from Akka 2.0.4 to 2.4.2 (I know, quite a big leap, but nobody thought to do it incrementally).
Anyway, in the old codebase, our master node is connected to some remote slave nodes that sometimes fail (the "why" is still to be investigated). When a slave dies, the master receives a RemoteClientShutdown event from which we can extract the getRemoteAddress and process it accordingly (e.g. inform the admin per email pointing to the failed node address).
In version 2.4.2 the RemoteClientShutdown class is replaced (at least I suppose so) by RemotingShutdownEvent which, being an object, doesn't carry any specific information as to the source of the event.
I've checked the migration guides as well as the current documentation but couldn't find info on how to solve this problem. According to the Event Bus documentation, the only way to extract such information is by providing it in the message ("Please note that the EventBus does not preserve the sender of the published messages. If you need a reference to the original sender you have to provide it inside the message").
Should I somehow override the message sent on the remote system shutdown? Or is there any other recommended way to solve it? I hope this question is not too newbie, I'm still quite new to Akka.

Ok, solved it using DisassociatedEvent which actually contains the address and other useful information. Turns out I was misled by the name of RemotingShutdownEvent which is actually received "when the remoting subsystem has been shut down" (docs) and not when a remote actor has been shut down.

Related

DDD - Concurrency and Command retrying with side-effects

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.
I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

ZMQ - Client Server: Client is powered off unexpectedly, how server detects it?

Multiple clients are connected to a single ZMQ_PUSH socket. When a client is powered off unexpectedly, server does not get an alert and keep sending messages to it. Despite of using ZMQ_OBLOCK and setting ZMQ_HWM to 5 (queue only 5 messages at max), my server doesn't get an error until unless client is reconnected and all the messages in queue are received at once.
I recently ran into a similar problem when using ZMQ. We would cut power to interconnected systems, and the subscriber would be unable to reconnect automatically. It turns out the there has recently (past year or so) been implemented a heartbeat mechanism over ZMTP, the underlying protocol used by ZMQ sockets.
If you are using ZMQ version 4.2.0 or greater, look into setting the ZMQ_HEARTBEAT_IVL and ZMQ_HEARTBEAT_TIMEOUT socket options (http://api.zeromq.org/4-2:zmq-setsockopt). These will set the interval between heartbeats (ZMQ_HEARTBEAT_IVL) and how long to wait for the reply until closing the connection (ZMQ_HEARTBEAT_TIMEOUT).
EDIT: You must set these socket options before connecting.
There is nothing in zmq explicitly to detect the unexpected termination of a program at the other end of a socket, or the gratuitous and unexpected failure of a network connection.
There has been historical talk of adding some kind of underlying ping-pong are-you-still-alive internal messaging to zmq, but last time I looked (quite some time ago) it had been decided not to do this.
This does mean that crashes, network failures, etc aren't necessarily handled very cleanly, and your application will not necessarily know what is going on or whether messages have been successfully sent. It is Actor model after all. As you're finding your program may eventually determine something had previously gone wrong. Timeouts in zmtp will spot the failure, and eventually the consequences bubble back up to your program.
To do anything better you'd have to layer something like a ping-pong on top yourself (eg have a separate socket just for that so that you can track the reachability of clients) but that then starts making it very hard to use the nice parts of ZMQ such as push / pull. Which is probably why the (excellent) zmq authors decided not to put it in themselves.
When faced with a similar problem I ended up writing my own transport library. I couldn't find one off the shelf that gave nice behaviour in the face of network failures, crashes, etc. It implemented CSP, not actor model, wasn't terribly fast (an inevitability), didn't do patterns in the zmq sense, but did mean that programs knew exactly where messages were at all times, and knew that clients were alive or unreachable at all times. The CSPness also meant message transfers were an execution rendezvous, so programs know what each other is doing too.

Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?

The Zookeeper Watches documentation states:
"A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode." Furthermore, "Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper."
The point is, there is no guarantee you'll get a watch notification.
This is important, because in a sytem like Clojure's Avout, you're trying to mimic Clojure's Software Transactional Memory, over the network using Zookeeper. This relies on there being a watch notification for every change.
Now I'm trying to work out if this is a coding flaw, or a fundamental computer science problem, (ie the CAP Theorem).
My question is: Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?
This seems to be a limitation in the way ZooKeeper implements watches, not a limitation of the CAP theorem. There is an open feature request to add continuous watch to ZooKeeper: https://issues.apache.org/jira/browse/ZOOKEEPER-1416.
etcd has a watch function that uses long polling. The limitation here which you need to account for is that multiple events may happen between receiving the first long poll result, and re-polling. This is roughly analogous to the issue with ZooKeeper. However they have a solution:
However, the watch command can do more than this. Using the index [passing the last index we've seen], we can watch for commands that have happened in the past. This is useful for ensuring you don't miss events between watch commands.
curl -L 'http://127.0.0.1:4001/v2/keys/foo?wait=true&waitIndex=7'

Websphere MQ - error with reason code 2042 on a get

We're getting an intermittent error on a ImqQueue::get( ImqMsg &, ImqGetMessageOptions & ); call with reason code 2042, which Should Not Happen™ based on the Websphere documentation; we should only get that reason code on an open.
Would this error indicate that the server could not open a queue on its side, or does it indicate that there's a problem in our client? What is the best way to handle this error? Right now we just log that it occurs, but it's happening a lot. Unfortunately I'm not well-versed in Websphere MQ; I'm kind of picking this up as I go, so I don't have all the terminology correct.
Our client is written in C++ linking against libmq 6.0.2.4 and running on SLES-10. I don't know the particulars for the server other than it's running version 7.1. We're requesting an upgrade to bring our side up-to-date. We have multiple instances of the client running concurrently; all are using the same request queue, but each is creating its own dynamic reply queue with MQOO_INPUT_EXCLUSIVE + MQOO_INPUT_FAIL_IF_QUIESCING.
If the queue is not already open, the ImqQueue::get method will implicitly open the queue for you. This will end up with the MQOO_INPUT_AS_Q_DEF option being used which will therefore use the DEFSOPT(EXCL|SHARED) attribute on the queue. You should also double check that the queue is defined SHARE rather than NOSHARE, but I suspect that will already be correctly set.
You mention that you have multiple instances of the application running concurrently so if one of them has the queue opened implicitly as MQOO_INPUT_AS_Q_DEF resulting in MQOO_INPUT_SHARED from DEFSOPT, then it will get 2042 (MQRC_OBJECT_IN_USE) if others have it open. If nothing else had it open at the time, then the implicit open will work, and later instances will instead get the 2042.
If it is intermittent, then I suggest there is a path through your application where ImqQueue::open method is not invoked. While you look for that, changing the queue definition to DEFSOPT(SHARED) should get rid of the 2042s.

Quickfix: acceptor and initator in same application?

I am new to quickfix (I'm a student trying to teach myself), and have downloaded the examples from quickfix.org (in c++) and have been able to connect ordermatch to tradeclient and get them talking to each other. I changed the config file for ordermatch to allow multiple clients and got that working (ordermatch can receive orders from multiple clients and manage the order book).
I have been trying to find a way to alter ordermatch to send it's confirm messages to ALL clients, not just the sender.
I have a seperate implementation of a limit orderbook and want to crack the incoming messages (orders, cancels, etc) and store them in my limit orderbook. My orderbook watches the book an makes trading decisions based on it. The problem is, I can't figure out how to get ordermatch to send all updates to this client. Further, I am having a hard time figuring out how to "soup up" the tradeclient to not only send orders, but receive and crack them.
I'm thinking I need to have an acceptor and an initator in each application(in ordermatch and in one of the tradeclients)--I've read this is possible and common but can't find any sample code. Am I on the right track here, or is there a better way to set this up? Does anybody have some sample code they can share? I am not planning on using this for live trading so crude code is perfectly fine by me.
Thanks in advance
Brandon
Same application can act as Initiator for one session and Acceptor for different session.
Infact you can have multiple Acceptor/Initiator sessions from same application.
Config file needs to define multiple sessions.
Or you can have separate config file for each session.
If I understand correctly, I think what you're trying to do is intercept messages between an OMS and a broker (c.f. client and server) and act depending on what they contain. There are a few ways you could do this, including intercepting at the TCP layer, but I think that the easiest way might be to use 2 separate programs as #DumbCoder suggests and connect to one of them as an acceptor from your clients, process the messages and then pass them on to another program via another protocol and then send them on from the other program. Theoretically you can create another instance of the engine in your program and, by using different config files on creation (when FIX::FileStoreFactory storeFactory(*settings); is called) of each instance of the engine. However, I have never seen this done and so feel that it could cause problems. If you do try this method I would strongly advise putting the initiator and the connector in different dlls which might just separate the two engine instances enough.