Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?

Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem? - clojure

The Zookeeper Watches documentation states:
"A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode." Furthermore, "Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper."
The point is, there is no guarantee you'll get a watch notification.
This is important, because in a sytem like Clojure's Avout, you're trying to mimic Clojure's Software Transactional Memory, over the network using Zookeeper. This relies on there being a watch notification for every change.
Now I'm trying to work out if this is a coding flaw, or a fundamental computer science problem, (ie the CAP Theorem).
My question is: Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?

This seems to be a limitation in the way ZooKeeper implements watches, not a limitation of the CAP theorem. There is an open feature request to add continuous watch to ZooKeeper: https://issues.apache.org/jira/browse/ZOOKEEPER-1416.
etcd has a watch function that uses long polling. The limitation here which you need to account for is that multiple events may happen between receiving the first long poll result, and re-polling. This is roughly analogous to the issue with ZooKeeper. However they have a solution:
However, the watch command can do more than this. Using the index [passing the last index we've seen], we can watch for commands that have happened in the past. This is useful for ensuring you don't miss events between watch commands.
curl -L 'http://127.0.0.1:4001/v2/keys/foo?wait=true&waitIndex=7'

Related

DDD - Concurrency and Command retrying with side-effects

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.

I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

Dataflow job stuck and not reading messages from PubSub

I have a dataflow job which reads JSON from 3 PubSub topics, flattening them in one, apply some transformations and save to BigQuery.
I'm using a GlobalWindow with following configuration.
.apply(Window.<PubsubMessage>into(new GlobalWindows()).triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterFirst.of(AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations))))
.discardingFiredPanes());
The job is running with following configuration
Max Workers : 20
Disk Size: 10GB
Machine Type : n1-standard-4
Autoscaling Algo: Throughput Based
The problem I'm facing is that after processing few messages (approx ~80k) the job stops reading messages from PubSub. There is a backlog of close to 10 Million messages in one of those topics and yet the Dataflow Job is not reading the messages or autoscaling.
I also checked the CPU usage of each worker and that is also hovering in single digit after initial burst.
I've tried changing machine type and max worker configuration but nothing seems to work.
How should I approach this problem ?

I suspect the windowing function is the culprit. GlobalWindow isn't suited to streaming jobs (which I assume this job is, due to the use of PubSub), because it won't fire the window until all elements are present, which never happens in a streaming context.
In your situation, it looks like the window will fire early once, when it hits either that element count or duration, but after that the window will get stuck waiting for all the elements to finally arrive. A quick fix to check if this is the case is to wrap the early firings in a Repeatedly.forever trigger, like so:
withEarlyFirings(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations)))))
This should allow the early firing to fire repeatedly, preventing the window from getting stuck.
However for a more permanent solution I recommend moving away from using GlobalWindow in streaming pipelines. Using fixed-time windows with early firings based on element count would give you the same behavior, but without risk of getting stuck.

CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.

First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?

It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.

Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

What happens to running processes on a continuous Azure WebJob when website is redeployed?

I've read about graceful shutdowns here using the WEBJOBS_SHUTDOWN_FILE and here using Cancellation Tokens, so I understand the premise of graceful shutdowns, however I'm not sure how they will affect WebJobs that are in the middle of processing a queue message.
So here's the scenario:
I have a WebJob with functions listening to queues.
Message is added to Queue and job begins processing.
While processing, someone pushes to develop, triggering a redeploy.
Assuming I have my WebJobs hooked up to deploy on git pushes, this deploy will also trigger the WebJobs to be updated, which (as far as I understand) will kick off some sort of shutdown workflow in the jobs. So I have a few questions stemming from that.
Will jobs in the middle of processing a queue message finish processing the message before the job quits? Or is any shutdown notification essentially treated as "this bitch is about to shutdown. If you don't have anything to handle it, you're SOL."
If we are SOL, is our best option for handling shutdowns essentially to wrap anything you're doing in the equivalent of DB transactions and implement your shutdown handler in such a way that all changes are rolled back on shutdown?
If a queue message is in the middle of being processed and the WebJob shuts down, will that message be requeued? If not, does that mean that my shutdown handler needs to handle requeuing that message?
Is it possible for functions listening to queues to grab any more queue messages after the Job has been notified that it needs to shutdown?
Any guidance here is greatly appreciated! Also, if anyone has any other useful links on how to handle job shutdowns besides the ones I mentioned, it would be great if you could share those.

After no small amount of testing, I think I've found the answers to my questions and I hope someone else can gain some insight from my experience.
NOTE: All of these scenarios were tested using .NET Console Apps and Azure queues, so I'm not sure how blobs or table storage, or different types of Job file types, would handle these different scenarios.
After a Job has been marked to exit, the triggered functions that are running will have the configured amount of time (grace period) (5 seconds by default, but I think that is configurable by using a settings.job file) to finish before they are exited. If they do not finish in the grace period, the function quits. Main() (or whichever file you declared host.RunAndBlock() in), however, will finish running any code after host.RunAndBlock() for up to the amount of time remaining in the grace period (I'm not sure how that would work if you used an infinite loop instead of RunAndBlock). As far as handling the quit in your functions, you can essentially "listen" to the CancellationToken that you can pass in to your triggered functions for IsCancellationRequired and then handle it accordingly. Also, you are not SOL if you don't handle the quits yourself. Huzzah! See point #3.
While you are not SOL if you don't handle the quit (see point #3), I do think it is a good idea to wrap all of your jobs in transactions that you won't commit until you're absolutely sure the job has ran its course. This way if your function exits mid-process, you'll be less likely to have to worry about corrupted data. I can think of a couple scenarios where you might want to commit transactions as they pass (batch jobs, for instance), however you would need to structure your data or logic so that previously processed entities aren't reprocessed after the job restarts.
You are not in trouble if you don't handle job quits yourself. My understanding of what's going on under the covers is virtually non-existent, however I am quite sure of the results. If a function is in the middle of processing a queue message and is forced to quit before it can finish, HAVE NO FEAR! When the job grabs the message to process, it will essentially hide it on the queue for a certain amount of time. If your function quits while processing the message, that message will "become visible" again after x amount of time, and it will be re-grabbed and ran against the potentially updated code that was just deployed.
So I have about 90% confidence in my findings for #4. And I say that because to attempt to test it involved quick-switching between windows while not actually being totally sure what was going on with certain pieces. But here's what I found: on the off chance that a queue has a new message added to it in the grace period b4 a job quits, I THINK one of two things can happen: If the function doesn't poll that queue before the job quits, then the message will stay on the queue and it will be grabbed when the job restarts. However if the function DOES grab the message, it will be treated the same as any other message that was interrupted: it will "become visible" on the queue again and be reran upon the restart of the job.
That pretty much sums it up. I hope other people will find this useful. Let me know if you want any of this expounded on and I'll be happy to try. Or if I'm full of it and you have lots of corrections, those are probably more welcome!

Web application background processes, newbie design question

I'm building my first web application after many years of desktop application development (I'm using Django/Python but maybe this is a completely generic question, I'm not sure). So please beware - this may be an ultra-newbie question...
One of my user processes involves heavy processing in the server (i.e. user inputs something, server needs ~10 minutes to process it). On a desktop application, what I would do it throw the user input into a queue protected by a mutex, and have a dedicated background thread running in low priority blocking on the queue using that mutex.
However in the web application everything seems to be oriented towards synchronization with the HTTP requests.
Assuming I will use the database as my queue, what is best practice architecture for running a background process?

There are two schools of thought on this (at least).
Throw the work on a queue and have something else outside your web-stack handle it.
Throw the work on a queue and have something else in your web-stack handle it.
In either case, you create work units in a queue somewhere (e.g. a database table) and let some process take care of them.
I typically work with number 1 where I have a dedicated windows service that takes care of these things. You could also do this with SQL jobs or something similar.
The advantage to item 2 is that you can more easily keep all your code in one place--in the web tier. You'd still need something that triggers the execution (e.g. loading the web page that processes work units with a sufficiently high timeout), but that could be easily accomplished with various mechanisms.

Since:
1) This is a common problem,
2) You're new to your platform
-- I suggest that you look in the contributed libraries for your platform to find a solution to handle the task. In addition to queuing and processing the jobs, you'll also want to consider:
1) status communications between the worker and the web-stack. This will enable web pages that show the percentage complete number for the job, assure the human that the job is progressing, etc.
2) How to ensure that the worker process does not die.
3) If a job has an error, will the worker process automatically retry it periodically?
Will you or an operations person be notified if a job fails?
4) As the number of jobs increase, can additional workers be added to gain parallelism?
Or, even better, can workers be added on other servers?
If you can't find a good solution in Django/Python, you can also consider porting a solution from another platform to yours. I use delayed_job for Ruby on Rails. The worker process is managed by runit.
Regards,
Larry

Speaking generally, I'd look at running background processes on a different server, especially if your web server has any kind of load.

Running long processes in Django: http://iraniweb.com/blog/?p=56

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js