I am creating actors that represents physical devices and their state. As devices come online I create them "on demand" by sending and Identify message to the actor's path and then if it does not exist yet I create one. Potentially, there could be several million of these devices.
My concern is that the Identify look-up will take a performance hit as the number of actors increases. Is this a valid concern?
I was considering using a router strategy to segment the actors, but then I found that searching on the path with a wild card for the router yielded ActorIdentities from each router. I assume that a ConsistentHashingRouter would suit this scenario, but before I go down that rabbit hole I just want to make sure I am not optimizing prematurely.
The entity which creates an actor is only its parent (there no other way), which means that that parent actor does not need to use Identify at all, just check context.child(name).isDefined. That is very efficient, although you might want to shard your devices across multiple parents if you really have a massive number of them.
Related
If you are familiar with Trello, would storing an entire Trello board as an actor (with akka persistence) be a good use case?
A trello board consists of:
lists
tasks in a list
each task can have comments and other properties
What are the general best practices or considerations when deciding if akka persistance is a good use case for a given problem set?
Any context where event sourcing is a good fit is a good fit for Akka Persistence.
Event sourcing, in turn, is generally applicable (note that nearly any DB you're using is event sourcing (with exceptionally frequent snapshotting, truncation of the event log, and purging of old snapshots)).
Event sourcing works really well when you want to explicitly model how entities in your domain change over time: you're effectively defining an algebra of changes. The richer (i.e. the further from just create/update) this model of change is, the more it's a fit for event sourcing. This modeling of change in turn facilitates letting other components of a system only update their state when needed.
Akka Persistence, especially when used with cluster sharding, lets you handle commands/requests without having to read from a DB on every command/request (basically, you'll read from the DB when bringing back an already persisted actor, but subsequent commands/requests (until such time as the actor passivates or dies) don't require such reads). The model of parent and child actors in Akka also tends to lead to a natural encoding of many-to-one relationships.
In the example of a trello board, I would probably have
each board be a persistent actor, which is parent to
lists, which are persistent actors and are each parents to
list items, which are also persistent actors
Depending on how much was under a list item, they might in turn have child persistent actors (for comments, etc.).
It's probably worth reading up on domain-driven design. While DDD doesn't require the actor model (nor vice versa), and neither of them requires event sourcing (nor vice versa), I and many others have found that they reinforce each other.
It mostly depends on how much write the app wants to perform.
Akka persistence is an approach to achieve very high write throughput while ensuring the persistence of the data, i.e., if the actor dies and data in memory is lost, it is fine because the write logs are persisted to disk.
If the persistence of the data is necessary, while very high write throughput is not required (imagine the app updates the Trello board 1 time per second), then it is totally fine to simply writing the data to external storage.
would storing an entire Trello board as an actor (with akka
persistence) be a good use case
I would say the size of the actor should match the size of an Aggregate Root. Making an entire board an Aggregate Root seems like a very bad choice. It means that all actions on that board are now serialized and none can happen concurrently. Why should changing description of card #1 conflicts with moving car #2 to a different category? Why should creating a new board category conflict with assigning card #3 to someone?
I mean, you could make an entire system a single actor and you wouldn't ever have to care about race conditions, but you'd also kill your scalability...
I'm working to design a middle layer for an application that will receive up to ~5000 requests every few seconds and need to retrieve information from a database. I've been looking at use the Play Framework (I use scala for my REST api design) as they say its fully async and built on Akka. However, the main bottleneck of any solution seems to happen during read/writes to the database. Many Database cannot support simultaneous read/writes from a database of such a scale. How is such high concurrency achieved then for an app like this? I would guess Facebook/Twitter/ (name other big company) may have achieved this for their Applications as millions of people may be using them concurrently.
As Tim's comment was saying caching may or may not be able to help in your case. If not I would also recommend looking into horizontally scalable databases, for example cockroachdb if you want a transactional SQL db. Otherwise there are many no-sql choices a la mongodb etc. And if you really want to stick to traditional SQL systems you'll have to vertically scale your servers (buy the most expensive hardware) and work with read-replicas.
A huge component is your data model and query access pattern. If each query is incrementing a shared counter that has to be synchronized there will be a ton of contention, but if each query is touch completely separate data on the other end the spectrum than there will be a lot less contention.
I think there are a couple of dimensions I would consider:
Data Schema and Access Patterns (discussed above)
Language Choice
This is important becaues if you were in a web server context and were using prefork by default each process may have its own connection to the database. In an environment like python or ruby you may need hundreds of processes to handle your load. Contrast this with akka or another async networking based runtime (node, python gevent/asyncio, go, etc) where a single instance with a small thread pool can handle a large number of requests. Each have their tradeoffs.
Distributed Systems
Depending on your data schema and access patterns 5000 requests per second to a RDBMS is completely achievable. It would probably require relatively beefy hardware but but I'v personally done it a number of times. Getting to larger scales requires more computers in order to distribute the work/load. If your workload is right heavy and you can support potentially stale reads, a read replica is one option. With another machine in the mix reads are distributed over 2 machines but writes are still directed at a single machine (leader). Caching is another option.
At much higher workloads some sort of partitioning needs to occur in order to overcome the constraints of a single machine. https://github.com/vitessio/vitess
Many of the big contenders have solutions to horizontally scaling their databases. This has many drawbacks as well and will require careful planning.
The one thing I'd recommend is that if 5000 requests per second is projected for the near future, start with the minimal amount of hardware necessary (single instance) query patterns and operation get exponentially more complicated with a distributed database.
I have two nodes/machines/JVMs connected with Akka.
Each JVM has one different actor.
I am going to send large amount of messages from one actor to second.
Should I use tell() or create topic and subscribe/publish?
As I understand, tell is one-to-one communication and subscribe/publish is for one-to-many and many-to-many, however, as subscribe/publish also works for one-to-one (each actor may subscribe their own topic and publish to other, and thus communicate).
I do not know which one is better in this case.
What is difference in subscribe/publish and tell in one-to-one communication?
I am particularly interested in:
- performance (case 1: both on single node, case 2: on different nodes)
- design (with tell() I need to pass ActorRef first, while subscribe/publish allows to stay "anonymous")
I also should note, that in future I may need to stream big files between actors. By "big" I mean up to 4gb, but mostly thousands of kb/mb smaller files.
I'm trying to write up a tool that requires knowledge of the state of other machines in a cluster (local LAN). This is for a network failover/high availability system similar to VRRP and corosync/openais, but I wish to contain more information (such as near real-time speed/performance characteristics) so devices can make more intelligent choices. This means using a protocol more complicated than a predetermine weight-based mechanism: by allowing all clustered machines to see the state of each other, they can communally agree on which is the most suitable to be the master device.
From my searches, I haven't found any (C, C++ or JavaME) libraries that offer a distributed state mechanism. Ideally, I'm looking for something that broadcasts/multicasts each individual machines state periodically so participating machines can build up a global state table and all can see who the master should be. State in this case is arbitrary key/value pairs.
I'd rather not re-invent any wheels so am curious to know if anyone here can point me in the right direction?
If I were you I'd investigate memcached (memcached.org) or one of the nosql variants.
It sounds like Apache ZooKeeper might be a good match. It's distributed, hierarchical key-value store. To quote their Overview page:
ZooKeeper was designed to store coordination data: status information, configuration, location information, etc.
Here's an example of a simple Leader Election recipie, although it would require adaptation to determine a leader by some weighted criterion.
I'm not sure if there is any application for your purpose or not.
But I know that you can write a simple program with MPI library and broadcast any information that you want.
all client's can send their state to root node, and the root node then broadcast the message.
functions that you need for this are:
MPI_Bcast
MPI_Send
MPI_Recv
there is lots of tutorial on C++/MPI on net, just google it!
Since there's no complete BPM framework/solution in ColdFusion as of yet, how would you model a workflow into a ColdFusion app that can be easily extensible and maintainable?
A business workflow is more then a flowchart that maps nicely into a programming language. For example:
How do you model a task X that follows by multiple tasks Y0,Y1,Y2 that happen in parallel, where Y0 is a human process (need to wait for inputs) and Y1 is a web service that might go wrong and might need auto retry, and Y2 is an automated process; follows by a task Z that only should be carried out when all Y's are completed?
My thoughts...
Seems like I need to do a whole lot of storing / managing / keeping
track of states, and frequent checking with cfscheuler.
cfthread ain't going to help much since some tasks can take days
(e.g. wait for user's confirmation).
I can already image the flow is going to be spread around in multiple UDFs,
DB, and CFCs
any opensource workflow engine in other language that maybe we can port over to CF?
Thank you for your brain power. :)
Study the Java Process Definition Language specification where JBoss has an execution engine for it. Using this Java based engine may be your easiest solution, and it solves many of the problems you've outlined.
If you intend to write your own, you will probably end up modelling states and transitions, vertices and edges in a directed graph. And this as Ciaran Archer wrote are the components of a State Machine. The best persistence approach IMO is capturing versions of whatever data is being sent through workflow via serialization, capturing the current state, and a history of transitions between states and changes to that data. The mechanism probably needs a way to keep track of who or what has responsibility for taking the next action against that workflow.
Based on your question, one thing to consider is whether or not you really need to represent parallel tasks in your solution. Where instead it might be possible to en-queue a set of messages and then specify a wait state for all of those to complete. Representing actual parallelism implies you are moving data simultaneously through several different processes. In which case when they join again you need an algorithm to resolve deltas, which is very much a non trivial task.
In the context of ColdFusion and what you're trying to accomplish, a scheduled task may be necessary if the system you're writing needs to poll other systems. Consider WDDX as a serialization format. JSON, while seductively simple, I recall has some edge cases around numbers and dates that can cause you grief.
Finally see my answer to this question for some additional thoughts.
Off the top of my head I'm thinking about the State design pattern with state persisted to a database. Check out the Head First Design Patterns's Gumball Machine example.
Generally this will work if you have something (like a client / order / etc.) going through a number of changes of state.
Different things will happen to your object depending on what state you are in, and that might mean sitting in a database table waiting for a flag to be updated by a user manually.
In terms of other languages I know Grails has a workflow module available. I don't know if you would be better off porting to CF or jumping ship to Grails (right tool for the job and all that).
It's just a thought, hope it helps.