Apache storm execute bolts on different machines (designated node) - python-2.7

I want to create a topology in which one spout is there which emits words, and a bolt which based on these words create a directory named with word.
I have two supervisor nodes on and want that if word starts with "a" to "l" the directory is created on one node and on another node otherwise.
e.g if word is 'acknowledgement' then one directory will be created on one node and if word is "machine" then directory will be created on another node.
Please suggest a way to configure storm to achieve this.
I would also like to know if one bolt is enough or if two bolts are deployed how can one manage that one bolt is run on one machine and another on other machine.
P.S. I am using Pyleus(https://github.com/Yelp/pyleus) for creating bolts and spout.

You can use single bolt but two instances of it. Each instance of this bolt runs a each supervisor node. Use custom field grouping functionality to achive the same. Your custom field grouping logic decides to which instance of the bolt this word has to dispatch.

Basically, you can't be sure a Bolt / Spout will be present in a specific worker (JVM). It is part of Storm design: have workers on different hardwares that are similar as Storm decides which bolt / spout instances goes to which worker.
Storm includes an abstraction: topologies are not tied to a cluster and can be rebalanced at runtime. It is wonderfully resilient and performant but you can't make specific code that run on a specific node easily (it would also be an anti-pattern in storm philosophy).
AFAIK Storm uses a mod hash function for the repartition of tasks / executors in workers (managed by supervisors), and you can't overwrite it easily.
So your best bet is to have tasks >= executors >= workers as storm will try to divide equally the load in workers of your cluster.

Related

Use of redis cluster vs standalone redis

I have a question about when it makes sense to use a Redis cluster versus standalone Redis.
Suppose one has a real-time gaming application that will allow multiple instances of the game and wish to implement
real time leaderboard for each instance. (Games are created by communities of users).
Suppose at any time we have say 100 simultaneous matches running.
Based on the use cases outlined here :
https://d0.awsstatic.com/whitepapers/performance-at-scale-with-amazon-elasticache.pdf
https://redislabs.com/solutions/use-cases/leaderboards/
https://aws.amazon.com/blogs/database/building-a-real-time-gaming-leaderboard-with-amazon-elasticache-for-redis/
We can implement each leaderboard using a Sorted Set dataset in memory.
Now I would like to implement some sort of persistence where leaderboard state is saved at the end of each
game as a snapshot. Thus each of these independent Sorted Sets are saved as a snapshot file.
I have a question about design choices:
Would a redis cluster make sense for this scenario ? Or would it make more sense to have standalone redis instances and create a new database for each game ?
As far as I know there is only a single database 0 for a single redis cluster.(https://redis.io/topics/cluster-spec)
In that case, how would one be able to snapshot datasets for each leaderboard at different times work ?
https://redis.io/topics/cluster-spec
From what I can see using a Redis cluster only makes sense for large-scale monolithic applications and may not be the best approach for the scenario described above. Is that the case ?
Or if one goes with AWS Elasticache for Redis Cluster mode can I configure snapshotting for individual datasets ?
You are correct, clustering is a way of scaling out to handle really high request loads and store tons of data.
It really doesn't quite sound like you need to bother with a cluster.
I'd quite be very surprised if a standalone Redis setup would be your bottleneck before having several tens of thousands of simultaneous players.
If you are unsure, you can probably mock some simulated load and see what it can handle. My guess is that you are better off focusing on other complexities of your game until you start reaching quite serious usage. Which is a good problem to have. :)
You might however want to consider having one or two replica instances, which is a different thing.
Secondly, regardless of cluster or not, why do you want to use snap-shots (SAVE or BGSAVE) to persist your scoreboard?
If you want to have individual snapshots per game, and its only a few keys per game, why don't you just have your application read and persist those keys when needed to a traditional db? You can for example use MULTI, DUMP and RESTORE to achieve something that is very similar to snapshotting, but on the specific keys you want.
It doesn't sound like multiple databases is warranted for this.
Multiple databases on clustered Redis is only supported in the Enterprise version, so not on ElastiCache. But the above mentioned approach should work just fine.

What are cgroups and how are people using them for cluster administration?

Are there examples of how people are using cgroups to better manage research computing clusters that runs parallel scientific codes and serial codes for an academic community?
The primary example I'm aware of is to be able set the cluster scheduler (e.g. Slurm) to assign multiple jobs to a single node without worrying about a renegade job utilizing more resources than assigned.
Cgroups is the mechanism so that the different jobs are only able to use the resources assigned to them by Slurm.
Prior to having cluster schedulers capable of doing this many HPC Centers only allowed either one job per node or one user per node. Otherwise a job that requested only 1 core, for example, could, once running, actually use all the cores in the node which would cause other jobs on the node to run poorly.

How to start specific number of workers actors during start?

I created clustered akka app based on.
https://github.com/typesafehub/activator-akka-distributed-workers-java/blob/master/tutorial/index.html
Is there any build in future to run specific number of actors of given type in cluster. Should I create router or there is better way ?
http://doc.akka.io/docs/akka/snapshot/java/routing.html
Yes, have a look at Cluster-aware routers.

Akka clustering - force actors to stay on specific machines

I've got an akka application that I will be deploying on many machines. I want each of these applications to communicate with each others by using the distributed publish/subscribe event bus features.
However, if I set the system up for clustering, then I am worried that actors for one application may be created on a different node to the one they started on.
It's really important that an actor is only created on the machine that the application it belongs to was started on.
Basically, I don't want the elasticity or the clustering of actors, I just want the distributed pub/sub. I can see options like singleton or roles, mentioned here http://letitcrash.com/tagged/spotlight22, but I wondered what the recommended way to do this is.
There is currently no feature in Akka which would move your actors around: either you programmatically deploy to a specific machine or you put the deployment into the configuration file. Otherwise it will be created locally as you want.
(Akka may one day get automatic actor tree partitioning, but that is not even specified yet.)
I think this is not the best way to use elastic clustering. But we also consider on the same issue, and found that it could to be usefull to spread actors over the nodes by hash of entity id (like database shards). For example, on each node we create one NodeRouterActor that proxies messages to multiple WorkerActors. When we send message to NodeRouterActor it selects the end point node by lookuping it in hash-table by key id % nodeCount then the end point NodeRouterActor proxies message to specific WorkerActor which controlls the entity.

AWS celery and database

I am writing a webapp thats runs on AWS. My app requires users to upload their pdf files. I will convert them into Images using the "convert" utility in linux.
Here is my setup on Ubuntu 12.04:
Django
Celery
Django Celery
Boto
I am using apache as my webserver.
The work flow is as follows:
Three are three asynchronous tasks and two queues for handling all the processing and S3 for storing input and Output files.
A user uploads a pdf then:
accept_file_task is called: This task takes the user uploaded pdf and stores it in my S3 storage and then inserts a message into the input_queue(SQS)
check_queue_and_launch_instance_task: A periodic task that keeps monitoring the number of messages in the input_queue and launches instances whenever the queue has more messages than the no of Ec2 instances
The instances have a bootstrap script which is a while True: loop. Any of the instances can pick the message from the input_queue and do a Subprocess.Popen("convert "+input+ouput) and write the processed stated to output_queue and also upload the image generated into S3 output bucket and make it available as a download link
output_process_task: another periodic task that keeps polling the output_queue and whenever a message is available it will update the status in the table mentioned below.
I am using a model called Document to store all the status information. I also have users registering and hence a table to store all user information. Also Celery created a lot of tables to store all its task information. Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
I am unsure about the following things
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If any of you have designed something similar please help me on how to go about it
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
Keep in mind that premature optimization is the root of all evil. The question of RDS (which is really just MySQL, Oracle, or MS SQL) vs. SimpleDB is more of an application design decision than one based on scalability. SimpleDB is just a simple key-value store. RDS, on the other hand, will give you full ACID functionality. If your data is relational, then you should be using a relational database. If you just need a place to store simple strings or integers, then something like SimpleDB would make more sense.
Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
Make sure that you understand the consequences of a) creating a single point-of-failure in your design and b) SQLite's limitations compared to using a standalone RDBMS in this application. (You can use it, but it's really intended for single-user applications).
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If you're willing to replace Celery with SQS, you can tie together SQS + SNS + Cloudwatch to simplify this portion of your app. Though what you're doing doesn't sound like a bad choice, especially if it's working well already. Your time is probably better spent working on the problems in front of you rather than those that might occur down the road.