Autoscale workers on Digital Ocean - digital-ocean

I have 3 Webservers. DB, Web and Worker. The worker is just processing sidekiq processes all day long.
As soon the queue is over 100.000 jobs, I want to have a second worker instance and I do struggle a little bit with the right thinking of how to do it. (and if the queue is above 300.000 I need 3 workers, on and on).
I take my Worker and make a snapshot.
Via Digital-Ocean::API I will create a new instance, based on that image.
As soon as the instance is booting it needs to update the code from the Git-Repository
I need to tell the Database Server that it is allowed to receive connections from this instance IP
as soon the the queue is below 20.000 i can kill my instance.
Is this the right way of doing or are there better ways of how to do ? Am i missing something?
Additional Question:
On DB i only have mysql and redis. no ruby or anything else. so there is also no rails to run. If my worker decides, to create another worker, the new one needs to have access to mysql. It seems like to be impossible to create some access from a remote machine and it looks I need to create the access from the db server.
mysql> show grants;
+-----------------------------------------------------------------------------------------+
| Grants for rails#162.243.10.147 |
+-----------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'rails'#'162.243.10.147' IDENTIFIED BY PASSWORD <secret> |
| GRANT ALL PRIVILEGES ON `followrado`.* TO 'rails'#'162.243.10.147' |
+-----------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
mysql> CREATE USER 'rails'#'162.243.243.127' IDENTIFIED BY 'swag';
ERROR 1227 (42000): Access denied; you need (at least one of) the CREATE USER privilege(s) for this operation

Is this the right way of doing or are there better ways of how to do ? Am i missing something?
Yes - seems reasonable
as soon the the queue is below 20.000 i can kill my instance.
Maybe let it linger around for a while before killing it in case the queue goes up again
On DB i only have mysql and redis. no ruby or anything else. so there is also no rails to run. If my worker decides, to create another worker, the new one needs to have access to mysql. It seems like to be impossible to create some access from a remote machine and it looks I need to create the access from the db server.
Yes, you need to create access from the db server - in general the access is granted to the entire VPC cidr and not just single server IPs [more common when static instances] - specially if you plan to launch dynamic instances with constantly changing IPs

Related

(AWS SWF) Is there a way to get a list of all activity workers listening on a particular tasklist?

In our beta stack, we have a single EC2 instance listening to a tasklist. Sometimes another developer in the team start's his own instance for testing purposes and forget to turn it off. This creates problems for the next developer who tries to start an activity only for it to be taken up by the last developer's machine. Is there a way to get the hostnames of all activity workers listening to a particular tasklist ?
It is not currently possible to get a list of pollers waiting on a task list through the SWF API. The workaround is to look at the identity field on the ActivityExecutionStarted event after it was picked up by the wrong worker.
One way to avoid this issue is always use a task list name that is specific to a machine or developer to avoid collisions.

CloudFoundry Instance Environment Variable?

I have a Java web app that persists some things to a database and I would like to know what instance processed the order. A quick google and SO search wasn't fruitful in answering my question:
Is there an environment variable or something that my application can use to glean an instance number from for persisting?
I assume that by "what instance" you mean that you have multiple instances of your Java application, and you want some way of knowing which of the multiple instances actually made the request to the database.
Googling "Cloud Foundry Instance Environment Variable" leads me to this first result. You can see one of the listed variables is CF_INSTANCE_INDEX. Those docs are for Pivotal's hosted Cloud Foundry service, I guess the OSS docs have worse SEO, but they also document this.
Do note that application instances are ephemeral. Instance #0 might be killed and restarted for any number of reasons (usually either because your application crashes, or the underlying application execution software/OS are being upgraded in a rolling deploy fashion so your instances are being transparently moved around to avoid downtime), in which case the new instance #0 will obviously be an entirely different process, possibly running on a different machine, in a different datacenter.
From the logs, you can see the APP instance
2015-11-13T11:44:42.000+00:00 [App/0] OUT 11:44:42.675 [main] INFO blah blah
2015-11-13T11:45:42.000+00:00 [App/1] OUT 11:45:42.676 [main] INFO blah2
here App/0 is instance 0 & App/1 is instance 1.
Or if you want to access the instance in the code,
Look out for the env var, CF_INSTANCE_*
eg; CF_INSTANCE_INDEX, CF_INSTANCE_IP, CF_INSTANCE_PORT etc

Call Sentinel failover from Python

I have a simple Redis configuration running with 3 Servers and 3 Sentinels (different instances though). This configuration is running almost flawlessly, eventually, my Master fails (common Redis problem where it couldn't finish background saving).
My problem is when that happens whenever I try to save (or delete) something, I get the error:
ResponseError: MISCONF Redis is configured to save RDB snapshots, but
is currently not able to persist on disk. Commands that may modify the
data set are disabled. Please check Redis logs for details about the error.
Is there any way for me to ask Sentinel to call "failover" force electing a new master? By redis-cli is quite easy, but I couldn't find a way of doing it from my Python(2.7) program (using redis.py).
Any ideas?
First you are probably running out of disk space, which would cause that error. So address that and it will stop needing to be failed over.
That said, to do it in Python I think you need to use "execute_command" and pass it the sentinel command and the args. So something like:
myclient.execute_command("SENTINEL failover", podname)
where myclient is your connection object and poundage is the name you use for the pod in Sentinel, should do it.

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.

AWS AutoScaling, downscale - wait for processes termination

I want to use AWS AutoScaling to scaledown a group of instances when SQS queue is short.
These instances do some heavy work that sometimes requires 5-10 minutes to complete. And I want this work to be completed before the instance termination.
I know a lot of people should have faced the same problem. Is it possible on EC2 to handle the AWS termination request and complete all my running processes before the instance is actually terminated? What is the best approach to this?
You could also use Lifecycle hooks. You would need a way to control a specific worker remotely, because AWS will select a particular instance to put in Terminating:Wait state and you need to manage that instance. You would want to take the following actions:
instruct the worker process running on the instance to not accept any more work.
wait for the worker to finish the work it already is handling
call the complete-lifecycle action.
AWS will take care of the rest for you.
ps. if you are using celery to power your workers then you can remotely ask a worker to shutdown gracefully. It won't shutdown unless it finishes with the tasks it had started executing.
Assuming you are using linux, you can create a pre-baked AMI that you use in your Launch Config attached to your Auto Scaling Group.
In the AMI you can put a script under /etc/init.d say /etc/init.d/servicesdown. This script would execute anything that you need to shutdown which would be scripts under /usr/share/services for example.
Here's kind like the gist:
servicesdown
It would always get executed when doing a graceful shutdown.
Then say on Ubuntu/Debian you would do something like this to add it to your shutdown sequence:
/usr/sbin/update-rc.d servicesdown stop 25 0 1 6 .
On CentOS/RedHat you can use the chkconfig command to add it to the right shutdown runlevel.
I stumbled onto this problem because I didn't want to terminate an instance that was doing work. Thought I'd share my findings here. There are two ways to look at this though :
I need to terminate a worker, but I only want to terminate one that's not working
I need to terminate a SPECIFIC worker and I want that specific worker to wait until it's done with the work.
If you're goal is #1, Amazon's new "Instance Protection" looks like it was designed to resolve this.
See the below link for an example, they give this code snippet as an example:
https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/
while (true)
{
SetInstanceProtection(False);
Work = GetNextWorkUnit();
SetInstanceProtection(True);
ProcessWorkUnit(Work);
SetInstanceProtection(False);
}
I haven't tested this myself, but I see API calls related to setting the protection, so it appears that this could be integrated into the EC2 Worker App code-base and then when Scaling In, instances shouldn't be terminated if they are protected (currently working).
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/autoscaling/AmazonAutoScaling.html
As far as I know currently there is no option to terminate instance while gracefully shutdown and let process to complete work.
I suggest you to look at http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html.
We implemented it for resque workers while we are moving instance to unhealthy state and than downsizing AS. There is a script which checking constantly health state on each instance. Once instance moved to unhealthy state it stops all services gracefully and sending terminate signal to ec2.
Hope it helps you.