Call Sentinel failover from Python - python-2.7

I have a simple Redis configuration running with 3 Servers and 3 Sentinels (different instances though). This configuration is running almost flawlessly, eventually, my Master fails (common Redis problem where it couldn't finish background saving).
My problem is when that happens whenever I try to save (or delete) something, I get the error:
ResponseError: MISCONF Redis is configured to save RDB snapshots, but
is currently not able to persist on disk. Commands that may modify the
data set are disabled. Please check Redis logs for details about the error.
Is there any way for me to ask Sentinel to call "failover" force electing a new master? By redis-cli is quite easy, but I couldn't find a way of doing it from my Python(2.7) program (using redis.py).
Any ideas?

First you are probably running out of disk space, which would cause that error. So address that and it will stop needing to be failed over.
That said, to do it in Python I think you need to use "execute_command" and pass it the sentinel command and the args. So something like:
myclient.execute_command("SENTINEL failover", podname)
where myclient is your connection object and poundage is the name you use for the pod in Sentinel, should do it.

Related

How do I Prevent Long Lived Selenium Processes from Generating OOM Errors?

How do I prevent containers that use selenium from 'freezing' after getting OOM errors and seeing Failed to start thread - pthread_create failed (EAGAIN)? What is the root cause and how do I fix it? Further, how can I test the solution locally and how can I implement the solution on AWS?
The following is a guide to diagnose and solve OOM issues occurring due to long lived Selenium instances. Specifically, on AWS you will likely get Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. and the dreaded OOM error should your Selenium process run long enough. This guide assumes that you are using Docker and can run your docker file via docker run <some args if you wish> your_image.
Diagnosis
Run your app using docker run .... In a separate tab do docker ps, find your Container ID and do docker container stats CONTAINER_ID. The key is to observe the PIDS column. Now trigger the selenium process to run, ideally many times (to test this you may want to create a for loop simply to test. You will notice that the PIDs grows without bound. This is because (reference: Selenium leaves behind running processes?) will leave around zombie processes.
Solution
The solution is to cull the zombie processes. Specifically per Selenium leaves behind running processes? there is a flag --init and this will be the key to the solution. You need to run docker run --init ... (note the --init). Per https://docs.docker.com/engine/reference/run/ "You can use the --init flag to indicate that an init process should be used as the PID 1 in the container. Specifying an init process ensures the usual responsibilities of an init system, such as reaping zombie processes, are performed inside the created container." . To be confident that the solution will work for you, run your image with docker run --init .... Re-trigger the calls to Selenium. This time the PIDS may grow, but not without bound (for me the number of PIDs never passed 200).
Solution - AWS
References are https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-linuxparameters.html and https://www.ernestchiang.com/en/posts/2021/using-amazon-ecs-exec/
In the 'Task Definition' (search for 'ECS' -> click on Task Definitions) select your task definition. Then scroll to the bottom and click on Configure via JSON. Next, find linuxParameters and, if it is null, replace null with the value:
{
"initProcessEnabled": true
}
If you already have a JSON value for linuxParameters then just add "initProcessEnabled": true as a JSON parameter. Next, crate your task definition and deploy!
Solution - Other
I have not used Google Cloud or Microsoft offerings, so I do not know how to add the --init flag. If someone with such experience could tell me how to do that, I would be happy to update the guide.

AWS EC2 Cloudwatch monitoring

Firstly, Appreciate your patience in reading and thinking through this problem I have mentioned here.
I had unique problem on one of my AWS EC2 instances(Ubuntu 14.04), where the instance just goes unreachable through either http or ping. It also locked me out of ssh access. I had to log in to aws console everytime, and reboot the instance manually. As a solution, I have configured cloudwatch monitoring to reboot the instance automatically and send a notification email to me, on any occasion where the system check has failed.
So far, so good.
Now, what I really want is the root cause / reason for instance going unreachable. I assuming that to be a memory issue. I have gone through the get-system-logs, which helped a bit. But, is there anyway, I can configure cloudwatch to send me the fail logs or something similar when it sends me the alert email. Or is there any way, I can alert myself with sufficient log info like - example : memory usage being 80%, network not responding etc- when I instance goes unreachable. I have heard of swap tool, but I am looking for something more generic, just not limited to memory monitoring.
Anything? Anyone has any idea?
I would go old skool and use a script on the server to log to a file
Presumably ( you don't mention this detail in the above ) there is a particular program running on the system that is giving you this problem
Usually system programs store their PID in a file. Let's assume the file is /var/run/nginx.pid. You can work this out for your particular system
Write a script to read the PID and record the memory use, for example add this file as "/usr/local/bin/mymemory"
PID=`cat /var/run/crond.pid`
# the 3 fields are %mem, VSZ and RSS
DATA=`ps uhp $PID| awk '{print $4, $5, $6}'`
NOW=`date --rfc-3339=sec`
echo "$NOW $DATA" >> /var/log/memory.log
Add a line to crontab as root
* * * * * /usr/local/bin/mymemory.log
This will make an ever growing file for memory per minute. I suggest you login once a day and check it, download it if interesting and delete it.
(In a real production context log rotation could be used)
Every time there is a crash the file should contain memory use data

(AWS SWF) Is there a way to get a list of all activity workers listening on a particular tasklist?

In our beta stack, we have a single EC2 instance listening to a tasklist. Sometimes another developer in the team start's his own instance for testing purposes and forget to turn it off. This creates problems for the next developer who tries to start an activity only for it to be taken up by the last developer's machine. Is there a way to get the hostnames of all activity workers listening to a particular tasklist ?
It is not currently possible to get a list of pollers waiting on a task list through the SWF API. The workaround is to look at the identity field on the ActivityExecutionStarted event after it was picked up by the wrong worker.
One way to avoid this issue is always use a task list name that is specific to a machine or developer to avoid collisions.

Autoscale workers on Digital Ocean

I have 3 Webservers. DB, Web and Worker. The worker is just processing sidekiq processes all day long.
As soon the queue is over 100.000 jobs, I want to have a second worker instance and I do struggle a little bit with the right thinking of how to do it. (and if the queue is above 300.000 I need 3 workers, on and on).
I take my Worker and make a snapshot.
Via Digital-Ocean::API I will create a new instance, based on that image.
As soon as the instance is booting it needs to update the code from the Git-Repository
I need to tell the Database Server that it is allowed to receive connections from this instance IP
as soon the the queue is below 20.000 i can kill my instance.
Is this the right way of doing or are there better ways of how to do ? Am i missing something?
Additional Question:
On DB i only have mysql and redis. no ruby or anything else. so there is also no rails to run. If my worker decides, to create another worker, the new one needs to have access to mysql. It seems like to be impossible to create some access from a remote machine and it looks I need to create the access from the db server.
mysql> show grants;
+-----------------------------------------------------------------------------------------+
| Grants for rails#162.243.10.147 |
+-----------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'rails'#'162.243.10.147' IDENTIFIED BY PASSWORD <secret> |
| GRANT ALL PRIVILEGES ON `followrado`.* TO 'rails'#'162.243.10.147' |
+-----------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
mysql> CREATE USER 'rails'#'162.243.243.127' IDENTIFIED BY 'swag';
ERROR 1227 (42000): Access denied; you need (at least one of) the CREATE USER privilege(s) for this operation
Is this the right way of doing or are there better ways of how to do ? Am i missing something?
Yes - seems reasonable
as soon the the queue is below 20.000 i can kill my instance.
Maybe let it linger around for a while before killing it in case the queue goes up again
On DB i only have mysql and redis. no ruby or anything else. so there is also no rails to run. If my worker decides, to create another worker, the new one needs to have access to mysql. It seems like to be impossible to create some access from a remote machine and it looks I need to create the access from the db server.
Yes, you need to create access from the db server - in general the access is granted to the entire VPC cidr and not just single server IPs [more common when static instances] - specially if you plan to launch dynamic instances with constantly changing IPs

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.