celery flower connection issue with --persistent True

celery flower connection issue with --persistent True - django

I was running celery flower 1.0.0 as a systemd service with --persistent=True. And every restart used to fail with error SSLV3_ALERT_HANDSHAKE_FAILURE which was in-depth because of db type could not be determined error message.
Upon removing --persisten=True it used to work perfectly on every restart, but then I couldn't get my celery flower database be intact after each restart.
Here is what worked for me.

First, that SSLV3_ALERT_HANDSHAKE_FAILURE was because of my mis-configuration of sentry + raven.
Second, the real error db type could not be determined was arising because the new restarted service wasn't able to connect with the database of older flower service. I couldn't find out why is that so, but a very helpful GitHub issue is here.
Finally, I tried specifying --db flag to specify my flower database location and name. This resolved the issue for me. Now the service was able to restart successful even with --persistent=True.
Later, I found out that the default database which flower was creating was owned by the primary user of the host. But upon specifying --db flag in the service file, the database was owned by celery user. This was the real solution for me.
Take Away: If you are running flower as a systemd service, then make sure the flower database file is owned by the celery user. Or even better is to always use --db flag to save you from this problem.

Related

How to easily debug Redis and Django/Gunicorn when developing using Docker?

I'm not referring to more sophisticated debugging techniques, but how to get access to the same kind of error messages that are normally directed to terminal tabs?
Basically I'm adopting Docker in a Django project also using Redis.
In the old way of working I opened a linux terminal tab for gunicorn like this: gunicorn --reload --bind 0.0.0.0:8001 myapp.wsgi:application
And this tab kept running Gunicorn and any Python error was shown in this tab so I could see the problem and fix it.
I could also open a second tab for the celery woker: celery -A myapp worker --pool=solo -l info
The same thing happened, the tab was occupied by Celery and any Python error in a task was shown in the tab and I could see the problem and correct the code.
My question is: Using docker is there a way to make each of the containers direct these same errors that would previously go to the screen, to go to log files so that I can debug my code when an error occurs in Python?
What is the correct way to handle simple debugging during development using Docker containers?

After looking more about this in the docker documentation I found a link that solves this problem: View logs for a container or service
Basically the command "docker logs CONTAINER_ID" shows on the screen exactly what we would see in the terminal running the application.
Works perfectly to see Django, Redis and Angular logs.
Just type:
docker logs CONTAINER_ID
Replace the container_id keyword with the real id of the container you want to log in.
To find the id type:
docker ps

Django Celery with Redis Issues on Digital Ocean App Platform

After quite a bit of trial and error and a step by step attempt to find solutions I thought I share the problems here and answer them myself according to what I've found. There is not too much documentation on this anywhere except small bits and pieces and this will hopefully help others in the future.
Please note that this is specific to Django, Celery, Redis and the Digital Ocean App Platform.
This is mostly about the below errors and further resulting implications:
OSError: [Errno 38] Function not implemented
and
Cannot connect to redis://......
The first error happens when you try run the celery command celery -A your_app worker --beat -l info
or similar on the App Platform. It appears that this is currently not supported on digital ocean. The second error occurs when you make a number of potential mistakes.

PART 1:
While Digital Ocean might remedy this in the future here is an approach that will offer a workaround. The problem is the not supported execution pool. Google "celery execution pools" if you want to know more and how they work. The default one is prefork. But what you need is either gevent or eventlet. I went with the former for my purposes.
Whichever you pick you will have to install it as it doesn't come with celery by default. In my case it was: pip install gevent (and don't forget adding it to your requirements as well).
Once you have that you can re-run the celery command but note that gevent and beat are not supported within a single command (will result in an error). Instead do the following:
celery -A your_app worker --pool=gevent -l info
and then separately (if you want to run beat that is) in another terminal/console
celery -A your_app beat -l info
In the first line you can also specify the concurrency like so: --concurrency=100. This is not required but useful. Read up on it what it does as that goes beyond the solution here.
PART 2:
In my specific case I've tested the above locally (development) first to make sure they work. The next issue was getting this into production. I use Redis as the db/broker.
In my specific setup I have most of my celery configuration in the_main_app/celery/__init__.py file but sometimes people put it directly into the_main_app/celery.py. Whichever it is you do make sure that the REDIS_URL is set correctly. For development it usually looks something like this:
YOUR_VAR_NAME = os.environ.get('REDIS_URL', 'redis://localhost:6379') where YOUR_VAR_NAME is then set to the broker with everything as below:
YOUR_VAR_NAME = os.environ.get('REDIS_URL', 'redis://localhost:6379')
app = Celery('the_main_app')
app.conf.broker_url = YOUR_VAR_NAME
The remaining settings are all documented on the "celery first steps with django" help page but are not relevant for what I am showing here.
PART 3:
When you setup your Redis Database on the App Platform (which is very simple) you will see the connection details as 'public network' and 'VPC network'.
The celery documentation says to use the following URL format for production: redis://:password#hostname:port/db_number. This didn't work. If you are not using a yaml file then you can simply copy paste the entire connection string (select from the dropdown!) from the Redis DB connection details and then setup an App-Level environment variable in your Digital Ocean project named REDIS_URL and paste in that entire string (and also encrypt it!).
The string should look like something like this (redis with 2 s!)
rediss://USER:PASS#URL.db.ondigitialocean.com:PORT.
You are almost done. The last step is to setup the workers. It was fine for me to run the PART 1 commands as console commands on the App Platform to test them but eventually I've setup a small worker (+ Add Component) for each line pasted them into the Run Command.
That is basically the process step by step. Good luck!

JupyterHub notebook server returning 500 error, pod stuck in "terminating" state

I have an AWS EKS cluster (kubernetes version 1.14) which runs JupyterHub application.
One of the users notebook servers is returning a 500 error
500 : Internal Server Error
Redirect loop detected. Notebook has JupyterHub version unknown (likely < 0.8), but the hub expects 0.9.6. Try installing JupyterHub==0.9.6 in the user environment if you continue to have problems.
You can try restarting your server from the homepage.
Only one user is experiencing this issue, others are not. When I do "kubectl get pod", this users pod shows that it is in state "terminating" (it appears to be stuck in this state).

I was able to fix it, but I can't say this is the right approach. (I would have preferred to diagnose the root cause)
First, I tried deleting the pod kubectl delete pod <pod_name> -- it did not work
Second, I tried force deleting the pod kubectl delete pod <pod_name> --grace-period=0 --force -- it worked, but it turns out this only deletes the handle, the pod resources are then orphaned on the cluster
I checked the node status kubectl get node and noticed one node was stuck in NotReady state. I recycled this node -- still did not work, the user notebook server was still stuck and returning 500 err
Finally, I simply deleted the user notebook server from the jupyter hub admin page. This fixed it....

How to access UI in Airflow 1.10?

To start with I am trying to upgrade from 1.9 version to 1.10 so my setup contains two vms running different versions of airflow with different port forwarding.
I can access UI from vm running with 1.9 but not able to access UI from 1.10.
To debug I want to confirm if airflow webserver is running. if I execute
sudo systemctl start airflow-webserver
it throws no error but when
I am looking at netstat I am not seeing any process listening to port 8080(default).
Also I have not created any user as I do not need rbac authentication ? Can that be a problem?
As requested by #kaxil. Below is the output of ps aux | grep airflow
Can someone provide some suggestions on how to fix this problem? Also if you need any further resource can provide it. I am not sure what is relevant here.
Output of journalctl -u airflow-webserver.service -b

The Error message shows that there is an issue with airflow.cfg file i.e. there might be a character in your airflow.cfg that is causing the issue. Recheck your config file, if you don't find an issue, post your config file in your question and we will try to figure it out.

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!

1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js