Django celery beat Substantial drift from warning message - django

So i am developing using a VM (vagrant) and i am getting this message when i start celery beat inside it:
[2014-07-15 10:16:49,627: INFO/MainProcess] beat: Starting...
[W 140715 09:16:51 state:74] Substantial drift from celery#worker_publications may mean clocks are out of sync. Current drift is
3600 seconds. [orig: 2014-07-15 09:16:51.476125 recv: 2014-07-15 10:16:51.474109]
[W 140715 09:16:51 state:74] Substantial drift from celery#worker_queue may mean clocks are out of sync. Current drift is
3600 seconds. [orig: 2014-07-15 09:16:51.480642 recv: 2014-07-15 10:16:51.475021]
When i do date inside it i get a Tue Jul 15 09:25:11 UTC 2014 but the thing is i live in Portugal and my host machine gives me Ter Jul 15 10:25:39 WEST 2014.
Whats the best approach for me to fix this?
What about when i put this live?
I am using celery 3.1.12 and i do not have a CELERY_TIME_ZONE set.

Well sometimes doing all the django setting will not help. The reason is that the local time on the instance or local is not correct ( even by some seconds)
To make sure the time is same between multiple instances (example for me ec2 and digitalocean)
sudo apt-get install ntp
sudo /etc/init.d/ntp restart
The above will make sure the time is in sync
After this as mention above, I use the following in Django
TIME_ZONE = 'America/Los_Angeles'
TZINFO = 'UTC'
USE_TZ = True
# For celery
CELERY_ENABLE_UTC = True
I am using the following pip freeze
django-celery==3.1.16
Django==1.6
celery==3.1.8

This usually indicates a time zone mismatch. Note that the drift is 3600 seconds = exactly one hour.
The celery variable to set is CELERY_TIMEZONE, not CELERY_TIME_ZONE.

Related

GCP VM time sync issue after resuming from suspension (in both linux and windows)

GCP VM doesn't update the system datetime after resuming it from suspension.
It keeps the system date/time same as what it was while suspending. Due to this, my scripts to fetch gcloud resources is failing as with auth token expiry error.
As per the Google Documentation https://cloud.google.com/compute/docs/instances/managing-instances#linux_1,
NTP is already configured but for my VMs I get the "command not found" error for ntpq -p.
$ sudo timedatectl status
Local time: Wed 2020-08-05 15:31:34 EDT
Universal time: Wed 2020-08-05 19:31:34 UTC
RTC time: Wed 2020-08-05 19:31:34
Time zone: America/New_York (EDT, -0400)
System clock synchronized: yes
NTP service: inactive
RTC in local TZ: no
gcloud auth activate-service-account in my script is failing with below error
(gcloud.compute.instances.describe) There was a problem refreshing your current auth tokens: invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.
OS - Windows/Linux
After resuming, the hardware clock of the VM instance is set correctly as it gets time from the hypervisor. You can check it with sudo hwclock.
The problem is with the time service of the operating system.
For Windows, it could take few minutes to sync system time with the time source. If you can't wait for the timesync cycle to complete, you can logon to Windows and force time synchronization manually:
net stop W32Time
net start W32Time
w32tm /resync /force
In Linux, NTP cannot handle a time offset of more that 1000 seconds (see http://doc.ntp.org/4.1.0/ntpd.htm. Therefore you have to force time synchronization manually. There are various ways to do that (some of them are deprecated, but still may work):
netdate timeserver1
ntpdate -u -s timeserver1
hwclock --hctosys
service ntp restart
systemctl restart ntp.service
If you run into this issue while using Google Cloud Platform, they replace netd and systemd-timesyncd with chronyd
I had to use systemctl start chrony to get my time in working order. Tried hwclock --hctosys, but it was ignoring time zones and thus setting the wrong time.
This happened because I was suspending every minute by accident. A permanent fix would be to modify the systemd definition and ask it keep retrying to start it.
The reason it stopped was this Can't synchronise: no selectable sources

Hyperledger Fabric: Peer nodes fail to restart with byfn script when machine is shut down while network is running

I have a hyperledger fabric network running on a single AWS instance using the default byfn script.
ERROR: Orderer, cli, CA docker containers show "Up" status. Peers show "Exited" status.
Error occurs when:
Byfn network is running, machine is rebooted (not in my control but because of some external reason).
Network is left running overnight without shutting the machine. Shows same status next morning.
Error shown:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b0523a7b1730 hyperledger/fabric-tools:latest "/bin/bash" 23 seconds ago Up 21 seconds cli
bfab227eb4df hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 23 seconds ago peer1.org1.example.com
6fd7e818fab3 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 19 seconds ago peer1.org2.example.com
1287b6d93a23 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 22 seconds ago peer0.org2.example.com
2684fc905258 hyperledger/fabric-orderer:latest "orderer" 28 seconds ago Up 26 seconds 0.0.0.0:7050->7050/tcp orderer.example.com
93d33b51d352 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 25 seconds ago peer0.org1.example.com
Attaching docker log: https://hastebin.com/ahuyihubup.cs
Only the peers fail to start up.
Steps I have tried to solve the issue:
docker start $(docker ps -aq) or manually, starting individual peers.
byfn down, generate and then up again. Shows the same result as above.
Rolled back to previous versions of fabric binaries. Same result on 1.1, 1.2 and 1.4. In older binaries, error is not repeated if network is left running overnight but repeats when machine is restarted.
Used older docker images such as 1.1 and 1.2.
Tried starting up only one peer, orderer and cli.
Changed network name and domain name.
Uninstalled docker, docker-compose and reinstalled.
Changed port numbers of all nodes.
Tried restarting without mounting any volumes.
The only thing that works is reformatting the AWS instance and reinstalling everything from scratch. Also, I am NOT using AWS blockchain template.
Any help would be appreciated. I have been stuck on this issue for a month now.
Error resolved by adding following lines to peer-base.yaml:
GODEBUG=netdns=go
dns_search: .
Thanks to #gari-singh for the answer:
https://stackoverflow.com/a/49649678/5248781

Decrease django's CPU Elapsed time from 31353.191 msec

According to the django-debug-toolbar my CPU Time is around 31000ms (on average). This is true for my own pages as well as for the admin. Here is the breakdown when loading http://127.0.0.1:8000/admin/:
Resource usage
User CPU time 500.219 msec
System CPU time 57.526 msec
Total CPU time 557.745 msec
Elapsed time 30236.380 msec
Context switches 11 voluntary, 1345 involuntary
Browser timing (Timing attribute / Milliseconds since navigation start (+length))
domainLookup 2 (+0)
connect 2 (+0)
request 7 (+30259)
response 30263 (+3)
domLoading 30279 (+1737)
domInteractive 31154
domContentLoadedEvent 31155 (+127)
loadEvent 32016 (+10)
As far as I understand the "request" step [7 (+30259)] is the biggest bottleneck here. But what does that tell me? The request panel just shows some variables, and no GET nor POST data.
The same code works fine hosted on pythonanywhere, locally I am running a MacBook Air (i5, 1.3 Ghz, 8GB RAM). The performance hasn't been this poor all the time. IIRC it happened "over night". One day I started the dev server and it was slow. Didn't change anything in the code or DB.
Is it right to assume that it could be an issue with my local machine?
EDIT:
I tried running ./manage.py runserver --noreload but the performance didn't improve. Also, starting the dev-server (using ./manage.py runserver) also takes around 40s and accessing the DB using postico takes around 1 minute. Starting the dev-sever while commenting out the database from django's settings makes load times normal.
Solved it.
This post pointed me in the right direction. Essentially, I ended up "reseting" my hostfile according to this post. My hostfile now looks like this:
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
Does not, however, explain what caused the sudden "overnight" issue in the first place. My guess: renamed hostname. A couple of days ago my hostname ended up sounding something along the lines of android- (similar to this) which was apparently caused by me using Android's file share tool. Ended up renaming my hostname to my username (see instructions below).
Perform the following tasks to change the workstation hostname using
the scutil command. Open a terminal. Type the following command to
change the primary hostname of your Mac: This is your fully qualified
hostname, for example myMac.domain.com sudo scutil --set HostName Type
the following command to change the Bonjour hostname of your Mac: This
is the name usable on the local network, for example myMac.local. sudo
scutil --set LocalHostName Optional: If you also want to change the
computer name, type the following command: This is the user-friendly
computer name you see in Finder, for example myMac. sudo scutil --set
ComputerName Flush the DNS cache by typing: dscacheutil -flushcache
Restart your Mac.
from here
Didn't test the "renaming hostname" theory, though.

aws s3 time not synced, authentication failure using awsaccesskeyid

Had the issue of awsaccesskey and awssecretkey not authenticating,
aws s3 ls
gave
An error occurred (RequestTimeTooSkewed) when calling the ListBuckets operation: The difference between the request time and the current time is too large.
So, I tried syncing the time with my local time, which was incorrect. Even after the sync, the issue persisted.
I am in the region of ap-south-1 Mumbai my time was set correctly but the error still occurred.
I tried launching an instance and timedatectl gave this,
Local time: Sat 2018-09-08 08:25:06 UTC
Universal time: Sat 2018-09-08 08:25:06 UTC
RTC time: Sat 2018-09-08 08:25:05
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
The server is also in ap-south-1 so I dont get why the local time is (UTC, +0000)
Trying to set my system clock to a similar time (UTC, +0000) results in this,
Local time: Sat 2018-09-08 20:09:46 +00
Universal time: Sat 2018-09-08 20:09:46 UTC
RTC time: Sat 2018-09-08 20:09:46
Time zone: Atlantic/Azores (+00, +0000)
System clock synchronized: yes systemd-timesyncd.service active: no
RTC in local TZ: no
I've tried adjusting my machine's time to everything I can think of but still am unable to fix this error. I also chose to add servers from my region to ntpd.conf
server 3.in.pool.ntp.org
server 3.asia.pool.ntp.org
server 0.asia.pool.ntp.org
But this didn't help either.
Local Machine is running Ubuntu 18.04LTS, Instance is Ubuntu 16.04LTS.
Is there something I'm missing about this? Thanks in advance.
I don't know how or why this was caused, but changing the time manually-in the bios and going into system settings and entering the hours and minutes fixed it. Should've tried that first.
Thanks for the help.

Why do I have so many Celery messages with CloudAMQP on Heroku?

My Django site running on Heroku is using CloudAMQP to handle its scheduled Celery tasks. CloudAMQP is registering many more messages than I have tasks, and I don't understand why.
e.g., in the past couple of hours I'll have run around 150 scheduled tasks (two that run once a minute, another that runs once every five minutes), but the CloudAMQP console's Messages count increased by around 1,300.
My relevant Django settings:
BROKER_URL = os.environ.get("CLOUDAMQP_URL", "")
BROKER_POOL_LIMIT = 1
BROKER_HEARTBEAT = None
BROKER_CONNECTION_TIMEOUT = 30
CELERY_ACCEPT_CONTENT = ['json',]
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_RESULT_EXPIRES = 7 * 86400
CELERY_SEND_EVENTS = False
CELERY_EVENT_QUEUE_EXPIRES = 60
CELERY_RESULT_BACKEND = None
CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler'
My Procfile:
web: gunicorn myproject.wsgi --log-file -
main_worker: python manage.py celery worker --beat --without-gossip --without-mingle --without-heartbeat --loglevel=info
Looking at the Heroku logs I only see the number of scheduled tasks running that I'd expect.
The RabbitMQ overview graphs tend to look like this most of the time:
I don't understand RabbitMQ enough to know if other panels shed light on the problem. I don't think they show anything obvious that would account for all these messages.
I'd like to at least understand what the extra messages are, and then whether there's a way I can eliminate some or all of them.
I'had got the same error few days ago.
For those who get the same issue, CloudAMQP's doc recommend to add some arguments when you launch your celery :
--without-gossip --without-mingle --without-heartbeat
"This will decrease the message rates substantially. Without these
flags Celery will send hundreds of messages per second with different
diagnostic and redundant heartbeat messages."
And indeed, this fix the issue so far ! And you finally get only YOUR messages sent.