GCP VM time sync issue after resuming from suspension (in both linux and windows)

GCP VM time sync issue after resuming from suspension (in both linux and windows) - google-cloud-platform

GCP VM doesn't update the system datetime after resuming it from suspension.
It keeps the system date/time same as what it was while suspending. Due to this, my scripts to fetch gcloud resources is failing as with auth token expiry error.
As per the Google Documentation https://cloud.google.com/compute/docs/instances/managing-instances#linux_1,
NTP is already configured but for my VMs I get the "command not found" error for ntpq -p.
$ sudo timedatectl status
Local time: Wed 2020-08-05 15:31:34 EDT
Universal time: Wed 2020-08-05 19:31:34 UTC
RTC time: Wed 2020-08-05 19:31:34
Time zone: America/New_York (EDT, -0400)
System clock synchronized: yes
NTP service: inactive
RTC in local TZ: no
gcloud auth activate-service-account in my script is failing with below error
(gcloud.compute.instances.describe) There was a problem refreshing your current auth tokens: invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.
OS - Windows/Linux

After resuming, the hardware clock of the VM instance is set correctly as it gets time from the hypervisor. You can check it with sudo hwclock.
The problem is with the time service of the operating system.
For Windows, it could take few minutes to sync system time with the time source. If you can't wait for the timesync cycle to complete, you can logon to Windows and force time synchronization manually:
net stop W32Time
net start W32Time
w32tm /resync /force
In Linux, NTP cannot handle a time offset of more that 1000 seconds (see http://doc.ntp.org/4.1.0/ntpd.htm. Therefore you have to force time synchronization manually. There are various ways to do that (some of them are deprecated, but still may work):
netdate timeserver1
ntpdate -u -s timeserver1
hwclock --hctosys
service ntp restart
systemctl restart ntp.service

If you run into this issue while using Google Cloud Platform, they replace netd and systemd-timesyncd with chronyd
I had to use systemctl start chrony to get my time in working order. Tried hwclock --hctosys, but it was ignoring time zones and thus setting the wrong time.
This happened because I was suspending every minute by accident. A permanent fix would be to modify the systemd definition and ask it keep retrying to start it.
The reason it stopped was this Can't synchronise: no selectable sources

Related

AWS-EC2 Redis-server RDB snapshot write error

I have a web application running on Laravel5.2 framework, with session driver set to redis with following AWS setup.
Instance-1: Running web application, with Redis configurations in .env file as follow
Redis-host: aws-private-ip-of-instance-2
Redis-password: NULL
Redis-port: 6379
Instance-2: Redis-server running with following configuration
Bind aws-private-ip-of-instance-2 and 127.0.0.1
Working directory /var/lib/redis with 775 permission, and ower-group is redis.
RDB snapshot name dump.rdb with 660 permission, and ower-group is redis.
NOTE: In AWS inbound rule for port 6379 is configured for
Instance-2.
Everything works fine, until redis tries to write the data on the RDB file. Following error shows on front-end.
MISCONF Redis is configured to save RDB snapshots, but is currently
not able to persist on disk. Commands that may modify the data set are
disabled. Please check Redis logs for details about the error.
While in the logs of Redis server i got following data.
4873:M 23 Sep 10:08:15.028 * 1 changes in 900 seconds. Saving...
4873:M 23 Sep 10:08:15.028 * Background saving started by pid 7392
7392:C 23 Sep 10:08:15.028 # Failed opening .rdb for saving: Read-only file system
4873:M 23 Sep 10:08:15.128 # Background saving error
Things I have tried
Add vm.overcommit_memory = 1 to /etc/sysctl.conf, as suggested in Redis-administraition-blog
Change path to dump.rdb file to tmp folder and change permissions to 777.

This other Stack Exchange thread might help, since you are using a custom /tmp dir for data:
The simple way to do this is to run systemctl edit redis. This will create an override drop-in file /etc/systemd/system/redis.service.d/override.conf, in which you can place your changes (and the proper section):
[Service]
ReadWriteDirectories=-/my/custom/data/dir
You may also create that directory and place files ending in .conf in it manually. But do not leave the directory empty, as this will disable the service.
In either case, run systemctl daemon-reload and you are ready to restart your service.
Many threads also point to filesystem inconsistency as root cause. Since you are using EC2, check this AWS forums post:
To fix this, you will have to:
Stop the instance
Detach the root volume of your instance
Attach the volume as a data volume to any running Linux instance in the same availability zone
Perform a filesystem check (fsck) on the volume and fix the issues
Detach the volume and attach it back to your instance as it's root volume
Boot back instance and verify if the volume was able to mount successfully
As a last resort, terminate the instance if possible.
Hope it helps!

Well this is very embarrassing to post answer of own question, which was a really stupid mistake. But hope new folks here learns from my mistake too.
So first thing I have done is enable detail logs for redis-server in /etc/redis/redis.conf file by changing log_level option to debug.
Observe the logs and understand that my redis port 6379 was open for everyone on internet.
So from logs I observe that someone else's server is spoofing into my redis server and making it slave of it. And as my redis server is configure in a way that slave is read-only, when i try to access my redis-server it throw error of read-only.
After applying the fire-wall for redis server port, I have not encounter this issue anymore.

Decrease django's CPU Elapsed time from 31353.191 msec

According to the django-debug-toolbar my CPU Time is around 31000ms (on average). This is true for my own pages as well as for the admin. Here is the breakdown when loading http://127.0.0.1:8000/admin/:
Resource usage
User CPU time 500.219 msec
System CPU time 57.526 msec
Total CPU time 557.745 msec
Elapsed time 30236.380 msec
Context switches 11 voluntary, 1345 involuntary
Browser timing (Timing attribute / Milliseconds since navigation start (+length))
domainLookup 2 (+0)
connect 2 (+0)
request 7 (+30259)
response 30263 (+3)
domLoading 30279 (+1737)
domInteractive 31154
domContentLoadedEvent 31155 (+127)
loadEvent 32016 (+10)
As far as I understand the "request" step [7 (+30259)] is the biggest bottleneck here. But what does that tell me? The request panel just shows some variables, and no GET nor POST data.
The same code works fine hosted on pythonanywhere, locally I am running a MacBook Air (i5, 1.3 Ghz, 8GB RAM). The performance hasn't been this poor all the time. IIRC it happened "over night". One day I started the dev server and it was slow. Didn't change anything in the code or DB.
Is it right to assume that it could be an issue with my local machine?
EDIT:
I tried running ./manage.py runserver --noreload but the performance didn't improve. Also, starting the dev-server (using ./manage.py runserver) also takes around 40s and accessing the DB using postico takes around 1 minute. Starting the dev-sever while commenting out the database from django's settings makes load times normal.

Solved it.
This post pointed me in the right direction. Essentially, I ended up "reseting" my hostfile according to this post. My hostfile now looks like this:
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
Does not, however, explain what caused the sudden "overnight" issue in the first place. My guess: renamed hostname. A couple of days ago my hostname ended up sounding something along the lines of android- (similar to this) which was apparently caused by me using Android's file share tool. Ended up renaming my hostname to my username (see instructions below).
Perform the following tasks to change the workstation hostname using
the scutil command. Open a terminal. Type the following command to
change the primary hostname of your Mac: This is your fully qualified
hostname, for example myMac.domain.com sudo scutil --set HostName Type
the following command to change the Bonjour hostname of your Mac: This
is the name usable on the local network, for example myMac.local. sudo
scutil --set LocalHostName Optional: If you also want to change the
computer name, type the following command: This is the user-friendly
computer name you see in Finder, for example myMac. sudo scutil --set
ComputerName Flush the DNS cache by typing: dscacheutil -flushcache
Restart your Mac.
from here
Didn't test the "renaming hostname" theory, though.

aws s3 time not synced, authentication failure using awsaccesskeyid

Had the issue of awsaccesskey and awssecretkey not authenticating,
aws s3 ls
gave
An error occurred (RequestTimeTooSkewed) when calling the ListBuckets operation: The difference between the request time and the current time is too large.
So, I tried syncing the time with my local time, which was incorrect. Even after the sync, the issue persisted.
I am in the region of ap-south-1 Mumbai my time was set correctly but the error still occurred.
I tried launching an instance and timedatectl gave this,
Local time: Sat 2018-09-08 08:25:06 UTC
Universal time: Sat 2018-09-08 08:25:06 UTC
RTC time: Sat 2018-09-08 08:25:05
Time zone: Etc/UTC (UTC, +0000)
Network time on: yes
NTP synchronized: yes
RTC in local TZ: no
The server is also in ap-south-1 so I dont get why the local time is (UTC, +0000)
Trying to set my system clock to a similar time (UTC, +0000) results in this,
Local time: Sat 2018-09-08 20:09:46 +00
Universal time: Sat 2018-09-08 20:09:46 UTC
RTC time: Sat 2018-09-08 20:09:46
Time zone: Atlantic/Azores (+00, +0000)
System clock synchronized: yes systemd-timesyncd.service active: no
RTC in local TZ: no
I've tried adjusting my machine's time to everything I can think of but still am unable to fix this error. I also chose to add servers from my region to ntpd.conf
server 3.in.pool.ntp.org
server 3.asia.pool.ntp.org
server 0.asia.pool.ntp.org
But this didn't help either.
Local Machine is running Ubuntu 18.04LTS, Instance is Ubuntu 16.04LTS.
Is there something I'm missing about this? Thanks in advance.

I don't know how or why this was caused, but changing the time manually-in the bios and going into system settings and entering the hours and minutes fixed it. Should've tried that first.
Thanks for the help.

SSH Connection disconnected

I'm a student from korea
first, i'm sorry about my low level english :)
I'm make a web service using AWS + nginx + django
I connect to AWS instance(ubuntu) using SSH protocol
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-74-generic x86_64)
* Documentation: https://help.ubuntu.com/
System information as of Sat Apr 30 07:03:51 UTC 2016
System load: 0.0 Processes: 105
Usage of /: 23.8% of 7.74GB Users logged in: 0
Memory usage: 14% IP address for eth0: 172.31.17.137
Swap usage: 0%
Graph this data and manage this system at:
https://landscape.canonical.com/
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
21 packages can be updated.
17 updates are security updates.
Last login: Sat Apr 30 07:03:52 2016 from 210.103.124.253
pyenv-virtualenv: no virtualenv has been activated.
and
manage.py runserver --settings=abc.settings.production
So everyone can access my web service!
but.... after 30miniute
the SSL connection is broken itself....
export this message
packet_write_wait: Connection to 52.69.xxx.xxx: Broken pipe
and nobody can't access my web service...
so... my web site can't access when my computer was power off, none SSL connection...
I want everyone can access my web service 24/7
please give me a method thank you :)

When you want to run a command that continues after your current shell terminates, you should use the nohup command to launch it.
That causes the process to be detached from its initial parent shell so it is not killed when the parent terminates.

Django celery beat Substantial drift from warning message

So i am developing using a VM (vagrant) and i am getting this message when i start celery beat inside it:
[2014-07-15 10:16:49,627: INFO/MainProcess] beat: Starting...
[W 140715 09:16:51 state:74] Substantial drift from celery#worker_publications may mean clocks are out of sync. Current drift is
3600 seconds. [orig: 2014-07-15 09:16:51.476125 recv: 2014-07-15 10:16:51.474109]
[W 140715 09:16:51 state:74] Substantial drift from celery#worker_queue may mean clocks are out of sync. Current drift is
3600 seconds. [orig: 2014-07-15 09:16:51.480642 recv: 2014-07-15 10:16:51.475021]
When i do date inside it i get a Tue Jul 15 09:25:11 UTC 2014 but the thing is i live in Portugal and my host machine gives me Ter Jul 15 10:25:39 WEST 2014.
Whats the best approach for me to fix this?
What about when i put this live?
I am using celery 3.1.12 and i do not have a CELERY_TIME_ZONE set.

Well sometimes doing all the django setting will not help. The reason is that the local time on the instance or local is not correct ( even by some seconds)
To make sure the time is same between multiple instances (example for me ec2 and digitalocean)
sudo apt-get install ntp
sudo /etc/init.d/ntp restart
The above will make sure the time is in sync
After this as mention above, I use the following in Django
TIME_ZONE = 'America/Los_Angeles'
TZINFO = 'UTC'
USE_TZ = True
# For celery
CELERY_ENABLE_UTC = True
I am using the following pip freeze
django-celery==3.1.16
Django==1.6
celery==3.1.8

This usually indicates a time zone mismatch. Note that the drift is 3600 seconds = exactly one hour.
The celery variable to set is CELERY_TIMEZONE, not CELERY_TIME_ZONE.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js