Why is the KernelRestarter killing my IBM DSX python kernel - data-science-experience

On IBM DSX I find that if i leave a long running python notebook running overnight, the kernel dies around the same time (around midnight UTC).
The jupyter log shows :
[I 2017-07-29 23:37:14.929 NotebookApp] KernelRestarter: restarting kernel (1/5)
WARNING:root:kernel e827e71b-6492-4dc4-9201-b6ce29c2100c restarted
[D 2017-07-29 23:37:14.950 NotebookApp] Starting kernel: [u'/usr/local/src/bluemix_jupyter_bundle.v54/provision/pyspark_kernel_wrapper.sh', u'/gpfs/fs01/user/sc1c-81b7dbb381fb6a-c4b9ad2fa578/notebook/jupyter-rt/kernel-e827e71b-6492-4dc4-9201-b6ce29c2100c.json', u'spark20master']
[D 2017-07-29 23:37:14.954 NotebookApp] Connecting to: tcp://127.0.0.1:42931
[D 2017-07-29 23:37:17.957 NotebookApp] KernelRestarter:
restart apparently succeeded
Kernel log or Jupyter log shows nothing else before this point.
Is there some policy that is being enforced here to kill kernels? or maybe some scheduled downtime each day? Does anybody know why the "KernelRestarter" is kicking in?

The KernelRestarter is not killing anything. It notices that the kernel is gone and starts a new one automatically. DSX has inactivity timeouts, but those would shut down your service altogether rather than kill a kernel. And inactivity timeouts are not tied to a wall clock time. This seems to be a bug in DSX.

Related

GCP VM time sync issue after resuming from suspension (in both linux and windows)

GCP VM doesn't update the system datetime after resuming it from suspension.
It keeps the system date/time same as what it was while suspending. Due to this, my scripts to fetch gcloud resources is failing as with auth token expiry error.
As per the Google Documentation https://cloud.google.com/compute/docs/instances/managing-instances#linux_1,
NTP is already configured but for my VMs I get the "command not found" error for ntpq -p.
$ sudo timedatectl status
Local time: Wed 2020-08-05 15:31:34 EDT
Universal time: Wed 2020-08-05 19:31:34 UTC
RTC time: Wed 2020-08-05 19:31:34
Time zone: America/New_York (EDT, -0400)
System clock synchronized: yes
NTP service: inactive
RTC in local TZ: no
gcloud auth activate-service-account in my script is failing with below error
(gcloud.compute.instances.describe) There was a problem refreshing your current auth tokens: invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.
OS - Windows/Linux
After resuming, the hardware clock of the VM instance is set correctly as it gets time from the hypervisor. You can check it with sudo hwclock.
The problem is with the time service of the operating system.
For Windows, it could take few minutes to sync system time with the time source. If you can't wait for the timesync cycle to complete, you can logon to Windows and force time synchronization manually:
net stop W32Time
net start W32Time
w32tm /resync /force
In Linux, NTP cannot handle a time offset of more that 1000 seconds (see http://doc.ntp.org/4.1.0/ntpd.htm. Therefore you have to force time synchronization manually. There are various ways to do that (some of them are deprecated, but still may work):
netdate timeserver1
ntpdate -u -s timeserver1
hwclock --hctosys
service ntp restart
systemctl restart ntp.service
If you run into this issue while using Google Cloud Platform, they replace netd and systemd-timesyncd with chronyd
I had to use systemctl start chrony to get my time in working order. Tried hwclock --hctosys, but it was ignoring time zones and thus setting the wrong time.
This happened because I was suspending every minute by accident. A permanent fix would be to modify the systemd definition and ask it keep retrying to start it.
The reason it stopped was this Can't synchronise: no selectable sources

AWS - EC 2 All of sudden lost access due to No /sbin/init, trying fallback

In my AWS EC2 instance, was locked and lost access from December 6th for an unknown reason, it cannot be an action i did on the EC2, because i was overseas on holidays from December 01st and Came back January 01st, I realized server was lost connection from 6t December and i have no way to connect to the EC2 now on,
EC2 runs on CENTOS 7 and PHP, NGINX, SSHD setup.
When i checked the System Log i see below.
[[32m OK [0m] Started Cleanup udevd DB.
[[32m OK [0m] Reached target Switch Root.
Starting Switch Root...
[ 6.058942] systemd-journald[99]: Received SIGTERM from PID 1 (systemd).
[ 6.077915] systemd[1]: No /sbin/init, trying fallback
[ 6.083729] systemd[1]: Failed to execute /bin/sh, giving up: No such file or directory
[ 180.596117] random: crng init done
Any idea on what is the issue will be much appreciated
In Brief, i had to following to recover, The root cause has been that the disk was completely full.
) Problem mounting the slaved volume (xfs_admin)
) Not able to chroot the environment (ln -s)
) Disk at 100% (df -h) Removing var/log files
) Rebuilt the initramfs (dracut -f)
) Rename the etc/fstab
) Switched the Slave volume back to original UUID (xfs_admin)
) Configured the Grub to boot the latest version of the kernel/initramfs
) Rebuilt Initramfs and Grub

GCloud compute instance halting but not stopping

I want to run a gcloud compute instance only for the duration of my startup script. My startup script calls a nodejs script which spawns sudo halt or shutdown -h 0. Observing the serial port the system comes to a halt but remains in the RUNNING state, never going into STOPPING or TERMINATED:
Starting Halt...
[[0;32m OK [0m] Stopped Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
Stopping LVM2 metadata daemon...
[[0;32m OK [0m] Stopped LVM2 metadata daemon.
[ 34.560467] reboot: System halted
How is it possible that the system can halt completely but the instance doesn't register as such?
As #DazWilkin said in the comments, you can use poweroff to stop a VM instance in Compute Engine. Alternatively, per the official Compute Engine documentation you can also run sudo shutdown -h now or sudo poweroff which will both accomplish what you are trying to do.

Jupyter notebook stopped running on AWS server

I used to run this Jupyter notebook with no problems (using 8889 port) at all but since yesterday I have been having troubles accessing it. This is what happens when I ssh to my server on AWS (below). When I copy/paste this url in Safari I get this message: "Safari cannot open the page because the server unexpectedly dropped the connection". SSH works fine though. Will appreciate help of this community as I'm new to AWS.
ssh [xxx]
ubuntu#ip-xxx:~$ cd mydir
ubuntu#ip-xxx:~/mydir$ source myenv/bin/activate
(myenv) ubuntu#ip-xxx:~/mydir$ jupyter notebook
[I xxx NotebookApp] The port 8888 is already in use, trying another port.
[I xxx NotebookApp] Serving notebooks from local directory: /home/ubuntu/mydir
[I xxx NotebookApp] The Jupyter Notebook is running at:
[I xxx NotebookApp] http://localhost:8889/?token=XXX
[I xxx NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W xxx NotebookApp] No web browser found: could not locate runnable browser.
[C xxx NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8889/?token=XXX
Run this to see what process is using that port
sudo netstat -tupln | grep 8888
Afterwards kill that process and try to start up notebook again.
sudo kill -9 PID

Get VirtualBox server booting status

I have a database server that is started from a VirtualBox VM. I can start my development only after the dataBase VM boots up. How can I check whether the dataBase server has been booted up successfully. I am well aware of the command VBoxManage showvminfo
but it shows the State as 'running' itself even when the VM is still booting up.
Is there a way to check the booting status?
I suspect not, as even the Vagrant project seems to use a constantly-retrying SSH connection with a large timeout to determine if the machine is ready.