I have a aws ec2 p3.2xlarge instance. I can ssh and connect to it easily. However about after 20 minutes, while I am running a keras model on it, it resets the connection and I am kicked out with the error Connection reset by 54.161.50.138 port 22. I then am able to reconnect, but have to start training the model over again because my progress was lost. This happens every time I connect to the instance. Any idea why this is happening?
For ssh I am using gow which lets me run linux commands on windows - https://github.com/bmatzelle/gow/wiki
I checked my public ip address before and after the reset and it was the same.
I also looked at the cpu usage using amazon CloudWatch, and it was normal - 20%.
I figured out a partial solution to this. In the instance terminal follow the following steps.
run the command "tmux"
in the new shell that pops up, execute the job
detach from the tmux shell by using the shortcut (Ctrl+b then d)
if the ssh connection resets, ssh to the instance again and run "tmux attach"
the job should have kept on running and you can resume where you left off
Related
I have launched 5 or 6 EC2 instances and every time I reboot one or (stop and start it) using the AWS console (Not with the Command Line), I find myself unable to connect to it.
When I try to connect using ssh, this is what gets logged:
ssh: connect to host ec2-xx-xxx-xxx-xx.eu-west-3.compute.amazonaws.com
port 22: Connection timed out
And when I use the AWS console to try to connect on the browser, this is what I see:
I have waited hours after restarting and still the same result and this happens with every single instance and every time I reboot it or stop and restart it.
This is deeply frustrating and there seems to be no answer on the internet.
Wait for some time after reboot to connect to instance again.
I have an ubuntu 18.04 based EC2 instance using an Elastic IP Address. I am able to SSH into the instance without any problems.
apt is executing some unattended updates on the instance. If I reboot the system after the updates, I am no longer able to SSH into the system. I am getting the error ssh: connect to host XXX port 22: Connection refused
Few points:
Even after the updates, I am able to SSH before the reboot
Method of restart does not make a difference. sudo shutdown -r now and EC2 dashboard have the same result.
There are no problems with sshd_config. I've detached the volume and attached it to a new working instance. sshd -t did not report any problems either
I am able to do sudo systemctl restart ssh.service after the updates but before the system restart.
I've tried with and without Elastic IP. Same result
From the system logs, I see that SSH is trying to start, but failing for some reason
I want to find out why the ssh daemon is not starting. Any pointers?
Update:
System Logs
Client Logs
No changes in the security groups before and after reboot
EC2 > Network & Security > Security Groups > Edit inblound rules > SSH 0.0.0.0/0
Step 1: EC2 > Instances > Actions > Image and templates > Create image
Step 2: Launch a new instance using the AMI image.
I missed the error Failed to start Create Static Device Nodes in /dev. in system logs. The solution given at https://askubuntu.com/questions/1301750/ubuntu-16-04-failed-to-start-create-static-device-nodes-in-dev helped solve my problem
I have a google VM instance that stopped working sometime in the last 4 days. The last time I tried to access it, everything was fine. By 'stopped working' I mean:
Unable to connect to websites hosted at that instance
Unable to connect to the instance using gcloud compute ssh
I can connect to the instance by opening an ssh terminal in a browser window from within console.gcloud.google.com.
Running gcloud compute ssh from my local terminal results in:
ssh: connect to host 34.69.41.204 port 22: Operation timed out
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
Connecting over http results in:
wget http://panam.whensparksfly.org
--2021-10-11 11:18:00-- http://panam.whensparksfly.org/
Resolving panam.whensparksfly.org (panam.whensparksfly.org)... 34.69.41.204
Connecting to panam.whensparksfly.org (panam.whensparksfly.org)|34.69.41.204|:80... failed: Operation timed out.
If I run that same wget command from the browser-based terminal I started from https://console.gcloud.google.com, it works.
I've tried stopping and restarting the instance. I also have another instance that I usually leave off. I started that instance and had the same problem.
Here are the firewall rules for that instance:
How should I go about troubleshooting this?
This is not an answer to my specific question about how to troubleshoot the problem, but here's how I resolved the issue:
Create a new machine image from the original instance
Create a new instance from the machine image. Go to the Machine Images page in the Google Cloud Console, click the Actions button for the desired image, then click create instance.
I was able to transfer my static external IP address to the new instance by following the instructions here.
Everything is now working as before.
I need to train deep learning model in AWS EC2 instances. I can connect with the instances through ssh connection. After establishing connection, I run training among the instances. If my wifi goes down then i lose connection with the instances, as it shows "Connection broken pipes". So i need to again establish ssh connection, it's like restarting the instances again.
How can i save the state of the running instances so that after reconnection, i can get the previous state?
Running command in background is one of the way to tackle this situation.
You can run commands using 'nohup' and it will continue even if your ssh disconnects.
About 6 months ago I created an AWS EC2 instance to mess around with on the free tier. After months of having no issues remoting into my AWS EC2 server, I've recently been unable to access it via SSH. I am using the following command:
ssh -i my-key-pair.pem ec2-user#ec2-**-**-***-***.us-****-*.compute.amazonaws.com
...and after a minute or two, am getting this response
ssh: connect to host ec2-**-**-***-***.us-****-*.compute.amazonaws.com port 22: Operation timed out
What's strange is that
1) I can read and write to my RDS database just fine
2) I can ping into the server
3) My port 22 is open
4) The instance is running and healthy
5) In the Inbound section of the security group of the EC2 server it allows for all traffic and SSH from any location via port 22.
6) I'm using the same key-pair as always
I went through this documentation (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesConnecting.html) and can confirm that the VPC, subnet, network ACL and route tables all line up (I haven't changed anything with those since the SSH stopped working). Any insight would be extremely helpful!
Sometimes the instance fails, you can check the screen of it via AWS
console.
Run another instance in the same security group and try to
connect to it and then from there to your original one - to verify if
ssh is still open (even if you do not have the ssh key, the error
will not be 'timeout')
You can create a snapshot of your instance and
attach it as another volume in a new one and you can investigate
logs, maybe something went wrong.
You can restart the instance, if
for example i ran out of memory it will most likely work after the
reboot (hopefully for a long enough time for you to investigate).
You can contact AWS support.