GCE Boot fail with disk from snapshot

GCE Boot fail with disk from snapshot - google-cloud-platform

i try in GCE to create a copy of a running system, i did a snapshot from the running system and set up a new instance. On starting it showed to be running but the log from console one says:
SeaBIOS (version 1.8.2-google)
Total RAM Size = 0x0000000100000000 = 4096 MiB
CPUs found: 2 Max CPUs supported: 2
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=20971520 = 10240 MiB
drive 0x000f22f0: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=20971520
Sending Seabios boot VM event.
Booting from Hard Disk 0...
So the system is not reachable. I Tryed to check the UUID of the drive but it seems to be the right one. Can someone tell me how to fix this.
Best regards
Alex

What you're experiencing looks very much like a bug. You may try to stop the VM, take a snapshot and try creating new one.
My recommendation would be to contact GCP support to get more immediate help (however it's a paid service) or open up a new issue at Google IssueTracker to get help for free but there are not ETA's for this.

Related

AWS EC2: cannot get bare metal instance

I have tried several times in the last two weeks to log on to a c5.metal instance. Each time I get "Initializing" in the status checks field, but after 10 minutes it is still "Initializing" and I'm not able to log on. I have had success with c5.metal before, but not any more.
Today I also tried to get an m5.metal instance. This time the instance successfully initialized after 10 minutes but I was not able to log on with Putty. I stopped the instance, then after about 30 minutes I tried again and this time I did not get past "Initializing" in the status check field and I stopped it after 15 minutes.
I get billed for the 10 to 15 minute bare metal wait periods, even when initialization doesn't complete. I have no problems with AWS virtual instances.
Thanks for any ideas on what I can do to get the bare metal instances to work.

To reproduce your situation, I did the following:
Launched an Amazon EC2 instance in Ohio:
Instance Type: c5.metal
AMI: Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
Network: In my Default VPC so that it uses a Public Subnet
Security Group: Default settings, which grants port 22 access from the Internet
Instance entered running state very quickly, Status Checks showed as Initializing
It took about 8 minutes until the status checks were showing 2/2 checks (it might have been faster, but I was testing other things in the meantime).
I was able to successfully login to the instance:
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-1065-aws x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Sat Jun 6 23:21:18 UTC 2020
System load: 0.02 Processes: 924
Usage of /: 13.7% of 7.69GB Users logged in: 0
Memory usage: 0% IP address for enp125s0: 172.31.9.77
Swap usage: 0%
0 packages can be updated.
0 updates are security updates.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu#ip-172-31-9-77:~$
(Actually, I first tried to login as ec2-user and it took me a while to realize this was an Ubuntu AMI, so I connected as ubuntu).
It is possible that the slow startup is due to the Operating System or hardware checking the 192GB of RAM that is allocated to the instance.
I booted another instance using an Amazon Linux 2 AMI and it required approximately 7 minutes before I could connect.
I also noticed that the c5.metal instances did not provide anything for "Get System Log" or "Get Instance Screenshot". This might be a result of using a bare-metal instance.

I joined John Rotenstein's twitch.tv channel and he showed how he got a c5.metal instance. What I learned is that if a metal instance does not work in the region you had chosen, try launching a new instance in a different data center region. For example, I had a c5.metal instance at us-east-2a. Following John's directions, I launched an instance at us-east-2c and after about 8 minutes the instance was ready for use.

AWS EC2 Instance starting time

Sometimes the starting time of the instance takes more than 5 minutes. In this case, the Status Checks takes more than 4 minutes.
How can I make the instance run less than a minute, including checking the status?

You do not need to wait for the Instance Status Check to complete before using an Amazon EC2 instance.
Linux instances are frequently ready 60-90 seconds after launch. Windows instances take considerably longer because the AMI has been configured for sysprep, which involves a reboot.
New instances take longer to be ready than existing instances because they typically run code on first startup. So, if you Stop and instance and later Start it again, the instance will be available quite quickly (especially Linux instances).

I'm not sure that "You do not need to wait for the Instance Status Check to complete" is
correct, and if the status check failed for any reason you (obviously) have a problem and should investigate before using.
Doing a quick check using a aws jdk script creating a "Nano" instance from a linux image loaded with ubuntu, apache, tomcat, java, mysql etc it took 45 secs to get "running" and 2 mins 15 secs to finish the Status Checks.
Starting an existing "stopped" instance ("Nano") took 18 secs and 2 mins 15 secs to finish the status checks.

You can't change instance health check it manages by aws. When a system status check fails, you can choose to wait for AWS to fix the issue, or you can resolve it yourself by stop and start the instance. which in most cases migrates it to a new host computer.
The following are examples of problems that can cause system status checks to fail:
Loss of network connectivity, Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability.
The instance will be accessible once boot. it should not take 5 min of time. you can check instance boot logs or screen from
ec2 --> Action --> Instance settings`Get system log` and `Get instance screenshot` and optimized instance up time.

Experienced problems with our RDS instance

we experienced problems with our RDS instance.
RDS stops running. RDS are in state of "green"(on the AWS console) but we cannot connect to the RDS instance.
Cloud Logs we found following errors:
2018-03-07 8:52:31 47886953160896 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:52:32 47886953160896 [ERROR] Plugin 'InnoDB' init function returned error.
2018-03-07 8:53:46 47508779897024 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:53:46 47508779897024 [ERROR] Plugin 'InnoDB' init function returned error.
When we tried to reboot RDS instance its take almost 2 hours to reboot. After rebooting its working fine again!.
Can someone help us to know the root cause of this incident.

As t2.small provides 2G of RAM. However you might be knowing, most DB engines tend to use up 75% of the memory for caching purposes such as queries, temporary tables, table scans to make things go faster.
For our Maria DB engine, following parameters are by default set to below pre-optimized values :
innodb_buffer_pool_size (DB instance size *3/4= 1.5 Gb)
key_buffer_size (16777216 = 16.7 Mb)
innodb_log_buffer_size (8388608 =8.3Mb)
Apart from that the OS and the RDS Processes will also use some amount of RAM to do their own operations. Hence to summarize, around 1.6 Gb approximately is utilized by DB engine and the actual usable memory which will be getting after taking out these values innodb_buffer_pool_size, key_buffer_size, innodb_log_buffer_size will be around 400 MB.
Overall a decrease in your Freeable Memory as low as ~137MB. As a result, Swap Usage increased drastically in the same time period to 152MB approximately.
FreeableMemory was quite low and there was a high swap utilization. Further, due to the memory pressure ( insufficient memory and high swap usage), RDS internal monitoring system was not able to proceed with host communication which in turn resulted into underlying host replacement.

Why is AWS EC2 CPU usage shooting up to 100% momentarily from IOWait?

I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?

AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.

I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.

Memory issues on RDS PostgreSQL instance / Rails 4

We are running into a memory issues on our RDS PostgreSQL instance i. e. Memory usage of the postgresql server reaches almost 100% resulting in stalled queries, and subsequent downtime of production app.
The memory usage of the RDS instance doesn't go up gradually, but suddenly within a period of 30min to 2hrs
Most of the time this happens, we see that lot of traffic from bots is going on, though there is no specific pattern in terms of frequency. This could happen after 1 week to 1 month of the previous occurence.
Disconnecting all clients, and then restarting the application also doesn't help, as the memory usage again goes up very rapidly.
Running "Full Vaccum" is the only solution we have found that resolves the issue when it occurs.
What we have tried so far
Periodic vacuuming (not full vacuuming) of some tables that get frequent updates.
Stopped storing Web sessions in DB as they are highly volatile and result in lot of dead tuples.
Both these haven't helped.
We have considered using tools like pgcompact / pg_repack as they don't acquire exclusive lock. However these can't be used with RDS.
We now see a strong possibility that this has to do with memory bloat that can happen on postgresql with prepared statements in rails 4, as discussed in following pages:
Memory leaks on postgresql server after upgrade to Rails 4
https://github.com/rails/rails/issues/14645
As a quick trial, we have now disabled prepared statements in our rails database configuration, and are observing the system. If the issue re-occurs, this hypothesis would be proven wrong.
Setup details:
We run our production environment inside Amazon Elastic Beanstalk, with following configuration:
App servers
OS : 64bit Amazon Linux 2016.03 v2.1.0 running Ruby 2.1 (Puma)
Instance type: r3.xlarge
Root volume size: 100 GiB
Number of app servers : 2
Rails workers running on each server : 4
Max number of threads in each worker : 8
Database pool size : 50 (applicable for each worker)
Database (RDS) Details:
PostgreSQL Version: PostgreSQL 9.3.10
RDS Instance type: db.m4.2xlarge
Rails Version: 4.2.5
Current size on disk: 2.2GB
Number of tables: 94
The environment is monitored with AWS cloudwatch and NewRelic.

Periodic vacuum should help in containing table bloat but not index bloat.
1)Have you tried more aggressive parameters of auto-vacuum ?
2)Tried routine reindexing ? If locking is a concern then consider
DROP INDEX CONCURRENTLY ...
CREATE INDEX CONCURRENTLY ...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js