Timeout during docker-machine create with Ubuntu 16.04 on EC2

Timeout during docker-machine create with Ubuntu 16.04 on EC2 - amazon-web-services

When attempting to create a VM on EC2 with an Ubuntu 16.04 AMI ami-835b4efa, I see the following:
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded
This issue goes away if I create a VM using Ubuntu 14.04 with AMI ami-fc4f5e85. I've seen this in the past and thought it was just a fluke. It's happened enough times today repeatedly that I'm thinking there's some issue here. Any thoughts on why the above fails with Ubuntu 16.04? I can use 14.04 for now but would like to upgrade in the not too distant future and still use Docker Machine for managing my VMs.
I downloaded the latest version of Docker Toolbox for OSX today to take that off the table as a possible issue.

Check if this is similar to issue 2533 where:
What worked for me was adding a --amazonec2-ami param and setting it to
aws's Ubuntu 14.04 LTS image: ami-fce3c696
Since you are usiing Ubuntu 16.04, check the Amazon EC2 AMI Locator to try a similar option with the right AMI. It can depend on your region.

what version of docker and docker-machine? Whats in the logs on the machine? If docker-machine version 0.12.0, build 45c69ad and docker version 17.06.0-ce then its probably this issue in docker-machine: https://github.com/docker/machine/issues/4156

Related

GitLab Runner suddenly fails to run jobs using Docker Machine and AWS Autoscaling

I use GitLab Runner for running CI jobs on AWS EC2 spot instances, using its autoscaling feature with Docker Machine.
All of a sudden, today GitLab CI failed to run jobs and shows me the following job output for all jobs that I want to start:
Running with gitlab-runner 14.9.1 (f188edd7)
on AWS EC2 runner ...
Preparing the "docker+machine" executor
10:05
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Job failed (system failure): exit status 1
I see in the AWS console that the EC2 instances do get created, but the instances always get stopped immediately by GitLab Runner again.
The GitLab Runner system logs show me the following errors:
ERROR: Machine creation failed error=exit status 1 name=runner-eauzytys-gitlab-ci-1651050768-f84b471e time=1m2.409578844s
ERROR: Error creating machine: Error running provisioning: error installing docker: driver=amazonec2 name=runner-xxxxxxxx-gitlab-ci-1651050768-f84b471e operation=create
So the error seams somehow to be related to Docker machine. Upgrading GitLab Runner as well as GitLab's Docker Machine fork to the newest versions do not fix the error. I'm using GitLab 14.8 and tried GitLab Runner 14.9 and 14.10.
What can be the reason for this?

Update:
In the meantime, GitLab have released a new version of their Docker Machine fork which upgrades the default AMI to Ubuntu 20.04. That means that upgrading Docker Machine to the latest version released by GitLab will fix the issue without changing your runner configuration. The latest release can be found here.
Original Workaround/fix:
Explicitly specify the AMI in your runner configuration and do not rely on the default one anymore, i.e. add something like "amazonec2-ami=ami-02584c1c9d05efa69" to your MachineOptions:
MachineOptions = [
"amazonec2-access-key=xxx",
"amazonec2-secret-key=xxx",
"amazonec2-region=eu-central-1",
"amazonec2-vpc-id=vpc-xxx",
"amazonec2-subnet-id=subnet-xxx",
"amazonec2-use-private-address=true",
"amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true",
"amazonec2-security-group=ci-runners",
"amazonec2-instance-type=m5.large",
"amazonec2-ami=ami-02584c1c9d05efa69", # Ubuntu 20.04 for amd64 in eu-central-1
"amazonec2-request-spot-instance=true",
"amazonec2-spot-price=0.045"
]
You can get a list of Ubuntu AMI IDs here. Be sure to select one that fits your AWS region and instance architecture and is supported by Docker.
Explanation:
The default AMI that GitLab Runner / the Docker Machine EC2 driver use is Ubuntu 16.04. The install script for Docker, which is available on https://get.docker.com/ and which Docker Machine relies on, seems to have stopped supporting Ubuntu 16.04 recently. Thus, the installation of Docker fails on the EC2 instance spawned by Docker Machine and the job cannot run.
See also this GitLab issue.
Azure and GCP suffer from similar problems.

Make sure to select an ami for Ubuntu and not Debian and that your aws account is subscribed to it
What I did
subscribe in aws marketplace to a Ubuntu Amazon Image (Ubuntu 20.04 LTS - Focal)
select launch instance, choose the region, and copy the ami shown

I had the same issue since yesterday.
It could be related to GitLab releasing 15.0 with breaking changes (going live on GitLab.com sometime between April 23 – May 22)
https://about.gitlab.com/blog/2022/04/18/gitlab-releases-15-breaking-changes/
but there is no mention of missing AMI field to add to field MachineOptions
Adding field AMI solved the issue on my side.

Just wanted to add as well, go here for the ubuntu that corresponds with your region. Amis are region specific

As Moritz pointed out:
Adding:
MachineOptions = [
"amazonec2-ami=ami-02584c1c9d05efa69",
]
solves the issue.

Can't install Galera Manger on Ubuntu

According to the documentation the Galera Manager can be installed on Ubuntu or AWS L2.
Incidentally, at this point, Galera Manager runs either Ubuntu or Amazon Linux 2. Future releases of Galera Manager may allow for other distributions of Linux.
This doesn't appear to be the case though, as the installer start by complaining about
WARN[0000] Debian / Ubuntu / Linux / focal / 20.04
and then asks a few questions before croaking with;
ERRO[0347] unsupported host os
I have tried 16.04, 18.04 and 20.04 without success. I haven't boot up an AWS instance as I don't want to run the GM in the cloud. Any thoughts/observations welcome.

Ambari-agent "CERTIFICATE_VERIFY_FAILED", Is it safe to disable the certificate verification in Python?

Ambari version: 2.2.2.18
HDP stack: 2.4.3
OS: centos 7.3
Issue description:
Ambari-server can't communicate with Ambari agent. I can see below error in the ambari-agent logs:
ERROR 2017-09-18 06:35:34,684 NetUtil.py:84 - [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)
ERROR 2017-09-18 06:35:34,684 NetUtil.py:85 - SSLError: Failed to connect. Please check openssl library versions.
I am facing this issue recently and it appears this can be replicated consistently after the instances are restarted. (I am using EC2 instances).
I am able to register agent nodes successfully, install HDP cluster, run yarn jobs etc.. no problem at all. Once i restart my instances, I see this problem.
There are some solutions already posted for this problem like:
Downgrade the Python from 2.7 to lower. This is a known problem of
Ambari with Python 2.7
Control the certificate verification by disabling it.
Set "verify = disable"; under /etc/python/cert-verification.cfg
I don't want to play with Python as it can disrupt lot many things like Cassandra, yum package manager etc...
Second work around is very much easy and it works well!
Now comes my question :- Is it safe to disable the certificate verification in Python ? i.e. by setting property verify = disable

Generally, it's a bad idea. If somebody has access to port on server that is used for agent-server communication (8443 if I'm not mistaken), he can register as agent and get all your cluster configs&passwords. Or classic man-in-the-middle attack would allow to do the same by reading your unencrypted traffic. A bit more difficult attack would allow to send commands to agents (probably with root permissions).
Your issue sounds like you reprovisioned your ambari-server host, and left old ambari-agent instances running, or maybe your certificates became outdated? At first connection to ambari-server, agents generate certificates and send to server. Server signs these certificates with it's own key, so now server-agent connection is encrypted. Did you try to remove old certificates and restart server&agents as suggested here?

How did we investigate this issue and What solution we adopted:
Investigation Details:
Downgrading to Python 2.6 is not feasible as there are OS dependencies and as per Suggestion from 'Dmitriusan' in the previous comment, it's not a good idea to disable certificate verification in Python.
We use AWS EC2
With Python 2.7, JDK 1.8 and Cent OS 7.2 there is no issue. Everything is smooth.
With Python 2.7, JDK 1.8 and Cent OS 7.3 and Centos 7.4 we are seeing this issue.
Issue which I have reported here, is with respect to Centos 7.3 and with Centos 7.4 Issue is slightly different. Certificate verification fails while adding nodes to the cluster itself.
Downgrading from centos 7.3 to 7.2 is not straight forward. And AWS EC2 market place provides Centos 7.0 Image and when we create instance from this image, it applies security and patch updates resulting in Centos 7.3.
We can create our own Image of Centos 7.2 from existing servers but, It's always good to be with the latest update for the OS for security reasons.
To describe it shortly, we had workarounds but not a solution.
Solution which we adopted:
After series of tests, we decided to upgrade to Centos 7.4, HDP-2.6.3.0, and Ambari 2.6.0.0
With Centos 7.4 and Ambari Version 2.6.0.0, we don't see this issue even though I have 'Python 2.7.5' installed.
So this looks to be an Issue with Ambari

Older version of Ambari (2.4.2) does not recognize the force TLS configuration. We upgraded Ambari to 2.6.2 and heart beat started working.

Openstack dashboard gives error "Error: Unable to retrieve usage information"

I installed OpenStack on an ec2 instance running Ubuntu 14.04 LTS via devstack. When I login into the dashboard I get an error "Error: Unable to retrieve usage information"
When I installed it and logged in for the first time, everything was working fine. But after I stopped my ec2 instance and restarted, I am facing this problem.
What might be causing this error?
I used the stable juno version of devstack.
And the AMI for my ec2 instance is Ubuntu Server 14.04 LTS (HVM), SSD Volume Type.
Does restarting the instance might have caused some problem?

cd to devstack and execute ./rejoin-stack
That solved it. I was trying to reboot nova and other services individually.
But since the installation was done using devstack, you need to run the ./rejoin-stack script.

In addition to [akshay1188] answer, you can re-stack your system. Sometimes, rejoin-stack does not work as expected. In that case, you can unstack (unstack.sh) it and stack (stack.sh) again. *This may take much time.
Another observation of mine says this can be an issue with IP address of the system. Try to keep IP address same after reboot.

ngnix on amazon ec2 instance which has RedHat 4.4.4-13?

I have an ec2 instance running on amazon which has AMI(ami-1b814f72).Its running RedHat 4.4.4-13 version.
I want to install ngnix and gunicorn on with django. According to ngnix http://wiki.nginx.org/Install#Official_Red_Hat.2FCentOS_packages page I need to create a file /etc/yum.repos.d/nginx.repo and paste those line for finding repo.But they also mentioned that :
Due to differences between how CentOS, RHEL, and Scientific Linux
populate the $releasever variable, it is necessary to manually replace
$releasever with either "5" (for 5.x) or "6" (for 6.x), depending upon
your OS version.
But I don't have either 5 or 6 version. I have RedHat 4.4.4-13 version, so what should I do in that case to make it work and get installed ngnix on my ec2 instance.
If I dont change the baseurl and try to install ngnix I got this error:
http://nginx.org/packages/rhel/latest/x86_64/repodata/repomd.xml:
[Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404"
Trying other mirror. Error: Cannot retrieve repository metadata
(repomd.xml) for repository: ngnix. Please verify its path and try
again
Please note: I want to be under AWS free Usage Tier and I don't want to be get charged
I hope someone will help me :(

So I solved my own problem and writing answer to my own QUESTION.Their is no available ngnix package for RHEL 4.4.Either we build from source specifically for RHEL 4.4 or just migrate to an updated version of AMI on amazon.I moved to ubuntu 11.10 which is updated one and currently supported by ubuntu community.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js