Docker-compose wouldn't start on Sagemaker's Notebook instance

Docker-compose wouldn't start on Sagemaker's Notebook instance - amazon-web-services

Docker-compose seems to have stopped working on Sagemaker Notebook instances. When running docker-compose up I encounter the following error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/docker-compose", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/main.py", line 81, in main
command_func()
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/main.py", line 200, in perform_command
project = project_from_options('.', options)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/command.py", line 70, in project_from_options
enabled_profiles=get_profiles_from_options(options, environment)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/command.py", line 153, in get_project
verbose=verbose, version=api_version, context=context, environment=environment
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/docker_client.py", line 43, in get_client
environment=environment, tls_version=get_tls_version(environment)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/compose/cli/docker_client.py", line 170, in docker_client
client = APIClient(use_ssh_client=not use_paramiko_ssh, **kwargs)
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/docker/api/client.py", line 197, in __init__
self._version = self._retrieve_server_version()
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/docker/api/client.py", line 222, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: Timeout value connect was Timeout(connect=60, read=60, total=None), but it must be an int, float or None
I can start Docker containers as usual.
sh-4.2$ docker version
Client:
Version: 20.10.7
API version: 1.41
Go version: go1.15.14
Git commit: f0df350
Built: Tue Sep 28 19:55:40 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.15.14
Git commit: b0f5bc3
Built: Tue Sep 28 19:57:35 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0
GitCommit: %runc_commit
docker-init:
Version: 0.19.0
GitCommit: de40ad0
But docker-compose wouldn't work...
sh-4.2$ docker-compose version
docker-compose version 1.29.2, build unknown
docker-py version: 5.0.0
CPython version: 3.6.13
OpenSSL version: OpenSSL 1.1.1l 24 Aug 2021

For those of you who (might) have encountered the same issue, here's the fix:
1). Install the newest version of docker-compose:
sh-4.2$ sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sh-4.2$ sudo chmod +x /usr/local/bin/docker-compose
2). Change your PATH accordingly (since docker-compose is installed using conda and is picked up first) or use /usr/local/bin/docker-compose onwards:
sh-4.2$ PATH=/usr/local/bin:$PATH
sh-4.2$ docker-compose version
docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019
Perhaps, the issue is related to this:
On August 9, 2021 the Jupyter Notebook and Jupyter Lab open source software projects announced 2 security concerns that could impact Amazon Sagemaker Notebook Instance customers.
Sagemaker has deployed updates to address these concerns, and we recommend customers with existing notebook sessions to stop and restart their notebook instance(s) to benefit from these updates. Notebook instances launched after August 10, 2021, when updates were deployed, are not impacted by this issue and do not need to be restarted.

Related

docker image is different when running from different host

During building of a 3rd party library (libtorch, if it matters) in a docker container, I came across an error of a missing include file.
The same process of building worked fine when running the build process from Ubuntu 16.04 host, but when running from an Ubuntu 18.04 host, the file was missing.
After a bit of trace back, I'm now just running the base container from NVidia, and looking for the file.
This is the outputs I get:
Ubuntu 16.04 host:
$ uname -a
Linux ub-carmel 4.15.0-123-generic #126~16.04.1-Ubuntu SMP Wed Oct 21 13:48:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 19.03.13, build 4484c46d9d
$ docker pull nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
11.1-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
Digest: sha256:c5bf5c984998cc18a3f3a741c2bd7187ed860dc6d993b6fb402d0effb9fe6579
Status: Image is up to date for nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
$ docker run -it nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
root#2ecc17248fab:/# ll /usr/lib/gcc/x86_64-linux-gnu/7/include | grep ia32
-rw-r--r-- 1 root root 7817 Dec 4 2019 ia32intrin.h
Ubuntu 18.04 host:
$ uname -a
Linux ub-carmel-18-04 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 19.03.14, build 5eb3275d40
$ docker pull nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
11.1-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
Digest: sha256:c5bf5c984998cc18a3f3a741c2bd7187ed860dc6d993b6fb402d0effb9fe6579
Status: Downloaded newer image for nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
$ docker run -it nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
root#89f771e82a51:/# ll /usr/lib/gcc/x86_64-linux-gnu/7/include | grep ia32
root#89f771e82a51:/#
As you can see, the sha256 digest of the images is the same (and matches the digest from NVidia's NGC here)
At first I thought that maybe in some hidden way the includes come from the host, but the ia32intrin.h file exists in both hosts
What can cause such issue?
EDIT
Added the docker --version outputs for each host. There's a difference, but I doubt this should cause such issues
EDIT 2
Added the output for uname -a
EDIT 3
Output of docker version:
Ubuntu 16:
$ docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:59 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:30 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Ubuntu 18:
$ docker version
Client: Docker Engine - Community
Version: 19.03.14
API version: 1.40
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:20:17 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:18:45 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
So I tested it on a different Ubuntu machines (EC2 instances) and in that case, for both 18.04 & 16.04 the file exists. so looks like it's a problem on my machine.
Any thoughts of what can cause this?

Best guess is that the pulled layers on the Ubuntu 18.04 host are somehow corrupt. The nuclear option to clean that up is to reset docker. This will delete all images, volumes, containers, logs, networks, everything, so backup anything you want to keep before running this:
sudo -s # these commands need root
systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker
exit # exit sudo

docker-compose No such command: convert error

I'm trying to follow this tutorial on AWS ECS integration that mentions the Docker command docker compose convert that is supposed to generate a AWS CloudFormation template.
However, when I run this command, it doesn't appear to exist.
$ docker-compose convert
No such command: convert
#...
$ docker compose convert
docker: 'compose' is not a docker command.
See 'docker --help'
$ docker context create ecs myecscontext
"docker context create" requires exactly 1 argument.
See 'docker context create --help'.
Usage: docker context create [OPTIONS] CONTEXT
Create a context
$ docker --version
Docker version 19.03.13, build 4484c46
$ docker-compose --version
docker-compose version 1.25.5, build unknown
$ docker version
Client:
Version: 19.03.13
API version: 1.40
Go version: go1.13.8
Git commit: 4484c46
Built: Thu Oct 15 18:34:11 2020
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.11
API version: 1.40 (minimum version 1.12)
Go version: go1.13.12
Git commit: 77e06fd
Built: Mon Jun 8 20:24:59 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit:
docker-init:
Version: 0.18.0
GitCommit: fec3683
$ docker info
Client:
Debug Mode: false
Server:
Containers: 12
Running: 3
Paused: 0
Stopped: 9
Images: 149
Server Version: 19.03.11
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version:
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.8.0-29-generic
Operating System: Ubuntu Core 16
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 7.202GiB
Name: HongLee
ID: GZ5R:KQDD:JHOJ:KCUF:73AE:N3NY:MWXS:ABQ2:2EVY:4ABJ:H375:J64V
Docker Root Dir: /var/snap/docker/common/var-lib-docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Any ideas?

To get the ECS integration, you need to be using an ECS docker context. First, enable the experimental flag in /etc/docker/daemon.json
// /etc/docker/daemon.json
{
"experimental": true
}
Then create the context:
docker context create ecs myecscontext
docker context use myecscontext
$ docker context ls
NAME TYPE DESCRIPTION DOCKER ENDPOINT KUBERNETES ENDPOINT ORCHESTRATOR
default moby Current DOCKER_HOST based configuration unix:///var/run/docker.sock [redacted] (default) swarm
myecscontext * ecs
Now run convert:
$ docker compose convert
WARN[0000] services.build: unsupported attribute
AWSTemplateFormatVersion: 2010-09-09
Resources:
AdminwebService:
DependsOn:
- AdminwebTCP80Listener
Properties:
Cluster:
...

You're running on Ubuntu. The /usr/bin/docker installed (even with latest docker-ce 20.10.6) does not enable the docker compose subcommand. It is enabled by default on Docker for Desktop Windows or Mac.
See the Linux installation instructions at https://github.com/docker/compose-cli to download and configure so that docker compose works.
There's a curl|bash script for Ubuntu or just download the latest release, put that docker executable into a PATH directory before /usr/bin/ and make sure the original docker is available as com.docker.cli e.g. ln -s /usr/bin/docker ~/bin/com.docker.cli.

AWSEBCLI does not work when running on Jenkins. No module named ERROR

I'm working on a continuous deployment for an application at work using Jenkins with Multibranch Pipeline, AWSCLI and AWSEBCLI. When running via ssh, everything works fine, but on jenkins don't.
Application:
- Java 8
- Maven
- Quarkus Framework https://quarkus.io/
Jenkinsfile:
tools {
jdk 'jdk_1.8.0'
maven 'Maven'
}
stages {
stage('Environment Configuration') {
steps {
sh 'sudo pip install awscli==1.16.9 awsebcli==3.14.4'
}
stage('Deploy') {
when {
anyOf {
branch 'feature/CD'
}
}
steps {
sh 'zip -r application.zip target Dockerfile'
sh 'aws configure set aws_access_key_id $ACCESS_KEY_DEV --profile eb-cli'
sh 'aws configure set aws_secret_access_key $SECRET_KEY_DEV --profile eb-cli'
sh 'eb deploy'
}
}
}
}
On SSH:
[root]# eb --version
EB CLI 3.14.4 (Python 2.7.5)
[root]# python --version
Python 2.7.5
[root]# aws --version
aws-cli/1.16.9 Python/2.7.5 Linux/3.10.0-862.11.6.el7.x86_64 botocore/1.11.9
[root]# eb deploy
Uploading application/app-9d9c-191122_104206.zip to S3. This may take a while.
Upload Complete.
2019-11-22 13:42:09 INFO Environment update is starting.
2019-11-22 13:42:13 INFO Deploying new version to instance(s).
On Jenkins:
+ python --version
Python 2.7.5
[Pipeline] sh
+ aws --version
aws-cli/1.16.9 Python/2.7.5 Linux/3.10.0-862.11.6.el7.x86_64 botocore/1.11.9
+ eb deploy
Traceback (most recent call last):
File "/bin/eb", line 5, in <module>
from ebcli.core.ebcore import main
File "/usr/lib/python2.7/site-packages/ebcli/core/ebcore.py", line 21, in <module>
from ebcli.controllers.clone import CloneController
File "/usr/lib/python2.7/site-packages/ebcli/controllers/clone.py", line 17, in <module>
from ..operations import cloneops, commonops, solution_stack_ops
File "/usr/lib/python2.7/site-packages/ebcli/operations/solution_stack_ops.py", line 23, in <module>
from ebcli.operations import commonops, platformops
File "/usr/lib/python2.7/site-packages/ebcli/operations/platformops.py", line 22, in <module>
from semantic_version import Version
ImportError: No module named semantic_version

You probably have multiple pips on your computer. pip install awsebcli is supposed to have installed the semantic_version Python package, however, as you can see from the stack trace you have posted, it couldn't be found. To get around these problems, you should ideally use virtualenv. If you just want a clean installation, Beanstalk provides a set of scripts for you to install EBCLI without any friction.

No module named packaging.version for Ansible VM provisioning in Azure

I am using a CentOS 7.2 and trying to provision a VM in azure through Ansible using the module "azure_rm_virtualmachine" and getting the error as "No module named packaging.version" Below is my error
Traceback (most recent call last):
File "/tmp/ansible_7aeFMQ/ansible_module_azure_rm_virtualmachine.py", line 445, in
from ansible.module_utils.azure_rm_common import *
File "/tmp/ansible_7aeFMQ/ansible_modlib.zip/ansible/module_utils/azure_rm_common.py", line 29, in
ImportError: No module named packaging.version
fatal: [localhost]: FAILED! => {
"changed": false,
"failed": true,
"module_stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_7aeFMQ/ansible_module_azure_rm_virtualmachine.py\", line 445, in \n from ansible.module_utils.azure_rm_common import *\n File \"/tmp/ansible_7aeFMQ/ansible_modlib.zip/ansible/module_utils/azure_rm_common.py\", line 29, in \nImportError: No module named packaging.version\n",
"module_stdout": "",
"msg": "MODULE FAILURE",
"rc": 0
}
Below is my playbook and I am using a ansible version 2.3.0.0 and python version of 2.7.5 and pip 9.0.1
name: Provision new VM in azure
hosts: localhost
connection: local
tasks:
name: Create VM
azure_rm_virtualmachine:
resource_group: xyz
name: ScriptVM
vm_size: Standard_D1
admin_username: xxxx
admin_password: xxxx
image:
offer: CentOS
publisher: Rogue Wave Software
sku: '7.2'
version: latest
I am running the playbook from the ansible host and I tried to create a resource group through ansible but I get the same error as "No module named packaging.version" .

The above error is occurred due to your environment doesn't have packaging module.
To solve this issue by installing packaging module.
pip install packaging
The above command will install packaging module of 16.8 version

You may try this, it solved for me
sudo pip install -U pip setuptools
FYI: My ENVs are
Ubuntu 16.04.2 LTS on Windows Subsystem for Linux (Windows 10 bash)
Python 2.7.12
pip 9.0.1
ansible 2.3.1.0
azure-cli (2.0.12)

SaltStack: What is the meaning of "'file.get_user' is not available"?

I am stuck at one point in my salt custom module creation. I am running the below salt version of master and minion on my vm's, and I am trying to call the get_user function to find the owner of the file using the path. The path exists but salt responds with an error message:
saltuser#vmSaltMaster:/$ sudo salt '*' file.get_user /etc/passwd
[sudo] password for saltuser:
172.18.1.7:
'file.get_user' is not available.
saltuser#vmSaltMaster:/$ salt '*' --versions-report
Salt: 2015.5.3
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
Jinja2: 2.7.2
M2Crypto: 0.21.1
msgpack-python: 0.3.0
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.0.1
RAET: Not Installed
ZMQ: 4.0.4
Mako: Not Installed
Tornado: Not Installed
Debian source package: 2015.5.3+ds-1trusty1

The error was that I created a custom module under /srv/salt/_module with the name as file.py and ran a sync_all salt command. salt was confused between the custom module and the original salt module which comes with the package. I deleted my custom module and it is working fine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Docker-compose wouldn't start on Sagemaker's Notebook instance - amazon-web-services

Related

docker image is different when running from different host

docker-compose No such command: convert error

AWSEBCLI does not work when running on Jenkins. No module named ERROR

No module named packaging.version for Ansible VM provisioning in Azure

SaltStack: What is the meaning of "'file.get_user' is not available"?

Categories

Resources