Intermittent DNS issues while pulling docker image from ECR repository - amazon-web-services

Has anyone facing this issue with docker pull. we recently upgraded docker to 18.03.1-ce from then we are seeing the issue. Although we are not exactly sure if this is related to docker, but just want to know if anyone faced this problem.
We have done some troubleshooting using tcp dump the DNS queries being made were under the permissible limit of 1024 packet. which is a limit on EC2, We also tried working around the issue by modifying the /etc/resolv.conf file to use a higher retry \ timeout value, but that didn't seem to help.
we did a packet capture line by line and found something. we found some responses to be negative. If you use Wireshark, you can use 'udp.stream eq 12' as a filter to view one of the negative answers. we can see the resolver sending an answer "No such name". All these requests that get a negative response use the following name in the request:
354XXXXX.dkr.ecr.us-east-1.amazonaws.com.ec2.internal
Would anyone of you happen to know why ec2.internal is being adding to the end of the DNS? If run a dig against this name it fails. So it appears that a wrong name is being sent to the server which responds with 'no such host'. Is docker is sending a wrong dns name for resolution.
We see this issue happening intermittently. looking forward for help. Thanks in advance.
Expected behaviour
5.0.25_61: Pulling from rrg
Digest: sha256:50bbce4af6749e9a976f0533c3b50a0badb54855b73d8a3743473f1487fd223e
Status: Downloaded newer image forXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Actual behaviour
docker-compose up -d rrg-node-1
Creating rrg-node-1
ERROR: for rrg-node-1 Cannot create container for service rrg-node-1: Error response from daemon: Get https:/XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/v2/: dial tcp: lookup XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com on 10.5.0.2:53: no such host
Steps to reproduce the issue
docker pull XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Output of docker version:
(Docker version 18.03.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3)
Output of docker info:
([ec2-user#ip-10-5-3-45 ~]$ docker info
Containers: 37
Running: 36
Paused: 0
Stopped: 1
Images: 60
Server Version: swarm/1.2.5
Role: replica
Primary: 10.5.4.172:3375
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 12
Plugins:
Volume:
Network:
Log:
Swarm:
NodeID:
Is Manager: false
Node Address:
Kernel Version: 4.14.51-60.38.amzn1.x86_64
Operating System: linux
Architecture: amd64
CPUs: 22
Total Memory: 80.85GiB
Name: mgr1
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false
WARNING: No kernel memory limit support)

Related

Conan fails to upload to Artifactory instance over HTTPS

Currently I have Artifactory set up through a system.yaml file
configVersion: 1
shared:
security:
exposeApplicationHeaders: true
node:
id: "*.example.com"
ip: artifacts.example.com
metrics:
enabled: true
artifactory:
#port: 8081
tomcat:
httpsConnector:
enabled: true
port: 8443
certificateFile: "$JFROG_HOME/artifactory/var/etc/artifactory/security/trusted/server2.crt"
certificateKeyFile: "$JFROG_HOME/artifactory/var/etc/artifactory/security/trusted/server.key"
frontend:
featureToggler:
commonProjects: true
And I'm able to access the webview on port 8082 through https just fine
I created a repo for conan artifacts and generated an api key. Then using the "set me up" prompt I ran the following commands on my dev machine
conan remote add myremote https://artifacts.example.com:8081/artifactory/api/conan/myremote
conan user -p <apikey> -r myremote will
I then get the following error from Conan
ERROR: HTTPSConnectionPool(host='artifacts.example.com', port=8081): Max retries exceeded with url: /artifactory/api/conan/myremote/v1/ping (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)')))
Unable to connect to myremote=https://artifacts.example.com:8081/artifactory/api/conan/myremote
1. Make sure the remote is reachable or,
2. Disable it by using conan remote disable,
Then try again.
I tried to repeat the same steps but using http instead of http and all worked fine. What am I doing wrong that won't let https access work?

DNS not found from single Docker container running on Beanstalk

I have an AWS Elastic Beanstalk environment and application running the "64bit Amazon Linux 2 v3.0.2 running Docker" solution stack in a standard way. (I followed the AWS documentation.)
I have deployed a Dockerrun.aws.json file to it, and found that it has DNS issues.
To troubleshoot, I SSHed into the EC2 instance where it is running and found that on this instance an nslookup of any of the hostnames in question runs fine.
But, running the nslookup from within Docker such as with . . .
sudo run busybox nslookup www.google.com
. . . yields the result:
*** Can't find www.google.com: No answer
Adding the typical solutions such as passing the --dns x.x.x.x argument or --network=host arguments does not fix the issue. It still times out contacting DNS.
Any thoughts as to what the issue might be? Here is a Docker info:
$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 30
Running: 1
Paused: 0
Stopped: 29
Images: 4
Server Version: 19.03.6-ce
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 107.4GB
Backing Filesystem: ext4
Udev Sync Supported: true
Data Space Used: 5.403GB
Data Space Total: 12.72GB
Data Space Available: 7.314GB
Metadata Space Used: 3.965MB
Metadata Space Total: 16.78MB
Metadata Space Available: 12.81MB
Thin Pool Minimum Free Space: 1.271GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.181-108.257.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.79GiB
Name:
ID:
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
Live Restore Enabled: false
WARNING: the devicemapper storage-driver is deprecated, and will be removed in a future release.
Thank you

Apache Graceful restart with Ansible

What is ideal ansible way to do a apache graceful restart?
- name: Restart Apache gracefully
command: apachectl -k graceful
Ansible systemd module does the same? If not, what is the difference? Thanks !
- name: Restart apache service.
systemd:
name: apache2
daemon_reload: yes
state: restarted
What you can do with Ansible is to ensure that all established connections to Apache are closed (drained in Ansible lingo).
Use the wait_for module with the condition to wait for drained connections on the particular host and port, with the state set to drained. See below:
- name: wait until apache2 connections are drained.
wait_for:
host: 0.0.0.0
port: 80
state: drained
Note: You can use this for all your Linux network services, which becomes very handy if you want to shutdown services in a particular order in your Ansible playbook.
The wait_for directive is useful to ensuring that Ansible does not run your playbook until specific steps are completed.
There is not support of graceful state at this moment in service or systemd modules because this is quite specific to certain services, status is limited to started, stopped, restarted reloaded and running.
So now you need to use a command module as you wrote in the question to perform a graceful restart, this is the only proper solution.
However there is an issue to support custom status, perhaps someone will implement that soon.
The documentation for the Ansible service module is not clearly stating what "reloaded" state does but, I found that for standard Red Hat 7 install using service module "reloaded" state results in a graceful restart.
I was led to this solution by this Server Fault QA
You can verify by getting process list of the httpd processes prior running your playbook which triggers your handler.
ps -ef | grep httpd | grep -v grep
After your playbook runs and handler reloaded state for httpd service shows "changed", re-examine the process list again.
You should see the start times for all the child httpd (non-root) processes have updated while the root owned parent process's start time has stayed the same.
If you also look in the error log you should see an entry containing:
"... configured -- resuming normal operations ... "
And, finally, you can see this by examining the output of systemctl status for the httpd.service and see the apachectl graceful option was called:
sudo systemctl status httpd.service
My handler now looks like:
- name: "{{ service_name }} restart handler"
become: yes
ansible.builtin.service:
service: "{{ service_name }}"
# state: restarted
state: reloaded

Something wrong on deploy chaincode for hyperledger v1.0

I have tried to use docker toolbox to setup Hyperledger V1.0 in my local machines.
I according to this document:
http://hyperledger-fabric.readthedocs.io/en/latest/asset_setup.html
But when I tried to deploy chaincode.
$node deploy.js
I got an error message:
info: Returning a new winston logger with default configurations
info: [Chain.js]: Constructed Chain instance: name - fabric-client1, securityEnabled: true, TCert download batch size: 10, network mode: true
info: [Peer.js]: Peer.const - url: grpc://localhost:8051 options grpc.ssl_target_name_override=tlsca, grpc.default_authority=tlsca
info: [Peer.js]: Peer.const - url: grpc://localhost:8055 options grpc.ssl_target_name_override=tlsca, grpc.default_authority=tlsca
info: [Peer.js]: Peer.const - url: grpc://localhost:8056 options grpc.ssl_target_name_override=tlsca, grpc.default_authority=tlsca
info: [Client.js]: Failed to load user "admin" from local key value store
info: [FabricCAClientImpl.js]: Successfully constructed Fabric COP service client: endpoint - {"protocol":"http","hostname":"localhost","port":8054}
info: [crypto_ecdsa_aes]: This class requires a KeyValueStore to save keys, no store was passed in, using the default store C:\Users\daniel\.hfc-key-store
[2017-04-15 22:14:29.268] [ERROR] Helper - Error: Calling enrollment endpoint failed with error [Error: connect ECONNREFUSED 127.0.0.1:8054]
at ClientRequest.<anonymous> (C:\Users\daniel\node_modules\fabric-ca-client\lib\FabricCAClientImpl.js:304:12)
at emitOne (events.js:96:13)
at ClientRequest.emit (events.js:188:7)
at Socket.socketErrorListener (_http_client.js:310:9)
at emitOne (events.js:96:13)
at Socket.emit (events.js:188:7)
at emitErrorNT (net.js:1278:8)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickCallback (internal/process/next_tick.js:98:9)
[2017-04-15 22:14:29.273] [ERROR] DEPLOY - Error: Failed to obtain an enrolled user
at ca_client.enroll.then.then.then.catch (C:\Users\daniel\helper.js:59:12)
at process._tickCallback (internal/process/next_tick.js:103:7)
events.js:160
throw er; // Unhandled 'error' event
^
Error: Connect Failed
at ClientDuplexStream._emitStatusIfDone (C:\Users\daniel\node_modules\grpc\src\node\src\client.js:201:19)
at ClientDuplexStream._readsDone (C:\Users\daniel\node_modules\grpc\src\node\src\client.js:169:8)
at readCallback (C:\Users\daniel\node_modules\grpc\src\node\src\client.js:229:12)
Is this an question about unable to connect to ca? Or other causes?
Edit:
Environment:
OS: Windows 10 Professional Edition
Docker Toolbox: 17.04.0-ce
Go: 1.7.5
Node.js: 6.10.0
My steps:
1.Open Docker Quickstart Terminal and key commands.
$curl -L https://raw.githubusercontent.com/hyperledger/fabric/master/examples/sfhackfest/sfhackfest.tar.gz -o sfhackfest.tar.gz 2> /dev/null; tar -xvf sfhackfest.tar.gz
$docker-compose -f docker-compose-gettingstarted.yml build
$docker-compose -f docker-compose-gettingstarted.yml up -d
$docker ps
It has been confirmed that six containers have been activated
2.Download examples and install modules.
$curl -OOOOOO https://raw.githubusercontent.com/hyperledger/fabric-sdk-node/v1.0-alpha/examples/balance-transfer/{config.json,deploy.js,helper.js,invoke.js,query.js,package.json}
//This link didn't work, so I downloaded the required files from GitHub of fabric-sdk-node
$npm install --global windows-build-tools
$npm install
3.Try to deploy chaincode.
$node deploy.js
There were several problems, not the least of which that documentation was outdated and was for a preview release of Hyperledger Fabric. The docs are actually in the process of being removed as we need to update our examples / samples.
You mentioned Docker Toolbox - so are you trying to run all of this on Windows or Mac?
UPDATE:
So one of the issue with Docker Toolbox or Docker for Windows is that you cannot use localhost / 127.0.0.1 as the address when trying to communicate from apps on the host (even in the QuickStart Terminal) to the endpoints of the Docker containers. When the QuickStart Terminal first launches Docker, you'll see that it will output the IP address of the endpoint you should use when communicating with exposed ports.
I was having the same issue while following the latest "Writing Your First Application" tutorial (http://hyperledger-fabric.readthedocs.io/en/latest/write_first_app.html). I had installed all the pre-requisites and the fabric-samples and started the local network.
When I got to the step of enrolling the Admin user, $ node enrollAdmin.js, I was getting the same error message as above, Error: connect ECONNREFUSED, followed by the localhost domain.
As the first answer suggests, the root cause is that I'm running Docker Toolbox. I'm developing on an older Mac, OSX v10.9.5, so I couldn't use Docker for Mac.
To fix the issue, I replaced 'localhost' in the enrollAdmin.js code with the IP from Docker Toolbox.
Here are the steps I took:
Started Docker with Applications > Docker Quickstart Terminal
Copied the IP from this sentence: docker is configured to use the default machine with IP...
Opened the copy of enrollAdmin.js from fabric-samples/fabcar directory
Found this code:
// be sure to change the http to https when the CA is running TLS enabled
fabric_ca_client = new Fabric_CA_Client('http://localhost:7054', tlsOptions , 'ca.example.com', crypto_suite); // <-- This is the line to change
Replaced 'localhost' with the Docker IP, leaving the port :7054 as is.
Saved
Re-ran the command, $ node enrollAdmin.js
The script connected to the CA and successfully completed the Admin enrollment.
On to the next step!

chef test-kitchen configuration for Vagrant ssh connect timeout

My google-fu is failing me. what do I need to put into my .kitchen.yml in order to get it to increase the config.vm.boot_timeout or number of attempts in my Vagrantfile. My kitchen converge almost always hits:
STDERR: Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.
After about another minute or so I can connect without issue.
I've tried many of what I thought it could be but none seem to be setting it to all of the following:
driver:
name: vagrant
vm.boot_timeout: 20
vm:
boot_timeout: 20
driver_config:
require_chef_omnibus: true
vm.boot_timeout: 20
vm:
boot_timeout: 20
What do I need to do to get this increased?
I added:
driver:
name: vagrant
boot_timeout: 1200
It appears to work, the boot_timout is already present in Vagantfile.erb, maybe because of a newer version.
This isn't supported directly, but you can copy the default Vagrantfile.erb and set
driver:
name: vagrant
vagrantfile_erb: path/to/your/Vagrantfile.erb
or possibly: (I forget which is needed)
driver:
name: vagrant
config:
vagrantfile_erb: path/to/your/Vagrantfile.erb