Ansible expect module not working

Ansible expect module not working - regex

This has been irritating me for the past hour, I use Ansible's expect module to answer to a command prompt, namely:
Re-format filesystem in Storage Directory /mnt/ephemeral-hdfs/dfs/name ? (Y or N)
for which I want to reply
Y
This should work according to standard regex matching and this other stackoverflow question
- name: Run Spark Cluster script
expect:
command: /home/ubuntu/cluster_setup/scripts/shell/utils-cluster_launcher-start_spark.sh
responses:
"Re-format filesystem": "Y"
timeout: 600
echo: yes
The issue I am facing is that when it reaches the point where it expects keyboard input it doesn't get anything, therefore it hangs. There is no error output as such; it just stays still.
Any ideas how to fix this?

The task from the question works properly on the data included in the question:
---
- hosts: localhost
gather_facts: no
connection: local
tasks:
- name: Run script producing the same prompt as Spark Cluster script
expect:
command: ./prompt.sh
responses:
"Re-format filesystem": "Y"
timeout: 600
echo: yes
register: prompt
- debug:
var: prompt.stdout_lines
Contents of the ./prompt.sh:
#!/bin/bash
read -p "Re-format filesystem in Storage Directory /mnt/ephemeral-hdfs/dfs/name ? (Y or N) " response
echo pressed: $response
Result:
PLAY [localhost] ***************************************************************
TASK [Run script producing the same prompt as Spark Cluster script] ************
changed: [localhost]
TASK [debug] *******************************************************************
ok: [localhost] => {
"prompt.stdout_lines": [
"Re-format filesystem in Storage Directory /mnt/ephemeral-hdfs/dfs/name ? (Y or N) Y",
"pressed: Y"
]
}
PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0

The Ansible documentation for expect does not have quotes around the regex in the example.
# Case insensitve password string match
- expect:
command: passwd username
responses:
(?i)password: "MySekretPa$$word"
Maybe try:
Re-format\sfilesystem: "Y"

I know this is old but I had the same trouble with this module and these answers didn't help, however I did find my own solutions eventually and thought I'd save people some time.
First the timeout in the poster's example is 10 minutes. Though this makes sense for a reformat, it means that you need to wait 10 minutes before the script will fail. e.g. If it is stopped waiting for a response to "Are you sure?". When debugging keep that timeout low and if you can't then wait patiently.
Second the fields in the responses are listed alphabetically so
responses:
"Test a specific string": "Specific"
"Test": "General"
Will always respond to ALL messages containing Test with General as that is the first alphabetically in the responses map.
Third (following on) this caught me out because in my case expect was simply hitting enter at the prompt and the script asked again for valid data. The problem then is that my timeout never fires and nothing gets returned so I don't see any response from the module, it just hangs. The solution in this case is to go to the server you are provisioning with Ansible, find the command Ansible is running with ps and kill it. This allows Ansible to collect the output and show you where it is stuck in an infinite loop.

Related

NLP Flask app startup nodes timing out on Google Kubernetes GKE

I have a flask app that includes some NLP packages and takes a while to initially build some vectors before it starts the server. I've noticed this in the past with Google App Engine and I was able to set a max timeout in the app.yaml file to fix this.
The problem is that when I start my cluster on Kubernetes with this app, I notice that the workers keep timing out in the logs. Which makes sense because I'm sure the default amount of time is not enough. However, I can't figure out how to configure GKE to allow the workers enough time to do everything it needs to do before it starts serving.
How do I increase the time the workers can take before they timeout?
I deleted the old instances so I can't get the logs right now, but I can start it up if someone wants to see the logs.
It's something like this:
I 2020-06-26T01:16:04.603060653Z Computing vectors for all products
E 2020-06-26T01:16:05.660331982Z
95it [00:05, 17.84it/s][2020-06-26 01:16:05 +0000] [220] [INFO] Booting worker with pid: 220
E 2020-06-26T01:16:31.198002748Z [nltk_data] Downloading package stopwords to /root/nltk_data...
E 2020-06-26T01:16:31.198056691Z [nltk_data] Package stopwords is already up-to-date!
100it 2020-06-26T01:16:35.696015992Z [CRITICAL] WORKER TIMEOUT (pid:220)
E 2020-06-26T01:16:35.696015992Z [2020-06-26 01:16:35 +0000] [220] [INFO] Worker exiting (pid: 220)
I also see this:
The node was low on resource: memory. Container thoughtful-sha256-1 was using 1035416Ki, which exceeds its request of 0.
Obviously I don't exactly know what I'm doing. Why does it say I'm requesting 0 memory and can I set a timeout amount for the Kubernetes nodes?
Thanks for the help!

One thing you can do is add some sort of delay in a startup script for your GCP instances. You could try a simple:
#!/bin/bash
sleep <time-in-seconds>
Another thing you can try is adding some sort of delay to when your containers start in your Kubernetes nodes. For example, a delay in an initContainer
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: myapa:latest
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "echo Waiting a bit && sleep 3600"]
Furthermore, you can try a StartupProbe combined with the Probe parameter initialDelaySeconds on your actual application container that way it actually waits for some time before saying: I'm going to see if the application has started.:
startupProbe:
exec:
command:
- touch
- /tmp/started
initialDelaySeconds: 3600

Apache Graceful restart with Ansible

What is ideal ansible way to do a apache graceful restart?
- name: Restart Apache gracefully
command: apachectl -k graceful
Ansible systemd module does the same? If not, what is the difference? Thanks !
- name: Restart apache service.
systemd:
name: apache2
daemon_reload: yes
state: restarted

What you can do with Ansible is to ensure that all established connections to Apache are closed (drained in Ansible lingo).
Use the wait_for module with the condition to wait for drained connections on the particular host and port, with the state set to drained. See below:
- name: wait until apache2 connections are drained.
wait_for:
host: 0.0.0.0
port: 80
state: drained
Note: You can use this for all your Linux network services, which becomes very handy if you want to shutdown services in a particular order in your Ansible playbook.
The wait_for directive is useful to ensuring that Ansible does not run your playbook until specific steps are completed.

There is not support of graceful state at this moment in service or systemd modules because this is quite specific to certain services, status is limited to started, stopped, restarted reloaded and running.
So now you need to use a command module as you wrote in the question to perform a graceful restart, this is the only proper solution.
However there is an issue to support custom status, perhaps someone will implement that soon.

The documentation for the Ansible service module is not clearly stating what "reloaded" state does but, I found that for standard Red Hat 7 install using service module "reloaded" state results in a graceful restart.
I was led to this solution by this Server Fault QA
You can verify by getting process list of the httpd processes prior running your playbook which triggers your handler.
ps -ef | grep httpd | grep -v grep
After your playbook runs and handler reloaded state for httpd service shows "changed", re-examine the process list again.
You should see the start times for all the child httpd (non-root) processes have updated while the root owned parent process's start time has stayed the same.
If you also look in the error log you should see an entry containing:
"... configured -- resuming normal operations ... "
And, finally, you can see this by examining the output of systemctl status for the httpd.service and see the apachectl graceful option was called:
sudo systemctl status httpd.service
My handler now looks like:
- name: "{{ service_name }} restart handler"
become: yes
ansible.builtin.service:
service: "{{ service_name }}"
# state: restarted
state: reloaded

Intermittent DNS issues while pulling docker image from ECR repository

Has anyone facing this issue with docker pull. we recently upgraded docker to 18.03.1-ce from then we are seeing the issue. Although we are not exactly sure if this is related to docker, but just want to know if anyone faced this problem.
We have done some troubleshooting using tcp dump the DNS queries being made were under the permissible limit of 1024 packet. which is a limit on EC2, We also tried working around the issue by modifying the /etc/resolv.conf file to use a higher retry \ timeout value, but that didn't seem to help.
we did a packet capture line by line and found something. we found some responses to be negative. If you use Wireshark, you can use 'udp.stream eq 12' as a filter to view one of the negative answers. we can see the resolver sending an answer "No such name". All these requests that get a negative response use the following name in the request:
354XXXXX.dkr.ecr.us-east-1.amazonaws.com.ec2.internal
Would anyone of you happen to know why ec2.internal is being adding to the end of the DNS? If run a dig against this name it fails. So it appears that a wrong name is being sent to the server which responds with 'no such host'. Is docker is sending a wrong dns name for resolution.
We see this issue happening intermittently. looking forward for help. Thanks in advance.
Expected behaviour
5.0.25_61: Pulling from rrg
Digest: sha256:50bbce4af6749e9a976f0533c3b50a0badb54855b73d8a3743473f1487fd223e
Status: Downloaded newer image forXXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Actual behaviour
docker-compose up -d rrg-node-1
Creating rrg-node-1
ERROR: for rrg-node-1 Cannot create container for service rrg-node-1: Error response from daemon: Get https:/XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/v2/: dial tcp: lookup XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com on 10.5.0.2:53: no such host
Steps to reproduce the issue
docker pull XXXXXXXX.dkr.ecr.us-east-1.amazonaws.com/rrg:5.0.25_61
Output of docker version:
(Docker version 18.03.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3)
Output of docker info:
([ec2-user#ip-10-5-3-45 ~]$ docker info
Containers: 37
Running: 36
Paused: 0
Stopped: 1
Images: 60
Server Version: swarm/1.2.5
Role: replica
Primary: 10.5.4.172:3375
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 12
Plugins:
Volume:
Network:
Log:
Swarm:
NodeID:
Is Manager: false
Node Address:
Kernel Version: 4.14.51-60.38.amzn1.x86_64
Operating System: linux
Architecture: amd64
CPUs: 22
Total Memory: 80.85GiB
Name: mgr1
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false
WARNING: No kernel memory limit support)

Running expect script on EC2 hangs, but runs successfully when manually invoked

I'm writing an expect script to start an SSH tunnel.
It gets run on EC2 when the instance starts, as part of the deployment which creates the script from a .ebextensions config file.
When the script is run, it always gets stuck at this point:
Enter passphrase for key '/home/ec2-user/id_data_app_rsa':
If I run the same script manually on the server it succeeds and i can see the tunnel process running.
ps aux | grep ssh
root 19046 0.0 0.0 73660 1068 ? Ss 16:58 0:00 ssh -i /home/ec2-user/id_data_app_rsa -p222 -vfN -L 3306:X.X.X.X:3306 root#X.X.X.X
I can verify that the script is reading the SSH_PASSPHRASE correctly by printing it to the console.
set password $::env(SSH_PASSPHRASE)
send_user "retrieved env variable : $password "
This is the debug output I get from the EC2 logs:
Enter passphrase for key '/home/ec2-user/id_data_app_rsa':
interact: received eof from spawn_id exp0
I'm baffled as to why it's getting no further here when the EC2 deployer runs, but it continues normally when run manually.
This is the script in .ebextensions, the script itself starts at #!/usr/bin/expect:
files:
"/scripts/createTunnel.sh" :
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/expect
exp_internal 1
set timeout 60
# set variables
set password $::env(SSH_PASSPHRASE)
send_user "retrieved env variable : $password "
spawn -ignore HUP ssh -i /home/ec2-user/id_data_app_rsa -p222 -vfN -L 3306:X.X.X.X:3306 root#X.X.X.X
expect {
"(yes/no)?" { send "yes\n" }
-re "(.*)assphrase" { sleep 1; send -- "$password\n" }
-re "(.*)data_app_rsa" { sleep 1; send -- "$password\n" }
-re "(.*)assword:" { sleep 1; send -- "$password\n" }
timeout { send_user "un-able to login: timeout\n"; return }
"denied" { send_user "\nFatal Error: denied \n"}
eof { send_user "Closed\n" ; return }
}
interact

We finally resolved this. There were two things that seemed to be at issue:
Changing the final interact to expect eof.
Trimming down the
expect pattern matching as much as possible.
We noticed in testing that expect seemed to be matching falsely, sending a password, for example, when it should have been sending a 'yes' matching on the 'yes/no' prompt.
This is the final script we ended up with in case it's useful to anyone else:
#!/usr/bin/expect
exp_internal 1
set timeout 60
# set variables
set password $::env(SSH_TUNNEL_PASSPHRASE)
spawn -ignore HUP ssh -i /home/ec2-user/id_data_rsa -p222 -vfN -L 3306:X.X.X.X:3306 root#X.X.X.X
expect {
"(yes/no)?" { send "yes\r" }
"Enter passphrase" { sleep 2; send -- "$password\r"; sleep 2; exit }
}
expect eof

Your problem is here:
set password $::env(SSH_PASSPHRASE)
and the way shell works with environment variables. When the script is invoked, you assume your environment variables are set. Depending on how the script is invoked, $::env(SSH_PASSPHRASE) may not be set, resulting in the variable to be null/blank. When init scripts (or cloud-init) are run, they are not run with the environment of a login shell. So you should not assume that .profile or /etc/profile environment variables are set, but rather source or set them explicitly.
A possible solution may be
. ~ec2-user/.profile /path/to/above.script

chef test-kitchen configuration for Vagrant ssh connect timeout

My google-fu is failing me. what do I need to put into my .kitchen.yml in order to get it to increase the config.vm.boot_timeout or number of attempts in my Vagrantfile. My kitchen converge almost always hits:
STDERR: Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.
After about another minute or so I can connect without issue.
I've tried many of what I thought it could be but none seem to be setting it to all of the following:
driver:
name: vagrant
vm.boot_timeout: 20
vm:
boot_timeout: 20
driver_config:
require_chef_omnibus: true
vm.boot_timeout: 20
vm:
boot_timeout: 20
What do I need to do to get this increased?

I added:
driver:
name: vagrant
boot_timeout: 1200
It appears to work, the boot_timout is already present in Vagantfile.erb, maybe because of a newer version.

This isn't supported directly, but you can copy the default Vagrantfile.erb and set
driver:
name: vagrant
vagrantfile_erb: path/to/your/Vagrantfile.erb
or possibly: (I forget which is needed)
driver:
name: vagrant
config:
vagrantfile_erb: path/to/your/Vagrantfile.erb

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js