Ansible AWX step fails when using DSL for Jenkins job - jenkins-job-dsl

I'm running a Jenkins 2.140 server on Ubuntu 16.04 as well as an Ansible AWX 1.0.7.2 server using Ansible 2.6.2.
I'm creating a job in Jenkins which runs a template on my Ansible AWX server. I've got several other Jenkins jobs that run templates that all work so I know the general configuration I am using for this is OK.
However when I create the Jenkins job using a seed job which uses the JobDSL, the job fails at the Ansible AWX step with this output:
11:50:42 [EnvInject] - Loading node environment variables.
11:50:42 Building remotely on windows-slave (excel Windows orqaheadless windows) in workspace C:\JenkinsSlave\workspace\create-ec2-instance-2
11:50:42 ERROR: Build step failed with exception
11:50:42 java.lang.NullPointerException
11:50:42 at org.jenkinsci.plugins.ansible_tower.AnsibleTower.perform(AnsibleTower.java:129)
11:50:42 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
11:50:42 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
11:50:42 at hudson.model.Build$BuildExecution.build(Build.java:206)
11:50:42 at hudson.model.Build$BuildExecution.doRun(Build.java:163)
11:50:42 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
11:50:42 at hudson.model.Run.execute(Run.java:1815)
11:50:42 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
11:50:42 at hudson.model.ResourceController.execute(ResourceController.java:97)
11:50:42 at hudson.model.Executor.run(Executor.java:429)
11:50:42 Build step 'Ansible Tower' marked build as failure
11:50:42 [BFA] Scanning build for known causes...
11:50:42 [BFA] No failure causes found
11:50:42 [BFA] Done. 0s
11:50:42 Started calculate disk usage of build
11:50:42 Finished Calculation of disk usage of build in 0 seconds
11:50:42 Started calculate disk usage of workspace
11:50:42 Finished Calculation of disk usage of workspace in 0 seconds
11:50:42 Finished: FAILURE
That output doesn't really give me anything to work with especially as I'm no Java expert.
I configure the Jenkins job manually and all is well. This is the config.xml for the working job (only the AWX part). Note that all of these extra variables are passed in earlier in the job as parameters:
<builders>
<org.jenkinsci.plugins.ansible__tower.AnsibleTower plugin="ansible-tower#0.9.0">
<towerServer>AWX Server</towerServer>
<jobTemplate>create-ec2-instance</jobTemplate>
<extraVars>
key_name: ${key_name} ec2_termination_protection: ${ec2_termination_protection} vpc_subnet_id: ${vpc_subnet_id} security_groups: ${security_groups} instance_type: ${instance_type} instance_profile_name: ${instance_profile_name} assign_public_ip: ${assign_public_ip} region: ${region} image: ${image} instance_tags: ${instance_tags} ec2_wait_for_create: ${ec2_wait_for_create} ec2_wait_for_create_timeout: ${ec2_wait_for_create_timeout} exact_count: ${exact_count} delete_volume_on_termination: ${delete_volume_on_termination} data_disk_size: ${data_disk_size} private_domain: ${private_domain} route53_private_record_ttl: ${route53_private_record_ttl} dns_record: ${dns_record} elastic_ip: ${elastic_ip}
</extraVars>
<jobTags/>
<skipJobTags/>
<jobType>run</jobType>
<limit/>
<inventory/>
<credential/>
<verbose>true</verbose>
<importTowerLogs>true</importTowerLogs>
<removeColor>false</removeColor>
<templateType>job</templateType>
<importWorkflowChildLogs>false</importWorkflowChildLogs>
</org.jenkinsci.plugins.ansible__tower.AnsibleTower>
</builders>
And the config.xml from the failing, JobDSL-generated job, which looks the same to me:
<builders>
<org.jenkinsci.plugins.ansible__tower.AnsibleTower>
<towerServer>AWX Server</towerServer>
<jobTemplate>create-ec2-instance</jobTemplate>
<jobType>run</jobType>
<templateType>job</templateType>
<extraVars>
key_name: ${key_name} ec2_termination_protection: ${ec2_termination_protection} vpc_subnet_id: ${vpc_subnet_id} security_groups: ${security_groups} instance_type: ${instance_type} instance_profile_name: ${instance_profile_name} assign_public_ip: ${assign_public_ip} region: ${region} image: ${image} instance_tags: ${instance_tags} ec2_wait_for_create: ${ec2_wait_for_create} ec2_wait_for_create_timeout: ${ec2_wait_for_create_timeout} exact_count: ${exact_count} delete_volume_on_termination: ${delete_volume_on_termination} data_disk_size: ${data_disk_size} private_domain: ${private_domain} route53_private_record_ttl: ${route53_private_record_ttl} dns_record: ${dns_record} elastic_ip: ${elastic_ip}
</extraVars>
<verbose>true</verbose>
<importTowerLogs>true</importTowerLogs>
</org.jenkinsci.plugins.ansible__tower.AnsibleTower>
</builders>
So there are some expected differences you always get with JobDSL-generated jobs, such as the empty fields are missing, but this is the case with all of our (successful) other jobs that follow this process.
The JobDSL script is here:
configure { project ->
project / 'builders ' << 'org.jenkinsci.plugins.ansible__tower.AnsibleTower' {
towerServer 'AWX Server'
jobTemplate ('create-ec2-instance')
templateType 'job'
jobType 'run'
extraVars('''key_name: ${key_name}
ec2_termination_protection: ${ec2_termination_protection}
vpc_subnet_id: ${vpc_subnet_id}
security_groups: ${security_groups}
instance_type: ${instance_type}
instance_profile_name: ${instance_profile_name}
assign_public_ip: ${assign_public_ip}
region: ${region} image: ${image}
instance_tags: ${instance_tags}
ec2_wait_for_create: ${ec2_wait_for_create}
ec2_wait_for_create_timeout: ${ec2_wait_for_create_timeout}
exact_count: ${exact_count}
delete_volume_on_termination: ${delete_volume_on_termination}
data_disk_size: ${data_disk_size}
private_domain: ${private_domain}
route53_private_record_ttl: ${route53_private_record_ttl}
dns_record: ${dns_record}
elastic_ip: ${elastic_ip}''')
verbose 'true'
importTowerLogs 'true'
}
}
The job that this generates looks identical in the UI (as well as the XML) to my eye, and yet I keep getting that failure when I run it. Clearly I'm missing something but I can't for the life if me see what.

Despite the fact that other AWX jobs build without this, I added the missing (empty) fields, and the job started succeeding.
So, change my JobDSL script to this:
configure { project ->
project / 'builders ' << 'org.jenkinsci.plugins.ansible__tower.AnsibleTower' {
towerServer 'AWX Server'
jobTemplate ('create-ec2-instance')
extraVars('''key_name: ${key_name}
ec2_termination_protection: ${ec2_termination_protection}
vpc_subnet_id: ${vpc_subnet_id}
security_groups: ${security_groups}
instance_type: ${instance_type}
instance_profile_name: ${instance_profile_name}
assign_public_ip: ${assign_public_ip}
region: ${region}
image: ${image}
instance_tags: ${instance_tags}
ec2_wait_for_create: ${ec2_wait_for_create}
ec2_wait_for_create_timeout: ${ec2_wait_for_create_timeout}
exact_count: ${exact_count}
delete_volume_on_termination: ${delete_volume_on_termination}
data_disk_size: ${data_disk_size}
private_domain: ${private_domain}
route53_private_record_ttl: ${route53_private_record_ttl}
dns_record: ${dns_record}
elastic_ip: ${elastic_ip}''')
jobTags ''
skipJobTags ''
jobType 'run'
limit ''
inventory ''
credential ''
verbose 'true'
importTowerLogs 'true'
removeColor ''
templateType 'job'
importWorkflowChildLogs ''
And now working as expected.

Related

GCP Helm Cloud Builder

Just curious, why isn't there a helm cloud builder officially supported? It seems like a very common requirement, yet I'm not seeing one in the list here:
https://github.com/GoogleCloudPlatform/cloud-builders
I was previously using alpine/helm in my cloudbuild.yaml for my helm deployment as follows:
steps:
# Build app image
- name: gcr.io/cloud_builders/docker
args:
- build
- -t
- $_IMAGE_REPO/$_CONTAINER_NAME:$COMMIT_SHA
- ./cloudbuild/$_CONTAINER_NAME/
# Push my-app image to Google Cloud Registry
- name: gcr.io/cloud-builders/docker
args:
- push
- $_IMAGE_REPO/$_CONTAINER_NAME:$COMMIT_SHA
# Configure a kubectl workspace for this project
- name: gcr.io/cloud-builders/kubectl
args:
- cluster-info
env:
- CLOUDSDK_COMPUTE_REGION=$_CUSTOM_REGION
- CLOUDSDK_CONTAINER_CLUSTER=$_CUSTOM_CLUSTER
- KUBECONFIG=/workspace/.kube/config
# Deploy with Helm
- name: alpine/helm
args:
- upgrade
- -i
- $_CONTAINER_NAME
- ./cloudbuild/$_CONTAINER_NAME/k8s
- --set
- image.repository=$_IMAGE_REPO/$_CONTAINER_NAME,image.tag=$COMMIT_SHA
- -f
- ./cloudbuild/$_CONTAINER_NAME/k8s/values.yaml
env:
- KUBECONFIG=/workspace/.kube/config
- TILLERLESS=false
- TILLER_NAMESPACE=kube-system
- USE_GKE_GCLOUD_AUTH_PLUGIN=True
timeout: 1200s
substitutions:
# substitutionOption: ALLOW_LOOSE
# dynamicSubstitutions: true
_CUSTOM_REGION: us-east1
_CUSTOM_CLUSTER: demo-gke
_IMAGE_REPO: us-east1-docker.pkg.dev/fakeproject/my-docker-repo
_CONTAINER_NAME: app2
options:
logging: CLOUD_LOGGING_ONLY
# In this option we are providing the worker pool name that we have created in the previous step
workerPool:
'projects/fakeproject/locations/us-east1/workerPools/cloud-build-pool'
And this was working with no issues. Then recently it just started failing with the following error so I'm guessing a change was made recently:
Error: Kubernetes cluster unreachable: Get "https://10.10.2.2/version": getting credentials: exec: executable gke-gcloud-auth-plugin not found"
I get this error regularly on VM's and can workaround it by setting USE_GKE_GCLOUD_AUTH_PLUGIN=True, but that does not seem to fix the issue here if I add it to the env section. So I'm looking for recommendations on how to use helm with Cloud Build. alpine/helm was just something I randomly tried and was working for me up until now, but there's probably better solutions out there.
Thanks!

getting logs from a file with Ops Agent

I have a python script on a vm that writes logs to a file and I want to use them in the google logging.
I tried this config yaml:
logging:
receivers:
syslog:
type: files
include_paths:
- /var/log/messages
- /var/log/syslog
etl-error-logs:
type: files
include_paths:
- /home/user/test_logging/err_*
etl-info-logs:
type: files
include_paths:
- /home/user/test_logging/out_*
processors:
etl_log_processor:
type: parse_regex
field: message
regex: "(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s(?<severity>INFO|ERROR)\s(?<message>.*)"
time_key: time
time_format: "%Y-%m-%d %H:%M:%S"
service:
pipelines:
default_pipeline:
receivers: [syslog]
error_pipeline:
receivers: [etl-error-logs]
processors: [etl_log_processor]
log_level: error
info_pipeline:
receivers: [etl-info-logs]
processors: [etl_log_processor]
log_level: info
metrics:
receivers:
hostmetrics:
type: hostmetrics
collection_interval: 60s
processors:
metrics_filter:
type: exclude_metrics
metrics_pattern: []
service:
pipelines:
default_pipeline:
receivers: [hostmetrics]
processors: [metrics_filter]
error_pipeline:
receivers: [hostmetrics]
processors: [metrics_filter]
info_pipeline:
receivers: [hostmetrics]
processors: [metrics_filter]
and this is an example of the logs: 2021-11-22 11:15:44 INFO testing normal
I didn't fully understand the google docs so I created the yaml as best as I understood and with a reference to their main example but I have no idea why it doesn't work
environmen:GCE VM
You want to use those logs in GCP Log Viewer: yes
Which docs did you follow: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/configuration#logging-receivers
How did you install OpsAgent: in gce I entered each vm instance went to observability and there was the option to install ops agent in cloud shell
What logs you want to save: I want to save all of the logs that are being written to my log file live.
specific applications logs: its an etl process that runs in python and saves its logs to a local file on the vm
sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
Try out this command it should show you the exact(almost) error

Try "Automated software delivery using Docker Compose and Amazon ECS", but fail at Compose2Cloudformation

■Disability Summary
Try "Automated software delivery using Docker Compose and Amazon ECS", but fail at Compose2Cloudformation at the end of the CodePipeline.
■Verification environment
OS:Windows 10 Professional
Terminal:MINGW64
AWS CL:aws-cli/2.2.13 Python/3.8.8 Windows/10 exe/AMD64 prompt/off
Docker Compose:Docker Compose version 1.0.17
■Procedures used for reference
https://aws.amazon.com/jp/blogs/containers/automated-software-delivery-using-docker-compose-and-amazon-ecs/
Translated from Japanese (contents are the same as the above link)
https://aws.amazon.com/jp/blogs/news/automated-software-delivery-using-docker-compose-and-amazon-ecs/
■Target Demo Project
https://github.com/aws-containers/demo-app-for-docker-compose.git
docker-compose.yml
x-aws-vpc: ${AWS_VPC}
x-aws-cluster: ${AWS_ECS_CLUSTER}
x-aws-loadbalancer: ${AWS_ELB}
services:
frontend:
image: ${IMAGE_URI:-frontend}:${IMAGE_TAG:-latest}
build: ./frontend
environment:
REDIS_URL: "backend"
networks:
- demoapp
ports:
- 80:80
backend:
image: public.ecr.aws/bitnami/redis:6.2
environment:
ALLOW_EMPTY_PASSWORD: "yes"
volumes:
- redisdata:/data
networks:
- demoapp
volumes:
redisdata:
networks:
demoapp:
■error log
compose-pipeline-ExtractBuild:17ef28f6-b566-47ed-a96d-0bb7a34cd47f
[Container] 2021/06/29 09:15:25 Running command docker context create ecs demoecs --from-env
Successfully created ecs context "demoecs"
[Container] 2021/06/29 09:15:25 Running command docker context use demoecs
demoecs
[Container] 2021/06/29 09:15:25 Phase complete: PRE_BUILD State: SUCCEEDED
[Container] 2021/06/29 09:15:25 Phase context status code: Message:
[Container] 2021/06/29 09:15:25 Entering phase BUILD
[Container] 2021/06/29 09:15:25 Running command echo Convert Compose File
Convert Compose File
[Container] 2021/06/29 09:15:25 Running command docker --debug compose convert > cloudformation.yml
level=debug msg=resolving host=098456798948.dkr.ecr.ap-northeast-1.amazonaws.com
.
.
.
level=debug msg="searching for existing filesystem as volume \"redisdata\""
multiple filesystems are tags as project="src", volume="redisdata"
[Container] 2021/06/29 09:15:26 Command did not exit successfully docker --debug compose convert > cloudformation.yml exit status 1
[Container] 2021/06/29 09:15:26 Phase complete: BUILD State: FAILED
[Container] 2021/06/29 09:15:26 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: docker --debug compose convert > cloudformation.yml. Reason: exit status 1
[Container] 2021/06/29 09:15:26 Entering phase POST_BUILD
[Container] 2021/06/29 09:15:26 Phase complete: POST_BUILD State: SUCCEEDED
Co-author of the blog here (thanks for giving it a try). So the message is kind of interesting:
searching for existing filesystem as volume \"redisdata\""
multiple filesystems are tags as project="src", volume="redisdata"
It almost feels like it's trying to find an existing EFS for this application to re-use (and if it doesn't exist it will create it) but it says it finds "multiple" (which should not happen because either it doesn't exist and it will be created or one exists and it will be re-used). Can you check if by any chances you see 2 or more EFS file systems with the tags project="src" and volume="redisdata"?
Also, at which point of the tutorial are you hitting this problem? At first deployment? Or when you update the application and re-deploy?
Anyway, as we were digging into this we found there was a missing action in the IAM policy that prevented the pipeline to properly interact with EFS. We just updated the repo with the missing action.
We believe these two things (the missing action and the error message you received are not strictly related) but can I ask you to remove everything and restart from scratch? Please make sure you delete manually the EFS volumes (because docker does not delete them) and also that you follow the Clean UP section at the end of the blog to delete everything properly?
Sorry for the inconvenience.
What I've done.
Added the following to line 240 of compose-pipeline
AmazonElasticFileSystemFullAccess:
Type: AWS::IAM::Policy
Properties:
PolicyName: AmazonElasticFileSystemFullAccess
Roles:
- Ref: ExtractBuildRole
PolicyDocument:
Version: "2012-10-17"
Statement:
- Action:
- ec2:CreateNetworkInterface
- ec2:DeleteNetworkInterface
- ec2:DescribeAvailabilityZones
- ec2:DescribeNetworkInterfaceAttribute
- ec2:DescribeNetworkInterfaces
- ec2:DescribeSecurityGroups
- ec2:DescribeSubnets
- ec2:DescribeVpcs
- ec2:ModifyNetworkInterfaceAttribute
- elasticfilesystem:*
Effect: Allow
Resource:
- "*"
I applied the change and tried again, and it worked!

Ansible RDS module only works once?

I'm using the RDS module in Ansible two times during a play. First time, it records the CNAME. Second time, it waits until the database status is "available".
Problem I'm running into is that it works as expected the first time, but the second time, it fails.
---
- name: Wait for RDS to be out of creating state and get its CNAME
rds:
command: facts
instance_name: '{{ wp_db_instance}}'
region: "{{ region }}"
register: rds_facts
until: rds_facts.instance.status != "creating"
retries: 15
delay: 30
become: no
- name: Set Endpoint variable
set_fact: rds_db_endpoint="{{ (rds_facts.stdout|from_json).DBInstances[0].Endpoint.Address }}"
The above code works as expected. It waits for the instance to leave "creating" status and records the name.
Later on in the play, it runs the following.
---
- name: Wait for RDS instance to be in available state
rds:
command: facts
instance_name: '{{ wp_db_instance}}'
region: "{{ region }}"
register: rds_facts_available
until: rds_facts_available.instance.status == "available"
retries: 55
delay: 30
become: no
This one fails due to "'dict object' has no attribute 'instance'" which tells me the module did not return any facts. Why wouldn't it though? Is there a problem with calling the module twice?
Any help would be appreciated.
I think I found the answer. It's kind of out there and I'm not sure if this is what fixed it, but here's what I did that works.
The first module (the one that worked) is a part of YAML that's called upon by main.yml.
- name: Get RDS endpoint
include: get_rds_endpoint.yml
delegate_to: localhost
become: no
The second module is a part of a different YAML but the files name is a variable:
- name: Configure DB
include: "{{configureDBFile}}"
delegate_to: localhost
become: no
I changed "{{configureDBFile}}" to the name of the actual file and the issue went away.
I am doing regression testing of our old playbooks against the latest release so I have to wonder if an update at some point changed the way environment variables/credentials are passed down?
Change your second tasks as below
---
- name: Wait for RDS instance to be in available state
rds_instance_info:
db_instance_identifier: '{{ wp_db_instance}}'
region: "{{ region }}"
register: rds_facts_available
until: rds_facts_available.instances[0].db_instance_status == "available"
retries: 55
delay: 30
become: no

Ansible docker_container 'no Host in request URL', docker pull works correctly

I'm trying to provision my infrastructure on AWS using Ansible playbooks. I have the instance, and am able to provision docker-engine, docker-py, etc. and, I swear, yesterday this worked correctly and I haven't changed the code since.
The relevant portion of my playbook is:
- name: Ensure AWS CLI is available
pip:
name: awscli
state: present
when: aws_deploy
- block:
- name: Add .boto file with AWS credentials.
copy:
content: "{{ boto_file }}"
dest: ~/.boto
when: aws_deploy
- name: Log in to docker registry.
shell: "$(aws ecr get-login --region us-east-1)"
when: aws_deploy
- name: Remove .boto file with AWS credentials.
file:
path: ~/.boto
state: absent
when: aws_deploy
- name: Create docker network
docker_network:
name: my-net
- name: Start Container
docker_container:
name: example
image: "{{ docker_registry }}/example"
pull: true
restart: true
network_mode: host
volumes:
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone
My {{ docker_registry }} is set to my-acct-id.dkr.ecr.us-east-1.amazonaws.com and the result I'm getting is:
"msg": "Error pulling my-acct-id.dkr.ecr.us-east-1.amazonaws.com/example - code: None message: Get http://: http: no Host in request URL"
However, as mentioned, this worked correctly last night. Since then I've made some VPC/subnet changes, but I'm able to ssh to the instance, and run docker pull my-acct-id.dkr.ecr.us-east-1.amazonaws.com/example with no issues.
Googling has led me not very far as I can't seem to find other folks with the same error. I'm wondering what changed, and how I can fix it! Thanks!
EDIT: Versions:
ansible - 2.2.0.0
docker - 1.12.3 6b644ec
docker-py - 1.10.6
I had the same problem. Downgrading docker-compose pip image on that host machine from 1.9.0 to 1.8.1 solved the problem.
- name: Install docker-compose
pip: name=docker-compose version=1.8.1
Per this thread: https://github.com/ansible/ansible-modules-core/issues/5775, the real culprit is requests. This fixes it:
- name: fix requests
pip: name=requests version=2.12.1 state=forcereinstall