Argo CD - limit time for Application for transit to Degradated state - argocd

Here is common scenario we stuck with:
Argocd Application created and synced with Helm, it has deployment with 1 pod, all green.
We updating deployment image tag with some broken value not exists in our Docker Image registry and push changes to git repo.
Argo pick up updates from git repo, sync status is green Synced state, but app health is "Processing"
In result of change Deployment tries to roll out new pod with broken image tag, and obviusly not able to do it.
Agrocd App stuck in app health "Processing" state for about 10 minunes and eventually transit to "Degradated" state
Now the questio, can we limit this time and have "Degradataed" state in 1 or 2 minutes instead 10?

Do you create app with GUI or yaml file?
If you create app with yaml file you can do it by set field limit or maxDuration in retry.
Here is an example
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: guestbook
# You'll usually want to add your resources to the argocd namespace.
namespace: argocd
# Add a this finalizer ONLY if you want these to cascade delete.
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
# The project the application belongs to.
project: default
# Source of the application manifests
source:
repoURL: https://github.com/argoproj/argocd-example-apps.git
targetRevision: HEAD
path: guestbook
# helm specific config
helm:
# Release name override (defaults to application name)
releaseName: guestbook
# Helm values files for overriding values in the helm chart
# The path is relative to the spec.source.path directory defined above
valueFiles:
- values-prod.yaml
# Optional Helm version to template with. If omitted it will fall back to look at the 'apiVersion' in Chart.yaml
# and decide which Helm binary to use automatically. This field can be either 'v2' or 'v3'.
version: v2
# Destination cluster and namespace to deploy the application
destination:
server: https://kubernetes.default.svc
namespace: guestbook
# Sync policy
syncPolicy:
automated: # automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
prune: true # Specifies if resources should be pruned during auto-syncing ( false by default ).
selfHeal: true # Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
allowEmpty: false # Allows deleting all application resources during automatic syncing ( false by default ).
syncOptions: # Sync options which modifies sync behavior
- Validate=false # disables resource validation (equivalent to 'kubectl apply --validate=false') ( true by default ).
- CreateNamespace=true # Namespace Auto-Creation ensures that namespace specified as the application destination exists in the destination cluster.
- PrunePropagationPolicy=foreground # Supported policies are background, foreground and orphan.
- PruneLast=true # Allow the ability for resource pruning to happen as a final, implicit wave of a sync operation
# The retry feature is available since v1.7
retry:
limit: 5 # number of failed sync attempt retries; unlimited number of attempts if less than 0
backoff:
duration: 5s # the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")
factor: 2 # a factor to multiply the base duration after each failed retry
maxDuration: 3m # the maximum amount of time allowed for the backoff strategy

I believe the problem is not coming from ArgoCD as, as you mentioned it sync status is ok.
You may want to set progressDeadlineSeconds in your Deployment object

Related

AWS copilot deployed service is not accessible?

I have a simple backend srevice that I just deployed with copilot.
However, I don't know where to access it?
According to AWS console it's running and active. I can even see it in the logs that it has been started.
My manifest:
# The manifest for the "user-service" service.
# Read the full specification for the "Backend Service" type at:
# https://aws.github.io/copilot-cli/docs/manifest/backend-service/
# Your service name will be used in naming your resources like log groups, ECS services, etc.
name: user-service
type: Backend Service
# Your service does not allow any traffic.
# Configuration for your containers and service.
image:
# Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/backend-service/#image-build
build: ./Dockerfile
port: 9000
cpu: 256 # Number of CPU units for the task.
memory: 512 # Amount of memory in MiB used by the task.
count: 1 # Number of tasks that should be running in your service.
# Optional fields for more advanced use-cases.
#
variables: # Pass environment variables as key value pairs.
SERVER_PORT: 9000
NODE_ENV: test
secrets: # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
ACCESS_TOKEN_SECRET: ACCESS_TOKEN_SECRET
REFRESH_TOKEN_SECRET: REFRESH_TOKEN_SECRET
MONGODB_URL: MONGODB_URL
# You can override any of the values defined above by environment.
environments:
test:
variables:
NODE_ENV: test
# count: 2 # Number of tasks to run for the "test" environment.
My Dockerfile
# Check out https://hub.docker.com/_/node to select a new base image
FROM node:lts-buster-slim
# Set to a non-root built-in user `node`
USER node
# Create app directory (with user `node`)
RUN mkdir -p /home/node/app
WORKDIR /home/node/app
# Install app dependencies
# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available (npm#5+)
COPY --chown=node package*.json ./
RUN npm install
# Bundle app source code
COPY --chown=node . .
RUN npm run build
# Bind to all network interfaces so that it can be mapped to the host OS
ENV HOST=0.0.0.0 PORT=3000
EXPOSE 9000
CMD [ "node", "." ]
This works fine locally, with docker-compose. But where can I find the URL of the deployed service? I checked ECS console and the task has a public IP. However I can't connect to that.
What's missing here?
Nm.. my bad. Backend services are not supposed to be reachable via internet. They expose endpoints but should talk to each other (or the frontend) via service discovery.

docker exec cli peer channel create | failed to create new connection: context deadline exceeded | amazon managed blockchain

I am trying to setup hyperledger fabric blockchain network using amazon managed blockchain following this guide. In the step 6, to create the channel I have executed the following command,
docker exec cli peer channel create -c hrschannel -f /opt/home/hrschannel.pb -o orderer.n-zzzz.managedblockchain.us-east-1.amazonaws.com:30001 --cafile /opt/home/managedblockchain-tls-chain.pem --tls
But I am getting the following error,
Error: failed to create deliver client: orderer client failed to connect to orderer.n-zzzz.managedblockchain.us-east-1.amazonaws.com:30001: failed to create new connection: context deadline exceeded
Help me to fix this issue.
Edited:
I asked the same question in reddit. One user replied that he added listenAddress environment variable in my configtx.yaml file. He did not say clear information about which listenAddress and where to add that address in configtx.yaml. Here is my configtx.yaml file.
################################################################################
#
# Section: Organizations
#
# - This section defines the different organizational identities which will
# be referenced later in the configuration.
#
################################################################################
Organizations:
- &Org1
# DefaultOrg defines the organization which is used in the sampleconfig
# of the fabric.git development environment
Name: m-CUB6HI
# ID to load the MSP definition as
ID: m-B6HI
MSPDir: /opt/home/admin-msp
# AnchorPeers defines the location of peers which can be used
# for cross org gossip communication. Note, this value is only
# encoded in the genesis block in the Application section context
AnchorPeers:
- Host:
Port:
################################################################################
#
# SECTION: Application
#
# - This section defines the values to encode into a config transaction or
# genesis block for application related parameters
#
################################################################################
Application: &ApplicationDefaults
# Organizations is the list of orgs which are defined as participants on
# the application side of the network
Organizations:
################################################################################
#
# Profile
#
# - Different configuration profiles may be encoded here to be specified
# as parameters to the configtxgen tool
#
################################################################################
Profiles:
OneOrgChannel:
Consortium: AWSSystemConsortium
Application:
<<: *ApplicationDefaults
Organizations:
- *Org1
Help me to fix this issue.
One must check if the peer container is able to communicate with the orderer container. curl orderer.endpoint port can be used to check the connection. If the peer is unable to communicate then either the orderer container is down or could be due to different security groups.
Update:
As OP mentioned in the comments, changing the port helped in resolving the issue. One must give it a try.

NLP Flask app startup nodes timing out on Google Kubernetes GKE

I have a flask app that includes some NLP packages and takes a while to initially build some vectors before it starts the server. I've noticed this in the past with Google App Engine and I was able to set a max timeout in the app.yaml file to fix this.
The problem is that when I start my cluster on Kubernetes with this app, I notice that the workers keep timing out in the logs. Which makes sense because I'm sure the default amount of time is not enough. However, I can't figure out how to configure GKE to allow the workers enough time to do everything it needs to do before it starts serving.
How do I increase the time the workers can take before they timeout?
I deleted the old instances so I can't get the logs right now, but I can start it up if someone wants to see the logs.
It's something like this:
I 2020-06-26T01:16:04.603060653Z Computing vectors for all products
E 2020-06-26T01:16:05.660331982Z
95it [00:05, 17.84it/s][2020-06-26 01:16:05 +0000] [220] [INFO] Booting worker with pid: 220
E 2020-06-26T01:16:31.198002748Z [nltk_data] Downloading package stopwords to /root/nltk_data...
E 2020-06-26T01:16:31.198056691Z [nltk_data] Package stopwords is already up-to-date!
100it 2020-06-26T01:16:35.696015992Z [CRITICAL] WORKER TIMEOUT (pid:220)
E 2020-06-26T01:16:35.696015992Z [2020-06-26 01:16:35 +0000] [220] [INFO] Worker exiting (pid: 220)
I also see this:
The node was low on resource: memory. Container thoughtful-sha256-1 was using 1035416Ki, which exceeds its request of 0.
Obviously I don't exactly know what I'm doing. Why does it say I'm requesting 0 memory and can I set a timeout amount for the Kubernetes nodes?
Thanks for the help!
One thing you can do is add some sort of delay in a startup script for your GCP instances. You could try a simple:
#!/bin/bash
sleep <time-in-seconds>
Another thing you can try is adding some sort of delay to when your containers start in your Kubernetes nodes. For example, a delay in an initContainer
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: myapa:latest
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "echo Waiting a bit && sleep 3600"]
Furthermore, you can try a StartupProbe combined with the Probe parameter initialDelaySeconds on your actual application container that way it actually waits for some time before saying: I'm going to see if the application has started.:
startupProbe:
exec:
command:
- touch
- /tmp/started
initialDelaySeconds: 3600

istio is failing to install in a Kubernetes cluster built via Kops in AWS

I can't get the demo profile to work with istioctl. It seems like istioctl is having trouble creating IngressGateway and the AddonComponents. I have tried doing the helm installation with similar issues. I did a fresh k8s cluster from kops and the same issue. Any help debugging this issue would be greatly appreciated.
I am following these instructions.
https://istio.io/docs/setup/getting-started/#download
I am running
istioctl manifest apply --set profile=demo --logtostderr
This is the output
2020-04-06T19:59:24.951136Z info Detected that your cluster does not support third party JWT authentication. Falling back to less secure first party JWT. See https://istio.io/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for details.
- Applying manifest for component Base...
✔ Finished applying manifest for component Base.
- Applying manifest for component Pilot...
✔ Finished applying manifest for component Pilot.
- Applying manifest for component IngressGateways...
- Applying manifest for component EgressGateways...
- Applying manifest for component AddonComponents...
✔ Finished applying manifest for component EgressGateways.
2020-04-06T20:00:11.501795Z error installer error running kubectl: exit status 1
✘ Finished applying manifest for component AddonComponents.
2020-04-06T20:00:40.418396Z error installer error running kubectl: exit status 1
✘ Finished applying manifest for component IngressGateways.
2020-04-06T20:00:40.421746Z info
Component AddonComponents - manifest apply returned the following errors:
2020-04-06T20:00:40.421823Z info Error: error running kubectl: exit status 1
2020-04-06T20:00:40.421884Z info Error detail:
Error from server (Timeout): error when creating "STDIN": Timeout: request did not complete within requested timeout 30s (repeated 1 times)
clusterrole.rbac.authorization.k8s.io/kiali unchanged
clusterrole.rbac.authorization.k8s.io/kiali-viewer unchanged
clusterrole.rbac.authorization.k8s.io/prometheus-istio-system unchanged
clusterrolebinding.rbac.authorization.k8s.io/kiali unchanged
clusterrolebinding.rbac.authorization.k8s.io/prometheus-istio-system unchanged
serviceaccount/kiali-service-account unchanged
serviceaccount/prometheus unchanged
configmap/istio-grafana unchanged
configmap/istio-grafana-configuration-dashboards-citadel-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-galley-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-istio-mesh-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-istio-performance-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-istio-service-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-istio-workload-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-mixer-dashboard unchanged
configmap/istio-grafana-configuration-dashboards-pilot-dashboard unchanged
configmap/kiali configured
configmap/prometheus unchanged
secret/kiali unchanged
deployment.apps/grafana unchanged
deployment.apps/istio-tracing unchanged
deployment.apps/kiali unchanged
deployment.apps/prometheus unchanged
service/grafana unchanged
service/jaeger-agent unchanged
service/jaeger-collector unchanged
service/jaeger-collector-headless unchanged
service/jaeger-query unchanged
service/kiali unchanged
service/prometheus unchanged
service/tracing unchanged
service/zipkin unchanged
2020-04-06T20:00:40.421999Z info
Component IngressGateways - manifest apply returned the following errors:
2020-04-06T20:00:40.422056Z info Error: error running kubectl: exit status 1
2020-04-06T20:00:40.422096Z info Error detail:
Error from server (Timeout): error when creating "STDIN": Timeout: request did not complete within requested timeout 30s (repeated 2 times)
serviceaccount/istio-ingressgateway-service-account unchanged
deployment.apps/istio-ingressgateway configured
poddisruptionbudget.policy/ingressgateway unchanged
role.rbac.authorization.k8s.io/istio-ingressgateway-sds unchanged
rolebinding.rbac.authorization.k8s.io/istio-ingressgateway-sds unchanged
service/istio-ingressgateway unchanged
2020-04-06T20:00:40.422134Z info
✘ Errors were logged during apply operation. Please check component installation logs above.
Error: failed to apply manifests: errors were logged during apply operation
I ran the below to verify install before running the above commands.
istioctl verify-install
Checking the cluster to make sure it is ready for Istio installation...
#1. Kubernetes-api
-----------------------
Can initialize the Kubernetes client.
Can query the Kubernetes API Server.
#2. Kubernetes-version
-----------------------
Istio is compatible with Kubernetes: v1.16.7.
#3. Istio-existence
-----------------------
Istio will be installed in the istio-system namespace.
#4. Kubernetes-setup
-----------------------
Can create necessary Kubernetes configurations: Namespace,ClusterRole,ClusterRoleBinding,CustomResourceDefinition,Role,ServiceAccount,Service,Deployments,ConfigMap.
#5. SideCar-Injector
-----------------------
This Kubernetes cluster supports automatic sidecar injection. To enable automatic sidecar injection see https://istio.io/docs/setup/kubernetes/additional-setup/sidecar-injection/#deploying-an-app
As mentioned in your logs
2020-04-06T19:59:24.951136Z info Detected that your cluster does not support third party JWT authentication. Falling back to less secure first party JWT.
As mentioned here
To determine if your cluster supports third party tokens, look for the TokenRequest API:
$ kubectl get --raw /api/v1 | jq '.resources[] | select(.name | index("serviceaccounts/token"))'
{
"name": "serviceaccounts/token",
"singularName": "",
"namespaced": true,
"group": "authentication.k8s.io",
"version": "v1",
"kind": "TokenRequest",
"verbs": [
"create"
]
}
While most cloud providers support this feature now, many local development tools and custom installations may not. To enable this feature, please refer to the Kubernetes documentation.
To authenticate with the Istio control plane, the Istio proxy will use a Service Account token. Kubernetes supports two forms of these tokens:
Third party tokens, which have a scoped audience and expiration.
First party tokens, which have no expiration and are mounted into all pods.
Because the properties of the first party token are less secure, Istio will default to using third party tokens. However, this feature is not enabled on all Kubernetes platforms.
If you are using istioctl to install, support will be automatically detected. This can be done manually as well, and configured by passing --set values.global.jwtPolicy=third-party-jwt or --set values.global.jwtPolicy=first-party-jwt.
If that won't work I would open a new github issue, or add a comment here as issue with installation is similar.

Deploying Django channels app on google flex engine

I am working on django channels and getting problem while deploying them on google flex engine,first I was getting error of 'deployment has failed to become healthy in the allotted time' and resolved it by adding readiness_check in app.yaml,now I am getting below error:
(gcloud.app.deploy) Operation [apps/socketapp-263709/operations/65c25731-1e5a-4aa1-83e1-34955ec48c98] timed out. This operation may still be underway.
App.yaml
runtime: python
env: flex
runtime_config:
python_version: 3
instance_class: F4_HIGHMEM
handlers:
# This configures Google App Engine to serve the files in the app's
# static directory.
- url: /static
static_dir: static/
- url: /.*
script: auto
# [END django_app]
readiness_check:
check_interval_sec: 120
timeout_sec: 40
failure_threshold: 5
success_threshold: 5
app_start_timeout_sec: 1500
How can I fix this issue,any suggestions?
The following error is due to several issues:
1) You aren't configuring correctly your app.yaml file. The resource request in App Engine Flexible is not through instance_class option, to request resources, you've to use resources option like following:
resources:
cpu: 2
memory_gb: 2.3
disk_size_gb: 10
volumes:
- name: ramdisk1
volume_type: tmpfs
size_gb: 0.5
2) You're missing an entrypoint for your app. To deploy Django channels, they suggest to have an entrypoint for Daphne server. Add in your app.yaml the following code:
entrypoint: daphne -b 0.0.0.0 -p 8080 mysite.asgi:application
3) After doing the previous, if you still get the same error, it's possible that your In-use IP addresses quota in the region of your App Engine Flexible application has reached its limit.
To check this issue, you can go to "Activity" tab of your project home page. Their, you can see warnings of quota limits and VM's failing to be created.
App Engine by default leaves the previous versions of your App, and running that may take IP addresses.You can delete the previous versions and/or request an increase of your IP address quota limit.
Also update gcloud tools and SDK which may resolve the issue.
To check your in-use addresses click here and you will be able to increase your quota by clicking the 'Edit Quotas' button in the Cloud Console.