Istio Ingress not showing address (Kubeflow on AWS) - amazon-web-services

I'm trying to setup kubeflow on AWS, I did follow this tutorial to setup kubeflow on AWS.
I used dex instead of cognito with following policy.
then at step: kfctl apply -V -f kfctl_aws.yaml , first I received this error:
IAM for Service Account is not supported on non-EKS cluster
So to fix this I set the property enablePodIamPolicy: false
Then retried and it successfully deployed kubeflow, on checking services status using kubectl -n kubeflow get all, I found all services ready except MPI operator.
ignoring this when I tried to run kubectl get ingress -n istio-system
I got the following result.
upon investigation using kubectl -n kubeflow logs $(kubectl get pods -n kubeflow --selector=app=aws-alb-ingress-controller --output=jsonpath={.items..metadata.name})
I found the following error:
E1104 12:09:37.446342 1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile LB managed SecurityGroup: failed to reconcile managed LoadBalancer securityGroup: UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: Lsvzm7f4rthL4Wxn6O8wiQL1iYXQUES_9Az_231BV7fyjgs7CHrwgUOVTNTf4334_C4voUogjSuCoF8GTOKhc5A7zAFzvcGUKT_FBs6if06KMQCLiCoujgfoqKJbG75pPsHHDFARIAdxNYZeIr4klmaUaxbQiFFxpvQsfT4ZkLMD7jmuQQcrEIw_U0MlpCQGkcvC69NRVVKjynIifxPBySubw_O81zifDp0Dk8ciRysaN1SbF85i8V3LoUkrtwROhUI9aQYJgYgSJ1CzWpfNLplbbr0X7YIrTDKb9sMhmlVicj_Yng0qFka_OVmBjHTnpojbKUSN96uBjGYZqC2VQXM1svLAHDTU1yRruFt5myqjhJ0fVh8Imhsk1Iqh0ytoO6eFoiLTWK4_Crb8XPS5tptBBzpEtgwgyk4QwOmzySUwkvNdDB-EIsTJcg5RQJl8ds4STNwqYV7XXeWxYQsmL1vGPVFY2lh_MX6q1jA9n8smxITE7F6AXsuRHTMP5q0jk58lbrUe-ZvuaD1b0kUTvpO3JtwWwxRd7jTKF7xde2InNOXwXxYCxHOw0sMX56Y1wLkvEDTLrNLZWOACS-T5o7mXDip43U0sSoUtMccu7lpfQzH3c7lNdr9s2Wgz4OqYaQYWsxNxRlRBdR11TRMweZt4Ta6K-7si5Z-rrcGmjG44NodT0O14Gzj-S4i6bK-qPYvUEsVeUl51ev_MsnBKtCXcMF8W6j9D7Oe3iGj13uvlVJEtq3OIoRjBXIuQQ012H0b3nQqlkoKEvsPAA_txAjgHXVzEVcM301_NDQikujTHdnxHNdzMcCfY7DQeeOE_2FT_hxYGlbuIg5vonRTT7MfSP8_LUuoIICGS81O-hDXvCLoomltb1fqCBBU2jpjIvNALMwNdJmMnwQOcIMI_QonRKoe5W43v\n\tstatus code: 403, request id: a9be63bd-2a3a-4a21-bb87-93532923ffd2" "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}
I don't understand what exactly went wrong in security permissions ?

The alb-ingress-controller doesn't have permission to create an ALB.
By setting the enablePodIamPolicy: false, I assume you go for option 2 of the guide.
The alb-ingress-controller uses the kf-admin role, and the installer needs attach on that role a policy found in aws-config/iam-alb-ingress-policy.json. Most probably it's not installed, so you'll have to add it in IAM and attach it to the role.
After doing that, observe the reconciler logs of the alb-ingress-controller to see if it's able to create the ALB.

It's likely the cluster-name in the aws-alb-ingress-controller-config is not correctly configured.
If that's the case, you should edit the Config Map to the right cluster name using kubectl edit cm aws-alb-ingress-controller-config -n kubeflow.
After that you should delete the pod so it restarts (kubectl -n kubeflow delete pod $(kubectl get pods -n kubeflow --selector=app=aws-alb-ingress-controller --output=jsonpath={.items..metadata.name})).

Related

kubectl wait for Service on AWS EKS to expose Elastic Load Balancer (ELB) address reported in .status.loadBalancer.ingress field

As the kubernetes.io docs state about a Service of type LoadBalancer:
On cloud providers which support external load balancers, setting the
type field to LoadBalancer provisions a load balancer for your
Service. The actual creation of the load balancer happens
asynchronously, and information about the provisioned balancer is
published in the Service's .status.loadBalancer field.
On AWS Elastic Kubernetes Service (EKS) a an AWS Load Balancer is provisioned that load balances network traffic (see AWS docs & the example project on GitHub provisioning a EKS cluster with Pulumi). Assuming we have a Deployment ready with the selector app=tekton-dashboard (it's the default Tekton dashboard you can deploy as stated in the docs), a Service of type LoadBalancer defined in tekton-dashboard-service.yml could look like this:
apiVersion: v1
kind: Service
metadata:
name: tekton-dashboard-external-svc-manual
spec:
selector:
app: tekton-dashboard
ports:
- protocol: TCP
port: 80
targetPort: 9097
type: LoadBalancer
If we create the Service in our cluster with kubectl apply -f tekton-dashboard-service.yml -n tekton-pipelines, the AWS ELB get's created automatically:
There's only one problem: The .status.loadBalancer field is populated with the ingress[0].hostname field asynchronously and is therefore not available immediately. We can check this, if we run the following commands together:
kubectl apply -f tekton-dashboard-service.yml -n tekton-pipelines && \
kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}'
The output will be an empty field:
{}%
So if we want to run this setup in a CI pipeline for example (e.g. GitHub Actions, see the example project's workflow provision.yml), we need to somehow wait until the .status.loadBalancer field got populated with the AWS ELB's hostname. How can we achieve this using kubectl wait?
TLDR;
Prior to Kubernetes v1.23 it's not possible using kubectl wait, but using until together with grep like this:
until kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}' | grep "ingress"; do : ; done
or even enhance the command using timeout (brew install coreutils on a Mac) to prevent the command from running infinitely:
timeout 10s bash -c 'until kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}' | grep "ingress"; do : ; done'
Problem with kubectl wait & the solution explained in detail
As stated in this so Q&A and the kubernetes issues kubectl wait unable to not wait for service ready #80828 & kubectl wait on arbitrary jsonpath #83094 using kubectl wait for this isn't possible in current Kubernetes versions right now.
The main reason is, that kubectl wait assumes that the status field of a Kubernetes resource queried with kubectl get service/xyz --output=yaml contains a conditions list. Which a Service doesn't have. Using jsonpath here would be a solution and will be possible from Kubernetes v1.23 on (see this merged PR). But until this version is broadly available in managed Kubernetes clusters like EKS, we need another solution. And it should also be available as "one-liner" just as a kubectl wait would be.
A good starting point could be this superuser answer about "watching" the output of a command until a particular string is observed and then exit:
until my_cmd | grep "String Im Looking For"; do : ; done
If we use this approach together with a kubectl get we can craft a command which will wait until the field ingress gets populated into the status.loadBalancer field in our Service:
until kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}' | grep "ingress"; do : ; done
This will wait until the ingress field got populated and then print out the AWS ELB address (e.g. via using kubectl get service tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer.ingress[0].hostname}' thereafter):
$ until kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}' | grep "ingress"; do : ; done
{"ingress":[{"hostname":"a74b078064c7d4ba1b89bf4e92586af0-18561896.eu-central-1.elb.amazonaws.com"}]}
Now we have a one-liner command that behaves just like a kubectl wait for our Service to become available through the AWS Loadbalancer. We can double check if this is working with the following commands combined (be sure to delete the Service using kubectl delete service/tekton-dashboard-external-svc-manual -n tekton-pipelines before you execute it, because otherwise the Service incl. the AWS LoadBalancer already exists):
kubectl apply -f tekton-dashboard-service.yml -n tekton-pipelines && \
until kubectl get service/tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer}' | grep "ingress"; do : ; done && \
kubectl get service tekton-dashboard-external-svc-manual -n tekton-pipelines --output=jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Here's also a full GitHub Actions pipeline run if you're interested.

AWS EKS - Failure creating load balancer controller

I am trying to create an application load balancer controller on my EKS cluster by following
this link
When I run these steps (after making the necessary changes to the downloaded yaml file)
curl -o v2_1_2_full.yaml https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.1.2/docs/install/v2_1_2_full.yaml
kubectl apply -f v2_1_2_full.yaml
I get this output
customresourcedefinition.apiextensions.k8s.io/targetgroupbindings.elbv2.k8s.aws configured
mutatingwebhookconfiguration.admissionregistration.k8s.io/aws-load-balancer-webhook configured
role.rbac.authorization.k8s.io/aws-load-balancer-controller-leader-election-role unchanged
clusterrole.rbac.authorization.k8s.io/aws-load-balancer-controller-role configured
rolebinding.rbac.authorization.k8s.io/aws-load-balancer-controller-leader-election-rolebinding unchanged
clusterrolebinding.rbac.authorization.k8s.io/aws-load-balancer-controller-rolebinding unchanged
service/aws-load-balancer-webhook-service unchanged
deployment.apps/aws-load-balancer-controller unchanged
validatingwebhookconfiguration.admissionregistration.k8s.io/aws-load-balancer-webhook configured
Error from server (InternalError): error when creating "v2_1_2_full.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: no endpoints available for service "cert-manager-webhook"
Error from server (InternalError): error when creating "v2_1_2_full.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: no endpoints available for service "cert-manager-webhook"
The load balancer controller doesnt appear to start up because of this and never gets to the ready state
Has anyone any suggestions on how to resolve this issue?
Turns out the taints on my nodegroup prevented the cert-manager pods from starting on any node.
These commands helped debug and led me to a fix for this issue:
kubectl get po -n cert-manager
kubectl describe po <pod id> -n cert-manager
My solution was to create another nodeGroup with no taints specified. This allowed the cert-manager to run.

Failed to retrieve token from the Google Compute Engine metadata service. Status: 404

I'm trying to set up Cloud SQL Proxy running as a sidecar in my GKE cluster. The configuration is done via Terraform. I've set up workload identity, required service accounts, and so on. When launching ./cloud_sql_proxy from within the GKE cluster (kubectl run -it --image google/cloud-sdk:slim --serviceaccount ksa-name --namespace k8s-namespace workload-identity-test), I get the following output:
root#workload-identity-test:/# ./cloud_sql_proxy -instances=project-id:europe-west4:db-instance=tcp:5432
2020/11/24 17:18:39 current FDs rlimit set to 1048576, wanted limit is 8500. Nothing to do here.
2020/11/24 17:18:40 GcloudConfig: error reading config: exit status 1; stderr was:
ERROR: (gcloud.config.config-helper) There was a problem refreshing your current auth tokens: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/db-proxy#project-id.iam.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 404 Response:\nb'Unable to generate access token; IAM returned 404 Not Found: Requested entity was not found.\\n'", <google_auth_httplib2._Response object at 0x7fc5575545f8>)
Please run:
$ gcloud auth login
to obtain new credentials.
If you have already logged in with a different account:
$ gcloud config set account ACCOUNT
to select an already authenticated account to use.
2020/11/24 17:18:41 GcloudConfig: error reading config: exit status 1; stderr was:
ERROR: (gcloud.config.config-helper) There was a problem refreshing your current auth tokens: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/db-proxy#project-id.iam.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 404 Response:\nb'Unable to generate access token; IAM returned 404 Not Found: Requested entity was not found.\\n'", <google_auth_httplib2._Response object at 0x7f06f72f45c0>)
Please run:
$ gcloud auth login
to obtain new credentials.
If you have already logged in with a different account:
$ gcloud config set account ACCOUNT
to select an already authenticated account to use.
2020/11/24 17:18:41 errors parsing config:
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/project-id/instances/europe-west4~db-instance?alt=json&prettyPrint=false": metadata: GCE metadata "instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.admin" not defined
Here's troubleshooting I've done so far:
root#workload-identity-test:/# gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
* db-proxy#project-id.iam.gserviceaccount.com
To set the active account, run:
$ gcloud config set account `ACCOUNT`
λ gcloud container clusters describe mycluster --format="value(workloadIdentityConfig.workloadPool)"
project-id.svc.id.goog
λ gcloud container node-pools describe mycluster-node-pool --cluster=mycluster --format="value(config.workloadMetadataConfig.mode)"
GKE_METADATA
λ gcloud container node-pools describe mycluster-node-pool --cluster=mycluster--format="value(config.oauthScopes)"
https://www.googleapis.com/auth/monitoring;https://www.googleapis.com/auth/devstorage.read_only;https://www.googleapis.com/auth/logging.write;https://www.googleapis.com/auth/cloud-platform;https://www.googleapis.com/auth/userinfo.email;https://www.googleapis.com/auth/compute;https://www.googleapis.com/auth/sqlservice.admin
λ kubectl describe serviceaccount --namespace k8s-namespace ksa-name
Name: ksa-name
Namespace: k8s-namespace
Labels: <none>
Annotations: iam.gke.io/gcp-service-account: db-proxy#project-id.iam.gserviceaccount.com
Image pull secrets: <none>
Mountable secrets: ksa-name-token-87n4t
Tokens: ksa-name-token-87n4t
Events: <none>
λ gcloud iam service-accounts get-iam-policy db-proxy#project-id.iam.gserviceaccount.com
bindings:
- members:
- serviceAccount:project-id.svc.id.goog[k8s-namespace/ksa-name]
role: roles/iam.workloadIdentityUser
etag: BwW02zludbY=
version: 1
λ kubectl get networkpolicy --namespace k8s-namespace
No resources found in k8s-namespace namespace.
λ gcloud projects get-iam-policy project-id
bindings:
- members:
- serviceAccount:db-proxy#project-id.iam.gserviceaccount.com
role: roles/cloudsql.editor
Expected result (I got this running on another cluster and changed configuration afterwards, can't find where my mistake is):
root#workload-identity-test:~# ./cloud_sql_proxy -instances=project-id:europe-west4:db-instance-2=tcp:5432
2020/11/24 18:09:54 current FDs rlimit set to 1048576, wanted limit is 8500. Nothing to do here.
2020/11/24 18:09:56 Listening on 127.0.0.1:5432 for project-id:europe-west4:db-instance-2
2020/11/24 18:09:56 Ready for new connections
What am I doing wrong? How do I troubleshoot or debug further?
This could be due to the service account not being enabled when the Kubernetes cluster was created, or it wasn't configured properly. Try checking if the Service Account is disabled and Enable if it is. You could also try to create a new service account and change the service account in the pods. Or finally, try to provide the credentials to the gcloud command when running.
I was able to resolve the problem by creating a service account with a different name. Just the name has changed, nothing else. If I delete the db-proxy#project-id.iam.gserviceaccount.com and then use the name again, the problem still persists. I was not able to find any other reference to said account. The problem was not encountered again after my comment on Nov 30 '20.
Could you confirm that 'db-proxy#project-id.iam.gserviceaccount.com' is the correct account? I may be reading it wrong, but it seems that there is an error trying to refresh the auth token for that account, and the error is that the account does not exist.
I encountered a similar error today and discovered that it was because the GSA was in a different project from the GKE cluster. It seem like the iam.workloadIdentityUser binding needs to be between accounts in the same project.
So this worked:
gcloud iam service-accounts create custom-metrics-adapter \
--project ${PLATFORM_PROJECT_ID}
gcloud iam service-accounts add-iam-policy-binding \
"${GSA_NAME}#${PLATFORM_PROJECT_ID}.iam.gserviceaccount.com" \
--member "serviceAccount:${PLATFORM_PROJECT_ID}.svc.id.goog[${KSA_NAMESPACE}/${KSA_NAME}]" \
--role "roles/iam.workloadIdentityUser" \
--project ${PLATFORM_PROJECT_ID}
with
apiVersion: v1
kind: ServiceAccount
metadata:
name: ${KSA_NAME}
namespace: ${KSA_NAMESPACE}
annotations:
iam.gke.io/gcp-service-account: ${GSA_NAME}${PLATFORM_PROJECT_ID}.iam.gserviceaccount.com
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: example
namespace: ${KSA_NAMESPACE}
spec:
template:
spec:
serviceAccountName: ${KSA_NAME}
# Deployment spec truncated for clarity
Not sure if that helps you, but maybe it will help someone else who finds this by searching the error string:
Failed to retrieve
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/${GSA_NAME}#${DIFFERENT_PROJECT_ID}.iam.gserviceaccount.com/token
from the Google Compute Enginemetadata service. Status: 404
Response:\nb'Unable to generate access token; IAM returned 404 Not
Found: Requested entity was not found.

kubectl : error: You must be logged in to the server (Unauthorized)

I have created kops cluster and getting below error when logging to the cluster.
Error log :
*****INFO! KUBECONFIG env var set to /home/user/scripts/kube/kubeconfig.yaml
INFO! Testing kubectl connection....
error: You must be logged in to the server (Unauthorized)
ERROR! Test Failed, AWS role might not be recongized by cluster*****
Using script for iam-authentication and logged in to server with proper role before connecting.
I am able to login to other server which is in the same environment. tried with diff k8s version and diff configuration.
KUBECONFIG doesn't have any problem and same entry and token details like other cluster.
I can see the token with 'aws-iam-authenticator' command
Went through most of the articles and didn't helped
with kops vs1.19 you need to add --admin or --user to update your kubernetes cluster and each time you log out of your server you have to export the cluster name and the storage bucket and then update the cluster again. this will work.
It seems as a AWS authorization issue. At cluster creation only the IAM user who created the cluster has admin rights on it, so you may need to add your own IAM User first.
1- Start by verifying the IAM user identity used implicitly in all commands: aws sts get-caller-identity
If your aws-cli is set correctly you will have an output similar to this:
{
"UserId": "ABCDEFGHIJK",
"Account": "12344455555",
"Arn": "arn:aws:iam::1234577777:user/Toto"
}
we will refer to the value in Account as YOUR_AWS_ACCOUNT_ID in step 3. (in this example YOUR_AWS_ACCOUNT_ID="12344455555"
2- Once you have this identity you have to add it to AWS role binding to get EKS permissions.
3- You will need to edit the ConfigMap file used by kubectl to add your user kubectl edit -n kube-system configmap/aws-auth
In the editor opened, create a username you want to use to refer to yourself using the cluster YOUR_USER_NAME (for simplicity you may use the same as your aws user name, example Toto in step 2) , you will need it in step 4, and use the aws account id (don't forget to keep the quotes ""),you found it in your identity info at step 1 YOUR_AWS_ACCOUNT_ID, as follows in sections mapUsers and mapAccounts.
mapUsers: |
- userarn: arn:aws:iam::111122223333:user/ops-user
username: YOUR_USER_NAME
groups:
- system:masters
mapAccounts: |
- "YOUR_AWS_ACCOUNT_ID"
4- Finally you need to create a role binding on the kubernetes cluster for the user specified in the ConfigMap
kubectl create clusterrolebinding cluster-admin-binding \
--clusterrole cluster-admin \
--user YOUR_USER_NAME

kubectl Error: You must be logged in to the server

I've checked almost all of the answers on here, but nothing has resolved this yet.
When running kubectl, I will consistently get error: You must be logged in to the server (Unauthorized).
I have tried editing the config file via kubectl config --kubeconfig=config view, but I still receive the same error, even when running kubectl edit -n kube-system configmap/aws-auth.
Even when I just try to analyze my clusters and run aws eks list-clusters, I receive a different error An error occurred (UnrecognizedClientException) when calling the ListClusters operation: The security token included in the request is invalid.
I have completely torn down my clusters on EKS and rebuilding them, but I keep encountering these same errors. This is my first time attempting to use AWS EKS, and I've been trying different things for a few days.
I've set my aws configure
λ aws configure
AWS Access Key ID [****************Q]: *****
AWS Secret Access Key [****************5]: *****
Default region name [us-west-2]: us-west-2
Default output format [json]: json
Even when trying to look at the config map, I receive the same error:
λ kubectl describe configmap -n kube-system aws-auth
error: You must be logged in to the server (Unauthorized)
For me the problem was because of the system time, below solved the issue for me.
sudo apt install ntp
service ntp restart