Unable to create/read document to HDFS deployed with AWS EBS in EKS cluster - hdfs

I have EKS cluster with EBS storage class/volume.
I am able to deploy hdfs namenode and datanode images (bde2020/hadoop-xxx) using statefulset successfully.
When I am trying to put a file to hdfs from my machine using hdfs://:, it gives me success, but it does not get written on datanode.
In namenode log, I see below error.
Can it be something to do with EBS volume? I cannot even upload/download files from namenode GUI. Can it be due to as datanode host name hdfs-data-X.hdfs-data.pulse.svc.cluster.local is not resolvable to my local machine?
Please help
2020-05-12 17:38:51,360 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=10.8.29.112:9866, 10.8.29.176:9866, 10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:13,036 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:13,036 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:13,036 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:13,036 INFO hdfs.StateChange: BLOCK* allocate blk_1073741826_1002, replicas=10.8.29.176:9866, 10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:34,607 INFO namenode.FSEditLog: Number of transactions: 11 Total time for transactions(ms): 23 Number of transactions batched in Syncs: 3 Number of syncs: 8 SyncTimes(ms): 23
2020-05-12 17:39:35,146 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:35,146 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 2 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:35,146 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:35,147 INFO hdfs.StateChange: BLOCK* allocate blk_1073741827_1003, replicas=10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:57,319 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:57,319 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 3 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:57,319 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:57,320 INFO ipc.Server: IPC Server handler 5 on default port 8020, call Call#12 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.254.40.95:59328
java.io.IOException: File /vault/a.json could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2219)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2789)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:892)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
My namenode web page shows below:
Node Http Address Last contact Last Block Report Capacity Blocks Block pool used Version
hdfs-data-0.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-0.hdfs-data.pulse.svc.cluster.local:9864 1s 0m
975.9 MB
0 24 KB (0%) 3.2.1
hdfs-data-1.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-1.hdfs-data.pulse.svc.cluster.local:9864 2s 0m
975.9 MB
0 24 KB (0%) 3.2.1
hdfs-data-2.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-2.hdfs-data.pulse.svc.cluster.local:9864 1s 0m
975.9 MB
0 24 KB (0%) 3.2.1
My deployment:
NameNode:
#clusterIP service of namenode
apiVersion: v1
kind: Service
metadata:
name: hdfs-name
namespace: pulse
labels:
component: hdfs-name
spec:
ports:
- port: 8020
protocol: TCP
name: nn-rpc
- port: 9870
protocol: TCP
name: nn-web
selector:
component: hdfs-name
type: ClusterIP
---
#namenode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-name
namespace: pulse
labels:
component: hdfs-name
spec:
serviceName: hdfs-name
replicas: 1
selector:
matchLabels:
component: hdfs-name
template:
metadata:
labels:
component: hdfs-name
spec:
initContainers:
- name: delete-lost-found
image: busybox
command: ["sh", "-c", "rm -rf /hadoop/dfs/name/lost+found"]
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
containers:
- name: hdfs-name
image: bde2020/hadoop-namenode
env:
- name: CLUSTER_NAME
value: hdfs-k8s
- name: HDFS_CONF_dfs_permissions_enabled
value: "false"
ports:
- containerPort: 8020
name: nn-rpc
- containerPort: 9870
name: nn-web
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
#subPath: data #subPath required as on root level, lost+found folder is created which does not cause to run namenode --format
volumeClaimTemplates:
- metadata:
name: hdfs-name-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
Datanode:
#headless service of datanode
apiVersion: v1
kind: Service
metadata:
name: hdfs-data
namespace: pulse
labels:
component: hdfs-data
spec:
ports:
- port: 80
protocol: TCP
selector:
component: hdfs-data
clusterIP: None
type: ClusterIP
---
#datanode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-data
namespace: pulse
labels:
component: hdfs-data
spec:
serviceName: hdfs-data
replicas: 3
selector:
matchLabels:
component: hdfs-data
template:
metadata:
labels:
component: hdfs-data
spec:
containers:
- name: hdfs-data
image: bde2020/hadoop-datanode
env:
- name: CORE_CONF_fs_defaultFS
value: hdfs://hdfs-name:8020
volumeMounts:
- name: hdfs-data-pv-claim
mountPath: /hadoop/dfs/data
volumeClaimTemplates:
- metadata:
name: hdfs-data-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi

It seems to be issue with the datanode not reachable over rpc port from my client machine.
I had datanodes http port reachable from my client machine. Tried using webhdfs:// (instead of hdfs://) after putting mapping of datanode podname vs IP in hosts file, it worked out.

Related

akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing failed. How can it be solved?

How can I solve this problem -- I am trying to run akka cluster on minikube. But failed to create a cluster.
17:46:49.093 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
My config is --
akka {
actor {
provider = cluster
}
cluster {
shutdown-after-unsuccessful-join-seed-nodes = 60s
}
coordinated-shutdown.exit-jvm = on
management {
cluster.bootstrap {
contact-point-discovery {
discovery-method = kubernetes-api
}
}
}
}
my yaml
kind: Deployment
metadata:
labels:
app: appka
name: appka
spec:
replicas: 2
selector:
matchLabels:
app: appka
template:
metadata:
labels:
app: appka
spec:
containers:
- name: appka
image: akkacluster:latest
imagePullPolicy: Never
readinessProbe:
httpGet:
path: /ready
port: management
periodSeconds: 10
failureThreshold: 10
initialDelaySeconds: 20
livenessProbe:
httpGet:
path: /alive
port: management
periodSeconds: 10
failureThreshold: 10
initialDelaySeconds: 20
ports:
- name: management
containerPort: 8558
protocol: TCP
- name: http
containerPort: 8080
protocol: TCP
- name: remoting
containerPort: 25520
protocol: TCP
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: read-pods
subjects:
- kind: User
name: system:serviceaccount:default:default
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Unfortunately my cluster is not formaing---
kubectl logs pod/appka-7c4b7df7f7-5v7cc
17:46:32.026 [appka-akka.actor.default-dispatcher-3] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
SLF4J: A number (1) of logging calls during the initialization phase have been intercepted and are
SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
SLF4J: See also http://www.slf4j.org/codes.html#replay
17:46:33.644 [appka-akka.actor.default-dispatcher-3] INFO akka.remote.artery.tcp.ArteryTcpTransport - Remoting started with transport [Artery tcp]; listening on address [akka://appka#172.17.0.4:25520] with UID [-8421566647681174079]
17:46:33.811 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Starting up, Akka version [2.6.14] ...
17:46:34.491 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Registered cluster JMX MBean [akka:type=Cluster]
17:46:34.512 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Started up successfully
17:46:34.883 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - No downing-provider-class configured, manual cluster downing required, see https://doc.akka.io/docs/akka/current/typed/cluster.html#downing
17:46:34.884 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - No seed nodes found in configuration, relying on Cluster Bootstrap for joining
17:46:39.084 [appka-akka.actor.default-dispatcher-11] INFO akka.management.internal.HealthChecksImpl - Loading readiness checks [(cluster-membership,akka.management.cluster.scaladsl.ClusterMembershipCheck), (sharding,akka.cluster.sharding.ClusterShardingHealthCheck)]
17:46:39.090 [appka-akka.actor.default-dispatcher-11] INFO akka.management.internal.HealthChecksImpl - Loading liveness checks []
17:46:39.104 [appka-akka.actor.default-dispatcher-3] INFO ClusterListenerActor$ - started actor akka://appka/user - (class akka.actor.typed.internal.adapter.ActorRefAdapter)
17:46:39.888 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Binding Akka Management (HTTP) endpoint to: 172.17.0.4:8558
17:46:40.525 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for ClusterHttpManagementRouteProvider
17:46:40.806 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for ClusterBootstrap
17:46:40.821 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Using self contact point address: http://172.17.0.4:8558
17:46:40.914 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for HealthCheckRoutes
17:46:44.198 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Initiating bootstrap procedure using kubernetes-api method...
17:46:44.200 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Bootstrap using `akka.discovery` method: kubernetes-api
17:46:44.226 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Bound Akka Management (HTTP) endpoint to: 172.17.0.4:8558
17:46:44.487 [appka-akka.actor.default-dispatcher-6] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Locating service members. Using discovery [akka.discovery.kubernetes.KubernetesApiServiceDiscovery], join decider [akka.management.cluster.bootstrap.LowestAddressJoinDecider], scheme [http]
17:46:44.490 [appka-akka.actor.default-dispatcher-6] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:44.493 [appka-akka.actor.default-dispatcher-6] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:45.626 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:45.627 [appka-akka.actor.default-dispatcher-12] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:48.428 [appka-akka.actor.default-dispatcher-13] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:48.485 [appka-akka.actor.default-dispatcher-22] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:48.586 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Discovered [2] contact points, confirmed [0], which is less than the required [2], retrying
17:46:49.092 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-4.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-4.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:49.093 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:49.603 [appka-akka.actor.default-dispatcher-22] INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Discovered [2] contact points, confirmed [0], which is less than the required [2], retrying
17:46:49.682 [appka-akka.actor.default-dispatcher-21] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:49.683 [appka-akka.actor.default-dispatcher-21] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:49.726 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:50.349 [appka-akka.actor.default-dispatcher-21] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:50.504 [appka-akka.actor.default-dispatcher-11] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-4.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-4.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
You are missing akka.remote setting block. Something like:
akka {
actor {
# provider=remote is possible, but prefer cluster
provider = cluster
}
remote {
artery {
transport = tcp # See Selecting a transport below
canonical.hostname = "127.0.0.1"
canonical.port = 25520
}
}
}

Istio: Health check / sidecar fails when I enable the JWT RequestAuthentication

OBSOLETE:
I keep this post for further reference, but you can check better diagnose (not solved yet, but workarounded) in
Istio: RequestAuthentication jwksUri does not resolve internal services names
UPDATE:
In Istio log we see the next error. uaa is a kubernetes pod serving OAUTH authentication/authorization. It is accessed with the name uaa from the normal services. I do not know why the istiod cannot find uaa host name. Have I to use an specific name? (remember, standard services find uaa host perfectly)
2021-03-03T18:39:36.750311Z error model Failed to fetch public key from "http://uaa:8090/uaa/token_keys": Get "http://uaa:8090/uaa/token_keys": dial tcp: lookup uaa on 10.96.0.10:53: no such host
2021-03-03T18:39:36.750364Z error Failed to fetch jwt public key from "http://uaa:8090/uaa/token_keys": Get "http://uaa:8090/uaa/token_keys": dial tcp: lookup uaa on 10.96.0.10:53: no such host
2021-03-03T18:39:36.753394Z info ads LDS: PUSH for node:product-composite-5cbf8498c7-jd4n5.chp18 resources:29 size:134.3kB
2021-03-03T18:39:36.754623Z info ads RDS: PUSH for node:product-composite-5cbf8498c7-jd4n5.chp18 resources:14 size:14.2kB
2021-03-03T18:39:36.790916Z warn ads ADS:LDS: ACK ERROR sidecar~10.1.1.56~product-composite-5cbf8498c7-jd4n5.chp18~chp18.svc.cluster.local-10 Internal:Error adding/updating listener(s) virtualInbound: Provider 'origins-0' in jwt_authn config has invalid local jwks: Jwks RSA [n] or [e] field is missing or has a parse error
2021-03-03T18:39:55.618106Z info ads ADS: "10.1.1.55:41162" sidecar~10.1.1.55~review-65b6886c89-bcv5f.chp18~chp18.svc.cluster.local-6 terminated rpc error: code = Canceled desc = context canceled
Original question
I have a service that is working fine, after injecting istio sidecar to a standard kubernetes pod.
I'm trying to add jwt Authentication, and for this, I'm following the official guide Authorization with JWT
My problem is
If I create the JWT resources (RequestAuthorization and AuthorizationPolicy) AFTER injecting the istio dependencies, everything (seems) to work fine
But if I create the JWT resources (RequestAuthorization and AuthorizationPolicy) and then inject the Istio the pod doesn't start. After checking the logs, seems that the sidecar is not able to work (maybe checking the health?)
My code:
JWT Resources
apiVersion: "security.istio.io/v1beta1"
kind: "RequestAuthentication"
metadata:
name: "ra-product-composite"
spec:
selector:
matchLabels:
app: "product-composite"
jwtRules:
- issuer: "http://uaa:8090/uaa/oauth/token"
jwksUri: "http://uaa:8090/uaa/token_keys"
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: "ap-product-composite"
spec:
selector:
matchLabels:
app: "product-composite"
action: ALLOW
# rules:
# - from:
# - source:
# requestPrincipals: ["http://uaa:8090/uaa/oauth/token/faf5e647-74ab-42cc-acdb-13cc9c573d5d"]
# b99ccf71-50ed-4714-a7fc-e85ebae4a8bb
2- I use destination rules as follows
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: dr-product-composite
spec:
host: product-composite
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
3- My service deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: product-composite
spec:
replicas: 1
selector:
matchLabels:
app: product-composite
template:
metadata:
labels:
app: product-composite
version: latest
spec:
containers:
- name: comp
image: bthinking/product-composite-service
imagePullPolicy: Never
env:
- name: SPRING_PROFILES_ACTIVE
value: "docker"
- name: SPRING_CONFIG_LOCATION
value: file:/config-repo/application.yml,file:/config-repo/product-composite.yml
envFrom:
- secretRef:
name: rabbitmq-client-secrets
ports:
- containerPort: 80
resources:
limits:
memory: 350Mi
livenessProbe:
httpGet:
scheme: HTTP
path: /actuator/info
port: 4004
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 20
successThreshold: 1
readinessProbe:
httpGet:
scheme: HTTP
path: /actuator/health
port: 4004
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
successThreshold: 1
volumeMounts:
- name: config-repo-volume
mountPath: /config-repo
volumes:
- name: config-repo-volume
configMap:
name: config-repo-product-composite
---
apiVersion: v1
kind: Service
metadata:
name: product-composite
spec:
selector:
app: "product-composite"
ports:
- port: 80
name: http
targetPort: 80
- port: 4004
name: http-mgm
targetPort: 4004
4- Error log in the pod (combined service and sidecar)
2021-03-02 19:34:41.315 DEBUG 1 --- [undedElastic-12] o.s.s.w.s.a.AuthorizationWebFilter : Authorization successful
2021-03-02 19:34:41.315 DEBUG 1 --- [undedElastic-12] .b.a.e.w.r.WebFluxEndpointHandlerMapping : [0e009bf1-133] Mapped to org.springframework.boot.actuate.endpoint.web.reactive.AbstractWebFluxEndpointHandlerMapping$ReadOperationHandler#e13aa23
2021-03-02 19:34:41.316 DEBUG 1 --- [undedElastic-12] ebSessionServerSecurityContextRepository : No SecurityContext found in WebSession: 'org.springframework.web.server.session.InMemoryWebSessionStore$InMemoryWebSession#48e89a58'
2021-03-02 19:34:41.319 DEBUG 1 --- [undedElastic-15] .s.w.r.r.m.a.ResponseEntityResultHandler : [0e009bf1-133] Using 'application/vnd.spring-boot.actuator.v3+json' given [*/*] and supported [application/vnd.spring-boot.actuator.v3+json, application/vnd.spring-boot.actuator.v2+json, application/json]
2021-03-02 19:34:41.320 DEBUG 1 --- [undedElastic-15] .s.w.r.r.m.a.ResponseEntityResultHandler : [0e009bf1-133] 0..1 [java.util.Collections$UnmodifiableMap<?, ?>]
2021-03-02 19:34:41.321 DEBUG 1 --- [undedElastic-15] o.s.http.codec.json.Jackson2JsonEncoder : [0e009bf1-133] Encoding [{}]
2021-03-02 19:34:41.326 DEBUG 1 --- [or-http-epoll-3] r.n.http.server.HttpServerOperations : [id: 0x0e009bf1, L:/127.0.0.1:4004 - R:/127.0.0.1:57138] Detected non persistent http connection, preparing to close
2021-03-02 19:34:41.327 DEBUG 1 --- [or-http-epoll-3] o.s.w.s.adapter.HttpWebHandlerAdapter : [0e009bf1-133] Completed 200 OK
2021-03-02 19:34:41.327 DEBUG 1 --- [or-http-epoll-3] r.n.http.server.HttpServerOperations : [id: 0x0e009bf1, L:/127.0.0.1:4004 - R:/127.0.0.1:57138] Last HTTP response frame
2021-03-02 19:34:41.328 DEBUG 1 --- [or-http-epoll-3] r.n.http.server.HttpServerOperations : [id: 0x0e009bf1, L:/127.0.0.1:4004 - R:/127.0.0.1:57138] Last HTTP packet was sent, terminating the channel
2021-03-02T19:34:41.871551Z warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 1 successful, 0 rejected; lds updates: 0 successful, 1 rejected
5- Istio injection
kubectl get deployment product-composite -o yaml | istioctl kube-inject -f - | kubectl apply -f -
NOTICE: I have checked a lot of post in SO, and it seems that health checking create a lot of problems with sidecars and other configurations. I have checked the guide Health Checking of Istio Services with no success. Specifically, I tried to disable the sidecar.istio.io/rewriteAppHTTPProbers: "false", but it is worse (in this case, doesn't start neither the sidecar neither the service.

Prometheus alert manager doesnt send alert k8s

Im using prometheus operator 0.3.4 and alert manager 0.20 and it doesnt work, i.e. I see that the alert is fired (on prometheus UI on the alerts tab) but I didnt get any alert to the email. by looking at the logs I see the following , any idea ? please see the warn in bold maybe this is the reason but not sure how to fix it...
This is the helm of prometheus operator which I use:
https://github.com/helm/charts/tree/master/stable/prometheus-operator
level=info ts=2019-12-23T15:42:28.039Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)"
level=info ts=2019-12-23T15:42:28.039Z caller=main.go:232 build_context="(go=go1.13.5, user=root#00c3106655f8, date=20191211-14:13:14)"
level=warn ts=2019-12-23T15:42:28.109Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-12-23T15:42:28.109Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-12-23T15:42:28.131Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.132Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail2
level=info ts=2019-12-23T15:42:28.135Z caller=main.go:497 msg=Listening address=:9093
level=info ts=2019-12-23T15:42:30.110Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.00011151s
level=info ts=2019-12-23T15:42:38.110Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.000659096s
this is my config yaml
global:
imagePullSecrets: []
prometheus-operator:
defaultRules:
grafana:
enabled: true
prometheusOperator:
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
tlsProxy:
image:
repository: squareup/ghostunnel
tag: v1.4.1
pullPolicy: IfNotPresent
resources:
limits:
cpu: 8000m
memory: 2000Mi
requests:
cpu: 2000m
memory: 2000Mi
admissionWebhooks:
patch:
priorityClassName: "operator-critical"
image:
repository: jettech/kube-webhook-certgen
tag: v1.0.0
pullPolicy: IfNotPresent
serviceAccount:
name: prometheus-operator
image:
repository: quay.io/coreos/prometheus-operator
tag: v0.34.0
pullPolicy: IfNotPresent
prometheus:
prometheusSpec:
replicas: 1
serviceMonitorSelector:
role: observeable
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
ruleSelector:
matchLabels:
role: alert-rules
prometheus: prometheus
image:
repository: quay.io/prometheus/prometheus
tag: v2.13.1
alertmanager:
alertmanagerSpec:
image:
repository: quay.io/prometheus/alertmanager
tag: v0.20.0
resources:
limits:
cpu: 500m
memory: 1000Mi
requests:
cpu: 500m
memory: 1000Mi
serviceAccount:
name: prometheus
config:
global:
resolve_timeout: 1m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager#vsx.com'
smtp_auth_username: 'ds.monitoring.grafana#gmail.com'
smtp_auth_password: 'mypass'
smtp_require_tls: false
route:
group_by: ['alertname', 'cluster']
group_wait: 45s
group_interval: 5m
repeat_interval: 1h
receiver: default-receiver
routes:
- receiver: str
match_re:
cluster: "canary|canary2"
receivers:
- name: default-receiver
- name: str
email_configs:
- to: 'rayndoll007#gmail.com'
from: alertmanager#vsx.com
smarthost: smtp.gmail.com:587
auth_identity: ds.monitoring.grafana#gmail.com
auth_username: ds.monitoring.grafana#gmail.com
auth_password: mypass
- name: 'AlertMail'
email_configs:
- to: 'rayndoll007#gmail.com'
https://codebeautify.org/yaml-validator/cb6a2781
The error says it failed in the resolve , the pod name called alertmanager-monitoring-prometheus-oper-alertmanager-0 which is up and running however it try to resolve : lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc not sure why...
Here is the output of kubectl get svc -n mon
update
This is warn logs
level=warn ts=2019-12-24T12:10:21.293Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.323Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-1.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.326Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-2.alertmanager-operated.monitoring.svc:9094
This is the kubectl get svc -n mon
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6m4s
monitoring-grafana ClusterIP 100.11.215.226 <none> 80/TCP 6m13s
monitoring-kube-state-metrics ClusterIP 100.22.248.232 <none> 8080/TCP 6m13s
monitoring-prometheus-node-exporter ClusterIP 100.33.130.77 <none> 9100/TCP 6m13s
monitoring-prometheus-oper-alertmanager ClusterIP 100.33.228.217 <none> 9093/TCP 6m13s
monitoring-prometheus-oper-operator ClusterIP 100.21.229.204 <none> 8080/TCP,443/TCP 6m13s
monitoring-prometheus-oper-prometheus ClusterIP 100.22.93.151 <none> 9090/TCP 6m13s
prometheus-operated ClusterIP None <none> 9090/TCP 5m54s
Proper debug steps to help with these kind of scenarios:
Enable Alertmanager debug logs: add argument --log.level=debug
Verify Alertmanager cluster is formed properly (Check /status endpoint and verify all peers are listed)
Verify that Prometheus is sending alerts to all Alertmanager peers (Check /status endpoint and verify all Alertmanager peers are listed)
End to End testing: Generate a test alert, alert should be seen in Prometheus UI, then alert should be seen in Alertmanager UI, finally alert notification should be seen.

CrashLoopBackOff Error when deploying Django app on GKE (Kubernetes)

Folks,
What problem now still persists:
I have now gone beyond the code getting stuck on CrashLoopBackOff by fixing the Dockerfile run command as suggested by Emil Gi, however the external IP is not forwarding to my pod library app server
Status
Fixed port to 8080 in Dockerfile and ensured it is consistent across
Made sure Dockerfile has proper commands so that it doesn't terminate immediately post startup, this was what was causing the CrashLoop Back
Problem is still that the load balancer external IP I click on gives this error "This site can’t be reached34.93.141.11 refused to connect."
Original Question:
How do I resolve this CrashLoopBackOff? I looked at many docs and tried debugging but unsure what is causing this? The app runs perfectly in local mode, it even deploys smoothly into appengine standard, but GKE nope. Any pointers to debug this further most appreciated.
Problem: The cloudsql proxy container is running, but the library-app container is having CrashLoopBackOff error. The pod was assigned to a node, starts pulling the images, starting the images, and then it goes into this BackOff state.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
library-7699b84747-9skst 1/2 CrashLoopBackOff 28 121m
$ kubectl logs library-7699b84747-9skst
Error from server (BadRequest): a container name must be specified for pod library-7699b84747-9skst, choose one of: [library-app cloudsql-proxy]
​$ kubectl describe pods library-7699b84747-9skst
Name: library-7699b84747-9skst
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: gke-library-default-pool-35b5943a-ps5v/10.160.0.13
Start Time: Fri, 06 Dec 2019 09:34:11 +0530
Labels: app=library
pod-template-hash=7699b84747
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container library-app; cpu request for container cloudsql-proxy
Status: Running
IP: 10.16.0.10
Controlled By: ReplicaSet/library-7699b84747
Containers:
library-app:
Container ID: docker://e7d8aac3dff318de34f750c3f1856cd754aa96a7203772de748b3e397441a609
Image: gcr.io/library-259506/library
Image ID: docker-pullable://gcr.io/library-259506/library#sha256:07f54e055621ab6ddcbb49666984501cf98c95133bcf7405ca076322fb0e4108
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 06 Dec 2019 09:35:07 +0530
Finished: Fri, 06 Dec 2019 09:35:07 +0530
Ready: False
Restart Count: 2
Requests:
cpu: 100m
Environment:
DATABASE_USER: <set to the key 'username' in secret 'cloudsql'> Optional: false
DATABASE_PASSWORD: <set to the key 'password' in secret 'cloudsql'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kj497 (ro)
cloudsql-proxy:
Container ID: docker://352284231e7f02011dd1ab6999bf9a283b334590435278442e9a04d4d0684405
Image: gcr.io/cloudsql-docker/gce-proxy:1.16
Image ID: docker-pullable://gcr.io/cloudsql-docker/gce-proxy#sha256:7d302c849bebee8a3fc90a2705c02409c44c91c813991d6e8072f092769645cf
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
--dir=/cloudsql
-instances=library-259506:asia-south1:library=tcp:3306
-credential_file=/secrets/cloudsql/credentials.json
State: Running
Started: Fri, 06 Dec 2019 09:34:51 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/cloudsql from cloudsql (rw)
/etc/ssl/certs from ssl-certs (rw)
/secrets/cloudsql from cloudsql-oauth-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kj497 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cloudsql-oauth-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-oauth-credentials
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
cloudsql:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
default-token-kj497:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-kj497
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 86s default-scheduler Successfully assigned default/library-7699b84747-9skst to gke-library-default-pool-35b5943a-ps5v
Normal Pulling 50s kubelet, gke-library-default-pool-35b5943a-ps5v pulling image "gcr.io/cloudsql-docker/gce-proxy:1.16"
Normal Pulled 47s kubelet, gke-library-default-pool-35b5943a-ps5v Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.16"
Normal Created 46s kubelet, gke-library-default-pool-35b5943a-ps5v Created container
Normal Started 46s kubelet, gke-library-default-pool-35b5943a-ps5v Started container
Normal Pulling 2s (x4 over 85s) kubelet, gke-library-default-pool-35b5943a-ps5v pulling image "gcr.io/library-259506/library"
Normal Created 1s (x4 over 50s) kubelet, gke-library-default-pool-35b5943a-ps5v Created container
Normal Started 1s (x4 over 50s) kubelet, gke-library-default-pool-35b5943a-ps5v Started container
Normal Pulled 1s (x4 over 52s) kubelet, gke-library-default-pool-35b5943a-ps5v Successfully pulled image "gcr.io/library-259506/library"
Warning BackOff 1s (x5 over 43s) kubelet, gke-library-default-pool-35b5943a-ps5v Back-off restarting failed container​
Here is the library.yaml file I have to go with it.
# [START kubernetes_deployment]
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: library
labels:
app: library
spec:
replicas: 2
template:
metadata:
labels:
app: library
spec:
containers:
- name: library-app
# Replace with your project ID or use `make template`
image: gcr.io/library-259506/library
# This setting makes nodes pull the docker image every time before
# starting the pod. This is useful when debugging, but should be turned
# off in production.
imagePullPolicy: Always
env:
# [START cloudsql_secrets]
- name: DATABASE_USER
valueFrom:
secretKeyRef:
name: cloudsql
key: username
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: cloudsql
key: password
# [END cloudsql_secrets]
ports:
- containerPort: 8080
# [START proxy_container]
- image: gcr.io/cloudsql-docker/gce-proxy:1.16
name: cloudsql-proxy
command: ["/cloud_sql_proxy", "--dir=/cloudsql",
"-instances=library-259506:asia-south1:library=tcp:3306",
"-credential_file=/secrets/cloudsql/credentials.json"]
volumeMounts:
- name: cloudsql-oauth-credentials
mountPath: /secrets/cloudsql
readOnly: true
- name: ssl-certs
mountPath: /etc/ssl/certs
- name: cloudsql
mountPath: /cloudsql
# [END proxy_container]
# [START volumes]
volumes:
- name: cloudsql-oauth-credentials
secret:
secretName: cloudsql-oauth-credentials
- name: ssl-certs
hostPath:
path: /etc/ssl/certs
- name: cloudsql
emptyDir:
# [END volumes]
# [END kubernetes_deployment]
---
# [START service]
# The library-svc service provides a load-balancing proxy over the polls app
# pods. By specifying the type as a 'LoadBalancer', Container Engine will
# create an external HTTP load balancer.
# The service directs traffic to the deployment by matching the service's selector to the deployment's label
#
# For more information about external HTTP load balancing see:
# https://cloud.google.com/container-engine/docs/load-balancer
apiVersion: v1
kind: Service
metadata:
name: library-svc
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
selector:
app: library
# [END service]
More error status
Container 'library-app' keeps crashing.
CrashLoopBackOff
Reason
Container 'library-app' keeps crashing.
Check Pod's logs to see more details. Learn more
Source
library-7699b84747-9skst
Conditions
Initialized: True Ready: False ContainersReady: False PodScheduled: True
- lastProbeTime: null
lastTransitionTime: "2019-12-06T06:03:43Z"
message: 'containers with unready status: [library-app]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
Key Events
Back-off restarting failed container BackOff Dec 6, 2019, 9:34:54
AM Dec 6, 2019, 12:24:26 PM 779 pulling image
"gcr.io/library-259506/library" Pulling Dec 6, 2019, 9:34:12 AM Dec 6,
2019, 11:59:26 AM 34
The Dockerfile is as follows (this fixed the CrashLoop btw):
FROM python:3
ENV PYTHONUNBUFFERED 1
RUN mkdir /code
WORKDIR /code
COPY requirements.txt /code/
RUN pip install -r requirements.txt
COPY . /code/
# Server
EXPOSE 8080
STOPSIGNAL SIGINT
ENTRYPOINT ["python", "manage.py"]
CMD ["runserver", "0.0.0.0:8080"]
I think a bunch of things all came together
I found the password to db had a special character that needed to be put within quotes and then ensuring port # where accurate across the Dockerfile, library.yaml files. This ensured the secrets actually worked, I detected in the logs a password mismatch issue.
IMPORTANT: the command line fix Emil G about ensuring my Dockerfile doesn't exit quickly, so make sure the CMD actually works and runs your server.
IMPORTANT: Finally I found a fix to the external IP not connecting to my server, see this thread where I explain what went wrong: basically I needed a security context where I had to fix the runAs to not run as root: RunAsUser issue & Clicking external IP of load balancer -> Bad Request (400) on deploying Django app on GKE (Kubernetes) and db connection failing:
I also documented all steps to deploy step 1-15 and

How to connect to GKE postgresql svc in GCP?

I'm trying to connect to the postgresql service (pod) in my kubernetes deployment but I GCP does not give a port (so I can not use something like: $ psql -h localhost -U postgresadmin1 --password -p 31070 postgresdb to connect to Postgresql and see my database).
I'm using a LoadBalancer in my service:
#cloudshell:~ (academic-veld-230622)$ psql -h 35.239.52.68 -U jhipsterpress --password -p 30728 jhipsterpress-postgresql
Password for user jhipsterpress:
psql: could not connect to server: Connection timed out
Is the server running on host "35.239.52.68" and accepting
TCP/IP connections on port 30728?
apiVersion: v1
kind: Service
metadata:
name: jhipsterpress
namespace: default
labels:
app: jhipsterpress
spec:
selector:
app: jhipsterpress
type: LoadBalancer
ports:
- name: http
port: 8080
NAME READY STATUS RESTARTS AGE
pod/jhipsterpress-84886f5cdf-mpwgb 1/1 Running 0 31m
pod/jhipsterpress-postgresql-5956df9557-fg8cn 1/1 Running 0 31m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/jhipsterpress LoadBalancer 10.11.243.22 35.184.135.134 8080:32670/TCP 31m
service/jhipsterpress-postgresql LoadBalancer 10.11.255.64 35.239.52.68 5432:30728/TCP 31m
service/kubernetes ClusterIP 10.11.240.1 <none> 443/TCP 35m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.apps/jhipsterpress 1 1 1 1 31m
deployment.apps/jhipsterpress-postgresql 1 1 1 1 31m
NAME DESIRED CURRENT READY AGE
replicaset.apps/jhipsterpress-84886f5cdf 1 1 1 31m
replicaset.apps/jhipsterpress-postgresql-5956df9557 1 1 1 31m
#cloudshell:~ (academic-veld-230622)$ kubectl describe pod jhipsterpress-postgresql
Name: jhipsterpress-postgresql-5956df9557-fg8cn
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: gke-standard-cluster-1-default-pool-bf9f446d-9hsq/10.128.0.58
Start Time: Sat, 06 Apr 2019 13:39:08 +0200
Labels: app=jhipsterpress-postgresql
pod-template-hash=1512895113
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container postgres
Status: Running
IP: 10.8.0.14
Controlled By: ReplicaSet/jhipsterpress-postgresql-5956df9557
Containers:
postgres:
Container ID: docker://55475d369c63da4d9bdc208e9d43c457f74845846fb4914c88c286ff96d0e45a
Image: postgres:10.4
Image ID: docker-pullable://postgres#sha256:9625c2fb34986a49cbf2f5aa225d8eb07346f89f7312f7c0ea19d82c3829fdaa
Port: 5432/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 06 Apr 2019 13:39:29 +0200
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment:
POSTGRES_USER: jhipsterpress
POSTGRES_PASSWORD: <set to the key 'postgres-password' in secret 'jhipsterpress-postgresql'> Optional: false
Mounts:
/var/lib/pgsql/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mlmm5 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: spingular-bucket
ReadOnly: false
default-token-mlmm5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mlmm5
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 33m (x3 over 33m) default-scheduler persistentvolumeclaim "spingular-bucket" not found
Warning FailedScheduling 33m (x3 over 33m) default-scheduler pod has unbound immediate PersistentVolumeClaims
Normal Scheduled 33m default-scheduler Successfully assigned default/jhipsterpress-postgresql-5956df9557-fg8cn to gke-standard-cluster-1-default-pool-bf9f446d-9hsq
Normal SuccessfulAttachVolume 33m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-95ba1737-5860-11e9-ae59-42010a8000a8"
Normal Pulling 33m kubelet, gke-standard-cluster-1-default-pool-bf9f446d-9hsq pulling image "postgres:10.4"
Normal Pulled 32m kubelet, gke-standard-cluster-1-default-pool-bf9f446d-9hsq Successfully pulled image "postgres:10.4"
Normal Created 32m kubelet, gke-standard-cluster-1-default-pool-bf9f446d-9hsq Created container
Normal Started 32m kubelet, gke-standard-cluster-1-default-pool-bf9f446d-9hsq Started container
With the open firewall: posgresql-jhipster
Ingress
Apply to all
IP ranges: 0.0.0.0/0
tcp:30728
Allow
999
default
Thanks for your help. Any documentation is really appreciated.
Your service is currently a type clusterIP. This does not expose the service or the pods outside the cluster. You can't connect to the pod from the Cloud Shell like this since the Cloud shell is not on your VPC and the pods are not exposed.
Update your service using kubectl edit svc jhipsterpress-postgresql
Change the spec.type field to 'LoadBalancer'
You will then have an external IP that you can connect to