How can I solve this problem -- I am trying to run akka cluster on minikube. But failed to create a cluster.
17:46:49.093 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
My config is --
akka {
actor {
provider = cluster
}
cluster {
shutdown-after-unsuccessful-join-seed-nodes = 60s
}
coordinated-shutdown.exit-jvm = on
management {
cluster.bootstrap {
contact-point-discovery {
discovery-method = kubernetes-api
}
}
}
}
my yaml
kind: Deployment
metadata:
labels:
app: appka
name: appka
spec:
replicas: 2
selector:
matchLabels:
app: appka
template:
metadata:
labels:
app: appka
spec:
containers:
- name: appka
image: akkacluster:latest
imagePullPolicy: Never
readinessProbe:
httpGet:
path: /ready
port: management
periodSeconds: 10
failureThreshold: 10
initialDelaySeconds: 20
livenessProbe:
httpGet:
path: /alive
port: management
periodSeconds: 10
failureThreshold: 10
initialDelaySeconds: 20
ports:
- name: management
containerPort: 8558
protocol: TCP
- name: http
containerPort: 8080
protocol: TCP
- name: remoting
containerPort: 25520
protocol: TCP
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: read-pods
subjects:
- kind: User
name: system:serviceaccount:default:default
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Unfortunately my cluster is not formaing---
kubectl logs pod/appka-7c4b7df7f7-5v7cc
17:46:32.026 [appka-akka.actor.default-dispatcher-3] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
SLF4J: A number (1) of logging calls during the initialization phase have been intercepted and are
SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
SLF4J: See also http://www.slf4j.org/codes.html#replay
17:46:33.644 [appka-akka.actor.default-dispatcher-3] INFO akka.remote.artery.tcp.ArteryTcpTransport - Remoting started with transport [Artery tcp]; listening on address [akka://appka#172.17.0.4:25520] with UID [-8421566647681174079]
17:46:33.811 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Starting up, Akka version [2.6.14] ...
17:46:34.491 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Registered cluster JMX MBean [akka:type=Cluster]
17:46:34.512 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - Started up successfully
17:46:34.883 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - No downing-provider-class configured, manual cluster downing required, see https://doc.akka.io/docs/akka/current/typed/cluster.html#downing
17:46:34.884 [appka-akka.actor.default-dispatcher-3] INFO akka.cluster.Cluster - Cluster Node [akka://appka#172.17.0.4:25520] - No seed nodes found in configuration, relying on Cluster Bootstrap for joining
17:46:39.084 [appka-akka.actor.default-dispatcher-11] INFO akka.management.internal.HealthChecksImpl - Loading readiness checks [(cluster-membership,akka.management.cluster.scaladsl.ClusterMembershipCheck), (sharding,akka.cluster.sharding.ClusterShardingHealthCheck)]
17:46:39.090 [appka-akka.actor.default-dispatcher-11] INFO akka.management.internal.HealthChecksImpl - Loading liveness checks []
17:46:39.104 [appka-akka.actor.default-dispatcher-3] INFO ClusterListenerActor$ - started actor akka://appka/user - (class akka.actor.typed.internal.adapter.ActorRefAdapter)
17:46:39.888 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Binding Akka Management (HTTP) endpoint to: 172.17.0.4:8558
17:46:40.525 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for ClusterHttpManagementRouteProvider
17:46:40.806 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for ClusterBootstrap
17:46:40.821 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Using self contact point address: http://172.17.0.4:8558
17:46:40.914 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Including HTTP management routes for HealthCheckRoutes
17:46:44.198 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Initiating bootstrap procedure using kubernetes-api method...
17:46:44.200 [appka-akka.actor.default-dispatcher-3] INFO akka.management.cluster.bootstrap.ClusterBootstrap - Bootstrap using `akka.discovery` method: kubernetes-api
17:46:44.226 [appka-akka.actor.default-dispatcher-3] INFO akka.management.scaladsl.AkkaManagement - Bound Akka Management (HTTP) endpoint to: 172.17.0.4:8558
17:46:44.487 [appka-akka.actor.default-dispatcher-6] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Locating service members. Using discovery [akka.discovery.kubernetes.KubernetesApiServiceDiscovery], join decider [akka.management.cluster.bootstrap.LowestAddressJoinDecider], scheme [http]
17:46:44.490 [appka-akka.actor.default-dispatcher-6] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:44.493 [appka-akka.actor.default-dispatcher-6] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:45.626 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:45.627 [appka-akka.actor.default-dispatcher-12] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:48.428 [appka-akka.actor.default-dispatcher-13] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:48.485 [appka-akka.actor.default-dispatcher-22] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:48.586 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Discovered [2] contact points, confirmed [0], which is less than the required [2], retrying
17:46:49.092 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-4.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-4.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:49.093 [appka-akka.actor.default-dispatcher-12] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:49.603 [appka-akka.actor.default-dispatcher-22] INFO akka.management.cluster.bootstrap.LowestAddressJoinDecider - Discovered [2] contact points, confirmed [0], which is less than the required [2], retrying
17:46:49.682 [appka-akka.actor.default-dispatcher-21] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Looking up [Lookup(appka,None,Some(tcp))]
17:46:49.683 [appka-akka.actor.default-dispatcher-21] INFO akka.discovery.kubernetes.KubernetesApiServiceDiscovery - Querying for pods with label selector: [app=appka]. Namespace: [default]. Port: [None]
17:46:49.726 [appka-akka.actor.default-dispatcher-12] INFO akka.management.cluster.bootstrap.internal.BootstrapCoordinator - Located service members based on: [Lookup(appka,None,Some(tcp))]: [ResolvedTarget(172-17-0-4.default.pod.cluster.local,None,Some(/172.17.0.4)), ResolvedTarget(172-17-0-3.default.pod.cluster.local,None,Some(/172.17.0.3))], filtered to [172-17-0-4.default.pod.cluster.local:0, 172-17-0-3.default.pod.cluster.local:0]
17:46:50.349 [appka-akka.actor.default-dispatcher-21] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-3.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-3.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
17:46:50.504 [appka-akka.actor.default-dispatcher-11] WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://172-17-0-4.default.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Tcp command [Connect(172-17-0-4.default.pod.cluster.local:8558,None,List(),Some(10 seconds),true)] failed because of java.net.ConnectException: Connection refused
You are missing akka.remote setting block. Something like:
akka {
actor {
# provider=remote is possible, but prefer cluster
provider = cluster
}
remote {
artery {
transport = tcp # See Selecting a transport below
canonical.hostname = "127.0.0.1"
canonical.port = 25520
}
}
}
Created a imageclassifier model built on renet50 to identify dog breeds. I created it in sagemaker studio. Tuning and training are done, I deployed it, but when I try to predict on it, it fails. I believe this is related to the pid of the worker because its first warning I see.
Getting following Cloudwatch log output says worker pid not available yet then soon after the worker dies.
timestamp,message,logStreamName
1648240674535,"2022-03-25 20:37:54,107 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...",AllTraffic/i-055c5d00e53e84b93
1648240674535,"2022-03-25 20:37:54,188 [INFO ] main org.pytorch.serve.ModelServer - ",AllTraffic/i-055c5d00e53e84b93
1648240674535,Torchserve version: 0.4.0,AllTraffic/i-055c5d00e53e84b93
1648240674535,TS Home: /opt/conda/lib/python3.6/site-packages,AllTraffic/i-055c5d00e53e84b93
1648240674535,Current directory: /,AllTraffic/i-055c5d00e53e84b93
1648240674535,Temp directory: /home/model-server/tmp,AllTraffic/i-055c5d00e53e84b93
1648240674535,Number of GPUs: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Number of CPUs: 1,AllTraffic/i-055c5d00e53e84b93
1648240674535,Max heap size: 6838 M,AllTraffic/i-055c5d00e53e84b93
1648240674535,Python executable: /opt/conda/bin/python3.6,AllTraffic/i-055c5d00e53e84b93
1648240674535,Config file: /etc/sagemaker-ts.properties,AllTraffic/i-055c5d00e53e84b93
1648240674535,Inference address: http://0.0.0.0:8080,AllTraffic/i-055c5d00e53e84b93
1648240674535,Management address: http://0.0.0.0:8080,AllTraffic/i-055c5d00e53e84b93
1648240674535,Metrics address: http://127.0.0.1:8082,AllTraffic/i-055c5d00e53e84b93
1648240674535,Model Store: /.sagemaker/ts/models,AllTraffic/i-055c5d00e53e84b93
1648240674535,Initial Models: model.mar,AllTraffic/i-055c5d00e53e84b93
1648240674535,Log dir: /logs,AllTraffic/i-055c5d00e53e84b93
1648240674535,Metrics dir: /logs,AllTraffic/i-055c5d00e53e84b93
1648240674535,Netty threads: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Netty client threads: 0,AllTraffic/i-055c5d00e53e84b93
1648240674535,Default workers per model: 1,AllTraffic/i-055c5d00e53e84b93
1648240674535,Blacklist Regex: N/A,AllTraffic/i-055c5d00e53e84b93
1648240674535,Maximum Response Size: 6553500,AllTraffic/i-055c5d00e53e84b93
1648240674536,Maximum Request Size: 6553500,AllTraffic/i-055c5d00e53e84b93
1648240674536,Prefer direct buffer: false,AllTraffic/i-055c5d00e53e84b93
1648240674536,Allowed Urls: [file://.*|http(s)?://.*],AllTraffic/i-055c5d00e53e84b93
1648240674536,Custom python dependency for model allowed: false,AllTraffic/i-055c5d00e53e84b93
1648240674536,Metrics report format: prometheus,AllTraffic/i-055c5d00e53e84b93
1648240674536,Enable metrics API: true,AllTraffic/i-055c5d00e53e84b93
1648240674536,Workflow Store: /.sagemaker/ts/models,AllTraffic/i-055c5d00e53e84b93
1648240674536,"2022-03-25 20:37:54,195 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...",AllTraffic/i-055c5d00e53e84b93
1648240675536,"2022-03-25 20:37:54,217 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar",AllTraffic/i-055c5d00e53e84b93
1648240675536,"2022-03-25 20:37:55,505 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,515 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.",AllTraffic/i-055c5d00e53e84b93
1648240675786,"2022-03-25 20:37:55,569 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082",AllTraffic/i-055c5d00e53e84b93
1648240675786,Model server started.,AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,727 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,812 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,813 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:38.02598190307617|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,813 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:12.715518951416016|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,814 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:25.1|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,815 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:29583.98046875|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,815 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:1355.765625|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,816 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:5.7|#Level:Host|#hostname:container-0.local,timestamp:1648240675",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]48",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,994 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.6.13",AllTraffic/i-055c5d00e53e84b93
1648240676036,"2022-03-25 20:37:55,999 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,006 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 182, in <module>",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 154, in run_server",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,111 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 116, in handle_connection",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 89, in load_model",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_loader.py"", line 110, in load",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,112 [INFO ] W-9000-model_1-stdout MODEL_LOG - initialize_fn(service.context)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/home/model-server/tmp/models/23b30361031647d08792d32672910688/handler_service.py"", line 51, in initialize",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - super().initialize(context)",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py"", line 66, in initialize",AllTraffic/i-055c5d00e53e84b93
1648240676286,"2022-03-25 20:37:56,113 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,114 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,416 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240676536,"2022-03-25 20:37:56,461 [INFO ] W-9000-model_1 ACCESS_LOG - /169.254.178.2:39848 ""GET /ping HTTP/1.1"" 200 9",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:56,461 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:container-0.local,timestamp:null",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,567 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]86",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.6.13",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,568 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,569 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,642 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 182, in <module>",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 154, in run_server",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 116, in handle_connection",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg)",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout MODEL_LOG - File ""/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py"", line 89, in load_model",AllTraffic/i-055c5d00e53e84b93
1648240677787,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout",AllTraffic/i-055c5d00e53e84b93
1648240678037,"2022-03-25 20:37:57,643 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:57,991 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:59,096 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000",AllTraffic/i-055c5d00e53e84b93
1648240679288,"2022-03-25 20:37:59,097 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]114",AllTraffic/i-055c5d00e53e84b93
Model tuning and training came out alright so I'm not sure why it won't predict if that is fine. Someone mentioned to me that it might be due to entry point script, but I don't know what would cause it fail in predicting after deployed if it can predict fine during training.
Entry point script:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.models as models
import torchvision.transforms as transforms
import json
import copy
import argparse
import os
import logging
import sys
from tqdm import tqdm
from PIL import ImageFile
import smdebug.pytorch as smd
ImageFile.LOAD_TRUNCATED_IMAGES = True
logger=logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))
def test(model, test_loader, criterion, hook):
model.eval()
running_loss=0
running_corrects=0
hook.set_mode(smd.modes.EVAL)
for inputs, labels in test_loader:
outputs=model(inputs)
loss=criterion(outputs, labels)
_, preds = torch.max(outputs, 1)
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
##total_loss = running_loss // len(test_loader)
##total_acc = running_corrects.double() // len(test_loader)
##logger.info(f"Testing Loss: {total_loss}")
##logger.info(f"Testing Accuracy: {total_acc}")
logger.info("New test acc")
logger.info(f'Test set: Accuracy: {running_corrects}/{len(test_loader.dataset)} = {100*(running_corrects/len(test_loader.dataset))}%)')
def train(model, train_loader, validation_loader, criterion, optimizer, hook):
epochs=50
best_loss=1e6
image_dataset={'train':train_loader, 'valid':validation_loader}
loss_counter=0
hook.set_mode(smd.modes.TRAIN)
for epoch in range(epochs):
logger.info(f"Epoch: {epoch}")
for phase in ['train', 'valid']:
if phase=='train':
model.train()
logger.info("Model Trained")
else:
model.eval()
running_loss = 0.0
running_corrects = 0
for inputs, labels in image_dataset[phase]:
outputs = model(inputs)
loss = criterion(outputs, labels)
if phase=='train':
optimizer.zero_grad()
loss.backward()
optimizer.step()
logger.info("Model Optimized")
_, preds = torch.max(outputs, 1)
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
epoch_loss = running_loss // len(image_dataset[phase])
epoch_acc = running_corrects // len(image_dataset[phase])
if phase=='valid':
logger.info("Model Validating")
if epoch_loss<best_loss:
best_loss=epoch_loss
else:
loss_counter+=1
logger.info(loss_counter)
'''logger.info('{} loss: {:.4f}, acc: {:.4f}, best loss: {:.4f}'.format(phase,
epoch_loss,
epoch_acc,
best_loss))'''
if phase=="train":
logger.info("New epoch acc for Train:")
logger.info(f"Epoch {epoch}: Loss {loss_counter/len(train_loader.dataset)}, Accuracy {100*(running_corrects/len(train_loader.dataset))}%")
if phase=="valid":
logger.info("New epoch acc for Valid:")
logger.info(f"Epoch {epoch}: Loss {loss_counter/len(train_loader.dataset)}, Accuracy {100*(running_corrects/len(train_loader.dataset))}%")
##if loss_counter==1:
## break
##if epoch==0:
## break
return model
def net():
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(
nn.Linear(2048, 128),
nn.ReLU(inplace=True),
nn.Linear(128, 133))
return model
def create_data_loaders(data, batch_size):
train_data_path = os.path.join(data, 'train')
test_data_path = os.path.join(data, 'test')
validation_data_path=os.path.join(data, 'valid')
train_transform = transforms.Compose([
transforms.RandomResizedCrop((224, 224)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
])
test_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])
train_data = torchvision.datasets.ImageFolder(root=train_data_path, transform=train_transform)
train_data_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data = torchvision.datasets.ImageFolder(root=test_data_path, transform=test_transform)
test_data_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True)
validation_data = torchvision.datasets.ImageFolder(root=validation_data_path, transform=test_transform)
validation_data_loader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size, shuffle=True)
return train_data_loader, test_data_loader, validation_data_loader
def main(args):
logger.info(f'Hyperparameters are LR: {args.lr}, Batch Size: {args.batch_size}')
logger.info(f'Data Paths: {args.data}')
train_loader, test_loader, validation_loader=create_data_loaders(args.data, args.batch_size)
model=net()
hook = smd.Hook.create_from_json_file()
hook.register_hook(model)
criterion = nn.CrossEntropyLoss(ignore_index=133)
optimizer = optim.Adam(model.fc.parameters(), lr=args.lr)
logger.info("Starting Model Training")
model=train(model, train_loader, validation_loader, criterion, optimizer, hook)
logger.info("Testing Model")
test(model, test_loader, criterion, hook)
logger.info("Saving Model")
torch.save(model.cpu().state_dict(), os.path.join(args.model_dir, "model.pth"))
if __name__=='__main__':
parser=argparse.ArgumentParser()
'''
TODO: Specify any training args that you might need
'''
parser.add_argument(
"--batch-size",
type=int,
default=64,
metavar="N",
help="input batch size for training (default: 64)",
)
parser.add_argument(
"--test-batch-size",
type=int,
default=1000,
metavar="N",
help="input batch size for testing (default: 1000)",
)
parser.add_argument(
"--epochs",
type=int,
default=5,
metavar="N",
help="number of epochs to train (default: 10)",
)
parser.add_argument(
"--lr", type=float, default=0.01, metavar="LR", help="learning rate (default: 0.01)"
)
parser.add_argument(
"--momentum", type=float, default=0.5, metavar="M", help="SGD momentum (default: 0.5)"
)
# Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--data", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
args=parser.parse_args()
main(args)
To test the model on the endpoint I sent over an image using the following code:
from sagemaker.serializers import IdentitySerializer
import base64
predictor.serializer = IdentitySerializer("image/png")
with open("Akita_00282.jpg", "rb") as f:
payload = f.read()
response = predictor.predict(payload)```
The model serving workers are either dying because they cannot load your model or deserialize the payload you are sending to them.
Note that you have to provide a model_fn implementation. Please read these docs here or this blog here to know more about how to adapt the inference scripts for SageMaker deployment. If you do not want to override the input_fn, predict_fn, and/or output_fn handlers, you can find their default implementations, for example, here.
I have EKS cluster with EBS storage class/volume.
I am able to deploy hdfs namenode and datanode images (bde2020/hadoop-xxx) using statefulset successfully.
When I am trying to put a file to hdfs from my machine using hdfs://:, it gives me success, but it does not get written on datanode.
In namenode log, I see below error.
Can it be something to do with EBS volume? I cannot even upload/download files from namenode GUI. Can it be due to as datanode host name hdfs-data-X.hdfs-data.pulse.svc.cluster.local is not resolvable to my local machine?
Please help
2020-05-12 17:38:51,360 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=10.8.29.112:9866, 10.8.29.176:9866, 10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:13,036 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:13,036 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:13,036 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:13,036 INFO hdfs.StateChange: BLOCK* allocate blk_1073741826_1002, replicas=10.8.29.176:9866, 10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:34,607 INFO namenode.FSEditLog: Number of transactions: 11 Total time for transactions(ms): 23 Number of transactions batched in Syncs: 3 Number of syncs: 8 SyncTimes(ms): 23
2020-05-12 17:39:35,146 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:35,146 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 2 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:35,146 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:35,147 INFO hdfs.StateChange: BLOCK* allocate blk_1073741827_1003, replicas=10.8.29.188:9866 for /vault/a.json
2020-05-12 17:39:57,319 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2020-05-12 17:39:57,319 WARN protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 3 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2020-05-12 17:39:57,319 WARN blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2020-05-12 17:39:57,320 INFO ipc.Server: IPC Server handler 5 on default port 8020, call Call#12 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.254.40.95:59328
java.io.IOException: File /vault/a.json could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2219)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2789)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:892)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
My namenode web page shows below:
Node Http Address Last contact Last Block Report Capacity Blocks Block pool used Version
hdfs-data-0.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-0.hdfs-data.pulse.svc.cluster.local:9864 1s 0m
975.9 MB
0 24 KB (0%) 3.2.1
hdfs-data-1.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-1.hdfs-data.pulse.svc.cluster.local:9864 2s 0m
975.9 MB
0 24 KB (0%) 3.2.1
hdfs-data-2.hdfs-data.pulse.svc.cluster.local:9866 http://hdfs-data-2.hdfs-data.pulse.svc.cluster.local:9864 1s 0m
975.9 MB
0 24 KB (0%) 3.2.1
My deployment:
NameNode:
#clusterIP service of namenode
apiVersion: v1
kind: Service
metadata:
name: hdfs-name
namespace: pulse
labels:
component: hdfs-name
spec:
ports:
- port: 8020
protocol: TCP
name: nn-rpc
- port: 9870
protocol: TCP
name: nn-web
selector:
component: hdfs-name
type: ClusterIP
---
#namenode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-name
namespace: pulse
labels:
component: hdfs-name
spec:
serviceName: hdfs-name
replicas: 1
selector:
matchLabels:
component: hdfs-name
template:
metadata:
labels:
component: hdfs-name
spec:
initContainers:
- name: delete-lost-found
image: busybox
command: ["sh", "-c", "rm -rf /hadoop/dfs/name/lost+found"]
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
containers:
- name: hdfs-name
image: bde2020/hadoop-namenode
env:
- name: CLUSTER_NAME
value: hdfs-k8s
- name: HDFS_CONF_dfs_permissions_enabled
value: "false"
ports:
- containerPort: 8020
name: nn-rpc
- containerPort: 9870
name: nn-web
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
#subPath: data #subPath required as on root level, lost+found folder is created which does not cause to run namenode --format
volumeClaimTemplates:
- metadata:
name: hdfs-name-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
Datanode:
#headless service of datanode
apiVersion: v1
kind: Service
metadata:
name: hdfs-data
namespace: pulse
labels:
component: hdfs-data
spec:
ports:
- port: 80
protocol: TCP
selector:
component: hdfs-data
clusterIP: None
type: ClusterIP
---
#datanode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-data
namespace: pulse
labels:
component: hdfs-data
spec:
serviceName: hdfs-data
replicas: 3
selector:
matchLabels:
component: hdfs-data
template:
metadata:
labels:
component: hdfs-data
spec:
containers:
- name: hdfs-data
image: bde2020/hadoop-datanode
env:
- name: CORE_CONF_fs_defaultFS
value: hdfs://hdfs-name:8020
volumeMounts:
- name: hdfs-data-pv-claim
mountPath: /hadoop/dfs/data
volumeClaimTemplates:
- metadata:
name: hdfs-data-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
It seems to be issue with the datanode not reachable over rpc port from my client machine.
I had datanodes http port reachable from my client machine. Tried using webhdfs:// (instead of hdfs://) after putting mapping of datanode podname vs IP in hosts file, it worked out.
I have written a sample MVC code using the Spring framework and I have deployed it in Bluemix.
When running the deployed URL, I am receiving the following error.
The application or context root for this request has not been found
What am i doing wrong ? Anything needed to be changed in web.xml?
Logs message
[AUDIT ] CWWKE0001I: The server defaultServer has been launched.
[AUDIT ] CWWKG0028A: Processing included configuration resource:
/home/vcap/app/wlp/usr/servers/defaultServer/runtime-vars.xml
[INFO ] CWWKE0002I: The kernel started after 10.005 seconds
[INFO ] CWWKF0007I: Feature update started.
[INFO ] CWWKO0219I: TCP Channel httpEndpoint-179 has been started
and is now listening for requests on host * (IPv6) port 61031.
[INFO ] CWWKO0219I: TCP Channel defaultHttpEndpoint has been
started and is now listening for requests on host localhost (IPv4:
127.0.0.1) port 9080.
[INFO ] CWSCX0122I: Register management Bean provider:
com.ibm.ws.cloudoe.management.client.provider.dump.JavaDumpBeanProvider#c68ae63e.
[INFO ] CWSCX0122I: Register management Bean provider:
com.ibm.ws.cloudoe.management.client.provider.logging.LibertyLoggingBeanProvider#f0d6d754.
[INFO ] SRVE0169I: Loading Web Module:
com.ibm.ws.cloudoe.management.client.liberty.connector.
[INFO ] SRVE0250I: Web Module
com.ibm.ws.cloudoe.management.client.liberty.connector has been bound
to default_host.
[AUDIT ] CWWKT0016I: Web application available (default_host):
http://localhost:9080/IBMMGMTRest/
[INFO ] CWWKZ0018I: Starting application myapp.
[INFO ] SRVE0169I: Loading Web Module: TaxBillReminder.
[INFO ] SRVE0250I: Web Module TaxBillReminder has been bound to
default_host.
[AUDIT ] CWWKT0016I: Web application available (default_host):
http://localhost:9080/
[AUDIT ] CWWKZ0001I: Application myapp started in 2.113 seconds.
[AUDIT ] CWWKF0012I: The server installed the following features:
[json-1.0, jpa-2.0, icap:managementConnector-1.0, beanValidation-1.0,
jdbc-4.0, managedBeans-1.0, jsf-2.0, jsp-2.2, servlet-3.0, jaxrs-1.1,
jndi-1.0, appState-1.0, ejbLite-3.1, cdi-1.0].
[INFO ] CWWKF0008I: Feature update completed in 9.472 seconds.
[AUDIT ] CWWKF0011I: The server defaultServer is ready to run a
smarter planet.
[INFO ] SESN8501I: The session manager did not find a persistent
storage location; HttpSession objects will be stored in the local
application server's memory.
[INFO ] SESN0176I: A new session context will be created for
application key default_host/
[INFO ] SESN0172I: The session manager is using the Java default
SecureRandom implementation for session ID generation.
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[INFO ] FFDC1015I: An FFDC Incident has been created:
"java.util.ServiceConfigurationError:
javax.servlet.ServletContainerInitializer: Provider
org.cloudfoundry.reconfiguration.spring.AutoReconfigurationServletContainerInitializer
could not be instantiated
com.ibm.ws.webcontainer.osgi.DynamicVirtualHost startWebApp" at
ffdc_15.05.22_06.28.59.0.log TaxBillReminder.mybluemix.net -
[22/05/2015:06:28:58 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.42:31418 x_forwarded_for:"-" vcap_request_id:430a380b-a68e-4123-6ff8-c87348c535a3
response_time:0.813611619 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:00 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.46:42514 x_forwarded_for:"-" vcap_request_id:c54dff7f-908f-4cc1-49d9-de6d8bd04fe7
response_time:0.127545436 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:01 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.43:29980 x_forwarded_for:"-" vcap_request_id:23bc66ac-c78e-42ab-5a07-60f99ffc492b
response_time:0.117255613 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". [WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:03 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.43:23392 x_forwarded_for:"-" vcap_request_id:c255a3fb-5eb1-44f5-4c08-b22222a4c8b7
response_time:0.111495485 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:04 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.46:41130
x_forwarded_for:"-"
vcap_request_id:0c009c84-f0c0-46e9-7b6d-da8e3ff91a55
response_time:0.115888617 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. [INFO ] SESN0175I: An existing session context
will be used for application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:05 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.46:52243 x_forwarded_for:"-" vcap_request_id:c4c29b52-ff3a-48b6-47e4-7e1fce0c3f74
response_time:0.187145593 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:06 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.42:11225
x_forwarded_for:"-"
vcap_request_id:54e0e021-826e-443b-6a7a-5f6bbc28a926
response_time:0.132534560 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp.
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:08 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.43:32255
x_forwarded_for:"-"
vcap_request_id:0ac50be0-e2e9-436c-4e97-d854f78e1f49
response_time:0.089186493 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp.
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:09 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.46:39103
x_forwarded_for:"-"
vcap_request_id:ddc4754a-cf0f-494c-78de-26fcd61ba1af
response_time:0.102293236 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp.
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:10 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.42:30749
x_forwarded_for:"-"
vcap_request_id:fa6ba947-4b8c-474b-4b48-ace26fc3274e
response_time:0.091226461 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp.
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:11 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.46:46353
x_forwarded_for:"-"
vcap_request_id:dfc99308-11c0-4ea7-48ca-b4061b3b4c6f
response_time:0.096913693 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp.
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:12 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.46:57429 x_forwarded_for:"-" vcap_request_id:4f7e9876-cf5d-46c2-6cb1-19f00329e029
response_time:0.100562784 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:13 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.43:52701 x_forwarded_for:"-" vcap_request_id:fd13c364-d65a-4ca6-66b1-9bc49c1ea427
response_time:0.098537113 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:15 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.42:10951 x_forwarded_for:"-" vcap_request_id:883eb6fc-cdb4-45c6-41f6-cc65970ef256
response_time:0.095498510 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:16 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.42:30830 x_forwarded_for:"-" vcap_request_id:fc251ebf-da3a-48ae-4312-5218bd83808b
response_time:0.134904531 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15". TaxBillReminder.mybluemix.net - [22/05/2015:06:29:17 +0000] "GET
/ HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.42:54827
x_forwarded_for:"-"
vcap_request_id:e09e1926-860b-481e-4b48-ed5a66330580
response_time:0.084558083 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. [INFO ] SESN0175I: An existing session context
will be used for application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:18 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.42:31009 x_forwarded_for:"-" vcap_request_id:a9c3a69f-ae27-4c72-7422-608fe01451fd
response_time:0.092770319 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:19 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.46:55458 x_forwarded_for:"-" vcap_request_id:20ebe389-2371-455a-5832-71c85f48c46d
response_time:0.083255059 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:21 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.46:44171 x_forwarded_for:"-" vcap_request_id:14081f78-3959-462f-5602-dd474718094c
response_time:0.104446356 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. TaxBillReminder.mybluemix.net -
[22/05/2015:06:29:22 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0"
75.126.70.43:21091 x_forwarded_for:"-" vcap_request_id:930a620b-e6a2-4bdb-6b72-36c072eea29b
response_time:0.100104583 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. taxbillreminder.mybluemix.net -
[22/05/2015:06:29:23 +0000] "GET / HTTP/1.1" 404 217 "-" "Mozilla/5.0
(compatible; MSIE 10.0; Windows NT 6.1; Win64; x64; Trident/6.0)"
75.126.70.43:45588 x_forwarded_for:"-" vcap_request_id:cd805473-5b36-423c-441f-4a013e0c91c3
response_time:0.092833842 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
[INFO ] SESN0175I: An existing session context will be used for
application key default_host/
[INFO ] JSPG8502I: The value of the JSP attribute jdkSourceLevel is
"15".
[WARNING ] SRVE0274W: Error while adding servlet mapping for
path-->/forms/, wrapper-->ServletWrapper[dispatcher:[/forms/]],
application-->myapp. taxbillreminder.mybluemix.net -
[22/05/2015:06:30:31 +0000] "GET / HTTP/1.1" 404 217 "-" "Mozilla/5.0
(Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0"
75.126.70.43:54400 x_forwarded_for:"-" vcap_request_id:7ca062d7-13ff-4ae2-5441-265d3c2194b5
response_time:0.424214609 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
You need to look at the context-root or contextRoot that might be defined in your server.xml or web.xml. If there is no context-root or contextRoot defined then the name of the liberty application is used, see here for the rules. The route to your app running on liberty will normally be something like this:
http://your_bluemix_app.mybluemix.net/the_liberty_app_name
The deployed url that you see Bluemix report is the base url for the application which in this case is a liberty server, so you need to append your context-root (or liberty app name) for your app to it.
You can imagine that you can push 2 or more liberty apps packaged in one liberty server to Bluemix. In this case you have one Bluemix app with 2 web applications running within it that can be accessed like this:
http://your_bluemix_app.mybluemix.net/the_liberty_app_name_1
http://your_bluemix_app.mybluemix.net/the_liberty_app_name_2
I had similar issue. The solution described at http://developer.ibm.com/answers/answers/185697/view.html worked for me.
Looks like the application failed to initialize because of the following:
[INFO ] FFDC1015I: An FFDC Incident has been created: "java.util.ServiceConfigurationError: javax.servlet.ServletContainerInitializer: Provider org.cloudfoundry.reconfiguration.spring.AutoReconfigurationServletContainerInitializer could not be instantiated com.ibm.ws.webcontainer.osgi.DynamicVirtualHost startWebApp" at ffdc_15.05.22_06.28.59.0.log TaxBillReminder.mybluemix.net - [22/05/2015:06:28:58 +0000] "GET / HTTP/1.1" 404 217 "-" "Java/1.8.0" 75.126.70.42:31418 x_forwarded_for:"-" vcap_request_id:430a380b-a68e-4123-6ff8-c87348c535a3 response_time:0.813611619 app_id:70683a0f-06f4-4ad9-93b7-b37dc8241211
Your application is spring and the autoconfiguration is causing problems.
With the latest Liberty buildpack, you can set JBP_CONFIG_SPRINGAUTORECONFIGURATION environment variable to '[enabled: false]' to disable Spring auto-reconfiguration. I think in your case the Spring auto-reconfiguration bit is the cause of this problem. Using the cf client execute and then restage your application:
$ cf set-env myApplication JBP_CONFIG_SPRINGAUTORECONFIGURATION '[enabled: false]'