Trouble connecting to GCP TPU VM - google-cloud-platform

Trouble connecting to GCP TPU VM - google-cloud-platform

I followed along with the instructions to create a cloud TPU VM and run a custom neural network as directed by the Run Tensorflow on TPU pod slices to a T. It's important to note that I have been able to initialize the cloud TPUs when running this model on google colab, but require more resources than can be provided there even when explicitly managing the memory of the code used.
When I create the VM, I use the following command:
gcloud compute tpus tpu-vm create test-tpu-vm --zone=us-central1-b --accelerator-type=v2-8 --version=tpu-vm-tf-2.11.0
Next I log into the instance like so:
gcloud compute tpus tpu-vm ssh test-tpu-vm --zone us-central1-b --project <project_id>
where I clone down the code repo as follows:
git clone https://github.com/messerb5467/kaggle-competitions.git
and need to install the pandas library as required by the script:
pip install pandas
After doing all of this, I run the script and get the following issue:
messerb5467#t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$ ./allstate_claims_data_nn.py test-tpu-vm
2022-12-29 23:06:08.675448: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 23:06:08.855095: I tensorflow/core/tpu/tpu_initializer_helper.cc:275] Libtpu path is: libtpu.so
D1229 23:06:09.026751179 12938 config.cc:113] gRPC EXPERIMENT tcp_frame_size_tuning OFF (default:OFF)
D1229 23:06:09.026775139 12938 config.cc:113] gRPC EXPERIMENT tcp_read_chunks OFF (default:OFF)
D1229 23:06:09.026782994 12938 config.cc:113] gRPC EXPERIMENT tcp_rcv_lowat OFF (default:OFF)
D1229 23:06:09.026790374 12938 config.cc:113] gRPC EXPERIMENT peer_state_based_framing OFF (default:OFF)
D1229 23:06:09.026797554 12938 config.cc:113] gRPC EXPERIMENT flow_control_fixes OFF (default:OFF)
D1229 23:06:09.026804675 12938 config.cc:113] gRPC EXPERIMENT memory_pressure_controller OFF (default:OFF)
D1229 23:06:09.026812324 12938 config.cc:113] gRPC EXPERIMENT periodic_resource_quota_reclamation ON (default:ON)
D1229 23:06:09.026819471 12938 config.cc:113] gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1229 23:06:09.026826517 12938 config.cc:113] gRPC EXPERIMENT new_hpack_huffman_decoder OFF (default:OFF)
D1229 23:06:09.026833747 12938 config.cc:113] gRPC EXPERIMENT event_engine_client OFF (default:OFF)
D1229 23:06:09.026840808 12938 config.cc:113] gRPC EXPERIMENT monitoring_experiment ON (default:ON)
D1229 23:06:09.026847921 12938 config.cc:113] gRPC EXPERIMENT promise_based_client_call OFF (default:OFF)
I1229 23:06:09.027065091 12938 ev_epoll1_linux.cc:121] grpc epoll fd: 7
D1229 23:06:09.027080773 12938 ev_posix.cc:141] Using polling engine: epoll1
D1229 23:06:09.027107304 12938 dns_resolver_ares.cc:824] Using ares dns resolver
D1229 23:06:09.027394086 12938 lb_policy_registry.cc:45] registering LB policy factory for "priority_experimental"
D1229 23:06:09.027405540 12938 lb_policy_registry.cc:45] registering LB policy factory for "outlier_detection_experimental"
D1229 23:06:09.027414241 12938 lb_policy_registry.cc:45] registering LB policy factory for "weighted_target_experimental"
D1229 23:06:09.027422457 12938 lb_policy_registry.cc:45] registering LB policy factory for "pick_first"
D1229 23:06:09.027430746 12938 lb_policy_registry.cc:45] registering LB policy factory for "round_robin"
D1229 23:06:09.027444142 12938 lb_policy_registry.cc:45] registering LB policy factory for "ring_hash_experimental"
D1229 23:06:09.027470472 12938 lb_policy_registry.cc:45] registering LB policy factory for "grpclb"
D1229 23:06:09.027508895 12938 lb_policy_registry.cc:45] registering LB policy factory for "rls_experimental"
D1229 23:06:09.027531154 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_manager_experimental"
D1229 23:06:09.027539743 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_impl_experimental"
D1229 23:06:09.027548580 12938 lb_policy_registry.cc:45] registering LB policy factory for "cds_experimental"
D1229 23:06:09.027556928 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_resolver_experimental"
D1229 23:06:09.027565714 12938 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1229 23:06:09.051036639 12938 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
2022-12-29 23:06:09.090117: I tensorflow/core/tpu/tpu_initializer_helper.cc:225] GetTpuPjrtApi not found
Traceback (most recent call last):
File "./allstate_claims_data_nn.py", line 145, in <module>
main()
File "./allstate_claims_data_nn.py", line 141, in main
allstate_nn_model = AllStateModelTrainer(args.tpu_name)
File "./allstate_claims_data_nn.py", line 34, in __init__
os.environ['TPU_LOAD_LIBRARY'] = 0
File "/usr/lib/python3.8/os.py", line 680, in __setitem__
value = self.encodevalue(value)
File "/usr/lib/python3.8/os.py", line 750, in encode
raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int
D1229 23:06:12.209423135 12938 init.cc:190] grpc_shutdown starts clean-up now
messerb5467#t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$
Even if I follow the message and make the 0 a string, it continues on to produce a core dump instead of running as one would expect. Any help would be mighty appreciated.
I've tried using a string for TPU_LOAD_LIBRARY instead of the documented integer:
export TPU_LOAD_LIBRARY=0
Use a TPU_NAME of local instead of test-tpu-vm since TPU vms run directly on a TPU host.
Unfortunately when following this, the errors start spinning out of control and I'm not able register with the TPU nodes at all despite the initialization working just fine in colab.
I imagine it has to be something simple and I'm just missing something somewhere.

This code uses a TPU pod to run. So, you would need to follow the instructions given here to create the pod. Note that you need to use the version tpu-vm-tf-2.11.0-pod and not tpu-vm-tf-2.11.0 when creating the pod.
For e.g.,
gcloud compute tpus tpu-vm create test-tpu-vm \
--zone=us-central1-a --accelerator-type=v2-32 \
--version=tpu-vm-tf-2.11.0-pod
For line 33 we should pass pod name, in your case test-tpu-vm. So, the call to the trainer would be ./allstate_claims_data_nn.py test-tpu-vm.
However, in line 34 it is trying to set the environment variable with an integer. This will not work because this needs to be string when setting environment variable from inside python code in Ubuntu. However, if you set it as a string TPU would throw errors because TPUs need this environment variable as an integer. So, I would recommend skipping the init function and following this and use
export TPU_NAME=test-tpu-vm
export TPU_LOAD_LIBRARY=0
./allstate_claims_data_nn.py test-tpu-vm
(or modify the code to skip taking tpu name as an argument)
This will get you past the TPU setup errors, there will be more code logic error from line 73 which you can continue to work on.

Related

Kubernetes does not seem to update internal IP table after node restart

We're currently experiencing an issue with our GCP Kubernetes which is forwarding client requests to pods that have been assigned IPs that other pods within the cluster have /previously/ gotten. The way we can see this is by using the following query in Logs Explorer:
resource.type="http_load_balancer"
httpRequest.requestMethod="GET"
httpRequest.status=404
Snippet from one of the logs:
httpRequest: {
latency: "0.017669s"
referer: "https://asdf.com/"
remoteIp: "5.57.50.217"
requestMethod: "GET"
requestSize: "34"
requestUrl: "https://[asdf.com]/api/service2/[...]"
responseSize: "13"
serverIp: "10.19.160.16"
status: 404
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
...where the requestUrl property indicates the incoming URL to the load balancer.
Then I search for the IP 10.19.160.16 to find out which pod the IP is assigned to:
c:\>kubectl get pods -o wide | findstr 10.19.160.16
service1-675bfc4f97-slq6g 1/1 Terminated 0 40h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service2-574d69cf69-c7knp 0/1 Error 0 3d16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service3-6db4c97784-428pq 1/1 Running 0 16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
So based on requestUrl the request should have been sent to service2. Instead, what we see is that it gets sent to service3 because it's gotten the IP that service2 once used to have, in other words it seems that the cluster still thinks that service2 is holding on to the IP 10.19.160.16. The effect is that service3 returns status code 404 because it doesn't recognize the endpoint.
This behavior only stops if we manually delete the pods in failed state (eg Error or Terminated) by using the kubectl delete pod ... command.
We suspect that this behavior started since we upgraded our cluster to v1.23 which required us to migrate away from extensions/v1beta1 to networking.k8s.io/v1 as described in https://cloud.google.com/kubernetes-engine/docs/deprecations/apis-1-22.
Our test environment is using pre-emptible VM and whilst we're not 100% (but pretty close) sure it seems like the pods end in Error state after a node is pre-empted.
Why does the cluster still think that a dead pod still has the IP that it used to have? Why is the problem gone after deleting failed pods? Shouldn't they have been cleaned up after a node pre-emption?

Gari Singh provided the answer in the comment.

Chaincode (invoke) is not able to endorse on remote cluster with all three orgs, org1 succeeds but org2 and org3 don't. What could be wrong?

I have a Kubernetes cluster configured which builds perfectly when running via Docker Desktop, including invoking with successful endorsement via all three Chaincode containers in the network.
On the remote side, I'm using AWS EKS to deploy my nodes and I have more recently followed this guide on deploying a production ready peer. I already had EFS set up and in use as a k8s Persistent Volume, and this is populated each time I spool up a network with all the config. This means all the crypto materials, connection profiles, etc are mounted to the relevant containers and as per best practice the reference to these TLS certs is in this directory.
This all works as expected... my admin pods can communicate with my peers, the orderers connect, etcetera. I'm able to fully install chaincode, approve it and commit it to all three of my peers successfully.
When it comes to invoking the chaincode, my org1 container always succeeds, and successfully communicates with the peer in its organization.
I'm aware of the core.yaml setting localMspId and this is being overridden by the environment variable CORE_PEER_LOCALMSPID for each set of peers, such that in my org1 peer the value is Org1MSP, in org2 it's Org2MSP, etc.
When running peer chaincode invoke, the first container (org1) succeeds very quickly, the other two try to contact their peers and hang for the timeout period set in the default gRPC settings (110000ms wait). I also have set the env var of CORE_PEER_ADDRESS_AUTODETECT: "true" on my peer in order to ensure it doesn't try to resolve using the hostnames like peer0.org1 (this clearly works for org1 but not the other two).
The environment variables set for TLS in each of the containers corresponds to the contents of the ones I am passing (in correct order) with my invoke command:
peer chaincode invoke --ctor '${CC_INIT_ARGS}' --channelID ${CHANNEL_ID} --name ${CC_NAME} --cafile \$ORDERER_TLS_ROOTCERT_FILE \
--tls true -o orderer.${ORG}:7050 \
--peerAddresses peer0.org1:7051 \
--peerAddresses peer0.org2:7051 \
--peerAddresses peer0.org3:7051 \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org1-cert.pem \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org2-cert.pem \
--tlsRootCertFiles /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org3-cert.pem >&invoke-log.txt
cat invoke-log.txt
That command is executed inside my container, and as mentioned, I have manually confirmed by inspecting all three containers, then cating the contents of the files, versus doing the same with the above paths, and they match exactly. That is to say the contents of /etc/hyperledger/fabric-peer/client-root-tlscas/tlsca.org1-cert.pem are equivalent to the CORE_PEER_TLS_ROOTCERT_FILE setting in org1, and so on per organization.
Example org1 chaincode container logs:
2022-02-23T13:47:07.255Z debug [c-api:lib/handler.js] [allorgs-5e707801] Calling chaincode Invoke(), response status: 200
2022-02-23T13:47:07.256Z info [c-api:lib/handler.js] [allorgs-5e707801] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
For org2 and org3 containers, once it finally finishes the timeout, it outputs:
2022-02-23T12:24:05.045Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: No connection established\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:182:78\n at processTicksAndRejections (internal/process/task_queues.js:79:11)"
2022-02-23T12:24:05.045Z debug [c-api:lib/handler.js] Chat stream ending
I have also enabled DEBUG logs on everything and I'm gleaning nothing useful from it. Any help or suggestions would be greatly appreciated!

The three peers share the same port. Is that even possible?
Also, when running invoke from the command line, I would normally use the following pattern, repeated for each peer.
--peerAddresses localhost:6051 --tlsRootCertFiles <path to peer on port 6051>
--peerAddresses localhost:6052 --tlsRootCertFiles <path to peer on port 6052>
not the three peers followed by the three TLS cert file paths.

"Kafka Timed out waiting for a node assignment." on MSK

Specs:
The serverless Amazon MSK that's in preview.
t2.xlarge EC2 instance with Amazon Linux 2
Installed Kafka from https://dlcdn.apache.org/kafka/3.0.0/kafka_2.13-3.0.0.tgz
openjdk version "11.0.13" 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
sharing)
Gradle 7.3.3
https://github.com/aws/aws-msk-iam-auth, successfully built.
I also tried adding IAM authentication information, as recommended by the Amazon MSK Library for AWS Identity and Access Management. It says to add the following in config/client.properties:
# Sets up TLS for encryption and SASL for authN.
security.protocol = SASL_SSL
# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM
# Binds SASL client implementation.
# sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;
# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler
# Binds SASL client implementation. Uses the specified profile name to look for credentials.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsProfileName="kafka-client";
And kafka-client is the IAM role attached to the EC2 instance as an instance profile.
Networking: I used VPC Reachability Analyzer to confirm that the security groups are configured correctly and the EC2 instance I'm using as a Producer can reach the serverless MSK cluster.
What I'm trying to do: create a topic.
How I'm trying: bin/kafka-topics.sh --create --partitions 1 --replication-factor 1 --topic quickstart-events --bootstrap-server boot-zclcyva3.c2.kafka-serverless.us-east-2.amazonaws.com:9098
Result:
Error while executing topic command : Timed out waiting for a node assignment. Call: createTopics
[2022-01-17 01:46:59,753] ERROR org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics
(kafka.admin.TopicCommand$)
I'm also trying: with the plaintext port of 9092. (9098 is the IAM-authentication port in MSK, and serverless MSK uses IAM authentication by default.)
All the other posts I found on SO about this node assignment error didn't include MSK. I tried suggestions like uncommenting the listener setting in server.properties, but that didn't change anything.
Installing kcat for troubleshooting didn't work for me, since there's no out-of-the box installation for the yum package manager, which Amazon Linux 2 uses, and since these instructions failed for me at checking for libcurl (by compile)... failed (fail).
The Question: Any other tips on solving this "node assignment" error?

The documentation has been updated recently, I was able to follow it end to end without any issue (The IAM policy is now correct)
https://docs.aws.amazon.com/msk/latest/developerguide/serverless-getting-started.html

The created properties file is not automatically used; your command needs to include --command-config client.properties, where this properties file is documented at the MSK docs on the linked IAM page.
Extract...
ssl.truststore.location=<PATH_TO_TRUST_STORE_FILE>
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
Alternatively, if the plaintext port didn't work, then you have other networking issues
Beyond these steps, I suggest reaching out to MSK support, and telling them to update the "Create a Topic" page to no longer use Zookeeper, keeping in mind that Kafka 3.0 is not (yet) supported

Random org.apache.http.conn.ConnectTimeoutException in old elasticsearch(1.X) using the Jest client

I am using very old Elasticsearch 1.x I know its EOL but no choice here and as that time official ES client was not there, using Jest client to interact with Elasticsearch and occassionaly seeing the timeout exception when Jest is trying to establish a connection and below is the stack trace of the log
rg.apache.http.conn.ConnectTimeoutException: Connect to <es-ip>:9200 [/<es-ip>] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:109)
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:56)
One weird thing I noticed is that this mostly happen to the Elasticsearch instances hosted in AWS and not in the data centre, I am using Data dog integration with Elasticsearch for elasticsearch infra monitoring and can provide more relevant details if required from there.

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?

It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js