Does sagemaker python sdk (training jobs) inherit all permissions from the edge node? - amazon-iam

Working in corporate network on training a machine learning model. The mlflow tracking works ok with a sagemaker notebook instance but when launching a hyper parameter tuning job from the same sagemaker notebook instance, mlflow tracking will fail:
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7eff60d845b0>: Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/requests/adapters.py", line 440, in send resp = conn.urlopen( File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 813, in urlopen return self.urlopen( [Previous line repeated 2 more times] File "/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py", line 785, in urlopen retries = retries.increment( File "/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='mlflow.dev.corp.net', port=80): Max retr
The mlflow tracking uri does have restrictions on corporate access. But I don't see why it blocks the sub-instances launched by sagemaker sdk since the IAM role ARN of the training jobs were inherited from the sagemaker notebook instance. Any solutions on it?

This error isn't related to IAM. The machine you're running this code from doesn't have network access to: mlflow.dev.corp.net. And apparently this breaks execution.

Related

Google Cloud API Gateway - "internal error has occurred" on creation

I am looking to create a GCP Cloud API Gateway resource for my project. I have created the API and also created the API config. I am now looking to create the Gateway itself. However, when I run the Gateway creation command I get the following error:
gcloud beta api-gateway gateways create project-api-gateway --api=project-api --api-config=project-config --location=us-central1
Waiting for API Gateway [project-api-gateway] to be created with [projects/${MY_PROJECT}/locations/global/apis/project-api/configs/project-config] config...failed.
ERROR: (gcloud.beta.api-gateway.gateways.create) an internal error has occurred
The same happens when I try and do this through the cloud console and creating the API gateway through terraform. Anyone know what's going wrong here?
The API and API config were created through terraform, but I believe the equivalent creation commands are:
gcloud beta api-gateway apis create project-api --project=
gcloud beta api-gateway api-configs create project-config
--api=project-api --openapi-spec=swagger.yaml
--project= --backend-auth-service-account=SERVICE_ACCOUNT_EMAIL
EDIT:
As request, the command run with debug turned on (and slightly redacted):
gcloud beta api-gateway gateways create project-api-gateway \
--api=project-api --api-config=project-config \
--location=europe-west1 --verbosity=debug
DEBUG: Running [gcloud.beta.api-gateprojectway.gateways.create] with arguments: [--api: "project-api", --api-config: "project-config", --location: "europe-west1", --verbosity: "debug", GATEWAY: "project-api-gateway"]
DEBUG: Making request: POST https://www.googleapis.com/oauth2/v4/token
Waiting for API Gateway [project-api-gateway] to be created with [projects/project-prd-f4be/locations/global/apis/project-api/configs/project-config] config...failed.
DEBUG: (gcloud.beta.api-gateway.gateways.create) an internal error has occurred
Traceback (most recent call last):
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/calliope/cli.py", line 983, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/calliope/backend.py", line 808, in Run
resources = command_instance.Run(args)
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/surface/api_gateway/gateways/create.py", line 66, in Run
return operations_util.PrintOperationResult(
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/command_lib/api_gateway/operations_util.py", line 67, in PrintOperationResult
return op_client.WaitForOperation(operation_ref, wait_string, service)
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/api_lib/api_gateway/operations.py", line 89, in WaitForOperation
return waiter.WaitFor(poller, operation_ref, message)
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/api_lib/util/waiter.py", line 261, in WaitFor
operation = PollUntilDone(
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/api_lib/util/waiter.py", line 322, in PollUntilDone
operation = retryer.RetryOnResult(
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
if not should_retry(result, state):
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
return not poller.IsDone(operation)
File "/Users/Person/Downloads/google-cloud-sdk 2/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
raise OperationError(operation.error.message)
googlecloudsdk.api_lib.util.waiter.OperationError: an internal error has occurred
ERROR: (gcloud.beta.api-gateway.gateways.create) an internal error has occurred
API can be found at http://www.filedropper.com/swagger_1

Airflow on GCP - Errno 111 Connection refused

We have Airflow 1.10.5, using CeleryExecutor, running on Google Cloud Platform.
Occasionally, the following error happens:
[2019-12-17 19:00:45,990] {{base_task_runner.py:115}} INFO - Job 704: Subtask our-task-name ERROR: (gcloud.container.clusters.get-credentials) [Errno 111] Connection refused
[2019-12-17 19:00:45,990] {{base_task_runner.py:115}} INFO - Job 704: Subtask our-task-name This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
[2019-12-17 19:00:46,279] {{taskinstance.py:1051}} ERROR - Command '['gcloud', 'container', 'clusters', 'get-credentials', 'airflow-pipeline-name', '--zone', 'us-central1-a', '--project', 'project-name']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 926, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/gcp_container_operator.py", line 271, in execute
"--project", self.project_id])
File "/usr/local/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'container', 'clusters', 'get-credentials', 'airflow-pipeline-name', '--zone', 'us-central1-a', '--project', 'project-name']' returned non-zero exit status 1.
[2019-12-17 19:00:46,358] {{taskinstance.py:1082}} INFO - Marking task as FAILED.
Is it a bug in Airflow itself, in its plugins (like the plugin for Kubernetes) or in Google Cloud Platform?
Is there any way to fix it?
The problem was that the metadata server was not responding at some moments. Our colleagues fixed this.

SSL error running Jupyter Notebook on Google Cloud VM

I've set up an instance on a GCP virtual machine and have installed anaconda, torch, etc. and have initialized my Jupyter notebooks (including the config setup). I've mounted my Google storage bucket and everything seems to be okay, except that when I run Jupyter:
~$ jupyter notebook --certfile=/home/username/certs/mycert.pem
[I 16:18:41.293 NotebookApp] [nb_conda_kernels] enabled, 3 kernels found
[I 16:18:44.879 NotebookApp] Serving notebooks from local directory:
/home/username
[I 16:18:44.879 NotebookApp] 0 active kernels
[I 16:18:44.879 NotebookApp] The Jupyter Notebook is running at:
https://0.0.0.0:8888/
[I 16:18:44.879 NotebookApp] Use Control-C to stop this server and shut
down all kernels (twice to skip confirmation).
I can access it on my VM's external IP:
https://xx.xxx.xxx.xx:8888
But as soon as I do that, I get this error:
ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert
certificate unknown (_ssl.c:645)
Full traceback:
[W 16:18:52.343 NotebookApp] SSL Error on 11 ('73.43.19.83', 56932):
[SSL: SSLV3_ALERT_CERTIFICATE
_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:645)
[W 16:18:52.343 NotebookApp] SSL Error on 14 ('73.43.19.83', 56936): [SSL: SSLV3_ALERT_CERTIFICATE
_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:645)
[E 16:18:52.343 NotebookApp] Uncaught exception
Traceback (most recent call last):
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/http1connection.py", l
ine 693, in _server_request_loop
ret = yield conn.read_response(request_delegate)
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/gen.py", line 870, in
run
value = future.result()
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/concurrent.py", line 2
15, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/gen.py", line 876, in
run
yielded = self.gen.throw(*exc_info)
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/http1connection.py", l
ine 168, in _read_message
quiet_exceptions=iostream.StreamClosedError)
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/gen.py", line 870, in
run
value = future.result()
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/concurrent.py", line 2
15, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/home/username/anaconda3/lib/python3.5/site-packages/tornado/iostream.py", line 124
3, in _do_ssl_handshake
self.socket.do_handshake()
File "/home/username/anaconda3/lib/python3.5/ssl.py", line 988, in do_handshake
self._sslobj.do_handshake()
File "/home/username/anaconda3/lib/python3.5/ssl.py", line 633, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:6
45)
This is on a Linux VM. Locally, I'm on a PC. I've already tried rewriting the config file for Jupyter but I'm stuck on this problem. None of the solutions I've found online have worked.
It seems like you're trying to manually setup a VM instance with Jupyter. That tends to be pretty tricky, have you considered using GCP's AI Platform Notebook instead?
https://cloud.google.com/ai-platform-notebooks
It gives you a VM with Jupyter Lab and many popular DL/ML libraries preinstalled and configured and you don't even need SSH to access your notebook.

gcloud datastore emulator: [Errno 8] nodename nor servname provided, or not known

I just installed the Google cloud datastore emulator locally for testing my applications. I'm getting a weird error when starting it up. Google search didn't return any results.
I installed it with the following command:
gcloud components install cloud-datastore-emulator
Installation seemed successful.
Here's the output of starting it with debug verbosity:
❯ gcloud beta emulators datastore start --verbosity debug
DEBUG: Running [gcloud.beta.emulators.datastore.start] with arguments: [--verbosity: "debug"]
DEBUG: (gcloud.beta.emulators.datastore.start) [Errno 8] nodename nor servname provided, or not known
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
Traceback (most recent call last):
File "/usr/local/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 787, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/usr/local/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 759, in Run
resources = command_instance.Run(args)
File "/usr/local/google-cloud-sdk/lib/surface/emulators/datastore/start.py", line 69, in Run
datastore_util.GetHostPort(), ipv6_enabled=socket.has_ipv6)
File "/usr/local/google-cloud-sdk/lib/googlecloudsdk/command_lib/emulators/datastore_util.py", line 162, in GetHostPort
return util.GetHostPort(DATASTORE)
File "/usr/local/google-cloud-sdk/lib/googlecloudsdk/command_lib/emulators/util.py", line 222, in GetHostPort
if sock.connect_ex((host, port)) != 0:
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
gaierror: [Errno 8] nodename nor servname provided, or not known
ERROR: (gcloud.beta.emulators.datastore.start) [Errno 8] nodename nor servname provided, or not known
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
Any ideas of what is going on are welcome.
I just made a quick test and installed Cloud Datastore with $ gcloud components install cloud-datastore-emulator and run $ gcloud beta emulators datastore start --verbosity debug without a problem.
What’s the output if you run $ gcloud beta emulators datastore start with no flags?
Looks to be a connectivity issue. Have you tried forcing the host and port? Try running $ gcloud beta emulators datastore start --verbosity debug --host-port 127.0.0.1:8080

How do I register an aws ec2 instance into an ecs cluster (not using the aws console)?

I am unable to register previously created ec2 instances into a cluster using the register_container_instance from the python boto3 library:
response = client.register_container_instance(
cluster=cluster_name,
instanceIdentityDocument='inst_id_doc.txt',
instanceIdentityDocumentSignature='inst_id_sign.txt'
)
I am getting the following error:
`traceback(most recent call last)import boto3
File "whetstone_2.py", line 79 in <module>
instanceIdentityDocumentSignature='inst_id_sign.txt'
File "C:\Users\ishanigh\Anaconda3\lib\site-packages\botocore\client.py",line 251, in _api_call
return self._make_api_call(operation_name,kwargs)
File "C:\Users\ishanigh\Anaconda3\lib\site-packages\botocore\client.py",line 251, in _api_call
raise ClientError(parsed_response,operation_name)
botocore.exceptions.clientError: An error occurred (ServerException)
when calling the RegisterContainerInstance operation (reached max retries: 4):Service Unavailable.Please try again later.`
I got the instance_id_doc.txt and inst_id_sign.txt from the instances. And the process of registering the container instances have to be automated.
How should this be done?