MLflow proxied artifact access: Unable to locate credentials - amazon-web-services

I am using MLflow to track my experiments. I am using an S3 bucket as an artifact store. For acessing it, I want to use proxied artifact access, as described in the docs, however this does not work for me, since it locally looks for credentials (but the server should handle this).
Expected Behaviour
As described in the docs, I would expect that locally, I do not need to specify my AWS credentials, since the server handles this for me. From docs:
This eliminates the need to allow end users to have direct path access to a remote object store (e.g., s3, adls, gcs, hdfs) for artifact handling and eliminates the need for an end-user to provide access credentials to interact with an underlying object store.
Actual Behaviour / Error
Whenever I run an experiment on my machine, I am running into the following error:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
So the error is local. However, this should not happen since the server should handle the auth instead of me needing to store my credentials locally. Also, I would expect that I would not even need library boto3 locally.
Solutions Tried
I am aware that I need to create a new experiment, because existing experiments might still use a different artifact location which is proposed in this SO answer as well as in the note in the docs. Creating a new experiment did not solve the error for me. Whenever I run the experiment, I get an explicit log in the console validating this:
INFO mlflow.tracking.fluent: Experiment with name 'test' does not exist. Creating a new experiment.
Related Questions (#1 and #2) refer to a different scenario, which is also described in the docs
Server Config
The server runs on a kubernetes pod with the following config:
mlflow server \
--host 0.0.0.0 \
--port 5000 \
--backend-store-uri postgresql://user:pw#endpoint \
--artifacts-destination s3://my_bucket/artifacts \
--serve-artifacts \
--default-artifact-root s3://my_bucket/artifacts \
I would expect my config to be correct, looking at doc page 1 and page 2
I am able to see the mlflow UI if I forward the port to my local machine. I also see the experiment runs as failed, because of the error I sent above.
My Code
The relevant part of my code which fails is the logging of the model:
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("test2)
...
# this works
mlflow.log_params(hyperparameters)
model = self._train(model_name, hyperparameters, X_train, y_train)
y_pred = model.predict(X_test)
self._evaluate(y_test, y_pred)
# this fails with the error from above
mlflow.sklearn.log_model(model, "artifacts")
Question
I am probably overlooking something. Is there a need to locally indicate that I want to use proxied artified access? If yes, how do I do this? Is there something I have missed?
Full Traceback
File /dir/venv/lib/python3.9/site-packages/mlflow/models/model.py", line 295, in log
mlflow.tracking.fluent.log_artifacts(local_path, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 726, in log_artifacts
MlflowClient().log_artifacts(run_id, local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/client.py", line 1001, in log_artifacts
self._tracking_client.log_artifacts(run_id, local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 346, in log_artifacts
self._get_artifact_repo(run_id).log_artifacts(local_dir, artifact_path)
File /dir/venv/lib/python3.9/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 141, in log_artifacts
self._upload_file(
File /dir/venv/lib/python3.9/site-packages/mlflow/store/artifact/s3_artifact_repo.py", line 117, in _upload_file
s3_client.upload_file(Filename=local_file, Bucket=bucket, Key=key, ExtraArgs=extra_args)
File /dir/venv/lib/python3.9/site-packages/boto3/s3/inject.py", line 143, in upload_file
return transfer.upload_file(
File /dir/venv/lib/python3.9/site-packages/boto3/s3/transfer.py", line 288, in upload_file
future.result()
File /dir/venv/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File /dir/venv/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
raise self._exception
File /dir/venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 139, in __call__
return self._execute_main(kwargs)
File /dir/venv/lib/python3.9/site-packages/s3transfer/tasks.py", line 162, in _execute_main
return_value = self._main(**kwargs)
File /dir/venv/lib/python3.9/site-packages/s3transfer/upload.py", line 758, in _main
client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 508, in _api_call
return self._make_api_call(operation_name, kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 898, in _make_api_call
http, parsed_response = self._make_request(
File /dir/venv/lib/python3.9/site-packages/botocore/client.py", line 921, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File /dir/venv/lib/python3.9/site-packages/botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File /dir/venv/lib/python3.9/site-packages/botocore/signers.py", line 103, in handler
return self.sign(operation_name, request)
File /dir/venv/lib/python3.9/site-packages/botocore/signers.py", line 187, in sign
auth.add_auth(request)
File /dir/venv/lib/python3.9/site-packages/botocore/auth.py", line 407, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

The problem is that the server is running on wrong run parameters, the --default-artifact-root needs to either be removed or set to mlflow-artifacts:/.
From mlflow server --help:
--default-artifact-root URI Directory in which to store artifacts for any
new experiments created. For tracking server
backends that rely on SQL, this option is
required in order to store artifacts. Note that
this flag does not impact already-created
experiments with any previous configuration of
an MLflow server instance. By default, data
will be logged to the mlflow-artifacts:/ uri
proxy if the --serve-artifacts option is
enabled. Otherwise, the default location will
be ./mlruns.

Having the same problem and the accepted answer doesn't seem to solve my issue.
Neither removing or setting mlflow-artifacts instead of s3 works for me. Moreover it gave me an error that since I have a remote backend-store-uri I need to set default-artifact-root while running the mlflow server.
How I solved it that I find the error self explanatory, and the reason it states that it was unable to find credential is that mlflow underneath uses boto3 to do all the transaction. Since I had setup my environment variables in .env, just loading the file was enough for me and solved the issue. If you have the similar scenario then just run the following commands before starting your mlflow server,
set -a
source .env
set +a
This will load the environment variables and you will be good to go.
Note:
I was using both remote server for backend and artifacts storage, mainly postgres and minio.
For remote backend backend-store-uri is must otherwise you will not be able to startup your mlflow server

The answer #bk_ helped me. I ended up with the following command to get my Tracking Server running with proxied connection for artifact storage:
mlflow server \
--backend-store-uri postgresql://postgres:postgres#postgres:5432/mlflow \
--default-artifact-root mlflow-artifacts:/ \
--serve-artifacts \
--host 0.0.0.0

Related

python3.6: ValueError: unsupported pickle protocol: 5

I have developed a personal site in the local with python3.8.
when I deployed the AWS ubuntu ec2 sever used the code file which deployed in the local, and when saved my blog contents, there is the following error. By the way, the site can saved well in the sever python3.6 which have been tested .
File "/home/ubuntu/.local/lib/python3.6/site-packages/whoosh/index.py", line 123, in open_dir
return FileIndex(storage, schema=schema, indexname=indexname)
File "/home/ubuntu/.local/lib/python3.6/site-packages/whoosh/index.py", line 421, in init
TOC.read(self.storage, self.indexname, schema=self._schema)
File "/home/ubuntu/.local/lib/python3.6/site-packages/whoosh/index.py", line 664, in read
segments = stream.read_pickle()
File "/home/ubuntu/.local/lib/python3.6/site-packages/whoosh/filedb/structfile.py", line 245, in read_pickle
return load_pickle(self.file)
ValueError: unsupported pickle protocol: 5
I am wondering is that a possible caused by the file in the local environment.
I have solved it, just deleted the pickle 5 file which generated by python3.8 version in the local. you can detect the file name in the code load_pickle(self.file) ,for example print(self.file). you can get the file position and name.

Google cloud functions deployment through Cloud Source repositories stopped working

I managed to have a script deploying a GCP Function using the following command :
gcloud beta functions deploy pipeline-helper --set-env-vars PROPFILE_BUCKET=${my_bucket},PROPFILE_PATH=${some_property} --source https://source.developers.google.com/projects/{PROJECT}/repos/{REPO}/fixed-aliases/1.0.1/paths/ --entry-point onFlagFileCreation --runtime nodejs6 --trigger-resource ${my_bucket} --trigger-event google.storage.object.finalize --region europe-west1 --memory 1G --timeout 300s
That worked for a few days, the last one being December 4th. Then, when launched on December 27th ... the command failed with the following output (with debug option added) :
Deploying function (may take a while - up to 2 minutes)...
..failed.
DEBUG: (gcloud.beta.functions.deploy) OperationError: code=13, message=Failed to retrieve function source code
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 841, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 770, in Run
resources = command_instance.Run(args)
File "/usr/lib/google-cloud-sdk/lib/surface/functions/deploy.py", line 203, in Run
return _Run(args, track=self.ReleaseTrack(), enable_env_vars=True)
File "/usr/lib/google-cloud-sdk/lib/surface/functions/deploy.py", line 157, in _Run
return api_util.PatchFunction(function, updated_fields)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/util.py", line 308, in CatchHTTPErrorRaiseHTTPExceptionFn
return func(*args, **kwargs)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/util.py", line 364, in PatchFunction
operations.Wait(op, messages, client, _DEPLOY_WAIT_NOTICE)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py", line 126, in Wait
_WaitForOperation(client, request, notice)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py", line 101, in _WaitForOperation
sleep_ms=SLEEP_MS)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 219, in RetryOnResult
result = func(*args, **kwargs)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/functions/operations.py", line 65, in _GetOperationStatus
raise exceptions.FunctionsError(OperationErrorToString(op.error))
FunctionsError: OperationError: code=13, message=Failed to retrieve function source code
ERROR: (gcloud.beta.functions.deploy) OperationError: code=13, message=Failed to retrieve function source code
Build step 'Execute shell' marked build as failure
Finished: FAILURE
My problem relates to the use of the --source option of this command when it points to a Google Source repository url (it works with gcs bucket or local directory)
I tried using the minimal valid source repository url https://source.developers.google.com/projects/PROJECT/repos/REPO as mentioned in the official doc here ... with no success (same error)
After that, i cloned the official sample « Google cloud functions - hello world sample to GC Repositories and tried to deploy it using an equivalent command ... with no more success (same error). However, i was able to deploy it via a zip uploaded to a gcs bucket in my project or from a local repository but not from Google Source repositories ...
The account used to deploy the Function (xxx-compute#developer.gserviceaccount.com) has the following right :
Stackdriver Debugger Agent
Cloud Functions Developer
Cloud Functions Service Agent
Editor
Service Account User
Source Repository Writer
Cloud Source Repositories Service Agent
Storage Object Creator
Storage Object Viewer
Any help would be greatly appreciated
As mentioned in my last comment to #Raj, the problem was due to a bug in GCP ... that is now fixed. Support « people » where kind and reactive.
All is working as expected now !

AssertionError: INTERNAL: No default project is specified

New to airflow. Trying to run the sql and store the result in a BigQuery table.
Getting following error. Not sure where to setup the default_rpoject_id.
Please help me.
Error:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 28, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 585, in test
ti.run(ignore_task_deps=True, ignore_ti_state=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1374, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 82, in execute
self.allow_large_results, self.udf_config, self.use_legacy_sql)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 228, in run_query
default_project_id=self.project_id)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 917, in _split_tablename
assert default_project_id is not None, "INTERNAL: No default project is specified"
AssertionError: INTERNAL: No default project is specified
Code:
sql_bigquery = BigQueryOperator(
task_id='sql_bigquery',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
#standardSQL
SELECT ID, Name, Group, Mark, RATIO_TO_REPORT(Mark) OVER(PARTITION BY Group) AS percent FROM `tensile-site-168620.temp.marks`
''',
destination_dataset_table='temp.percentage',
dag=dag
)
EDIT: I finally fixed this problem by simply adding the bigquery_conn_id='bigquery' parameter in the BigQueryOperator task, after running the code below in a separate python script.
Apparently you need to specify your project ID in Admin -> Connection in the Airflow UI. You must do this as a JSON object such as "project" : "".
Personally I can't get the webserver working on GCP so this is unfeasible. There is a programmatic solution here:
from airflow.models import Connection
from airflow.settings import Session
session = Session()
gcp_conn = Connection(
conn_id='bigquery',
conn_type='google_cloud_platform',
extra='{"extra__google_cloud_platform__project":"<YOUR PROJECT HERE>"}')
if not session.query(Connection).filter(
Connection.conn_id == gcp_conn.conn_id).first():
session.add(gcp_conn)
session.commit()
These suggestions are from a similar question here.
I get the same error when running airflow locally. My solution is to add a the following connection string as a environment variable:
AIRFLOW_CONN_BIGQUERY_DEFAULT="google-cloud-platform://?extra__google_cloud_platform__project=<YOUR PROJECT HERE>"
BigQueryOperator uses the "bigquery_default" connection. When not specified, local airflow uses an internal version of the connection which misses the property project_id. As you can see the connection string above provides the project_id property.
On startup Airflow loads environment variables that start with "AIRFLOW_" into memory. This mechanism can be used to override airflow properties and providing connections when running locally, as explained in the airflow documentation here. Note this also works when running airflow directly without starting the web server.
So I have set up environments variables for all my connections, for example AIRFLOW_CONN_MYSQL_DEFAULT. I have put them into a .ENV file that get sourced from my IDE, but putting them into your .bash_profile would work fine too.
When you look inside your airflow instance on Cloud Composer, you see that the at the "bigquery_default" connection there has the project_idproperty set. That's why BigQueryOperator works when running through Cloud Composer.
(I am on airflow 1.10.2 and BigQuery 1.10.2)

boto3 throws error in when packaged under rpm

I am using boto3 in my project and when i package it as rpm it is raising error while initializing ec2 client.
<class 'botocore.exceptions.DataNotFoundError'>:Unable to load data for: _endpoints. Traceback -Traceback (most recent call last):
File "roboClientLib/boto/awsDRLib.py", line 186, in _get_ec2_client
File "boto3/__init__.py", line 79, in client
File "boto3/session.py", line 200, in client
File "botocore/session.py", line 789, in create_client
File "botocore/session.py", line 682, in get_component
File "botocore/session.py", line 809, in get_component
File "botocore/session.py", line 179, in <lambda>
File "botocore/session.py", line 475, in get_data
File "botocore/loaders.py", line 119, in _wrapper
File "botocore/loaders.py", line 377, in load_data
DataNotFoundError: Unable to load data for: _endpoints
Can anyone help me here. Probably boto3 requires some run time resolutions which it not able to get this in rpm.
I tried with using LD_LIBRARY_PATH in /etc/environment which is not working.
export LD_LIBRARY_PATH="/usr/lib/python2.6/site-packages/boto3:/usr/lib/python2.6/site-packages/boto3-1.2.3.dist-info:/usr/lib/python2.6/site-packages/botocore:
I faced the same issue:
botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-04-01/service-2
For which I figured out the directory was missing. Updating botocore by running the following solved my issue:
pip install --upgrade botocore
Botocore depends on a set of service definition files that it uses to generate clients on the fly. Boto3 further depends on another set of files that it uses to generate resource clients. You will need to include these in any installs of boto3 or botocore. The files will need to be located in the 'data' folder of the root of the respective library.
I faced similar issue which was due to old version of botocore. Once I updated it, it started working.
Please consider using below command.
pip install --upgrade botocore
Also please ensure, you have setup boto configuration profile.
Boto searches credentials in below order.
Passing credentials as parameters in the boto.client() method
Passing credentials as parameters when creating a Session object
Environment variables
Shared credential file (~/.aws/credentials)
AWS config file (~/.aws/config)
Assume Role provider
Boto2 config file (/etc/boto.cfg and ~/.boto)
Instance metadata service on an Amazon EC2 instance that has an IAM
role configured.

Connection reset by peer when using s3, boto, django-storage for static files

I'm trying to switch to use amazon s3 to host our static files for our django project. I am using django, boto, django-storage and django-compressor. When I run collect static on my dev server, I get the error
socket.error: [Errno 104] Connection reset by peer
The size of all of my static files is 74MB, which doesnt seem too large. Has anyone seen this before, or have any debugging tips?
Here is the full trace.
Traceback (most recent call last):
File "./manage.py", line 10, in <module>
execute_from_command_line(sys.argv)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 443, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python2.7/dist-packages/django/core/management/__init__.py", line 382, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 196, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 232, in execute
output = self.handle(*args, **options)
File "/usr/local/lib/python2.7/dist-packages/django/core/management/base.py", line 371, in handle
return self.handle_noargs(**options)
File "/usr/local/lib/python2.7/dist-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 163, in handle_noargs
collected = self.collect()
File "/usr/local/lib/python2.7/dist-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 113, in collect
handler(path, prefixed_path, storage)
File "/usr/local/lib/python2.7/dist-packages/django/contrib/staticfiles/management/commands/collectstatic.py", line 303, in copy_file
self.storage.save(prefixed_path, source_file)
File "/usr/local/lib/python2.7/dist-packages/django/core/files/storage.py", line 45, in save
name = self._save(name, content)
File "/usr/local/lib/python2.7/dist-packages/storages/backends/s3boto.py", line 392, in _save
self._save_content(key, content, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/storages/backends/s3boto.py", line 403, in _save_content
rewind=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1222, in set_contents_from_file
chunked_transfer=chunked_transfer, size=size)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 714, in send_file
chunked_transfer=chunked_transfer, size=size)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 890, in _send_file_internal
query_args=query_args
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 547, in make_request
retry_handler=retry_handler
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 966, in make_request
retry_handler=retry_handler)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 927, in _mexe
raise e
socket.error: [Errno 104] Connection reset by peer
UPDATE: I don't have the answer to how to debug this error, but later this just stopped happening which makes me think it may have to do with something on S3.
tl;dr
If your bucket is not in the default region, you need to tell boto which region to connect to, e.g. if your bucket is in us-west-2 you need to add the following line to settings.py:
AWS_S3_HOST = 's3-us-west-2.amazonaws.com'
Long explanation:
It's not a permission problem and you should not set your bucket permissions to 'Authenticated users'.
This problem happens if you create your bucket in a region which is not the default one (in my case I was using us-west-2).
If you don't use the default region and you don't tell boto in which region your bucket resides, boto will connect to the default region and S3 will reply with a 307 redirect to the region where the bucket belongs.
Unfortunately, due to this bug in boto:
https://github.com/boto/boto/issues/2207
if the 307 reply arrives before boto has finished uploading the file, boto won't see the redirect and will keep uploading to the default region.
Eventually S3 closes the socket resulting into a 'Connection reset by peer'.
It's a kind of race condition which depends on the size of the object being uploaded and the speed of your internet connection, which explains why it happens randomly.
There are two possible reasons why the OP stopped seeing the error after some time:
- he later created a new bucket in the default region and the problem went away by itself.
- he started uploading only small files, which are fast enough to be fully uploaded by the time S3 replies with 307
This is issue someties occurs when you create a new bucket the first time, you have to wait for some hours or mins before you start uploading. I don't know why s3 behave like that. To prove that try creating a new bucket, point your django storage to it and u will see it show connection peer reset when u try to upload any thing from your django project, but wait for couple of hour or min try again it will work. Repeat the same step and see.
I just had this issue trying to set up a second S3 bucket to use for testing/devel and what helped was deploying an older version of the application.
I have no clue why that would help, but for those of you reading this way after the fact (like me, a couple hours ago), it's worth trying to deploy a different application version.
You have to set your bucket permissions to Authenticated Users List + Upload/Delete or you can create a specific user in IAM section of amazon and setup the access rights only for that specific user
This helped me some times ago: Setup S3 for Django