I have a Dataproc(Spark Structured Streaming) job which takes data from Kafka, and does some processing. The checkpoint is a GCP Cloud storage, and it is somehow unable to list the objects in GCP Storage
Here is the error :
Traceback (most recent call last):
File "/tmp/3c930be06f4242378251aa6760390f8d/main-v1-test.py", line 352, in <module>
sys.exit(main())
File "/tmp/3c930be06f4242378251aa6760390f8d/main-v1-test.py", line 348, in main
ss.readData()
File "/tmp/3c930be06f4242378251aa6760390f8d/main-v1-test.py", line 301, in readData
query = df_stream.selectExpr("CAST(value AS STRING)", "timestamp", "topic").writeStream \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1491, in start
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o96.start.
: java.io.IOException: Error accessing gs://ss-checkpoint-10m-test-v1/metadata
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2221)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:2108)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfoInternal(GoogleCloudStorageFileSystem.java:1091)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1065)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:955)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1691)
at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:53)
at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:174)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:50)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:317)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:359)
at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:414)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://storage.googleapis.com/storage/v1/b/ss-checkpoint-10m-test-v1/o/metadata?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "939354532596-compute#developer.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).",
"reason" : "forbidden"
} ],
"message" : "939354532596-compute#developer.gserviceaccount.com does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist)."
}
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2215)
... 24 more
Additional details :
I'm using Service account kafka-admin#versa-sml-googl.iam.gserviceaccount.com to start the job, however the Dataproc VMs seem to be using SA -> 939354532596-compute#developer.gserviceaccount.com to access the buckets :
(base) Karans-MacBook-Pro:Versa-StructuredStreaming karanalang$ gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
dataproc-access#versa-kafka-poc.iam.gserviceaccount.com
* kafka-admin#versa-sml-googl.iam.gserviceaccount.com
karan.alang#gmail.com
karan#versa-networks.com
loki-748#versa-sml-googl.iam.gserviceaccount.com
loki-storage#versa-sml-googl.iam.gserviceaccount.com
Here is the list of roles in the above 2 Service Accounts:
kafka-admin#versa-sml-googl.iam.gserviceaccount.com
(base) Karans-MacBook-Pro:Versa-StructuredStreaming karanalang$ gcloud projects get-iam-policy versa-sml-googl \
> --flatten="bindings[].members" \
> --format='table(bindings.role)' \
> --filter="bindings.members:kafka-admin#versa-sml-googl.iam.gserviceaccount.com"
ROLE
roles/compute.admin
roles/compute.storageAdmin
roles/container.admin
roles/container.clusterAdmin
roles/dataproc.admin
roles/dataproc.editor
roles/dataproc.worker
roles/iam.serviceAccountTokenCreator
roles/iam.serviceAccountUser
roles/metastore.editor
roles/owner
roles/storage.admin
roles/storage.objectAdmin
roles/storage.objectViewer
(base) Karans-MacBook-Pro:Versa-StructuredStreaming karanalang$ gcloud projects get-iam-policy versa-sml-googl \
> --flatten="bindings[].members" \
> --format='table(bindings.role)' \
> --filter="939354532596-compute#developer.gserviceaccount.com"
ROLE
roles/compute.storageAdmin
roles/editor
roles/metastore.serviceAgent
roles/storage.objectAdmin
roles/storage.objectViewer
The ServiceAccounts seem to have the required roles, so am not sure what the issue is ?
Any input on how to debug/fix this issue ?
tia!
Related
So i when to the airflow documentation for aws redshift there is 2 operator that can execute the sql query they are RedshiftSQLOperator and RedshiftDataOperator. I already implemented my job using RedshiftSQLOperator but i want to do it using RedshiftDataOperator instead, because i dont want to using postgres connection in RedshiftSQLOperator but AWS API.
RedshiftDataOperator Documentation
I had read this documentation there is aws_conn_id in the parameter. But when im trying to use the same connection id there is error.
[2023-01-11, 04:55:56 UTC] {base.py:68} INFO - Using connection ID 'redshift_default' for task execution.
[2023-01-11, 04:55:56 UTC] {base_aws.py:206} INFO - Credentials retrieved from login
[2023-01-11, 04:55:56 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 146, in execute
self.statement_id = self.execute_query()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 124, in execute_query
resp = self.hook.conn.execute_statement(**filter_values)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the ExecuteStatement operation: The security token included in the request is invalid.
From task id
redshift_data_task = RedshiftDataOperator(
task_id='redshift_data_task',
database='rds',
region='ap-southeast-1',
aws_conn_id='redshift_default',
sql="""
call some_procedure();
"""
)
What should i fill in the airflow connection ? Because in the documentation there is no example of value that i should fill to airflow. Thanks
Airflow RedshiftDataOperator Connection Required Value
Have you tried using the Amazon Redshift connection? There is both an option for authenticating using your Redshift credentials:
Connection ID: redshift_default
Connection Type: Amazon Redshift
Host: <your-redshift-endpoint> (for example, redshift-cluster-1.123456789.us-west-1.redshift.amazonaws.com)
Schema: <your-redshift-database> (for example, dev, test, prod, etc.)
Login: <your-redshift-username> (for example, awsuser)
Password: <your-redshift-password>
Port: <your-redshift-port> (for example, 5439)
(source)
and an option for using an IAM role (there is an example in the first link).
Disclaimer: I work at Astronomer :)
EDIT: Tested the following with Airflow 2.5.0 and Amazon provider 6.2.0:
Added the IP of my Airflow instance to the VPC security group with "All traffic" access.
Airflow Connection with the connection id aws_default, Connection type "Amazon Web Services", extra: { "aws_access_key_id": "<your-access-key-id>", "aws_secret_access_key": "<your-secret-access-key>", "region_name": "<your-region-name>" }. All other fields blank. I used a root key for my toy-aws. If you use other credentials you need to make sure that IAM role has access and the right permissions to the Redshift cluster (there is a list in the link above).
Operator code:
red = RedshiftDataOperator(
task_id="red",
database="dev",
sql="SELECT * FROM dev.public.users LIMIT 5;",
cluster_identifier="redshift-cluster-1",
db_user="awsuser",
aws_conn_id="aws_default"
)
I am running a GCP composer cluster on GKE. I am defining a DAG to submit a job to dataproc cluster. I have read GCP doc, and it says that Composer's service account will get used by the workers to send the dataproc api requests.
But DataprocSubmitJobOperator reports error in getting the auth credentials.
Stack trace below. Composer env info attached.
I need suggestion to fix this issue.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1448} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=harshit.bapna#dexterity.ai
AIRFLOW_CTX_DAG_ID=dataproc_spark_operators
AIRFLOW_CTX_TASK_ID=pyspark_task
AIRFLOW_CTX_EXECUTION_DATE=2022-08-23T16:03:16.986859+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-08-23T16:03:16.986859+00:00
[2022-08-23, 16:03:25 UTC] {dataproc.py:1847} INFO - Submitting job
[2022-08-23, 16:03:25 UTC] {credentials_provider.py:312} INFO - Getting connection using `google.auth.default()` since no key file is defined for hook.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/dataproc.py", line 1849, in execute
job_object = self.hook.submit_job(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 869, in submit_job
client = self.get_job_client(region=region)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/dataproc.py", line 258, in get_job_client
credentials=self._get_credentials(), client_info=CLIENT_INFO, client_options=client_options
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 261, in _get_credentials
credentials, _ = self._get_credentials_and_project_id()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 240, in _get_credentials_and_project_id
credentials, project_id = get_credentials_and_project_id(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 321, in get_credentials_and_project_id
return _CredentialProvider(*args, **kwargs).get_credentials_and_project()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 229, in get_credentials_and_project
credentials, project_id = self._get_credentials_using_adc()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/utils/credentials_provider.py", line 307, in _get_credentials_using_adc
credentials, project_id = google.auth.default(scopes=self.scopes)
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 459, in default
credentials, project_id = checker()
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 221, in _get_explicit_environ_credentials
credentials, project_id = load_credentials_from_file(
File "/opt/python3.8/lib/python3.8/site-packages/google/auth/_default.py", line 107, in load_credentials_from_file
raise exceptions.DefaultCredentialsError(
google.auth.exceptions.DefaultCredentialsError: File celery was not found.
[2022-08-23, 16:03:25 UTC] {taskinstance.py:1279} INFO - Marking task as UP_FOR_RETRY. dag_id=dataproc_spark_operators, task_id=pyspark_task, execution_date=20220823T160316, start_date=20220823T160324, end_date=20220823T160325
[2022-08-23, 16:03:25 UTC] {standard_task_runner.py:93} ERROR - Failed to execute job 32837 for task pyspark_task (File celery was not found.; 356144)
[2022-08-23, 16:03:26 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-08-23, 16:03:26 UTC] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
GCP Composer Env
Based on the error File celery was not found, I think that the Application Default Credentials (ADC) tries to read a file named celery, and it doesn't find it, so check if you set the environment variable GOOGLE_APPLICATION_CREDENTIALS, because if you set it, ADC will read the the file to use it:
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC uses the service account key or configuration file that the variable points to.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS isn't set, ADC uses the service account that is attached to the resource that is running your code.
This service account might be a default service account provided by Compute Engine, Google Kubernetes Engine, App Engine, Cloud Run, or Cloud Functions. It might also be a user-managed service account that you created.
GCP doc
I want to transfer data from GCS to BigQuery by embulk and digdag.
But error occurs.
com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
.......
Error: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
↓ Details
command :
embulk run XXXX.yaml
XXXX.yaml :
in:
type: gcs
bucket: <bucket name>
path_prefix: <file path>
auth_method: compute_engine
parser:
type: poi_excel
sheets: <sheet name>
skip_header_lines: 4
columns:
- {name: 'name', type: string}
.
.
.
out:
type: bigquery
mode: replace
project: <project name>
dataset: <dataset name>
table: <table name>
auth_method: compute_engine
schema_file: <file name of json type>
gcs_bucket: <gcs tmp bucket name>
output :
$ embulk run target_item_bottoms_config.yaml
2020-07-22 14:27:36.559 +0900: Embulk v0.9.23
2020-07-22 14:27:37.609 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2020-07-22 14:27:40.577 +0900 [INFO] (main): Gem's home and path are set by default: "/Users/oniki/.embulk/lib/gems"
2020-07-22 14:27:41.662 +0900 [INFO] (main): Started Embulk v0.9.23
2020-07-22 14:27:41.853 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-gcs (0.3.2)
2020-07-22 14:27:46.263 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-bigquery (0.6.4)
2020-07-22 14:27:46.369 +0900 [INFO] (0001:transaction): Loaded plugin embulk-parser-poi_excel (0.1.7)
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:566)
at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:35)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
at org.embulk.spi.Exec.doWith(Exec.java:22)
at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:242)
at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:291)
at org.embulk.EmbulkRunner.run(EmbulkRunner.java:155)
at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:431)
at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:90)
at org.embulk.cli.Main.main(Main.java:64)
Suppressed: java.lang.NullPointerException
at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:463)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:397)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:394)
at org.embulk.spi.Exec.doWith(Exec.java:22)
at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:394)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:245)
... 5 more
Caused by: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at org.embulk.input.gcs.AuthUtils.newClient(AuthUtils.java:81)
at org.embulk.input.gcs.GcsFileInput.listFiles(GcsFileInput.java:49)
at org.embulk.input.gcs.GcsFileInputPlugin.transaction(GcsFileInputPlugin.java:59)
at org.embulk.spi.FileInputRunner.transaction(FileInputRunner.java:62)
at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:507)
... 11 more
Caused by: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:226)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.list(HttpStorageRpc.java:366)
at com.google.cloud.storage.StorageImpl$8.call(StorageImpl.java:338)
at com.google.cloud.storage.StorageImpl$8.call(StorageImpl.java:335)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.storage.StorageImpl.listBlobs(StorageImpl.java:334)
at com.google.cloud.storage.StorageImpl.list(StorageImpl.java:290)
at org.embulk.input.gcs.AuthUtils.newClient(AuthUtils.java:77)
... 15 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
"code" : 401,
"errors" : [ {
"domain" : "global",
"location" : "Authorization",
"locationType" : "header",
"message" : "Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.",
"reason" : "required"
} ],
"message" : "Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:401)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.list(HttpStorageRpc.java:356)
... 23 more
Error: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
my environment :
$ gcloud config list
[compute]
region = us-east1
zone = us-east1-c
[core]
account = myname#xxx.com
disable_usage_reporting = False
project = <project ID>
Your active configuration is: [default]
$ gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
* myname#xxxx.com
To set the active account, run:
$ gcloud config set account `ACCOUNT`
$ gsutil ls
gs://<bucket name>
my gcp IAM role :
owner
I understand that the solution to this error is authorization.
But my preferences seem to be fine.
what's wrong?
As the documentation [1], if we have 401- Unauthorized error then there could be many reasons, please have a related list of reasons listed below [followed the link 1], which could be helpful for troubleshooting:
Reason:AuthenticationRequiredRequesterPays
Access to a Requester Pays bucket requires authentication.
Reason: authError
This error indicates a problem with the authorization provided in the request to Cloud Storage. The following are some situations where that will occur:
The OAuth access token has expired and needs to be refreshed. This can be avoided by refreshing the access token early, but code can also catch this error, refresh the token and retry automatically.
Multiple non-matching authorizations were provided; choose one mode only.
The OAuth access token's bound project does not match the project associated with the provided developer key.
The Authorization header was of an unrecognized format or uses an unsupported credential type.
reason:lockedDomainExpired
When downloading content from a cookie-authenticated site, e.g., using the Storage Browser, the response will redirect to a temporary domain. This error will occur if access to said domain occurs after the domain expires. Issue the original request again, and receive a new redirect.
Reason: push.webhookUrlUnauthorized
Requests to storage.objects.watchAll will fail unless you verify you own the domain.
Reason: required
Access to a non-public method that requires authorization was made, but none was provided in the Authorization header or through other means.
[1] https://cloud.google.com/storage/docs/json_api/v1/status-codes#401_Unauthorized
I try locally , and create Service Account Key and save at local .
◾️XXXX.yaml
before
auth_method: compute_engine
after
auth_method: json_key
json_keyfile: /path/to/json_keyfile.json
I have an MLLib model saved in a folder on S3, say bucket-name/test-model. Now, I have a spark cluster (let's say on a single machine for now). I am running the following commands to load the model:
pyspark --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
Then,
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.awsAccessKeyId", AWS_ACCESS_KEY)
hadoopConf.set("fs.s3a.awsSecretAccessKey", AWS_SECRET_KEY)
hadoopConf.set("fs.s3a.endpoint", "s3.us-east-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
m1 = RandomForestClassificationModel.load('s3a://test-bucket/test-model')
and I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.6/site-packages/pyspark/ml/util.py", line 362, in load
return cls.read().load(path)
File "/home/user/.local/lib/python3.6/site-packages/pyspark/ml/util.py", line 300, in load
java_obj = self._jread.load(path)
File "/home/user/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/user/.local/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/user/.local/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1343)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:615)
at org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:427)
at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:316)
at org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:306)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Honestly, these lines of code are taken from the web and I have no idea about storing and loading MLLib models on to S3. Any help here will be appreciated and also the next step for me is to do the same on a cluster of machines. So any heads up will also be appreciated.
You are using the wrong property names for the s3a connector.
see https://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/#Authentication_properties
Specifically:
fs.s3a.access.key your access key
fs.s3a.secret.key your secret key
Note in particular
it's lower case
there are dots/periods between access and key, secret and key
The mixedCaseOptions are from the s3n connector which is obsolete and has long been deleted from the hadoop codebase. the s3a connector will simply ignore them
The AWS Java SDK has a credential resolution logic/chain to properly resolve the AWS credentials to use when interfacing with AWS services.
See http://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html
This error means the SDK could not find credentials in any of the places the SDK looks at. Make sure the credentials exist in at least one of the places mentioned in the above link.
As a starting point, populate environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The AWS SDK for Java uses the EnvironmentVariableCredentialsProvider class to load these credentials.
This piece of code did the trick for me.
First, define AWS credential:
config = configparser.ConfigParser()
config.read_file(open('aws/dl.cfg'))
os.environ["AWS_ACCESS_KEY_ID"]= config['default']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['default']['AWS_SECRET_ACCESS_KEY']
Then, start a session like this:
spark = SparkSession \
.builder \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']) \
.config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']) \
.getOrCreate()
New to AWS. The below job is running into error. Please help
mrjob create-cluster --instance-type m1.medium --region us-east-1 --num-core-instances 3
Trace:
File "c:\python27\lib\site-packages\mrjob\emr.py", line 817, in
_set_cloud_tmp_dir
if (tmp_bucket.get_location() == s3_location_constraint_for_region( File
"c:\python27\lib\site-packages\boto\s3\bucket.py", line 1146, in
get_location
response.status, response.reason, body) error: boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
InvalidRequestThe authorization mechanism
you have provided is not supported. Please use
AWS4-HMAC-SHA256.AFAFB32563D25847nQrsnGiNfsvYpDYxIWKlvOCWEp5VPuPm2mEkKuDvb29+SpRCjs029CYTx3SjHEQJH5zuYB9XUUo=