Using notebook (Databricks) show error java.io.IOException: Error getting access token from metadata server - google-cloud-platform

I'm using https://community.cloud.databricks.com/ (notebook) when I try to access Storage GCP through the Python command as below:
df = spark.read.format(csv).load(gs://test-gcs-doc-bucket-pr/test)
Error:
java.io.IOException: Error getting access token from metadata server at: 169.254.169.xxx/computeMetadata/v1/instance/service-accounts/default/token
Databricks Spark Configuration:
spark.hadoop.fs.gs.auth.client_id "10"
spark.hadoop.fs.gs.auth.auth_uri "https://accounts.google.com/o/oauth2/auth"
spark.databricks.delta.preview.enabled true
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email "test-gcs.iam.gserviceaccount.com"
spark.hadoop.fs.gs.auth.token_uri "https://oauth2.googleapis.com/token"
spark.hadoop.fs.gs.project_id "oval-replica-9999999"
spark.hadoop.fs.gs.auth.service.account.private_key "--BEGIN"
spark.hadoop.fs.gs.auth.service.account.private_key_id "3f869c98d389bb28c5b13a0e31785e73d8b ```

Related

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

AWS unable to get query result because of ResourceNotFoundException

I'm trying to get cloudwatch query with boto3, but I'm getting ResourceNotFoundException.
import boto3
if __name__ == "__main__":
client = boto3.client('logs')
response = client.start_query(
logGroupName='/aws/lambda/My-Stack-Name-SE349DJ',
startTime=123,
endTime=123,
queryString="fields #message",
limit=1
)
I attempted to the above code. And an error message is as follows.
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group '/aws/lambda/My-Stack-Name-SE349DJ' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
What I tested are as below.
The log group exists. I tested it with Logs Insights on the aws console. Also I tested after paste the log group as it is.
I added a backslash to test if '/' is a problem (ex. '/aws/lambda/My-Stack-Name-SE349DJ') and InvalidParameterException appears.
The aws account has administrate access privileges in the log group.
I got the same error message when I tested with aws cli.
An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group 'XXXXXXXXXXXXXX' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
How can I solve this problem?
Actually the reason why I'm trying this is because I need to get more than 500,000 data from the filtered log group, but 10,000 are the maximum. I think It's better to pull it out by changing the start time and end time.
There is a high possibility that there are too many data in certain time, so I think it would be better to run it with boto3 rather than directly. Is there an easy way to extract more than 500,000 pieces of data from the console or other methods?
As #Marcin commented, It was because of the region configuration.
I added these lines before creating an aws client.
from botocore.config import Config
...
my_config = Config(
region_name = 'us-east-2',
)
...
client = boto3.client('logs', config=my_config)

Error returned: 'OLE DB or ODBC error: [DataSource.Error] Teradata: [Teradata Database] [3119] Continue request submitted but no response to return

Failed to save modifications to the server. Error returned: 'OLE DB or ODBC error: [DataSource.Error] Teradata: [Teradata Database] [3119] Continue request submitted but no response to return..
'.
When trying to connect a View to my power bi file, I get the above error when around 25M records are imported. I do not have any issue on smaller tables.

GoogleStorageException - 401 Unauthorized / Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket

I want to transfer data from GCS to BigQuery by embulk and digdag.
But error occurs.
com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
.......
Error: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
↓ Details
command :
embulk run XXXX.yaml
XXXX.yaml :
in:
type: gcs
bucket: <bucket name>
path_prefix: <file path>
auth_method: compute_engine
parser:
type: poi_excel
sheets: <sheet name>
skip_header_lines: 4
columns:
- {name: 'name', type: string}
.
.
.
out:
type: bigquery
mode: replace
project: <project name>
dataset: <dataset name>
table: <table name>
auth_method: compute_engine
schema_file: <file name of json type>
gcs_bucket: <gcs tmp bucket name>
output :
$ embulk run target_item_bottoms_config.yaml
2020-07-22 14:27:36.559 +0900: Embulk v0.9.23
2020-07-22 14:27:37.609 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2020-07-22 14:27:40.577 +0900 [INFO] (main): Gem's home and path are set by default: "/Users/oniki/.embulk/lib/gems"
2020-07-22 14:27:41.662 +0900 [INFO] (main): Started Embulk v0.9.23
2020-07-22 14:27:41.853 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-gcs (0.3.2)
2020-07-22 14:27:46.263 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-bigquery (0.6.4)
2020-07-22 14:27:46.369 +0900 [INFO] (0001:transaction): Loaded plugin embulk-parser-poi_excel (0.1.7)
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:566)
at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:35)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
at org.embulk.spi.Exec.doWith(Exec.java:22)
at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:242)
at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:291)
at org.embulk.EmbulkRunner.run(EmbulkRunner.java:155)
at org.embulk.cli.EmbulkRun.runSubcommand(EmbulkRun.java:431)
at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:90)
at org.embulk.cli.Main.main(Main.java:64)
Suppressed: java.lang.NullPointerException
at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:463)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:397)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:394)
at org.embulk.spi.Exec.doWith(Exec.java:22)
at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:394)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:245)
... 5 more
Caused by: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at org.embulk.input.gcs.AuthUtils.newClient(AuthUtils.java:81)
at org.embulk.input.gcs.GcsFileInput.listFiles(GcsFileInput.java:49)
at org.embulk.input.gcs.GcsFileInputPlugin.transaction(GcsFileInputPlugin.java:59)
at org.embulk.spi.FileInputRunner.transaction(FileInputRunner.java:62)
at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:507)
... 11 more
Caused by: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:226)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.list(HttpStorageRpc.java:366)
at com.google.cloud.storage.StorageImpl$8.call(StorageImpl.java:338)
at com.google.cloud.storage.StorageImpl$8.call(StorageImpl.java:335)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.storage.StorageImpl.listBlobs(StorageImpl.java:334)
at com.google.cloud.storage.StorageImpl.list(StorageImpl.java:290)
at org.embulk.input.gcs.AuthUtils.newClient(AuthUtils.java:77)
... 15 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 401 Unauthorized
{
"code" : 401,
"errors" : [ {
"domain" : "global",
"location" : "Authorization",
"locationType" : "header",
"message" : "Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.",
"reason" : "required"
} ],
"message" : "Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket."
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:401)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:499)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:549)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.list(HttpStorageRpc.java:356)
... 23 more
Error: org.embulk.config.ConfigException: com.google.cloud.storage.StorageException: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket.
my environment :
$ gcloud config list
[compute]
region = us-east1
zone = us-east1-c
[core]
account = myname#xxx.com
disable_usage_reporting = False
project = <project ID>
Your active configuration is: [default]
$ gcloud auth list
Credentialed Accounts
ACTIVE ACCOUNT
* myname#xxxx.com
To set the active account, run:
$ gcloud config set account `ACCOUNT`
$ gsutil ls
gs://<bucket name>
my gcp IAM role :
owner
I understand that the solution to this error is authorization.
But my preferences seem to be fine.
what's wrong?
As the documentation [1], if we have 401- Unauthorized error then there could be many reasons, please have a related list of reasons listed below [followed the link 1], which could be helpful for troubleshooting:
Reason:AuthenticationRequiredRequesterPays
Access to a Requester Pays bucket requires authentication.
Reason: authError
This error indicates a problem with the authorization provided in the request to Cloud Storage. The following are some situations where that will occur:
The OAuth access token has expired and needs to be refreshed. This can be avoided by refreshing the access token early, but code can also catch this error, refresh the token and retry automatically.
Multiple non-matching authorizations were provided; choose one mode only.
The OAuth access token's bound project does not match the project associated with the provided developer key.
The Authorization header was of an unrecognized format or uses an unsupported credential type.
reason:lockedDomainExpired
When downloading content from a cookie-authenticated site, e.g., using the Storage Browser, the response will redirect to a temporary domain. This error will occur if access to said domain occurs after the domain expires. Issue the original request again, and receive a new redirect.
Reason: push.webhookUrlUnauthorized
Requests to storage.objects.watchAll will fail unless you verify you own the domain.
Reason: required
Access to a non-public method that requires authorization was made, but none was provided in the Authorization header or through other means.
[1] https://cloud.google.com/storage/docs/json_api/v1/status-codes#401_Unauthorized
I try locally , and create Service Account Key and save at local .
◾️XXXX.yaml
before
auth_method: compute_engine
after
auth_method: json_key
json_keyfile: /path/to/json_keyfile.json

AWS Batch - Access denied 403

I am using AWS Batch with ECS to perform a job which need to send a request to Athena. I use python boto3 to send the query and the get the request status :
start_query_execution : work fine
get_query_execution : have an error !
When I try to get the query execution I have the following error :
{'QueryExecution': {'QueryExecutionId': 'XXXX', 'Query': "SELECT * FROM my_table LIMIT 10 ", 'StatementType': 'DML', 'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/query_id.csv'}, 'QueryExecutionContext': {'Database': 'my_database'}, 'Status': {'State': 'FAILED', 'StateChangeReason': '**Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 4.**. ; S3 Extended Request ID: ....=)'
I have the all permissions to the container role (only to test) :
s3:*
athena : *
glue : *
I face this problem only in container in AWS batch : with the same policy and code in a lambda it's working !
Any help will be appreciated.
In Athena Output location what I have been using Athena bucket name not file name.
As result set will be generated which will have its own id
'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/'}
If ypu are not sure of the bucket for query you can check in query console -->settings