Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request - amazon-web-services

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated

unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

Related

Issue creating AWS DMS task from boto3 script

Using the below python boto3 script to create the AWS DMS task and start the replication task, but getting the below error:
Error:
botocore.errorfactory.InvalidResourceStateFault: An error occurred (InvalidResourceStateFault) when calling the StartReplicationTask operation: Replication Task cannot be started, invalid state
Python script:
#!/usr/bin/python
import boto3
client_dms = boto3.client('dms')
#Create a replication DMS task
response = client_dms.create_replication_task(
ReplicationTaskIdentifier='test-new1',
ResourceIdentifier='test-new',
ReplicationInstanceArn='arn:aws:dms:us-east-1:xxxxxxxxxx:rep:test1',
SourceEndpointArn='arn:aws:dms:us-east-1:xxxxxxxxxx:endpoint:source',
TargetEndpointArn='arn:aws:dms:us-east-1:xxxxxxxxxx:endpoint:target',
MigrationType='full-load',
TableMappings='{\n \"TableMappings\": [\n {\n \"Type\": \"Include\",\n \"SourceSchema\": \"test\",\n \"SourceTable\": \"table_name\"\n}\n ]\n}\n\n'
)
#Start the task from DMS
response = client_dms.start_replication_task(
ReplicationTaskArn='arn:aws:dms:us-east-1:xxxxxxxxxx:task:test-new',
StartReplicationTaskType='start-replication'
)
Probably have to use waiter for the task to be ready:
ReplicationTaskReady
before you can perform other actions on it.

AWS unable to get query result because of ResourceNotFoundException

I'm trying to get cloudwatch query with boto3, but I'm getting ResourceNotFoundException.
import boto3
if __name__ == "__main__":
client = boto3.client('logs')
response = client.start_query(
logGroupName='/aws/lambda/My-Stack-Name-SE349DJ',
startTime=123,
endTime=123,
queryString="fields #message",
limit=1
)
I attempted to the above code. And an error message is as follows.
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group '/aws/lambda/My-Stack-Name-SE349DJ' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
What I tested are as below.
The log group exists. I tested it with Logs Insights on the aws console. Also I tested after paste the log group as it is.
I added a backslash to test if '/' is a problem (ex. '/aws/lambda/My-Stack-Name-SE349DJ') and InvalidParameterException appears.
The aws account has administrate access privileges in the log group.
I got the same error message when I tested with aws cli.
An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group 'XXXXXXXXXXXXXX' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
How can I solve this problem?
Actually the reason why I'm trying this is because I need to get more than 500,000 data from the filtered log group, but 10,000 are the maximum. I think It's better to pull it out by changing the start time and end time.
There is a high possibility that there are too many data in certain time, so I think it would be better to run it with boto3 rather than directly. Is there an easy way to extract more than 500,000 pieces of data from the console or other methods?
As #Marcin commented, It was because of the region configuration.
I added these lines before creating an aws client.
from botocore.config import Config
...
my_config = Config(
region_name = 'us-east-2',
)
...
client = boto3.client('logs', config=my_config)

AWS Batch - Access denied 403

I am using AWS Batch with ECS to perform a job which need to send a request to Athena. I use python boto3 to send the query and the get the request status :
start_query_execution : work fine
get_query_execution : have an error !
When I try to get the query execution I have the following error :
{'QueryExecution': {'QueryExecutionId': 'XXXX', 'Query': "SELECT * FROM my_table LIMIT 10 ", 'StatementType': 'DML', 'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/query_id.csv'}, 'QueryExecutionContext': {'Database': 'my_database'}, 'Status': {'State': 'FAILED', 'StateChangeReason': '**Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 4.**. ; S3 Extended Request ID: ....=)'
I have the all permissions to the container role (only to test) :
s3:*
athena : *
glue : *
I face this problem only in container in AWS batch : with the same policy and code in a lambda it's working !
Any help will be appreciated.
In Athena Output location what I have been using Athena bucket name not file name.
As result set will be generated which will have its own id
'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/'}
If ypu are not sure of the bucket for query you can check in query console -->settings

How to set aws proxy host to Spark config

Any idea how to set aws proxy host, and region to spark session or spark context.
I am able to set in aws javasdk code, and it is working fine.
ClientConfiguration clientConfig = new ClientConfiguration();
clientConfig.setProxyHost("aws-proxy-qa.xxxxx.organization.com");
clientConfig.setProxyPort(8099));
AmazonS3ClientBuilder.standard()
.withRegion(getAWSRegion(Regions.US_WEST_2)
.withClientConfiguration(clientConfig) //Setting aws proxy host
Can help me to set same thing to spark context ( both region and proxy) since i am reading a s3 file which is different region from emr region.
based on fs.s3a.access.key and fs.s3a.secret.key region will be automatically determined.
just like other s3 properties
set this to sparkConf
/**
* example getSparkSessionForS3
* #return
*/
def getSparkSessionForS3():SparkSession = {
val conf = new SparkConf()
.setAppName("testS3File")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.fs.s3a.endpoint", "yourendpoint")
.set("spark.hadoop.fs.s3a.connection.maximum", "200")
.set("spark.hadoop.fs.s3a.fast.upload", "true")
.set("spark.hadoop.fs.s3a.connection.establish.timeout", "500")
.set("spark.hadoop.fs.s3a.connection.timeout", "5000")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
.set("spark.hadoop.com.amazonaws.services.s3.enforceV4", "true")
.set("spark.hadoop.fs.s3a.proxy.host","yourhost")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark
}

Pyspark - read data from elasticsearch cluster on EMR

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?