How to set aws proxy host to Spark config - amazon-web-services

Any idea how to set aws proxy host, and region to spark session or spark context.
I am able to set in aws javasdk code, and it is working fine.
ClientConfiguration clientConfig = new ClientConfiguration();
clientConfig.setProxyHost("aws-proxy-qa.xxxxx.organization.com");
clientConfig.setProxyPort(8099));
AmazonS3ClientBuilder.standard()
.withRegion(getAWSRegion(Regions.US_WEST_2)
.withClientConfiguration(clientConfig) //Setting aws proxy host
Can help me to set same thing to spark context ( both region and proxy) since i am reading a s3 file which is different region from emr region.

based on fs.s3a.access.key and fs.s3a.secret.key region will be automatically determined.
just like other s3 properties
set this to sparkConf
/**
* example getSparkSessionForS3
* #return
*/
def getSparkSessionForS3():SparkSession = {
val conf = new SparkConf()
.setAppName("testS3File")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.fs.s3a.endpoint", "yourendpoint")
.set("spark.hadoop.fs.s3a.connection.maximum", "200")
.set("spark.hadoop.fs.s3a.fast.upload", "true")
.set("spark.hadoop.fs.s3a.connection.establish.timeout", "500")
.set("spark.hadoop.fs.s3a.connection.timeout", "5000")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
.set("spark.hadoop.com.amazonaws.services.s3.enforceV4", "true")
.set("spark.hadoop.fs.s3a.proxy.host","yourhost")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark
}

Related

Idle connections AWS S3 on putObject

I need to copy files from the s3 bucket from one account to another. I am trying to do that via aws-java-sdk client and its functions getObject and putObject. There are a lot of files that should be uploaded. So during putObject run, I get this error:
Exception in thread "main" software.amazon.awssdk.services.s3.model.S3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400)
How can this issue be fixed?
Here's the code that produces this error:
val clientConfigurationBuilder = ClientOverrideConfiguration.builder()
val clientConfiguration = clientConfigurationBuilder.build
val builder = AWSS3Client
.builder
.credentialsProvider(createCredentialsProvider(accessKey, secretKey))
.region(Region.of(region))
.overrideConfiguration(clientConfiguration)
.httpClientBuilder(ApacheHttpClient.builder())
endpoint.map(URI.create).foreach(builder.endpointOverride)
val awsS3Client = builder.build
val getObjectRequest = GetObjectRequest
.builder()
.bucket(fromBucket)
.key(fromKey)
.build()
val getObjectResponse = awsS3Client.getObject(getObjectRequest)
val putObjectRequest = PutObjectRequest
.builder()
.bucket(toBucket)
.key(toKey)
.build()
val reqBody = RequestBody.fromInputStream(getObjectResponse,
getObjectResponse.response().contentLength())
awsS3Client.putObject(putObjectRequest, reqBody)

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

How do I improve performance when downloading files from an Amazon S3 Bucket?

I am trying to download a 19 Mb file from an an Amazon S3 bucket using Amazon SDK but it eventually takes a lot more time than Amazon CLI. The code I am using is below:
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(Regions.EU_WEST_1)
.withCredentials(new DefaultAWSCredentialsProviderChain())
.build();
s3Client.getObject(new GetObjectRequest("bucketName", "path/fileName.zip"), new File("localFileName.zip"));
If we compare downloading timings of both mechanisms then: Amazon SDK took around 9 min to get it downloaded whereas Amazon CLI took around 5 seconds.
Is there a way where we can decrease downloading time while using Amazon SDK?
First issue here is you are using the OLD SDK for Java which is V1. Amazon recommends moving to V2 as best practice.
To learn about AWS SDK for Java V2, see:
Developer guide - AWS SDK for Java 2.x
Here is the code you should use to download an object from an Amazon S3 bucket. This is the V2 S3TransferManager:
package com.example.transfermanager;
import software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.transfer.s3.FileDownload;
import software.amazon.awssdk.transfer.s3.S3TransferManager;
import java.nio.file.Paths;
/**
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class GetObject {
public static void main(String[] args) {
final String usage = "\n" +
"Usage:\n" +
" <bucketName> <objectKey> <objectPath> \n\n" +
"Where:\n" +
" bucketName - the Amazon S3 bucket to upload an object into.\n" +
" objectKey - the object to download (for example, book.pdf).\n" +
" objectPath - the path where the file is written (for example, C:/AWS/book2.pdf). \n\n" ;
if (args.length != 3) {
System.out.println(usage);
System.exit(1);
}
long MB = 1024;
String bucketName = args[0];
String objectKey = args[1];
String objectPath = args[2];
Region region = Region.US_EAST_1;
S3TransferManager transferManager = S3TransferManager.builder()
.s3ClientConfiguration(cfg ->cfg.region(region)
.credentialsProvider(EnvironmentVariableCredentialsProvider.create())
.targetThroughputInGbps(20.0)
.minimumPartSizeInBytes(10 * MB))
.build();
downloadObjectTM(transferManager, bucketName, objectKey, objectPath);
System.out.println("Object was successfully downloaded using the Transfer Manager.");
transferManager.close();
}
public static void downloadObjectTM(S3TransferManager transferManager, String bucketName, String objectKey, String objectPath ) {
FileDownload download =
transferManager.downloadFile(d -> d.getObjectRequest(g -> g.bucket(bucketName).key(objectKey))
.destination(Paths.get(objectPath)));
download.completionFuture().join();
}
}
I just ran this code and downloaded a PDF that is 25 MB in seconds...

AWS: How to programmatically create a RDS Aurora Cluster in Python/Boto3

My application is hosted on Amazon Web Services, and I'm starting to script the creation of all the infrastructure of my app (VPC, Security Group, Beanstalk ect ...). I did not find the proper way to create a RDS Aurora Cluster, and I failed to reproduce the RDS wizard (helping you to create the db instances and the cluster) in Python with Boto3. Maybe I lack of knowledge in infrastructure, and networks, but I think creating a Aurora cluster must be accessible to me.
So here is my question:
Lets says I have a VPC id, a security group id, and some database info (user, password...), what are the minimum API calls I have to do to create a cluster, and make it usable by my application? The procedure must end with a cluster reader/writer endpoint and a reader only endpoint.
Here is how I create an Aurora MySQL instance in Python/BOTO3. You have to implement by yourself some missing functions.
def create_aurora(
instance_identifier, # used for instance name and cluster name
db_username,
db_password,
db_name,
db_port,
vpc_id,
vpc_sg, # Must be an array
dbsubnetgroup_name,
public_access = False,
AZ = None,
instance_type = "db.t2.small",
multi_az = True,
nb_instance = 1,
extratags = []
):
rds = boto3.client('rds')
# Assume a DB SUBNET Groups exists before creating the cluster. You must have created a DBSUbnetGroup associated to the Subnet of the VPC of your cluster. AWS will find it automatically.
#
# Search if the cluster exists
try:
db_cluster = rds.describe_db_clusters(
DBClusterIdentifier = instance_identifier
)['DBClusters']
db_cluster = db_cluster[0]
except botocore.exceptions.ClientError as e:
psa.printf("Creating empty cluster\r\n");
res = rds.create_db_cluster(
DBClusterIdentifier = instance_identifier,
Engine="aurora",
MasterUsername=db_username,
MasterUserPassword=db_password,
DBSubnetGroupName=dbsubnetgroup_name,
VpcSecurityGroupIds=vpc_sg,
AvailabilityZones=AZ
)
db_cluster = res['DBCluster']
cluster_name = db_cluster['DBClusterIdentifier']
instance_identifier = db_cluster['DBClusterIdentifier']
psa.printf("Cluster identifier : %s, status : %s, members : %d\n", instance_identifier , db_cluster['Status'], len(db_cluster['DBClusterMembers']))
if (db_cluster['Status'] == 'deleting'):
psa.printf(" Please wait for the cluster to be deleted and try again.\n")
return None
psa.printf(" Writer Endpoint : %s\n", db_cluster['Endpoint'])
psa.printf(" Reader Endpoint : %s\n", db_cluster['ReaderEndpoint'])
# Now create instances
# Loop on requested number of instance, and balance them on AZ
for i in range(1, nb_instance+1):
if AZ != None:
the_AZ = AZ[i -1 % len(AZ)]
dbinstance_id = instance_identifier+"-"+str(i)+"-"+the_AZ
else:
the_AZ = None
dbinstance_id = instance_identifier+"-"+str(i)
psa.printf("Creating instance %d named '%s' in AZ %s\n", i, dbinstance_id, the_AZ)
try:
res = rds.create_db_instance(
DBInstanceIdentifier=dbinstance_id,
DBInstanceClass=instance_type,
Engine='aurora',
PubliclyAccessible=False,
AvailabilityZone=the_AZ,
DBSubnetGroupName=dbsubnetgroup_name,
DBClusterIdentifier=instance_identifier,
Tags = psa.tagsKeyValueToAWStags(extratags)
)['DBInstance']
psa.printf(" DbiResourceId=%s\n", res['DbiResourceId'])
except botocore.exceptions.ClientError as e:
psa.printf(" Instance seems to exists.\n")
res = rds.describe_db_instances(DBInstanceIdentifier = dbinstance_id)['DBInstances']
psa.printf(" Status is %s\n", res[0]['DBInstanceStatus'])
return db_cluster
Yeah, you are on the right track. Here is the boto3 document for creating a Aurora RDS cluster.
Further, to address the bigger picture problem (i.e. managing your entire infrastructure as code), you should look at options like Terraform.
Check out their Git Repo Terraform Git Repo So, you can accomplish the same task of creating the Aurora cluster using terraform using this template

Cannot connect from EC2 to S3

I am trying to connect to S3 from EC2 instance using AmazonS3Client, to get the list of objects present in S3 bucket. While I can connect to S3 when running this code from my local machine, I am having a hard time running the same code on EC2.
Am I missing any setting or configuration on EC2 instance?
Code
AWSCredentials credentials = new BasicAWSCredentials("XXXX", "YYYY");
AmazonS3Client conn = new AmazonS3Client(credentials);
String bucketName = "s3-xyz";
String prefix = "123";
ObjectListing objects = conn.listObjects(bucketName, prefix);
List<S3ObjectSummary> objectSummary = objects.getObjectSummaries();
for(S3ObjectSummary os : objectSummary)
{
System.out.println(os.getKey());
}
Errors
ERROR com.amazonaws.http.AmazonHttpClient - Unable to execute HTTP request: Connect to s3-xyz.amazonaws.com:443 timed out
org.apache.http.conn.ConnectTimeoutException: Connect to s3-xyz.s3.amazonaws.com:443 timed out
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:640)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:318)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:202)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3037)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3008)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:531)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:515)
ClientConfiguration cc = new ClientConfiguration();
cc.setProxyHost("10.66.80.122");
cc.setProxyPort(8080);
propertiesCredentials = new BasicAWSCredentials(aws_access_key_id, aws_secret_access_key);
s3 = new AmazonS3Client(propertiesCredentials,cc);
To find proxy_host & port go to LAN settings.