Idle connections AWS S3 on putObject - amazon-web-services

I need to copy files from the s3 bucket from one account to another. I am trying to do that via aws-java-sdk client and its functions getObject and putObject. There are a lot of files that should be uploaded. So during putObject run, I get this error:
Exception in thread "main" software.amazon.awssdk.services.s3.model.S3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400)
How can this issue be fixed?
Here's the code that produces this error:
val clientConfigurationBuilder = ClientOverrideConfiguration.builder()
val clientConfiguration = clientConfigurationBuilder.build
val builder = AWSS3Client
.builder
.credentialsProvider(createCredentialsProvider(accessKey, secretKey))
.region(Region.of(region))
.overrideConfiguration(clientConfiguration)
.httpClientBuilder(ApacheHttpClient.builder())
endpoint.map(URI.create).foreach(builder.endpointOverride)
val awsS3Client = builder.build
val getObjectRequest = GetObjectRequest
.builder()
.bucket(fromBucket)
.key(fromKey)
.build()
val getObjectResponse = awsS3Client.getObject(getObjectRequest)
val putObjectRequest = PutObjectRequest
.builder()
.bucket(toBucket)
.key(toKey)
.build()
val reqBody = RequestBody.fromInputStream(getObjectResponse,
getObjectResponse.response().contentLength())
awsS3Client.putObject(putObjectRequest, reqBody)

Related

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

How do I improve performance when downloading files from an Amazon S3 Bucket?

I am trying to download a 19 Mb file from an an Amazon S3 bucket using Amazon SDK but it eventually takes a lot more time than Amazon CLI. The code I am using is below:
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(Regions.EU_WEST_1)
.withCredentials(new DefaultAWSCredentialsProviderChain())
.build();
s3Client.getObject(new GetObjectRequest("bucketName", "path/fileName.zip"), new File("localFileName.zip"));
If we compare downloading timings of both mechanisms then: Amazon SDK took around 9 min to get it downloaded whereas Amazon CLI took around 5 seconds.
Is there a way where we can decrease downloading time while using Amazon SDK?
First issue here is you are using the OLD SDK for Java which is V1. Amazon recommends moving to V2 as best practice.
To learn about AWS SDK for Java V2, see:
Developer guide - AWS SDK for Java 2.x
Here is the code you should use to download an object from an Amazon S3 bucket. This is the V2 S3TransferManager:
package com.example.transfermanager;
import software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.transfer.s3.FileDownload;
import software.amazon.awssdk.transfer.s3.S3TransferManager;
import java.nio.file.Paths;
/**
* To run this AWS code example, ensure that you have setup your development environment, including your AWS credentials.
*
* For information, see this documentation topic:
*
* https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/get-started.html
*/
public class GetObject {
public static void main(String[] args) {
final String usage = "\n" +
"Usage:\n" +
" <bucketName> <objectKey> <objectPath> \n\n" +
"Where:\n" +
" bucketName - the Amazon S3 bucket to upload an object into.\n" +
" objectKey - the object to download (for example, book.pdf).\n" +
" objectPath - the path where the file is written (for example, C:/AWS/book2.pdf). \n\n" ;
if (args.length != 3) {
System.out.println(usage);
System.exit(1);
}
long MB = 1024;
String bucketName = args[0];
String objectKey = args[1];
String objectPath = args[2];
Region region = Region.US_EAST_1;
S3TransferManager transferManager = S3TransferManager.builder()
.s3ClientConfiguration(cfg ->cfg.region(region)
.credentialsProvider(EnvironmentVariableCredentialsProvider.create())
.targetThroughputInGbps(20.0)
.minimumPartSizeInBytes(10 * MB))
.build();
downloadObjectTM(transferManager, bucketName, objectKey, objectPath);
System.out.println("Object was successfully downloaded using the Transfer Manager.");
transferManager.close();
}
public static void downloadObjectTM(S3TransferManager transferManager, String bucketName, String objectKey, String objectPath ) {
FileDownload download =
transferManager.downloadFile(d -> d.getObjectRequest(g -> g.bucket(bucketName).key(objectKey))
.destination(Paths.get(objectPath)));
download.completionFuture().join();
}
}
I just ran this code and downloaded a PDF that is 25 MB in seconds...

How to set aws proxy host to Spark config

Any idea how to set aws proxy host, and region to spark session or spark context.
I am able to set in aws javasdk code, and it is working fine.
ClientConfiguration clientConfig = new ClientConfiguration();
clientConfig.setProxyHost("aws-proxy-qa.xxxxx.organization.com");
clientConfig.setProxyPort(8099));
AmazonS3ClientBuilder.standard()
.withRegion(getAWSRegion(Regions.US_WEST_2)
.withClientConfiguration(clientConfig) //Setting aws proxy host
Can help me to set same thing to spark context ( both region and proxy) since i am reading a s3 file which is different region from emr region.
based on fs.s3a.access.key and fs.s3a.secret.key region will be automatically determined.
just like other s3 properties
set this to sparkConf
/**
* example getSparkSessionForS3
* #return
*/
def getSparkSessionForS3():SparkSession = {
val conf = new SparkConf()
.setAppName("testS3File")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.fs.s3a.endpoint", "yourendpoint")
.set("spark.hadoop.fs.s3a.connection.maximum", "200")
.set("spark.hadoop.fs.s3a.fast.upload", "true")
.set("spark.hadoop.fs.s3a.connection.establish.timeout", "500")
.set("spark.hadoop.fs.s3a.connection.timeout", "5000")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.set("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
.set("spark.hadoop.com.amazonaws.services.s3.enforceV4", "true")
.set("spark.hadoop.fs.s3a.proxy.host","yourhost")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
spark
}

Why does amazon-aws-jdk 1.11.95 returns File not found exception while uploading image to AWS S3?

I am trying to upload an image to AWS S3.
The web app runs in my local desktop in tomcat server.
When I upload the image from server, I see the file details in http request multipart file , I'm able to get its size and details.
This is how I set up connection
File convFile = new File( file.getOriginalFilename());
file.transferTo(convFile);
AmazonS3 s3 = AmazonS3ClientBuilder.standard()
.withRegion(Regions.US_WEST_2) //regionName is a string for a region not supported by the SDK yet
.withCredentials(new AWSStaticCredentialsProvider
(new BasicAWSCredentials("key", "accessId")))
// .setEndpointConfiguration(new EndpointConfiguration("https://s3.console.aws.amazon.com", "us-west-1"))
.enablePathStyleAccess()
.disableChunkedEncoding()
.build();
s3.putObject(new PutObjectRequest(bucketName, "key", convFile));
I tried two methodologies.
1) Converting Multipart file to java.io.File and uploading
Error: com.amazonaws.SdkClientException: Unable to calculate MD5 hash: MyImage.png (No such file or directory)
2) Sending the image as bytestream
Error: I am getting java.io.FileNotFound Exception: /path/to/tomcat/MyImage.tmp not found
The actual image name is MyImage.png.
Either method I try, I get exception.
Ok. There were several issues.
I mis typed the Region for a different set of keys.
But still the issues was happening and I went back to 1.11.76 version. And still there were some issues and this is how I fixed.
ObjectMetadata objectMetadata = new ObjectMetadata();
objectMetadata.setContentType(file.getContentType());
byte[] contentBytes = null;
try {
InputStream is = file.getInputStream();
contentBytes = IOUtils.toByteArray(is);
} catch (IOException e) {
System.err.printf("Failed while reading bytes from %s", e.getMessage());
}
Long contentLength = Long.valueOf(contentBytes.length);
objectMetadata.setContentLength(contentLength);
objectMetadata.setHeader("filename", fileNameWithExtn);
/*
* Reobtain the tmp uploaded file as input stream
*/
InputStream inputStream = file.getInputStream();
File convFile = new File(fileNameWithExtn); //If i don't do //this, I think I was getting file not found or MD5 error.
file.transferTo(convFile);
FileUtils.copyInputStreamToFile(inputStream, convFile); //you //need to have commons.io in your pom.xml for this FileUtils to work. Not //the apache FileUtils.
AmazonS3 s3 = new AmazonS3Client(new AWSStaticCredentialsProvider
(new BasicAWSCredentials("<yourkeyId>", "<YourAccessKey>")));
s3.setRegion(Region.US_West.toAWSRegion());
s3.setEndpoint("yourRegion.amazonaws.com");
versionId = s3.putObject(new PutObjectRequest("YourBucketName", name, convFile)).getVersionId();

Cannot connect from EC2 to S3

I am trying to connect to S3 from EC2 instance using AmazonS3Client, to get the list of objects present in S3 bucket. While I can connect to S3 when running this code from my local machine, I am having a hard time running the same code on EC2.
Am I missing any setting or configuration on EC2 instance?
Code
AWSCredentials credentials = new BasicAWSCredentials("XXXX", "YYYY");
AmazonS3Client conn = new AmazonS3Client(credentials);
String bucketName = "s3-xyz";
String prefix = "123";
ObjectListing objects = conn.listObjects(bucketName, prefix);
List<S3ObjectSummary> objectSummary = objects.getObjectSummaries();
for(S3ObjectSummary os : objectSummary)
{
System.out.println(os.getKey());
}
Errors
ERROR com.amazonaws.http.AmazonHttpClient - Unable to execute HTTP request: Connect to s3-xyz.amazonaws.com:443 timed out
org.apache.http.conn.ConnectTimeoutException: Connect to s3-xyz.s3.amazonaws.com:443 timed out
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:640)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:318)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:202)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3037)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3008)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:531)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:515)
ClientConfiguration cc = new ClientConfiguration();
cc.setProxyHost("10.66.80.122");
cc.setProxyPort(8080);
propertiesCredentials = new BasicAWSCredentials(aws_access_key_id, aws_secret_access_key);
s3 = new AmazonS3Client(propertiesCredentials,cc);
To find proxy_host & port go to LAN settings.