Does AWS CPP S3 SDK support "Transfer acceleration" - c++

I enabled "Transfer acceleration" on my bucket. But I dont see any improvement in speed of Upload in my C++ application. I have waited for more than 20 minutes that is mentioned in AWS Documentation.
Does the SDK support "Transfer acceleration" by default or is there a run time flag or compiler flag? I did not spot anything in the SDK code.
thanks

Currently, there isn't a configuration option that simply turns on transfer acceleration. You can however, use endpoint override in the client configuration to set the accelerated endpoint.

What I did to enable a (working) transfer acceleration:
set in the bucket configuration on the AWS panel "Transfer Acceleration" to enabled.
add to the IAM user that I use inside my C++ application the permission s3::PutAccelerateConfiguration
Add the following code to the s3 transfer configuration (bucket_ is your bucket name, the final URL must match the one shown in the AWS panel "Transfer Acceleration"):
Aws::Client::ClientConfiguration config;
/* other configuration options */
config.endpointOverride = bucket_ + ".s3-accelerate.amazonaws.com";
Ask for acceleration to the bucket before transfer... (docs in here )
auto s3Client = Aws::MakeShared<Aws::S3::S3Client>("Uploader",
Aws::Auth::AWSCredentials(id_, key_), config);
Aws::S3::Model::PutBucketAccelerateConfigurationRequest bucket_accel;
bucket_accel.SetAccelerateConfiguration(
Aws::S3::Model::AccelerateConfiguration().WithStatus(
Aws::S3::Model::BucketAccelerateStatus::Enabled));
bucket_accel.SetBucket(bucket_);
s3Client->PutBucketAccelerateConfiguration(bucket_accel);
You can check in the detailed logs of the AWS sdk that your code is using the accelerated entrypoint and you can also check that before the transfer start there is a call to /?accelerate (info)

What worked for me:
Enabling S3 Transfer Acceleration within AWS console
When configuring the client, only utilize the accelerated endpoint service:
clientConfig->endpointOverride = "s3-accelerate.amazonaws.com";
#gabry - your solution was extremely close, I think the reason it wasn't working for me was perhaps due to SDK changes since originally posted as the change is relatively small. Or maybe because I am constructing put object templates for requests used with the transfer manager.
Looking through the logs (Debug level) the SDK automatically concatenates the bucket used in transferManager::UploadFile() with the overridden endpoint. I was getting unresolved host errors as the requested host looked like:
[DEBUG] host: myBucket.myBucket.s3-accelerate.amazonaws.com
This way I could still keep the same S3_BUCKET macro name while only selectively calling this when instantiating a new configuration for upload.
e.g.
<<
...
auto putTemplate = new Aws::S3::Model::PutObjectRequest();
putTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->putObjectTemplate = *putTemplate;
auto multiTemplate = new Aws::S3::Model::CreateMultipartUploadRequest();
multiTemplate->SetStorageClass(STORAGE_CLASS);
transferConfig->createMultipartUploadTemplate = *multiTemplate;
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
auto transferHandle = transferMgr->UploadFile(localFile, S3_BUCKET, s3File);
transferMgr = Aws::Transfer::TransferManager::Create(*transferConfig);
...
>>

Related

AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue?

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

S3Client and Quarkus Native App Issueu with Runn

I am trying to create a lambda S3 listener leveraging Lambda as a native image. The point is to get the S3 event and then do some work by pulling the file, etc. To get the file I am using het AWS 2.x S3 client as below
S3Client.builder().httpClient().build();
This code results in
2020-03-12 19:45:06,205 ERROR [io.qua.ama.lam.run.AmazonLambdaRecorder] (Lambda Thread) Failed to run lambda: software.amazon.awssdk.core.exception.SdkClientException: Unable to load an HTTP implementation from any provider in the chain. You must declare a dependency on an appropriate HTTP implementation or pass in an SdkHttpClient explicitly to the client builder.
To resolve this I added the aws apache client and updated the code to do the following:
SdkHttpClient httpClient = ApacheHttpClient.builder().
maxConnections(50).
build()
S3Client.builder().httpClient(httpClient).build();
I also had to add:
[
["org.apache.http.conn.HttpClientConnectionManager",
"org.apache.http.pool.ConnPoolControl","software.amazon.awssdk.http.apache.internal.conn.Wrapped"]
]
After this I am now getting the following stack trace:
Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
at java.security.cert.PKIXParameters.setTrustAnchors(PKIXParameters.java:200)
at java.security.cert.PKIXParameters.<init>(PKIXParameters.java:120)
at java.security.cert.PKIXBuilderParameters.<init>(PKIXBuilderParameters.java:104)
at sun.security.validator.PKIXValidator.<init>(PKIXValidator.java:86)
... 76 more
I am running version 1.2.0 of qurkaus on 19.3.1 of graal. I am building this via Maven and the the provided docker container for Quarkus. I thought the trust store was added by default (in the build command it looks to be accurate) but am I missing something? Is there another way to get this to run without the setting of the HttpService on the S3 client?
There is a PR, under review at the moment, that introduces AWS S3 extension both JVM & Native. AWS clients are fully Quarkified, meaning configured via application.properties and enabled for dependency injection. So stay tuned as it most probably be available in Quarkus 1.5.0

Does boto2 use http or https to upload files to s3?

I noticed that uploading small files to S3 bucket is very slow. For a file with size of 100KB, it takes 200ms to upload. Both the bucket and our app are in Oregon. App is hosted on EC2.
I googled it and found some blogs; e.g. http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
It's mentioned that http can bring much speed gain than https.
We're using boto 2.45; I'm wondering whether both uses https or http by default? Or is there any param to configure this behavior in boto?
Thanks in advance!
The boto3 client includes a use_ssl parameter:
use_ssl (boolean) -- Whether or not to use SSL. By default, SSL is used. Note that not all services support non-ssl connections.
Looks like it's time for you to move to boto3!
I tried boto3, which has a nice parameter "use_ssl" in connection constructor. However, it turned out that boto3 is significantly slower than boto2.... there're actually already many posts online about this issue.
Finally, I found that, in boto2, there's also a similar param "is_secure"
self.s3Conn = S3Connection(config.AWS_ACCESS_KEY_ID, config.AWS_SECRET_KEY, host=config.S3_ENDPOINT, is_secure=False)
Setting is_secure to False saves us about 20ms. Not bad..........

Spark is inventing his own AWS secretKey

I'm trying to read a s3 bucket from Spark and up until today Spark always complain that the request return 403
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "ACCESSKEY")
hadoopConf.set("fs.s3a.secret.key", "SECRETKEY")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
logs = spark_context.textFile("s3a://mybucket/logs/*)
Spark was saying .... Invalid Access key [ACCESSKEY]
However with the same ACCESSKEY and SECRETKEY this was working with aws-cli
aws s3 ls mybucket/logs/
and in python boto3 this was working
resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "logs/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")
so my credentials ARE invalid and the problem is definitely something with Spark..
Today I decided to turn on the "DEBUG" log for the entire spark and to my suprise... Spark is NOT using the [SECRETKEY] I have provided but instead... add a random one???
17/03/08 10:40:04 DEBUG request: Sending Request: HEAD https://mybucket.s3.amazonaws.com / Headers: (Authorization: AWS ACCESSKEY:[RANDON-SECRET-KEY], User-Agent: aws-sdk-java/1.7.4 Mac_OS_X/10.11.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.65-b01/1.8.0_65, Date: Wed, 08 Mar 2017 10:40:04 GMT, Content-Type: application/x-www-form-urlencoded; charset=utf-8, )
This is why it still return 403! Spark is not using the key I provide with fs.s3a.secret.key but instead invent a random one??
For the record I'm running this locally on my machine (OSX) with this command
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
Could some one enlighten me on this?
(updated as my original one was downvoted as clearly considered unacceptable)
The AWS auth protocol doesn't send your secret over the wire. It signs the message. That's why what you see isn't what you passed in.
For further information, please reread.
I ran into a similar issue. Requests that were using valid AWS credentials returned a 403 Forbidden, but only on certain machines. Eventually I found out that the system time on those particular machines were 10 minutes behind. Synchronizing the system clock solved the problem.
Hope this helps!
It is very intriguing this random passkey. Maybe AWS SDK is getting the password from OS environment.
In hadoop 2.8, the default AWS provider chain shows the following list of providers:
BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider
Order, of course, matters! the AWSCredentialProviderChain, get the first keys from the first provider that provides that information.
if (credentials.getAWSAccessKeyId() != null &&
credentials.getAWSSecretKey() != null) {
log.debug("Loading credentials from " + provider.toString());
lastUsedProvider = provider;
return credentials;
}
See the code in "GrepCode for AWSCredentialProviderChain".
I face similar problem using profile credentials. SDK was ignoring the credentials inside ~/.aws/credentials (as good practice, I encourage you to not store credentials inside the program in any way).
My solution...
Set the credentials provider to use ProfileCredentialsProvider
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com") # yes, I am using central eu server.
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.profile.ProfileCredentialsProvider')
Folks, go for the IAM configuration based on Roles ... that will open up S3 access policies that should be added to the EMR default one.

AWS Lambda + Tinkerpop/Gremlin + TitanDB on EC2 + AWS DynamoDB in cloud

I am trying to execute following flow:
user hits AWS Gateway (REST),
it triggers AWS Lambda,
that uses Tinkerpop/Gremlin connects to
TitanDB on EC2, that uses
AWS DynamoDB in cloud (not on EC2) as backend.
Right now I have managed to crete fully working TitanDB instance on EC2, that stores data in DynamoDB in cloud.
I am also able to connect from AWS Lambda to EC2 through Tinkerpop/Gremlin BUT only this way:
Cluster.build()
.addContactPoint("10.x.x.x") // ip of EC2
.create()
.connect()
.submit("here I type my query as string and it will work");
And this works, however I strongly prefer to use "Criteria API" (GremlinPipeline) instead of plain Gremlin language.
In other words, I need ORM or something like that.
I know, that Tinkerpop includes it.
I have realized, that what I need is object of class Graph.
This is what I have tried:
Graph graph = TitanFactory
.build()
.set("storage.hostname", "10.x.x.x")
.set("storage.backend", "com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager")
.set("storage.dynamodb.client.credentials.class-name", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
.set("storage.dynamodb.client.credentials.constructor-args", "")
.set("storage.dynamodb.client.endpoint", "https://dynamodb.ap-southeast-2.amazonaws.com")
.open();
However, it throws "Could not find implementation class: com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager".
Of course, computer is correct, as IntelliJ IDEA also cannot find it.
My dependencies:
//
// aws
compile 'com.amazonaws:aws-lambda-java-core:+'
compile 'com.amazonaws:aws-lambda-java-events:+'
compile 'com.amazonaws:aws-lambda-java-log4j:+'
compile 'com.amazonaws:aws-java-sdk-dynamodb:1.10.5.1'
compile 'com.amazonaws:aws-java-sdk-ec2:+'
//
// database
// titan 1.0.0 is compatible with gremlin 3.0.2-incubating, but not yet with 3.2.0
compile 'com.thinkaurelius.titan:titan-core:1.0.0'
compile 'org.apache.tinkerpop:gremlin-core:3.0.2-incubating'
compile 'org.apache.tinkerpop:gremlin-driver:3.0.2-incubating'
What is my goal: have fully working Graph object
What is my problem: I don't have DynamoDBStoreManager class, and I do not know what dependency I have to add.
My additional question is: why connecting through Cluster class requires only IP and works, but TitanFactory requires properties like those I have used on gremlin-server on EC2?
I do not want to create second server, I just want to connect as client to it and take Graph object.
EDIT:
After adding resolver, it builds, in output I get multiple:
13689 [TitanID(0)(4)[0]] WARN com.thinkaurelius.titan.diskstorage.idmanagement.ConsistentKeyIDAuthority - Temporary storage exception while acquiring id block - retrying in PT2.4S: com.thinkaurelius.titan.diskstorage.TemporaryBackendException: Wrote claim for id block [1, 51) in PT0.342S => too slow, threshold is: PT0.3S
and execution hangs on open() method, so does not allow me to execute any queries.
For the DynamoDBStoreManager class, you would need this dependency:
compile 'com.amazonaws:dynamodb-titan100-storage-backend:1.0.0'
Then for the DynamoDBLocal issue, try adding this resolver:
resolvers += "AWS DynamoDB Local Release Repository" at "http://dynamodb-local.s3-website-us-west-2.amazonaws.com/release"
I'm not entirely clear on what this means -- "Criteria API" instead of plain Gremlin language. I'm guessing that you mean that you want to interact with the graph using Java rather than passing Gremlin as a string over to a running Titan/Gremlin Server? If this is the case, then you don't need to start a Titan/Gremlin Server at all (step 4 above). Write an AWS Lambda program (step 2-3 above) that creates a direct Titan client connection via TitanFactory, where all of the Titan configuration properties are for your DynamoDB instance (step 5 above).