AWS S3: How to set maximum number of retries in C++? - c++

I have a typical example of an S3 upload which works just fine. I decided to set a limit on the number of retries since sometimes due to network issues, the delay causes problems. I looked at the AWS SDK and apparently there is a MaxErrorRetry option I can set for the client configuration. However, that doesn't seem to be an option in C++. Instead, I found a RetryStrategy function, but i'm not sure how to use it. All I need to do is to set a number for the amount of retries instead of resulting to the default. Any advice?
Thanks

long maxRetry = 2;
long scope = 2;
std::shared_ptr<Aws::Client::DefaultRetryStrategy> retryStrategy = std::make_shared<Aws::Client::DefaultRetryStrategy>(maxRetry,scope); // strategy with custom max retries
Aws::Client::ClientConfiguration clientConfig;
clientConfig.retryStrategy = retryStrategy; // assign it to Client configuration
Aws::S3::S3Client s3Client(clientConfig); // create S3 client with your configuration

Found the answer:
std::shared_ptr<Aws::Client::RetryStrategy> retry; // initialise retry strategy
retry.reset(new Aws::Client::DefaultRetryStrategy(num_of_retries, scope));//override default by creating an instance of DefaultRetryStrategy
client_config.retryStrategy = retry; // assign to client_config

Related

Lambda random long execution while running QLDB query

I have a lambda triggered by a SQS FIFO queue when there are messages on this queue. Basically this lambda is getting the message from the queue and connecting to QLDB through a VPC endpoint in order to run a simple SELECT query and a subsequent INSERT query. The table selected by the query has a index for the field used in the where condition.
Flow (all the services are running "inside" a VPC):
SQS -> Lambda -> VPC interface endpoint -> QLDB
Query SELECT:
SELECT FIELD1, FIELD2 FROM TABLE1 WHERE FIELD3 = "ABCDE"
Query INSERT:
INSERT INTO TABLE1 .....
This lambda is using a shared connection/session on QLDB and this is how I'm connecting to it.
import { QldbDriver, RetryConfig } from 'amazon-qldb-driver-nodejs'
let driverQldb: QldbDriver
const ledgerName = 'MyLedger'
export function connectQLDB(): QldbDriver {
if ( !driverQldb ) {
const retryLimit = 4
const retryConfig = new RetryConfig(retryLimit)
const maxConcurrentTransactions = 1500
driverQldb = new QldbDriver(ledgerName, {}, maxConcurrentTransactions, retryConfig)
}
return driverQldb
}
When I run a load test that simulates around 200 requests/messages per second to that lambda in a time interval of 15 minutes, I'm starting facing a random long execution for that lambda while running the queries on QLDB (mainly the SELECT query). Sometimes the same query retrieves data around 100ms and sometimes it takes more than 40 seconds which results in lambda timeouts. I have changed lambda timeout to 1 minute but this is not the best approch and sometimes it is not enough too.
The VPC endpoint metrics are showing around 250 active connections and 1000 new connections during this load test execution. Is there any QLDB metric that could help to identify the root cause of this behavior?
Could it be related to some QLDB limitation (like the 1500 active sessions described here: https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.default) or something related to concurrency read/write iops?
scodeler, I've read through the NodeJS QLDB driver, and I think theres an order of operations error. If you provide your own backoff function in the RetryConfig where RetryConfig(4, newBackoffFunction), you should see significant performance improvement in your lambda's completing.
The driver's default backoff
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, Math.pow(SLEEP_BASE_MS * 2, retryAttempt));
summarized...it returns
return Math.random() * exponentialBackoff;
does not match the default best jitter function practices
const newBackoffFunction: BackoffFunction = (retryAttempt: number, error: Error, transactionId: string) => {
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, SLEEP_BASE_MS * Math.pow(2, retryAttempt));
const jitterRand: number = Math.random();
const delayTime: number = jitterRand * exponentialBackoff;
return delayTime;
}
The difference is that the SLEEP_BASE_MS should be multiplied by 2 ^ retryAttempt, and not (SLEEP_BASE_MS x 2) ^ retryAttempt.
Hope this helps!

Amazon Keyspaces "DefaultTokenFactoryRegistry and DefaultTopologyMonitor" causes High CPU and memory usage

We have partially moved some of our tables from AWS RDS to AWS Keyspaces to see if we could get better performance on KeySpaces. We have put a lot of work to migrate from MySQL to Keyspaces and also we have been monitoring the system to avoid exploding inconsistency. Through our monitoring period, we have observed the following warnings that result in High CPU and memory usage.
- DefaultTokenFactoryRegistry - [s0] Unsupported partitioner 'com.amazonaws.cassandra.DefaultPartitioner, token map will be empty.
-DefaultTopologyMonitor - [s0] Control node IPx/IPy:9142 has an entry for itself in system.peers: this entry will be ignored. This is likely due to a misconfiguration; please verify your rpc_address configuration in cassandra.yaml on all nodes in your cluster(IPx and IPy are cassandra node IPs)
- Control node cassandra.{REGION}.amazonaws.com/{IP_1}:9142 has an entry for itself in system.peers: this entry will be ignored. This is likely due to a misconfiguration; please verify your rpc_address configuration in cassandra.yaml on all nodes in your cluster.
Even though these warnings does not appear immediately after we deployed our code and the following hours, it somehow appears after 24-72 hours after the deployment.
What we have done so far?
We have tried all connections methods existing in AWS Keyspaces Developer Guide: https://docs.aws.amazon.com/keyspaces/latest/devguide/using_java_driver.html
We have found there is an already open discussion in AWS forums: https://forums.aws.amazon.com/thread.jspa?messageID=945795
We configured our client as it's stated by an amazonian: https://forums.aws.amazon.com/profile.jspa?userID=512911
We have also created an issue on the GitHub of aws-sigv4-auth-cassandra-java-driver-plugin. You can see the details by following the link https://github.com/aws/aws-sigv4-auth-cassandra-java-driver-plugin/issues/24
We have walked through the DataStax java driver code to see what's wrong. When we check DefaultTopologyMonitor class, we have seen that there's a rule that checks if our access point to AWS Keyspaces -{IP_2}- which resolves from contact-point [cassandra.{REGION}.amazonaws.com:9142] is control node or not. As this ip address [{IP_2}] exists in system.peers, the control connections is triggered always and iterations and asssignments consumes high cpu and creates garbage. As we understood, the contact point should not be listed in system.peers. We do not have any decision making point to adjust system.peers table, or setting the control node. These are all managed by AWS keyspaces.
Even though it's possible to suppress warnings by setting the log level to error, The Driver says there's a misconfiguration in cassandra.yml which we do not have permission to edit or view. Is there a way to avoid this warning or any solution suggested to solve this issue?
datastax-java-driver {
basic {
contact-points = ["cassandra.eu-west-1.amazonaws.com:9142"]
load-balancing-policy {
class = DefaultLoadBalancingPolicy
local-datacenter = eu-west-1
}
request {
timeout = 10 seconds
default-idempotence = true
}
}
advanced {
auth-provider = {
class = software.aws.mcs.auth.SigV4AuthProvider
aws-region = eu-west-1
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "./cassandra_truststore.jks"
truststore-password = "XXX"
keystore-path = "./cassandra_truststore.jks"
keystore-password = "XXX"
}
retry-policy {
class = com.ABC.DEF.config.cassandra.AmazonKeyspacesRetryPolicy
max-attempts = 5
}
connection {
pool {
local {
size = 9
}
remote {
size = 1
}
}
init-query-timeout = 5 seconds
max-requests-per-connection = 1024
}
reconnect-on-init = true
heartbeat {
timeout = 1 seconds
}
metadata {
schema {
enabled = false
}
token-map {
enabled = false
}
}
control-connection {
timeout = 1 seconds
}
}
}
----------
This is indeed a non-standard, unsupported partitioner: com.amazonaws.cassandra.DefaultPartitioner. Token-aware routing won't work with AWS Keyspaces unless you write your own TopologyMonitor and TokenFactory.
I suggest that you disable token-aware routing completely, see here for instructions.
The warning is just letting you know that the ip will be filtered out. See the line of code here on github. In cassandra the system.peers table contains a list of nodes minus the ip of the control node. In Amazon Keyspaces, the system.peers table also contains the control node ip. You will see this warning when driver initiates a connection or when the driver metadata is updated. When using keyspaces this warning is expected and will not impact performance. There is a patch that will resolve the warning, but I do not have an ETA to share.
I suggest upgrading the java driver to see if your issue is resolved. You can also download the lastest sigv4 plugin which brings in java driver 4.13 as a dependency.
<dependency>
<groupId>software.aws.mcs</groupId>
<artifactId>aws-sigv4-auth-cassandra-java-driver-plugin</artifactId>
<version>4.0.5</version>
</dependency>
Here is a sample driver config for reference.
datastax-java-driver {
basic.contact-points = ["cassandra.us-east-2.amazonaws.com:9142"]
basic.load-balancing-policy {
class = DefaultLoadBalancingPolicy
local-datacenter = us-east-2
}
advanced {
auth-provider = {
class = software.aws.mcs.auth.SigV4AuthProvider
aws-region = us-east-2
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "./src/main/resources/cassandra_truststore.jks"
truststore-password = "my_password"
hostname-validation = false
}
}
advanced.metadata.token-map.enabled = false
advanced.metadata.schema.enabled = false
advanced.reconnect-on-init = true
advanced.connection {
pool {
local.size = 3
remote.size = 1
}
}
}

Prometheus push gateway how to increment message requests

I have a use case where we need to increment the number of requests received on a Nuclio serverless service. Pod is recreated each time the service is invoked. Following the examples from the Prometheus-client library, I am not able to increment the request number using Counter()or Gauge() Object and inc() method, here is the code I tried.
registry = CollectorRegistry()
c = Counter('my_requests', 'HTTP Failures', ['method', 'endpoint'],registry=registry)
c.labels(method='get', endpoint='/').inc()
c.labels(method='post', endpoint='/submit').inc()
pushadd_to_gateway('localhost:8082', job='countJob', registry=registry)
I tried both push_to_gateway and pushadd_to_gateway both resulted the counter value for my_requests remain 1.
Question - by creating the Counter object each time does it resets the increment value back to 0, if so how do we go about it for ephomeral jobs ? Any code example would be helpful.

How to specify the database in an ArangoDb AQL query?

If have multiple databases defined on a particular ArangoDB server, how do I specify the database I'd like an AQL query to run against?
Running the query through the REST endpoint that includes the db name (substituted into [DBNAME] below) ie:
/_db/[DBNAME]/_api/cursor
doesn't seem to work. The error message says 'unknown path /_db/[DBNAME]/_api/cursor'
Is this something I have to specify in the query itself?
Also: The query I'm trying to run is:
FOR col in COLLECTIONS() RETURN col.name
Fwiw, I haven't found a way to set the "current" database through the REST API. Also, I'm accessing the REST API from C++ using fuerte.
Tom Regner deserves primary credit here for prompting the enquiry that produced this answer. I am posting my findings here as an answer to help others who might run into this.
I don't know if this is a fuerte bug, shortcoming or just an api caveat that wasn't clear to me... BUT...
In order for the '/_db/[DBNAME/' prefix in an endpoint (eg full endpoint '/_db/[DBNAME/_api/cursor') to be registered and used in the header of a ::arangodb::fuerte::Request, it is NOT sufficient (as of arangodb 3.5.3 and the fuerte version available at the time of this answer) to simply call:
std::unique_ptr<fuerte::Request> request;
const char *endpoint = "/_db/[DBNAME/_api/cursor";
request = fuerte::createRequest(fuerte::RestVerb::Post,endpoint);
// and adding any arguments to the request using a VPackBuilder...
// in this case the query (omitted)
To have the database name included as part of such a request, you must additionally call the following:
request->header.parseArangoPath(endpoint);
Failure to do so seems to result in an error about an 'unknown path'.
Note 1: Simply setting the database member variable, ie
request->header.database = "[DBNAME]";
does not work.
Note 2: that operations without the leading '/_db/[DBNAME]/' prefix, seem to work fine using the 'current' database. (which at least for me, seems to be stuck at '_system' since as far as I can tell, there doesn't seem to be an endpoint to change this via the HTTP REST Api.)
The docs aren't very helpful right now, so just incase someone is looking for a more complete example, then please consider the following code.
EventLoopService eventLoopService;
// adjust the connection for your environment!
std::shared_ptr<Connection> conn = ConnectionBuilder().endpoint("http://localhost:8529")
.authenticationType(AuthenticationType::Basic)
.user(?) // enter a user with access
.password(?) // enter the password
.connect(eventLoopService);
// create the request
std::unique_ptr<Request> request = createRequest(RestVerb::Post, ContentType::VPack);
// enter the database name (ensure the user has access)
request->header.database = ?;
// API endpoint to submit AQL queries
request->header.path = "/_api/cursor";
// Create a payload to be submitted to the API endpoint
VPackBuilder builder;
builder.openObject();
// here is your query
builder.add("query", VPackValue("for col in collections() return col.name"));
builder.close();
// add the payload to the request
request->addVPack(builder.slice());
// send the request (blocking)
std::unique_ptr<Response> response = conn->sendRequest(std::move(request));
// check the response code - it should be 201
unsigned int statusCode = response->statusCode();
// slice has the response data
VPackSlice slice = response->slices().front();
std::cout << slice.get("result").toJson() << std::endl;

Peoplecode - how to create cookies?

We are trying to create a cookie in the PeopleSoft Peoplecode by using the %Response object.
However, the code we tried is failing.
&YourCookie = %Response.AddCookie("YourCookieName", "LR");
Another snippet we tried to create the cookie
Local object &Response = %Response;
Local object &YourCookie;
&YourCookie = &Response.CreateCookie("YourCookieName");
&YourCookie.Domain = %Request.AuthTokenDomain;
&YourCookie.MaxAge = -1; /* Makes this a session cookie (default) */
&YourCookie.Path = "/";
&YourCookie.Secure = True; /* Set to true if using https (will still work with http) */
&YourCookie.Value = "Set the cookie value here. Encrypt sensitive information.";
The document reference points to IScript functions called CreateCookie methods etc.
http://docs.oracle.com/cd/E15645_01/pt850pbr0/eng/psbooks/tpcr/chapter.htm?File=tpcr/htm/tpcr21.htm
However, these don't work in Peoplecode. We don't have the knowledge to create IScript or use it. Any insight with the People code API for cookies or IScript is much appreciated.
I just tested on PeopleTools 8.54.11 and was able to create a cookie using the snippet you provided above.
I did find I had an issue if I set
&YourCookie.Secure = True;
in an environment where I was using HTTP.
If you set Secure to False the cookie will be available in both HTTP and HTTPS
if you set Secure to True the cookie is only available in HTTPS
PeopleTools 8.54 Documentation showing the CreateCookie method
I have been trying to do this (same code snippet) from within signon peoplecode, tools release is 8.54.09. I can execute the first two lines of code, but as soon as the line of code executing the CreateCookie() method executes, I get tossed out / end up on the signon error page.
This seems to support the previous answer saying that the API has removed the method, but the answer before that says it has been successful on tools 8.54.11 -- does that mean they removed it, then put it back, and I happen to be stuck with a release where it was removed? :-/