Mongo C++ Driver - How to Change Timeout Configurations - c++

How can I change the timeout duration for different operations that can fail due to server inaccessibility? (start_session, insert, find, delete, update, ...)
...
auto pool = mongocxx::pool(mongocxx::uri("bad_uri"), pool_options);
auto connection = pool.try_acquire();
auto db = (*(connection.value()))["test_db"];
auto collection = db["test_collection"];
// This does not help
mongocxx::write_concern wc;
wc.timeout(std::chrono::milliseconds(1000));
mongocxx::options::insert insert_options;
insert_options.write_concern(wc);
// takes about 30 seconds to fail
collection.insert_one(from_json(R"({"name": "john doe", "occupation": "_redacted_", "skills" : "a certain set"})"), insert_options);
[Edit]
Here is the exception message:
C++ exception with description "No suitable servers found:
serverSelectionTimeoutMS expired: [connection timeout calling
ismaster on '127.0.0.1:27017']

It would be helpful to see the actual error message from the insert_one() operation, but "takes about 30 seconds to fail" suggests that this may be due to the default server selection timeout. You can configure that via the serverSelectionTimeoutMS connection string option.
If you are connecting to a replica set, I would suggest keeping that timeout a bit above the expected time for a failover to complete. Replica Set Elections states:
The median time before a cluster elects a new primary should not typically exceed 12 seconds
You may find that is shorter in practice. By keeping the server selection timeout above the expected failover time, you'll allow the driver to insulate your application from an error (at the expense of wait time).
If you are not connecting to a replica set, feel free to lower serverSelectionTimeoutMS to a lower value, albeit still greater than the expected latency to your mongod (standalone) or mongos (sharded cluster) node.
Do note that since server selection occurs within a loop, the connectTimeoutMS connection string option won't affect the delay you're seeing. Lower the connection timeout will allow the driver to internally give up when attempting to connect to an inaccessible server, but the server selection will still block for up to serverSelectionTimeoutMS (and likely retry connections to the server during that loop).

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

BigQuery Storage Write / managedwriter api return error server_shutting_down

As we know, the advantage of BigQuery Storage Write API, one month ago, we replace insertAll with managedwriter API on our server. It seems to work well for one month, however, we met the following errors recently
rpc error: code = Unavailable desc = closing transport due to: connection error:
desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR,
debug data: "server_shutting_down"
The version of managedwriter API are:
cloud.google.com/go/bigquery v1.25.0
google.golang.org/protobuf v1.27.1
There is a piece of retrying logic for storage write API that detects error messages on our server-side. We notice the response time of storage write API becomes longer after retrying, as a result, OOM is happening on our server. We also tried to increase the request timeout to 30 seconds, and most of those requests could not be completed within it.
How to handle the error server_shutting_down correctly?
Update 02/08/2022
The default stream of managedwrite API is used in our server. And server_shutting_down error comes up periodically. And this issue happened on 02/04/2022 12:00 PM UTC and the default stream of managedwrite API works well for over one month.
Here is one wrapper function of appendRow and we log the cost time of this function.
func (cl *GBOutput) appendRows(ctx context.Context,datas [][]byte, schema *gbSchema) error {
var result *managedwriter.AppendResult
var err error
if cl.schema != schema {
cl.schema = schema
result, err = cl.managedStream.AppendRows(ctx, datas, managedwriter.UpdateSchemaDescriptor(schema.descriptorProto))
} else {
result, err = cl.managedStream.AppendRows(ctx, datas)
}
if err != nil {
return err
}
_, err = result.GetResult(ctx)
return err
}
When the error server_shutting_down comes up, the cost time of this function could be several hundred seconds. It is so weird, and it seems to there is no way to handle the timeout of appendRow.
Are you using the "raw" v1 storage API, or the managedwriter? I ask because managedwriter should handle stream reconnection automatically. Are you simply observing connection closes periodically, or something about your retry traffic induces the closes?
The interesting question is how to deal with in-flight appends for which you haven't yet received an acknowledgement back (or the ack ended in failure). If you're using offsets, you should be able to re-send the append without risk of duplication.
Per the GCP support guy,
The issue is hit once 10MB has been sent over the connection, regardless of how long it takes or how much is inflight at that time. The BigQuery Engineering team has identified the root cause and the fix would be rolled out by Friday, Feb 11th, 2022.

GCP CloudSql Connection limit from Dataflow Job/compute engine

I have a dataflow job that is connecting to cloudsql and persisting some data.
On average I have about 75 active connections (a few spikes to just over 100 connections once in a while). I was therefore wondering if there is a maximum number of connections. The documentation doesn't seem to indicate. (https://cloud.google.com/sql/docs/mysql/connect-admin-ip)
Backstory and for some context: I am getting an error with one of my jobs, it seems to just lock randomly and stops persisting data:
Operation ongoing in step X for at least 305h20m00s without outputting or completing in state start
at sun.misc.Unsafe.park (Native Method)
at java.util.concurrent.locks.LockSupport.park (LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await (AbstractQueuedSynchronizer.java:2039)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst (LinkedBlockingDeque.java:590)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:425)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:346)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection (PoolingDataSource.java:134)
at org.apache.commons.dbcp2.BasicDataSource.getConnection (BasicDataSource.java:809)
at org.apache.commons.dbcp2.DataSourceConnectionFactory.createConnection (DataSourceConnectionFactory.java:83)
at org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject (PoolableConnectionFactory.java:355)
at org.apache.commons.pool2.impl.GenericObjectPool.create (GenericObjectPool.java:874)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:417)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:346)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection (PoolingDataSource.java:134)
at x.io.jobs.common.mysql.function.MySqlReadAllFn.setup (MySqlReadAllFn.java:57)
at x.io.jobs.tracer.function.ReadAggTraceStatusByIdFn$DoFnInvoker.invokeSetup (Unknown Source)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.deserializeCopy (DoFnInstanceManagers.java:83)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.get (DoFnInstanceManagers.java:75)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.reallyStartBundle (SimpleParDoFn.java:296)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement (SimpleParDoFn.java:326)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process (ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process (OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output (GroupAlsoByWindowsParDoFn.java:185)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue (GroupAlsoByWindowFnRunner.java:108)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1 (ReduceFnRunner.java:1060)
Thanks.
There is a connection limit for Cloud SQL, which can be changed by setting the max_connections flag on an instance. There's more info on setting and viewing the value of database flags on an instance here: https://cloud.google.com/sql/docs/mysql/flags

Checking SQL connection state in ADO

I am checking my SQL connection state as:
_ConnectionPtr m_pADOConnection;
// Connection is created and working fine...
// Now I disable network adapter (from Control panel)
if( (pApp->m_pConnection->GetState() == adStateOpen) )
{
// I got here every time....
}
The problem is I get every time adStateOpen even if connection is really not working!
If I try to execute a query or do anything it fails, mostly with
SMux Provider: Physical connection is not usable [xFFFFFFFF].
or
Error number: 80004005 = Unable to open a logical session
Is this value of State property reliable or I need to perform some other check top detect this state?
The state property does not become 0 (adStateClosed) when the connection is interrupted. So checking the state of the connection will always return 1 (adStateOpen) even after the connection is interrupted.
No there is no way to check immediately. The architecture of SQL servers don't allow for it.
I advise you to create an error handling.
It looks as though the only way to establish that the connection has been lost is to try to open it and handle the error.

Jetty 8.1 flooding the log file with "Dispatched Failed" messages

We are using Jetty 8.1 as an embedded HTTP server. Under overload conditions the server sometimes starts flooding the log file with these messages:
warn: java.util.concurrent.RejectedExecutionException
warn: Dispatched Failed! SCEP#76107610{l(...)<->r(...),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}...
The same message is repeated thousands of times, and the amount of logging appears to slow down the whole system. The messages itself are fine, our request handler ist just to slow to process the requests in time. But the huge number of repeated messages makes things actually worse and makes it more difficult for the system to recover from the overload.
So, my question is: is this a normal behaviour, or are we doing something wrong?
Here is how we set up the server:
Server server = new Server();
SelectChannelConnector connector = new SelectChannelConnector();
connector.setAcceptQueueSize( 10 );
server.setConnectors( new Connector[]{ connector } );
server.setThreadPool( new ExecutorThreadPool( 32, 32, 60, TimeUnit.SECONDS,
new ArrayBlockingQueue<Runnable>( 10 )));
The SelectChannelEndPoint is the origin of this log message.
To not see it, just set your named logger of org.eclipse.jetty.io.nio.SelectChannelEndPoint to LEVEL=OFF.
Now as for why you see it, that is more interesting to the developers of Jetty. Can you detail what specific version of Jetty you are using and also what specific JVM you are using?