azure dw limit error - azure-sqldw

I scheduled around 70 concurrent queries using 70 logins to stress test Azure DW (DWU 200) and after a while started getting this error
[Execute SQL Task]
Error: Executing the query "SELECT Distinct S.[Nurse ID],S.[Trust Code],S.[Loc..." failed with the following error: "110802;An internal DMS error occurred that caused this operation to fail.
Details:
Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException, Message: NativeOdbcConnection.Open, error in OdbcConnectionCreate: SqlState: HY000, NativeError: 10928, 'Error calling: SQLExecDirect(hstmt, (SQLWCHAR *) L"SELECT ##SPID", SQL_NTS), SQL return code: -1 |
SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 20, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Resource ID : 1. The request limit for the database is 1600 and has been reached. See 'http://go.microsoft.com/fwlink/?LinkId=267637' for assistance. |
ConnectionString: Driver={pdwodbc};APP=TypeC01-DmsNativeReader:DB22\mpdwsvc (69820)-ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.22-f8e91ff83e68\sql\query, ConnectionPooling: 1 | Error calling: pConn->Create(connectionString, useConnectionPooling, packetSize, connectionLoginTimeout, environmentSettings, spid) | state: FFFF, number: 19183, active connections: 266', Connection String: Driver={pdwodbc};APP=TypeC01-DmsNativeReader:DB22\mpdwsvc (69820)-ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.22-f8e91ff83e68\sql\query".
Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
But I can't find a corresponding 1600 limit, neither can I understand how I could have hit it? Any help would be truly appreciated, thanks.

Have you read the concurrency article on Azure.com? You are stress testing at 70 concurrent queries at a scale level that will, by design begin queueing those requests. My suspicion is that your queue of queued requests is increasing throughout your load test until you hit one of the limits in the system. The limit I expect you are hitting is the number of open sessions.
If you would like to be certain that is the case then you would need to open a support ticket. However, I would also suggest increasing the DWU to a higher figure if you want to run at 70 concurrent queries in a saturated load test.

Related

BigQuery Storage Write / managedwriter api return error server_shutting_down

As we know, the advantage of BigQuery Storage Write API, one month ago, we replace insertAll with managedwriter API on our server. It seems to work well for one month, however, we met the following errors recently
rpc error: code = Unavailable desc = closing transport due to: connection error:
desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR,
debug data: "server_shutting_down"
The version of managedwriter API are:
cloud.google.com/go/bigquery v1.25.0
google.golang.org/protobuf v1.27.1
There is a piece of retrying logic for storage write API that detects error messages on our server-side. We notice the response time of storage write API becomes longer after retrying, as a result, OOM is happening on our server. We also tried to increase the request timeout to 30 seconds, and most of those requests could not be completed within it.
How to handle the error server_shutting_down correctly?
Update 02/08/2022
The default stream of managedwrite API is used in our server. And server_shutting_down error comes up periodically. And this issue happened on 02/04/2022 12:00 PM UTC and the default stream of managedwrite API works well for over one month.
Here is one wrapper function of appendRow and we log the cost time of this function.
func (cl *GBOutput) appendRows(ctx context.Context,datas [][]byte, schema *gbSchema) error {
var result *managedwriter.AppendResult
var err error
if cl.schema != schema {
cl.schema = schema
result, err = cl.managedStream.AppendRows(ctx, datas, managedwriter.UpdateSchemaDescriptor(schema.descriptorProto))
} else {
result, err = cl.managedStream.AppendRows(ctx, datas)
}
if err != nil {
return err
}
_, err = result.GetResult(ctx)
return err
}
When the error server_shutting_down comes up, the cost time of this function could be several hundred seconds. It is so weird, and it seems to there is no way to handle the timeout of appendRow.
Are you using the "raw" v1 storage API, or the managedwriter? I ask because managedwriter should handle stream reconnection automatically. Are you simply observing connection closes periodically, or something about your retry traffic induces the closes?
The interesting question is how to deal with in-flight appends for which you haven't yet received an acknowledgement back (or the ack ended in failure). If you're using offsets, you should be able to re-send the append without risk of duplication.
Per the GCP support guy,
The issue is hit once 10MB has been sent over the connection, regardless of how long it takes or how much is inflight at that time. The BigQuery Engineering team has identified the root cause and the fix would be rolled out by Friday, Feb 11th, 2022.

GCP CloudSql Connection limit from Dataflow Job/compute engine

I have a dataflow job that is connecting to cloudsql and persisting some data.
On average I have about 75 active connections (a few spikes to just over 100 connections once in a while). I was therefore wondering if there is a maximum number of connections. The documentation doesn't seem to indicate. (https://cloud.google.com/sql/docs/mysql/connect-admin-ip)
Backstory and for some context: I am getting an error with one of my jobs, it seems to just lock randomly and stops persisting data:
Operation ongoing in step X for at least 305h20m00s without outputting or completing in state start
at sun.misc.Unsafe.park (Native Method)
at java.util.concurrent.locks.LockSupport.park (LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await (AbstractQueuedSynchronizer.java:2039)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst (LinkedBlockingDeque.java:590)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:425)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:346)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection (PoolingDataSource.java:134)
at org.apache.commons.dbcp2.BasicDataSource.getConnection (BasicDataSource.java:809)
at org.apache.commons.dbcp2.DataSourceConnectionFactory.createConnection (DataSourceConnectionFactory.java:83)
at org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject (PoolableConnectionFactory.java:355)
at org.apache.commons.pool2.impl.GenericObjectPool.create (GenericObjectPool.java:874)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:417)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:346)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection (PoolingDataSource.java:134)
at x.io.jobs.common.mysql.function.MySqlReadAllFn.setup (MySqlReadAllFn.java:57)
at x.io.jobs.tracer.function.ReadAggTraceStatusByIdFn$DoFnInvoker.invokeSetup (Unknown Source)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.deserializeCopy (DoFnInstanceManagers.java:83)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.get (DoFnInstanceManagers.java:75)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.reallyStartBundle (SimpleParDoFn.java:296)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement (SimpleParDoFn.java:326)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process (ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process (OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output (GroupAlsoByWindowsParDoFn.java:185)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue (GroupAlsoByWindowFnRunner.java:108)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1 (ReduceFnRunner.java:1060)
Thanks.
There is a connection limit for Cloud SQL, which can be changed by setting the max_connections flag on an instance. There's more info on setting and viewing the value of database flags on an instance here: https://cloud.google.com/sql/docs/mysql/flags

Mongo C++ Driver - How to Change Timeout Configurations

How can I change the timeout duration for different operations that can fail due to server inaccessibility? (start_session, insert, find, delete, update, ...)
...
auto pool = mongocxx::pool(mongocxx::uri("bad_uri"), pool_options);
auto connection = pool.try_acquire();
auto db = (*(connection.value()))["test_db"];
auto collection = db["test_collection"];
// This does not help
mongocxx::write_concern wc;
wc.timeout(std::chrono::milliseconds(1000));
mongocxx::options::insert insert_options;
insert_options.write_concern(wc);
// takes about 30 seconds to fail
collection.insert_one(from_json(R"({"name": "john doe", "occupation": "_redacted_", "skills" : "a certain set"})"), insert_options);
[Edit]
Here is the exception message:
C++ exception with description "No suitable servers found:
serverSelectionTimeoutMS expired: [connection timeout calling
ismaster on '127.0.0.1:27017']
It would be helpful to see the actual error message from the insert_one() operation, but "takes about 30 seconds to fail" suggests that this may be due to the default server selection timeout. You can configure that via the serverSelectionTimeoutMS connection string option.
If you are connecting to a replica set, I would suggest keeping that timeout a bit above the expected time for a failover to complete. Replica Set Elections states:
The median time before a cluster elects a new primary should not typically exceed 12 seconds
You may find that is shorter in practice. By keeping the server selection timeout above the expected failover time, you'll allow the driver to insulate your application from an error (at the expense of wait time).
If you are not connecting to a replica set, feel free to lower serverSelectionTimeoutMS to a lower value, albeit still greater than the expected latency to your mongod (standalone) or mongos (sharded cluster) node.
Do note that since server selection occurs within a loop, the connectTimeoutMS connection string option won't affect the delay you're seeing. Lower the connection timeout will allow the driver to internally give up when attempting to connect to an inaccessible server, but the server selection will still block for up to serverSelectionTimeoutMS (and likely retry connections to the server during that loop).

Azure SQL DW deadlock inside DMS

We are getting deadlocks inside DMS at least 30% of the time or more on several of the larger sprocs which truncate and insert several million rows. However, there is only one query running, so I don't see how the deadlock can be my fault:
Msg 110802, Level 16, State 1, Line 1
110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException, Message: SqlNativeBufferReader.Run, error in OdbcExecuteQuery: SqlState: 40001, NativeError: 1205, 'Error calling: SQLExecDirect(this->GetHstmt(), (SQLWCHAR *)statementText, SQL_NTS), SQL return code: -1 | SQL Error Info: SrvrMsgState: 71, SrvrSeverity: 13, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Transaction (Process ID 2265) was deadlocked on lock | generic waitable object resources with another process and has been chosen as the deadlock victim. Rerun the transaction. | Error calling: pReadConn->ExecuteQuery(statementText, bufferFormat) | state: FFFF, number: 7801, active connections: 120', Connection String: Driver={pdwodbc};APP=TypeC01-DmsNativeReader:DB3\mpdwsvc (13732)-ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.3-b6c0a7b26544\sql\query
And:
110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException, Message: SqlNativeBufferReader.Run, error in OdbcExecuteQuery: SqlState: 40001, NativeError: 1205, 'Error calling: SQLExecDirect(this->GetHstmt(), (SQLWCHAR *)statementText, SQL_NTS), SQL return code: -1 | SQL Error Info: SrvrMsgState: 71, SrvrSeverity: 13, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Transaction (Process ID 804) was deadlocked on lock | generic waitable object resources with another process and has been chosen as the deadlock victim. Rerun the transaction. | Error calling: pReadConn->ExecuteQuery(statementText, bufferFormat) | state: FFFF, number: 8106, active connections: 240', Connection String: Driver={pdwodbc};APP=TypeC01-DmsNativeReader:DB38\mpdwsvc (14728)-ODBC;Trusted_Connection=yes;AutoTranslate=no;Server=\\.\pipe\DB.38-b6c0a7b26544\sql\query
Does this point at something obvious to check or fix? Or is an Azure support case the best road to a resolution?
update: support case 115111713384329 is open for this issue
update: our SQL DW got a new update March 4, 2016 which supposedly fixes this issue. (I can't reproduce it on demand so I can't say for sure.) If you run "select ##version" then
10.0.8224.5 or higher should have the fix. If you don't have the fix yet I would imagine opening a support case and requesting it or waiting a few weeks would get you the fix.
Creating an Azure support case would be best in this case. If you can share your stored procedure, that would help us root cause.

How to debug "could not receive data from client: Connection reset by peer"

I'm running a django-celery application on Ubuntu-12.04.
When I run a celery task from my web interface, I get the following error, taken form postgresql-9.3 logfile (maximum level of log):
2013-11-12 13:57:01 GMT tss_usr 8113 LOG: could not receive data from client: Connection reset by peer
tss_usr is the postgresql user of the django application database and (in this example) 8113 is the pid of the process who killed the connection, I guess.
Have you got any idea on why this happens or at least how to debug this issue?
To make things work again I need to restart postgresql which is extremely uncomfortable.
I know this is an older post, but I just found it because I had the same error today in my postgres logs. I narrowed it down to a PDO select statement. I'm using Zend Framework 1.10.3 on Ubuntu Precise.
The following pdo statement generated an error if $opinion is a long text string. The column opinion is type Text in my postgres table. The query succeeds if $opinion is under a certain number of characters. 1000 characters works fine. 2000 characters fails with "could not receive data from client: Connection reset by peer".
$select = $this->db->select()
->from( 'datauserstopics' )
->where("opinion = ?",trim($opinion))
->where("datatopicsid = ?",trim($tid))
->where("datausersid= ?",$datausersid);
$stmt = $this->db->query($select);
I circumvented the problem by using:
->where("substr(opinion,1,100) = ?",trim(substr($opinion,1,100)))
This is not a perfect solution, but for my purposes, the select statement using substr() suffices.
Note that I have no problem inserting long strings into the same table/column. The disconnect problem only appears for me on the PDO select with relatively long text strings.
I'm getting it in 2017 with 9.4, I have no text fields, don't know what a PDO is. My select statement is about 50 bytes long, I'm trying to fetch an int4 and a double precision. I suspect the error message can mean multiple things.
I've since found https://dba.stackexchange.com/questions/142350/postgres-could-not-receive-data-from-client-connection-reset-by-peer which indicates it could be a problem with the client configuration. My client is libpg and PQconnectdb() is giving me a CONNECTION_OK return. It works at least partly.
For me, restarting the hypervisor where both the Postgres and the application using it helped. I've seen stack traces in dmesg before, though.