BigQuery Storage Write / managedwriter api return error server_shutting_down - google-cloud-platform

As we know, the advantage of BigQuery Storage Write API, one month ago, we replace insertAll with managedwriter API on our server. It seems to work well for one month, however, we met the following errors recently
rpc error: code = Unavailable desc = closing transport due to: connection error:
desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR,
debug data: "server_shutting_down"
The version of managedwriter API are:
cloud.google.com/go/bigquery v1.25.0
google.golang.org/protobuf v1.27.1
There is a piece of retrying logic for storage write API that detects error messages on our server-side. We notice the response time of storage write API becomes longer after retrying, as a result, OOM is happening on our server. We also tried to increase the request timeout to 30 seconds, and most of those requests could not be completed within it.
How to handle the error server_shutting_down correctly?
Update 02/08/2022
The default stream of managedwrite API is used in our server. And server_shutting_down error comes up periodically. And this issue happened on 02/04/2022 12:00 PM UTC and the default stream of managedwrite API works well for over one month.
Here is one wrapper function of appendRow and we log the cost time of this function.
func (cl *GBOutput) appendRows(ctx context.Context,datas [][]byte, schema *gbSchema) error {
var result *managedwriter.AppendResult
var err error
if cl.schema != schema {
cl.schema = schema
result, err = cl.managedStream.AppendRows(ctx, datas, managedwriter.UpdateSchemaDescriptor(schema.descriptorProto))
} else {
result, err = cl.managedStream.AppendRows(ctx, datas)
}
if err != nil {
return err
}
_, err = result.GetResult(ctx)
return err
}
When the error server_shutting_down comes up, the cost time of this function could be several hundred seconds. It is so weird, and it seems to there is no way to handle the timeout of appendRow.

Are you using the "raw" v1 storage API, or the managedwriter? I ask because managedwriter should handle stream reconnection automatically. Are you simply observing connection closes periodically, or something about your retry traffic induces the closes?
The interesting question is how to deal with in-flight appends for which you haven't yet received an acknowledgement back (or the ack ended in failure). If you're using offsets, you should be able to re-send the append without risk of duplication.

Per the GCP support guy,
The issue is hit once 10MB has been sent over the connection, regardless of how long it takes or how much is inflight at that time. The BigQuery Engineering team has identified the root cause and the fix would be rolled out by Friday, Feb 11th, 2022.

Related

Google Cloud Functions Java 11 (Beta) Runtime - Performance Issue

I have created a new Cloud Function using Java 11 (Beta) Runtime to handle HTML form submission for my static site. It's a simple 3-field form (name, email, message). No file upload is involved. The function does 2 things primarily:
Creates a pull request with BitBucket
Sends email to me using SendGrid
NOTE: It also verifies recaptcha but I've disabled it for testing.
The function when ran on my local machine (base model 2019 Macbook Pro 13") takes about 3 secs. I'm based in SE Asia. The same function when deployed to Google Cloud us-central1 takes about 25 secs (8 times slower). I have almost the same code running in production as part of a Servlet on GAE Java 8 runtime also in US Central region for a few years. It takes about 2-3 secs including recaptcha verification and sending the email. I'm trying to port it over to Cloud Function, but the performance is about 10 times slower with Cloud Function even without recaptcha verification.
For comparison, the Cloud Function is running on 256MB / 400GHz instance, whereas my GAE Java 8 runtime runs on F1 (128MB / 600GHz) instance. The function is using only about 75MB of memory. The function is configured to accept unauthenticated requests.
I noticed that even basic String concatenation like: String c = a + b; takes a good 100ms on the Cloud Function. I have timed the calls and a simple string concatenation of about 15 strings into one takes about 1.5-2.0 seconds.
Moreover, writing a small message (~ 1KB) to the HTTPUrlConnection output stream and reading the response back takes about 10 seconds (yes seconds)!
/* Writing < 1KB to output stream takes about 4-5 secs */
wr = new OutputStreamWriter(con.getOutputStream());
wr.write(encodedParams);
wr.flush();
wr.close();
/* Reading response also take about 4-5 secs */
String responseMessage = con.getResponseMessage();
Similarly, the SendGrid code below takes another 10 secs to send the email. It takes about 1 sec on my local machine.
Email from = new Email(fromEmail, fromName);
Email to = new Email(toEmail, toName);
Email replyTo = new Email(replyToEmail, replyToName);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);
mail.setReplyTo(replyTo);
SendGrid sg = new SendGrid(SENDGRID_API_KEY);
Request sgRequest = new Request();
Response sgResponse = null;
try {
sgRequest.setMethod(Method.POST);
sgRequest.setEndpoint("mail/send");
sgRequest.setBody(mail.build());
sgResponse = sg.api(sgRequest);
} catch (IOException ex) {
throw ex;
}
Something is obviously wrong with the Cloud Function. Since my original code is running on GAE Java 8 runtime, it was very easy for me to port it over to the Cloud Function with minor changes. Otherwise I would have gone with NodeJS runtime. I'm also not seeing any of the performance issues when running this function on my local machine.
Can someone help me make sense of the slow performance issue?
What you're seeing is almost certainly due to the "cold start" cost associated with the creation of a new server instance to handle the request. This is an issue with all types of Cloud Functions, as described in the documentation:
Several of the recommendations in this document center around what is known as a cold start. Functions are stateless, and the execution environment is often initialized from scratch, which is called a cold start. Cold starts can take significant amounts of time to complete. It is best practice to avoid unnecessary cold starts, and to streamline the cold start process to whatever extent possible (for example, by avoiding unnecessary dependencies).
I would expect JVM languages to have an even longer cold start time due to the amount of time that it takes to initalize a JVM, in addition to the server instance itself.
Other than the advice above, there is very little one can due to effectively mitigate cold starts. Efforts to keep a function warm are not as effective as you might imagine. There is a lot of discussion about this on the internet if you wish to search.
Keep in mind that the Java runtime is also in beta, so you can expect improvements in the future. The same thing happened with the other runtimes.

Akka StreamRefs - IllegalStateException (Saw RemoteStreamCompleted while in state UpstreamTerminated)

I'm trying to send stream of audio from service A to service B with the usage of akka stream refs (akka-streams library version: 2.6.3). Everything is working rather good, except for the fact that once in a month an exception (With daily usage of this service being around 50k calls per day or so) is thrown in the akka stream ref, and I can't find the cause of the problem.
The stacktrace for error is following:
Caused by: java.lang.IllegalStateException: [SourceRef-46] Saw RemoteStreamCompleted(37) while in state UpstreamTerminated(Actor[akka://system-name#serviceA:34363/system/Materializers/StreamSupervisor-3/$$S4-SinkRef-3405#-939568637]), should never happen
at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.$anonfun$receiveRemoteMessage$1(SourceRefImpl.scala:285)
at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.$anonfun$receiveRemoteMessage$1$adapted(SourceRefImpl.scala:196)
at akka.stream.stage.GraphStageLogic$StageActor.internalReceive(GraphStage.scala:243)
at akka.stream.stage.GraphStageLogic$StageActor.$anonfun$callback$1(GraphStage.scala:202)
at akka.stream.stage.GraphStageLogic$StageActor.$anonfun$callback$1$adapted(GraphStage.scala:202)
at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:466)
at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:497)
at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:599)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:768)
at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:783)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:690)
... 11 common frames omitted
The code responsible for pushing audio through SourceRef in service A:
Materializer materializer = Materializer.createMaterializer(actorSystem);
AudioExtractor extractor = new AudioExtractorImpl("/path/to/audio/file"); // gets all audio bytes from audio file and puts them into chunks (byte arrays of certain length)
List<AudioChunk> audioChunkList = extractor.getChunkedBytesIntoList();
SourceRef<AudioChunk> sourceRef = Source.from(audioChunkList)
.runWith(StreamRefs.sourceRef(), materializer);
// wrap the sourceRef into msg
serviceBActor.tell(wrappedAudioSourceRefInMsg, getSelf());
Whereas code responsible for accepting audio in service B:
private final List<AudioChunk> audioChunksBuffer = new ArrayList<>();
private final Materializer materializer;
public Receive createReceive() {
return receiveBuilder.match(WrappedAudioSourceRefInMsg.class, response -> {
response.getSourceRef()
.getSource()
.runWith(Sink.forEach(chunk -> audioChunksBuffer.add(chunk)), materializer);
}).build();
}
What I've confirmed is that this error always happens after all audio has been sent from service A, and the stream completed. I can't figure out though why is the SourceRef receiving RemoteStreamCompleted while in state UpstreamTerminated. Especially frustrating is the part of should never happen in the message. :|
Any help with this would be much welcome.
Closing, bug in akka reported here: https://github.com/akka/akka/issues/28852

Go HTTP service implemented with Gin sends about 50% of the data then closes the connection, why?

I have an HTTP service which sends very large files to the client (most often between 50Mb and 100Mb). When I start several clients in parallel, I often get part of the file (500Kb to 25Mb) and then the server closes the connection.
If I'm the only one to do a GET with a single connection, it doesn't close the connection until the entire file was transferred. So the service works when not under load.
Here is how my Gin handler looks like:
func handler(c *gin.Context) {
...some initialization...
// prepare list of readers
readers := []io.Reader{}
size := 0
...load some data in a []byte buffer...
// create one reader and stack it in `readers`
readers = append(readers, bytes.NewReader(data))
size += len(data)
...repeat as needed for this request...
// create stream
r := io.MultiReader(readers...)
// send data
c.Status(http.StatusOK) // HTTP 200 response
c.Header("Content-Length", strconv.FormatInt(size, 10))
io.CopyN(c.Writer, audioReader, size)
}
As I mentioned, the process works just fine when I do one GET at a time and wait for completion.
When I run two or more GET in parallel, that's when things break.
I'm wondering whether the Gin TCP connection gets reused and thus the io.CopyN() get interrupted before it completed. If so, how would I wait to make sure io.CopyN() is done?!
I know a close() is possible if I'm directly in control of the socket, but that doesn't work well if the HTTP protocol said keep-alive but especially, closing on my end, I may be walking on Gin's toes. I have some integration tests which send/receive hundreds of connections and transfers Mb of data and these work great. Only I don't ever send two requests in parallel in those tests (at least not yet).

Mongo C++ Driver - How to Change Timeout Configurations

How can I change the timeout duration for different operations that can fail due to server inaccessibility? (start_session, insert, find, delete, update, ...)
...
auto pool = mongocxx::pool(mongocxx::uri("bad_uri"), pool_options);
auto connection = pool.try_acquire();
auto db = (*(connection.value()))["test_db"];
auto collection = db["test_collection"];
// This does not help
mongocxx::write_concern wc;
wc.timeout(std::chrono::milliseconds(1000));
mongocxx::options::insert insert_options;
insert_options.write_concern(wc);
// takes about 30 seconds to fail
collection.insert_one(from_json(R"({"name": "john doe", "occupation": "_redacted_", "skills" : "a certain set"})"), insert_options);
[Edit]
Here is the exception message:
C++ exception with description "No suitable servers found:
serverSelectionTimeoutMS expired: [connection timeout calling
ismaster on '127.0.0.1:27017']
It would be helpful to see the actual error message from the insert_one() operation, but "takes about 30 seconds to fail" suggests that this may be due to the default server selection timeout. You can configure that via the serverSelectionTimeoutMS connection string option.
If you are connecting to a replica set, I would suggest keeping that timeout a bit above the expected time for a failover to complete. Replica Set Elections states:
The median time before a cluster elects a new primary should not typically exceed 12 seconds
You may find that is shorter in practice. By keeping the server selection timeout above the expected failover time, you'll allow the driver to insulate your application from an error (at the expense of wait time).
If you are not connecting to a replica set, feel free to lower serverSelectionTimeoutMS to a lower value, albeit still greater than the expected latency to your mongod (standalone) or mongos (sharded cluster) node.
Do note that since server selection occurs within a loop, the connectTimeoutMS connection string option won't affect the delay you're seeing. Lower the connection timeout will allow the driver to internally give up when attempting to connect to an inaccessible server, but the server selection will still block for up to serverSelectionTimeoutMS (and likely retry connections to the server during that loop).

StatsD gauge timer send data issue - cannot send only one value to statsd server in one flush interval

I use Statsd client based on Akka IO source code to send data to statsd server. In my scenario, I want to monitor spark jobs status, if current job success, we gonna send 1 to statsd server, otherwise 0. So in one flush interval I just wanna send one value (1 or 0) to statsd server, but it didn't work, If I added a for loop, and send this value(1 or 0) at least twice, it works, But I don't know why should I send the same value twice, so I checked the statsd source code, and found:
for (key in gauges) {
var namespace = gaugesNamespace.concat(sk(key));
stats.add(namespace.join(".") + globalSuffix, gauges[key], ts);
numStats += 1;
}
So the type of gauges should be the iterator, if I just send one value, It can not be iteratored, this is what I thought, maybe it is wrong, hope someone can help me explain why should I send one value at least twice.
My client code snippet:
for(i<- 1 to 2) {
client ! ExcutionTime("StatsD_Prod.Reporting."+name+":"+status_str+"|ms", status)
Thread.sleep(100)
}