Akka StreamRefs - IllegalStateException (Saw RemoteStreamCompleted while in state UpstreamTerminated)

Akka StreamRefs - IllegalStateException (Saw RemoteStreamCompleted while in state UpstreamTerminated) - akka

I'm trying to send stream of audio from service A to service B with the usage of akka stream refs (akka-streams library version: 2.6.3). Everything is working rather good, except for the fact that once in a month an exception (With daily usage of this service being around 50k calls per day or so) is thrown in the akka stream ref, and I can't find the cause of the problem.
The stacktrace for error is following:
Caused by: java.lang.IllegalStateException: [SourceRef-46] Saw RemoteStreamCompleted(37) while in state UpstreamTerminated(Actor[akka://system-name#serviceA:34363/system/Materializers/StreamSupervisor-3/$$S4-SinkRef-3405#-939568637]), should never happen
at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.$anonfun$receiveRemoteMessage$1(SourceRefImpl.scala:285)
at akka.stream.impl.streamref.SourceRefStageImpl$$anon$1.$anonfun$receiveRemoteMessage$1$adapted(SourceRefImpl.scala:196)
at akka.stream.stage.GraphStageLogic$StageActor.internalReceive(GraphStage.scala:243)
at akka.stream.stage.GraphStageLogic$StageActor.$anonfun$callback$1(GraphStage.scala:202)
at akka.stream.stage.GraphStageLogic$StageActor.$anonfun$callback$1$adapted(GraphStage.scala:202)
at akka.stream.impl.fusing.GraphInterpreter.runAsyncInput(GraphInterpreter.scala:466)
at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:497)
at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:599)
at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:768)
at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:783)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:690)
... 11 common frames omitted
The code responsible for pushing audio through SourceRef in service A:
Materializer materializer = Materializer.createMaterializer(actorSystem);
AudioExtractor extractor = new AudioExtractorImpl("/path/to/audio/file"); // gets all audio bytes from audio file and puts them into chunks (byte arrays of certain length)
List<AudioChunk> audioChunkList = extractor.getChunkedBytesIntoList();
SourceRef<AudioChunk> sourceRef = Source.from(audioChunkList)
.runWith(StreamRefs.sourceRef(), materializer);
// wrap the sourceRef into msg
serviceBActor.tell(wrappedAudioSourceRefInMsg, getSelf());
Whereas code responsible for accepting audio in service B:
private final List<AudioChunk> audioChunksBuffer = new ArrayList<>();
private final Materializer materializer;
public Receive createReceive() {
return receiveBuilder.match(WrappedAudioSourceRefInMsg.class, response -> {
response.getSourceRef()
.getSource()
.runWith(Sink.forEach(chunk -> audioChunksBuffer.add(chunk)), materializer);
}).build();
}
What I've confirmed is that this error always happens after all audio has been sent from service A, and the stream completed. I can't figure out though why is the SourceRef receiving RemoteStreamCompleted while in state UpstreamTerminated. Especially frustrating is the part of should never happen in the message. :|
Any help with this would be much welcome.

Closing, bug in akka reported here: https://github.com/akka/akka/issues/28852

Related

BigQuery Storage Write / managedwriter api return error server_shutting_down

As we know, the advantage of BigQuery Storage Write API, one month ago, we replace insertAll with managedwriter API on our server. It seems to work well for one month, however, we met the following errors recently
rpc error: code = Unavailable desc = closing transport due to: connection error:
desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR,
debug data: "server_shutting_down"
The version of managedwriter API are:
cloud.google.com/go/bigquery v1.25.0
google.golang.org/protobuf v1.27.1
There is a piece of retrying logic for storage write API that detects error messages on our server-side. We notice the response time of storage write API becomes longer after retrying, as a result, OOM is happening on our server. We also tried to increase the request timeout to 30 seconds, and most of those requests could not be completed within it.
How to handle the error server_shutting_down correctly?
Update 02/08/2022
The default stream of managedwrite API is used in our server. And server_shutting_down error comes up periodically. And this issue happened on 02/04/2022 12:00 PM UTC and the default stream of managedwrite API works well for over one month.
Here is one wrapper function of appendRow and we log the cost time of this function.
func (cl *GBOutput) appendRows(ctx context.Context,datas [][]byte, schema *gbSchema) error {
var result *managedwriter.AppendResult
var err error
if cl.schema != schema {
cl.schema = schema
result, err = cl.managedStream.AppendRows(ctx, datas, managedwriter.UpdateSchemaDescriptor(schema.descriptorProto))
} else {
result, err = cl.managedStream.AppendRows(ctx, datas)
}
if err != nil {
return err
}
_, err = result.GetResult(ctx)
return err
}
When the error server_shutting_down comes up, the cost time of this function could be several hundred seconds. It is so weird, and it seems to there is no way to handle the timeout of appendRow.

Are you using the "raw" v1 storage API, or the managedwriter? I ask because managedwriter should handle stream reconnection automatically. Are you simply observing connection closes periodically, or something about your retry traffic induces the closes?
The interesting question is how to deal with in-flight appends for which you haven't yet received an acknowledgement back (or the ack ended in failure). If you're using offsets, you should be able to re-send the append without risk of duplication.

Per the GCP support guy,
The issue is hit once 10MB has been sent over the connection, regardless of how long it takes or how much is inflight at that time. The BigQuery Engineering team has identified the root cause and the fix would be rolled out by Friday, Feb 11th, 2022.

QueueTrigger : Mesage with encoding error doesnt push message to poison queue

I have webjob queue trigger which is responding to queue message and it works fine. However sometimes we push messages manually in queue and if there is manual mistake which causes DecoderFallBackException. But the strange behavior is that looks like webjob keeps trying unlimited times and our AI logs are creating a mess. I tried restarting webjob to see if it clears any internal cache but doesn’t help.
only thing which helps is deleting queue
Ideally any exception beyond deque count should move message to poison queue.

I've tried to reproduce your issue on my side, but it works well. First I create a backend demo to insert invalid byte message in queue that could cause DecoderFallBackException
Encoding ae = Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
string inputString = "XYZ";
byte[] encodedBytes = new byte[ae.GetByteCount(inputString)];
ae.GetBytes(inputString, 0, inputString.Length,
encodedBytes, 0);
//make the byte invalid
encodedBytes[0] = 0xFF;
encodedBytes[2] = 0xFF;
CloudQueueMessage message = new CloudQueueMessage(encodedBytes);
queue.AddMessage(message);
Web Job code:
public static void ProcessQueueMessage([QueueTrigger("queue")] string message, TextWriter log)
{
log.WriteLine(message);
}
After 5 times the exception occurs, the message is moved to 'queue-poison'. This is the expected behavior. Check here for details:
maxDequeueCount 5 The number of times to try processing a message before moving it to the poison queue.
You may check if you accidently set "maxDequeueCount" to bigger value. If not, please provide your webjob code and how you find DecoderFallBackException for us to investigate.

Akka HTTP Stream listener stops processing databytes after a while

I have an app which has 3 HTTP listeners like this one:
val futureResponse1: Future[HttpResponse] =
Http().singleRequest(HttpRequest(uri = someUrl))
each of the 3 is listening to a non stop stream (each to a different one). And handles it with a simple flow that starts with grouping and then a relatively fast processing (non-blocking):
futureResponse1.flatMap {response =>
response.status match {
case StatusCodes.OK =>
val source: Source[ByteString, Any] = response.entity.dataBytes
source.
grouped(100).
map(doSomethingFast).
runWith(Sink.ignore)
case notOK => system.log.info("failed opening, status: " + notOK.toString())
}
...
I get no exceptions or warnings. But after a while (could be 15-25 minutes) the listeners are just suddenly stopping. One after the other (not together).
Maybe its the grouped phase that is the problem there? Or maybe the connection/stream just stops? Or the dispatcher that is shared by them is getting starved / something not getting released.
Any ideas why that may be happening please?
==== update ====
#Ramon J Romero y Vigil
I changed my run to have only 1 stream instead of 3, and I removed the grouped stage. Still happening after few minutes. I suspect that the stream is closing based on timeout. all I do is get chunks and consume them.
==== update ====
found the reason, see below.

that was the reason:
EntityStreamSizeException: actual entity size (None) exceeded content length limit (8388608 bytes)! You can configure this by setting akka.http.[server|client].parsing.max-content-length or calling HttpEntity.withSizeLimit before materializing the dataBytes stream.
For anyone seeking the solution in the case of continuous response stream, you can get the source this way, using withoutSizeLimit:
val source: Source[ByteString, Any] = response.entity.withoutSizeLimit().dataBytes

StatsD gauge timer send data issue - cannot send only one value to statsd server in one flush interval

I use Statsd client based on Akka IO source code to send data to statsd server. In my scenario, I want to monitor spark jobs status, if current job success, we gonna send 1 to statsd server, otherwise 0. So in one flush interval I just wanna send one value (1 or 0) to statsd server, but it didn't work, If I added a for loop, and send this value(1 or 0) at least twice, it works, But I don't know why should I send the same value twice, so I checked the statsd source code, and found:
for (key in gauges) {
var namespace = gaugesNamespace.concat(sk(key));
stats.add(namespace.join(".") + globalSuffix, gauges[key], ts);
numStats += 1;
}
So the type of gauges should be the iterator, if I just send one value, It can not be iteratored, this is what I thought, maybe it is wrong, hope someone can help me explain why should I send one value at least twice.
My client code snippet:
for(i<- 1 to 2) {
client ! ExcutionTime("StatsD_Prod.Reporting."+name+":"+status_str+"|ms", status)
Thread.sleep(100)
}

zeromq: reset REQ/REP socket state

When you use the simple ZeroMQ REQ/REP pattern you depend on a fixed send()->recv() / recv()->send() sequence.
As this article describes you get into trouble when a participant disconnects in the middle of a request because then you can't just start over with receiving the next request from another connection but the state machine would force you to send a request to the disconnected one.
Has there emerged a more elegant way to solve this since the mentioned article has been written?
Is reconnecting the only way to solve this (apart from not using REQ/REP but use another pattern)

As the accepted answer seem so terribly sad to me, I did some research and have found that everything we need was actually in the documentation.
The .setsockopt() with the correct parameter can help you resetting your socket state-machine without brutally destroy it and rebuild another on top of the previous one dead body.
(yeah I like the image).
ZMQ_REQ_CORRELATE: match replies with requests
The default behaviour of REQ sockets is to rely on the ordering of messages to match requests and responses and that is usually sufficient. When this option is set to 1, the REQ socket will prefix outgoing messages with an extra frame containing a request id. That means the full message is (request id, 0, user frames…). The REQ socket will discard all incoming messages that don't begin with these two frames.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
ZMQ_REQ_RELAXED: relax strict alternation between request and reply
By default, a REQ socket does not allow initiating a new request with zmq_send(3) until the reply to the previous one has been received. When set to 1, sending another message is allowed and has the effect of disconnecting the underlying connection to the peer from which the reply was expected, triggering a reconnection attempt on transports that support it. The request-reply state machine is reset and a new request is sent to the next available peer.
If set to 1, also enable ZMQ_REQ_CORRELATE to ensure correct matching of requests and replies. Otherwise a late reply to an aborted request can be reported as the reply to the superseding request.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
A complete documentation is here

The good news is that, as of ZMQ 3.0 and later (the modern era), you can set a timeout on a socket. As others have noted elsewhere, you must do this after you have created the socket, but before you connect it:
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
Then, when you actually try to receive the reply (after you have sent a message to the REP socket), you can catch the error that will be asserted if the timeout is exceeded:
try:
send( message, 0 )
send_failed = False
except zmq.Again:
logging.warning( "Image send failed." )
send_failed = True
However! When this happens, as observed elsewhere, your socket will be in a funny state, because it will still be expecting the response. At this point, I cannot find anything that works reliably other than just restarting the socket. Note that if you disconnect() the socket and then re connect() it, it will still be in this bad state. Thus you need to
def reset_my_socket:
zmq_req_socket.close()
zmq_req_socket = zmq_context.socket( zmq.REQ )
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
zmq_req_socket.connect( zmq_endpoint )
You will also notice that because I close()d the socket, the receive timeout option was "lost", so it is important set that on the new socket.
I hope this helps. And I hope that this does not turn out to be the best answer to this question. :)

There is one solution to this and that is adding timeouts to all calls. Since ZeroMQ by itself does not really provide simple timeout functionality I recommend using a subclass of the ZeroMQ socket that adds a timeout parameter to all important calls.
So, instead of calling s.recv() you would call s.recv(timeout=5.0) and if a response does not come back within that 5 second window it will return None and stop blocking. I had made a futile attempt at this when I run into this problem.

I'm actually looking into this at the moment, because I am retro fitting a legacy system.
I am coming across code constantly that "needs" to know about the state of the connection. However the thing is I want to move to the message passing paradigm that the library promotes.
I found the following function : zmq_socket_monitor
What it does is monitor the socket passed to it and generate events that are then passed to an "inproc" endpoint - at that point you can add handling code to actually do something.
There is also an example (actually test code) here : github
I have not got any specific code to give at the moment (maybe at the end of the week) but my intention is to respond to the connect and disconnects such that I can actually perform any resetting of logic required.
Hope this helps, and despite quoting 4.2 docs, I am using 4.0.4 which seems to have the functionality
as well.
Note I notice you talk about python above, but the question is tagged C++ so that's where my answer is coming from...

Update: I'm updating this answer with this excellent resource here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ Socket programming is complicated so do checkout the references in this post.
None of the answers here seem accurate or useful. The OP is not looking for information on BSD socket programming. He is trying to figure out how to robustly handle accept()ed client-socket failures in ZMQ on the REP socket to prevent the server from hanging or crashing.
As already noted -- this problem is complicated by the fact that ZMQ tries to pretend that the servers listen()ing socket is the same as an accept()ed socket (and there is no where in the documentation that describes how to set basic timeouts on such sockets.)
My answer:
After doing a lot of digging through the code, the only relevant socket options passed along to accept()ed socks seem to be keep alive options from the parent listen()er. So the solution is to set the following options on the listen socket before calling send or recv:
void zmq_setup(zmq::context_t** context, zmq::socket_t** socket, const char* endpoint)
{
// Free old references.
if(*socket != NULL)
{
(**socket).close();
(**socket).~socket_t();
}
if(*context != NULL)
{
// Shutdown all previous server client-sockets.
zmq_ctx_destroy((*context));
(**context).~context_t();
}
*context = new zmq::context_t(1);
*socket = new zmq::socket_t(**context, ZMQ_REP);
// Enable TCP keep alive.
int is_tcp_keep_alive = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE, &is_tcp_keep_alive, sizeof(is_tcp_keep_alive));
// Only send 2 probes to check if client is still alive.
int tcp_probe_no = 2;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_CNT, &tcp_probe_no, sizeof(tcp_probe_no));
// How long does a con need to be "idle" for in seconds.
int tcp_idle_timeout = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_IDLE, &tcp_idle_timeout, sizeof(tcp_idle_timeout));
// Time in seconds between individual keep alive probes.
int tcp_probe_interval = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_INTVL, &tcp_probe_interval, sizeof(tcp_probe_interval));
// Discard pending messages in buf on close.
int is_linger = 0;
(**socket).setsockopt(ZMQ_LINGER, &is_linger, sizeof(is_linger));
// TCP user timeout on unacknowledged send buffer
int is_user_timeout = 2;
(**socket).setsockopt(ZMQ_TCP_MAXRT, &is_user_timeout, sizeof(is_user_timeout));
// Start internal enclave event server.
printf("Host: Starting enclave event server\n");
(**socket).bind(endpoint);
}
What this does is tell the operating system to aggressively check the client socket for timeouts and reap them for cleanup when a client doesn't return a heart beat in time. The result is that the OS will send a SIGPIPE back to your program and socket errors will bubble up to send / recv - fixing a hung server. You then need to do two more things:
1. Handle SIGPIPE errors so the program doesn't crash
#include <signal.h>
#include <zmq.hpp>
// zmq_setup def here [...]
int main(int argc, char** argv)
{
// Ignore SIGPIPE signals.
signal(SIGPIPE, SIG_IGN);
// ... rest of your code after
// (Could potentially also restart the server
// sock on N SIGPIPEs if you're paranoid.)
// Start server socket.
const char* endpoint = "tcp://127.0.0.1:47357";
zmq::context_t* context;
zmq::socket_t* socket;
zmq_setup(&context, &socket, endpoint);
// Message buffers.
zmq::message_t request;
zmq::message_t reply;
// ... rest of your socket code here
}
2. Check for -1 returned by send or recv and catch ZMQ errors.
// E.g. skip broken accepted sockets (pseudo-code.)
while (1):
{
try
{
if ((*socket).recv(&request)) == -1)
throw -1;
}
catch (...)
{
// Prevent any endless error loops killing CPU.
sleep(1)
// Reset ZMQ state machine.
try
{
zmq::message_t blank_reply = zmq::message_t();
(*socket).send (blank_reply);
}
catch (...)
{
1;
}
continue;
}
Notice the weird code that tries to send a reply on a socket failure? In ZMQ, a REP server "socket" is an endpoint to another program making a REQ socket to that server. The result is if you go do a recv on a REP socket with a hung client, the server sock becomes stuck in a broken receive loop where it will wait forever to receive a valid reply.
To force an update on the state machine, you try send a reply. ZMQ detects that the socket is broken, and removes it from its queue. The server socket becomes "unstuck", and the next recv call returns a new client from the queue.
To enable timeouts on an async client (in Python 3), the code would look something like this:
import asyncio
import zmq
import zmq.asyncio
#asyncio.coroutine
def req(endpoint):
ms = 2000 # In milliseconds.
sock = ctx.socket(zmq.REQ)
sock.setsockopt(zmq.SNDTIMEO, ms)
sock.setsockopt(zmq.RCVTIMEO, ms)
sock.setsockopt(zmq.LINGER, ms) # Discard pending buffered socket messages on close().
sock.setsockopt(zmq.CONNECT_TIMEOUT, ms)
# Connect the socket.
# Connections don't strictly happen here.
# ZMQ waits until the socket is used (which is confusing, I know.)
sock.connect(endpoint)
# Send some bytes.
yield from sock.send(b"some bytes")
# Recv bytes and convert to unicode.
msg = yield from sock.recv()
msg = msg.decode(u"utf-8")
Now you have some failure scenarios when something goes wrong.
By the way -- if anyone's curious -- the default value for TCP idle timeout in Linux seems to be 7200 seconds or 2 hours. So you would be waiting a long time for a hung server to do anything!
Sources:
https://github.com/zeromq/libzmq/blob/84dc40dd90fdc59b91cb011a14c1abb79b01b726/src/tcp_listener.cpp#L82 TCP keep alive options preserved for client sock
http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ How does keep alive work
https://github.com/zeromq/libzmq/blob/master/builds/zos/README.md Handling sig pipe errors
https://github.com/zeromq/libzmq/issues/2586 for information on closing sockets
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
https://github.com/zeromq/libzmq/issues/976
Disclaimer:
I've tested this code and it seems to be working, but ZMQ does complicate testing this a fair bit because the client re-connects on failure? If anyone wants to use this solution in production, I recommend writing some basic unit tests, first.
The server code could also be improved a lot with threading or polling to be able to handle multiple clients at once. As it stands, a malicious client can temporarily take up resources from the server (3 second timeout) which isn't ideal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js