gRPC: What are the best practices for long-running streaming? - c++

We've implemented a Java gRPC service that runs in the cloud, with an unidirectional (client to server) streaming RPC which looks like:
rpc PushUpdates(stream Update) returns (Ack);
A C++ client (a mobile device) calls this rpc as soon as it boots up, to continuously send an update every 30 or so seconds, perpetually as long as the device is up and running.
ChannelArguments chan_args;
// this will be secure channel eventually
auto channel_p = CreateCustomChannel(remote_addr, InsecureChannelCredentials(), chan_args);
auto stub_p = DialTcc::NewStub(channel_p);
// ...
Ack ack;
auto strm_ctxt_p = make_unique<ClientContext>();
auto strm_p = stub_p->PushUpdates(strm_ctxt_p.get(), &ack);
// ...
While(true) {
// wait until we are ready to send a new update
Update updt;
// populate updt;
if(!strm_p->Write(updt)) {
// stream is not kosher, create a new one and restart
break;
}
}
Now different kinds of network interruptions happen while this is happening:
the gRPC service running in the cloud may go down (for maintenance) or may simply become unreachable.
the device's own ip address keeps changing as it is a mobile device.
We've seen that on such events, neither the channel, nor the Write() API is able to detect network disconnection reliably. At times the client keep calling Write() (which doesn't return false) but the server doesn't receive any data (wireshark doesn't show any activity at the outgoing port of the client device).
What are the best practices to recover in such cases, so that the server starts receiving the updates within X seconds from the time when such an event occurs? It is understandable that there would loss of X seconds worth data whenever such an event happens, but we want to recover reliably within X seconds.
gRPC version: 1.30.2, Client: C++-14/Linux, Sever: Java/Linux

Here's how we've hacked this. I want to check if this can be made any better or anyone from gRPC can guide me about a better solution.
The protobuf for our service looks like this. It has an RPC for pinging the service, which is used frequently to test connectivity.
// Message used in IsAlive RPC
message Empty {}
// Acknowledgement sent by the service for updates received
message UpdateAck {}
// Messages streamed to the service by the client
message Update {
...
...
}
service GrpcService {
// for checking if we're able to connect
rpc Ping(Empty) returns (Empty);
// streaming RPC for pushing updates by client
rpc PushUpdate(stream Update) returns (UpdateAck);
}
Here is how the c++ client looks, which does the following:
Connect():
Create the stub for calling the RPCs, if the stub is nullptr.
Call Ping() in regular intervals until it is successful.
On success call PushUpdate(...) RPC to create a new stream.
On failure reset the stream to nullptr.
Stream(): Do the following a while(true) loop:
Get the update to be pushed.
Call Write(...) on the stream with the update to be pushed.
If Write(...) fails for any reason break and the control goes back to Connect().
Once in every 30 minutes (or some regular interval), reset everything (stub, channel, stream) to nullptr to start afresh. This is required because at times Write(...) does not fail even if there is no connection between the client and the service. Write(...) calls are successful but the outgoing port on the client does not show any activity on wireshark!
Here is the code:
constexpr GRPC_TIMEOUT_S = 10;
constexpr RESTART_INTERVAL_M = 15;
constexpr GRPC_KEEPALIVE_TIME_MS = 10000;
string root_ca, tls_key, tls_cert; // for SSL
string remote_addr = "https://remote.com:5445";
...
...
void ResetStreaming() {
if (stub_p) {
if (strm_p) { // graceful restart/stop, this pair of API are called together, in this order
if (!strm_p->WritesDone()) {
// Log a message
}
strm_p->Finish(); // Log if return value of this is NOT grpc::OK
}
strm_p = nullptr;
strm_ctxt_p = nullptr;
stub_p = nullptr;
channel_p = nullptr;
}
}
void CreateStub() {
if (!stub_p) {
ChannelArguments chan_args;
chan_args.SetInt(GRPC_ARG_KEEPALIVE_TIME_MS, GRPC_KEEPALIVE_TIME_MS);
channel_p = CreateCustomChannel(
remote_addr,
SslCredentials(SslCredentialsOptions{root_ca, tls_key, tls_cert}),
chan_args);
stub_p = GrpcService::NewStub(m_channel_p);
}
}
void Stream() {
const auto restart_time = steady_clock::now() + minutes(RESTART_INTERVAL_M);
while (!stop) {
// restart every RESTART_INTERVAL_M (15m) even if ALL IS WELL!!
if (steady_clock::now() > restart_time) {
break;
}
Update updt = GetUpdate(); // get the update to be sent
if (!stop) {
if (channel_p->GetState(true) == GRPC_CHANNEL_SHUTDOWN ||
!strm_p->Write(updt)) {
// could not write!!
return; // we will Connect() again
}
}
}
// stopped due to stop = true or interval to create new stream has expired
ResetStreaming(); // channel, stub, stream are recreated once in every 15m
}
bool PingRemote() {
ClientContext ctxt;
ctxt.set_deadline(system_clock::now() + seconds(GRPC_TIMEOUT_S));
Empty req, resp;
CreateStub();
if (stub_p->Ping(&ctxt, req, &resp).ok()) {
static UpdateAck ack;
strm_ctxt_p = make_unique<ClientContext>(); // need new context
strm_p = stub_p->PushUpdate(strm_ctxt_p.get(), &ack);
return true;
}
if (strm_p) {
strm_p = nullptr;
strm_ctxt_p = nullptr;
}
return false;
}
void Connect() {
while (!stop) {
if (PingRemote() || stop) {
break;
}
sleep_for(seconds(5)); // wait before retrying
}
}
// set to true from another thread when we want to stop
atomic<bool> stop = false;
void StreamUntilStopped() {
if (stop) {
return;
}
strm_thread_p = make_unique<thread>([&] {
while (!stop) {
Connect();
Stream();
}
});
}
// called by the thread that sets stop = true
void Finish() {
strm_thread_p->join();
}
With this we are seeing that the streaming recovers within 15 minutes (or RESTART_INTERVAL_M) whenever there is a disruption for any reason. This code runs in a fast path, so I am curious to know if this can be made any better.

Related

How to use grpc c++ ClientAsyncReader<Message> for server side streams

I am using a very simple proto where the Message contains only 1 string field. Like so:
service LongLivedConnection {
// Starts a grpc connection
rpc Connect(Connection) returns (stream Message) {}
}
message Connection{
string userId = 1;
}
message Message{
string serverMessage = 1;
}
The use case is that the client should connect to the server, and the server will use this grpc for push messages.
Now, for the client code, assuming that I am already in a worker thread, how do I properly set it up so that I can continuously receive messages that come from server at random times?
void StartConnection(const std::string& user) {
Connection request;
request.set_userId(user);
Message message;
ClientContext context;
stub_->Connect(&context, request, &reply);
// What should I do from now on?
// notify(serverMessage);
}
void notify(std::string message) {
// generate message events and pass to main event loop
}
I figured out how to used the api. Looks like it is pretty flexible, but still a little bit weird given that I typically just expect the async api to receive some kind of lambda callback.
The code below is blocking, you'll have to run this in a different thread so it doesn't block your application.
I believe you can have multiple thread accessing the CompletionQueue, but in my case I just had one single thread handling this grpc connection.
GrpcConnection.h file:
public:
void StartGrpcConnection();
private:
std::shared_ptr<grpc::Channel> m_channel;
std::unique_ptr<grpc::ClientReader<push_notifications::Message>> m_reader;
std::unique_ptr<push_notifications::PushNotificationService::Stub> m_stub;
GrpcConnection.cpp files:
...
void GrpcConnectionService::StartGrpcConnection()
{
m_channel = grpc::CreateChannel("localhost:50051",grpc::InsecureChannelCredentials());
LongLiveConnection::Connect request;
request.set_user_id(12345);
m_stub = LongLiveConnection::LongLiveConnectionService::NewStub(m_channel);
grpc::ClientContext context;
grpc::CompletionQueue cq;
std::unique_ptr<grpc::ClientAsyncReader<LongLiveConnection::Message>> reader =
m_stub->PrepareAsyncConnect(&context, request, &cq);
void* got_tag;
bool ok = false;
LongLiveConnection::Message reply;
reader->StartCall((void*)1);
cq.Next(&got_tag, &ok);
if (ok && got_tag == (void*)1)
{
// startCall() is successful if ok is true, and got_tag is void*1
// start the first read message with a different hardcoded tag
reader->Read(&reply, (void*)2);
while (true)
{
ok = false;
cq.Next(&got_tag, &ok);
if (got_tag == (void*)2)
{
// this is the message from server
std::string body = reply.server_message();
// do whatever you want with body, in my case i push it to my applications' event stream to be processed by other components
// lastly, initialize another read
reader->Read(&reply, (void*)2);
}
else if (got_tag == (void*)3)
{
// if you do something else, such as listening to GRPC channel state change, in your call, you can pass a different hardcoded tag, then, in here, you will be notified when the result is received from that call.
}
}
}
}

Detect client context destruction from gRPC server

I have create an Async C++ gRPC server that offer several APIs similar with a signature similar to this:
service Foo {
rpc FunctionalityA(ARequest) returns (stream AResponse);
rpc FunctionalityB(BRequest) returns (stream BResponse);
}
The client creates one channel to connect to this service, and uses calls the various RPCs from separate threads, something like this:
class FooClient {
// ...
void FunctionalityA() {
auto stub = example::Foo::NewStub(m_channel);
grpc::ClientContext context;
example::ARequest request;
example::AResponse response;
auto reader = stub->FunctionalityA(&context, request);
for(int i = 0; i < 3; i++) {
reader->Read(&response);
}
}
void FunctionalityB() {
auto stub = example::Foo::NewStub(m_channel);
grpc::ClientContext context;
example::BRequest request;
example::BResponse response;
auto reader = stub->FunctionalityB(&context, request);
for(int i = 0; i < 3; i++) {
reader->Read(&response);
}
}
// ...
};
int main() {
// ...
FooClient client(grpc::CreateChannel("127.0.0.1:12345", grpc::InsecureChannelCredentials()));
auto ta = std::thread(&FooClient::FunctionalityA, &client);
auto tb = std::thread(&FooClient::FunctionalityB, &client);
// ...
}
I want to implement the server so that:
when FunctionalityA is called, it start streaming objects of type AResponse
when FunctionalityB is called, it start streaming objects of type BResponse
when the context used to call FunctionalityA is cancelled, streaming of AResponse ends
when the context used to call FunctionalityB is cancelled, streaming of BResponse ends
The problem I face is that even when the ClientContext associated with one of the two Functionalities goes out of scope (after the 3 reads in the example) the server does not receive any information and keeps writing, and the "ok" status remains true.
The "ok" status goes to false and allows me to stop Writing only when the client disconnects.
Is this the intended behavior of gRPC? Does the client need to send a specific "kiss of death" message in order to inform the server to stop writing on the stream?
Here is an example of the implementation of a Functionality server side, for completeness:
void FunctionalityB::ProcessRequest(bool ok, RequestState state) {
if(!ok) {
if(state == RequestState::START) {
// the server has been Shutdown before this particular call got matched to an incoming RPC
delete this;
} else if(state == RequestState::WRITE || state == RequestState::FINISH) {
// not going to the wire because the call is already dead (i.e., canceled, deadline expired, other side dropped the channel, etc).
delete this;
} else {
// unhandled state
}
} else {
if(state == RequestState::START) {
// the RPC has indeed been started
m_writer.Write(m_response, CreateTag(RequestState::WRITE));
// the constructor of the functionality requests a new one to handle future new connections
new FunctionalityB(m_completion_queue, m_service, m_worker);
} else if(state == RequestState::WRITE) {
// TODO do some real work
std::this_thread::sleep_for(std::chrono::milliseconds(50));
m_writer.Write(m_response, CreateTag(RequestState::WRITE)); // this write will continue forever, even after client stops reading and TryCancel its context
} else if(state == RequestState::FINISH) {
delete this;
} else {
// unhandled state
}
}
}
There are two ways to detect call cancellation on the server.
The first one is to check ServerContext::IsCancelled(). That is something you can check right before you do a write, which in this case may be fine. In the general case, though, it may not be ideal, because your application might be waiting for some other event (other than the previous write completing) before it does another write, and you ideally want some async way of getting notified when the cancellation happens.
Which brings me to the second approach, which is to request an event on the completion queue when the call is cancelled by calling ServerContext::AsyncNotifyWhenDone() before the RPC starts. This will give you async notification of the cancellation, but unfortunately, the API is very cumbersome and has a few sharp edges. (This is something that is handled much more cleanly in the new callback-based API, but that API isn't that performant in OSS until we finish the EventEngine effort.)
I hope this info is helpful.

How can I read and write from a grpc stream simultaneously

I am now implementing the Raft algorithm, and I want to use gRPC stream to do this. My main idea is to create 3 streams for each node to every other peers, one stream will transmit one type of RPCs, there are AppendEntries, RequestVote and InstallSnapshot. I write some code with limited help from route_guide, because in its bidirectional stream demo RouteChat, the client send all its data before it starts to read.
Firstly, I want to write to a stream at any time, so I write the following codes
void RaftMessagesStreamClientSync::AsyncRequestVote(const RequestVoteRequest& request){
std::string peer_name = this->peer_name;
debug("GRPC: Send RequestVoteRequest from %s to %s\n", request.name().c_str(), peer_name.c_str());
request_vote_stream->Write(request);
}
Meanwhile, I want a thread keep reading from a stream, like the following codes, which is called immediately after RaftMessagesStreamClientSync is constructed.
void RaftMessagesStreamClientSync::handle_response(){
// strongThis is a must
auto strongThis = shared_from_this();
t1 = new std::thread([strongThis](){
RequestVoteResponse response;
while (strongThis->request_vote_stream->Read(&response)) {
debug("GRPC: Recv RequestVoteResponse from %s, me %s\n", response.name().c_str(), strongThis->raft_node->name.c_str());
...
}
});
...
In order to initialize 3 streams, I have to write the constructor like this, I use 3 ClientContext here because the document says one ClientContext for one RPC
struct RaftMessagesStreamClientSync : std::enable_shared_from_this<RaftMessagesStreamClientSync>{
typedef grpc::ClientReaderWriter<RequestVoteRequest, RequestVoteResponse> CR;
typedef grpc::ClientReaderWriter<AppendEntriesRequest, AppendEntriesResponse> CA;
typedef grpc::ClientReaderWriter<InstallSnapshotRequest, InstallSnapshotResponse> CI;
std::unique_ptr<CR> request_vote_stream;
std::unique_ptr<CA> append_entries_stream;
std::unique_ptr<CI> install_snapshot_stream;
ClientContext context_r;
ClientContext context_a;
ClientContext context_i;
std::thread * t1 = nullptr;
std::thread * t2 = nullptr;
std::thread * t3 = nullptr;
...
}
RaftMessagesStreamClientSync::RaftMessagesStreamClientSync(const char * addr, struct RaftNode * _raft_node) : raft_node(_raft_node), peer_name(addr) {
std::shared_ptr<Channel> channel = grpc::CreateChannel(addr, grpc::InsecureChannelCredentials());
stub = raft_messages::RaftStreamMessages::NewStub(channel);
// 1
request_vote_stream = stub->RequestVote(&context_r);
// 2
append_entries_stream = stub->AppendEntries(&context_a);
// 3
install_snapshot_stream = stub->InstallSnapshot(&context_i);
}
~RaftMessagesStreamClientSync() {
raft_node = nullptr;
t1->join();
t2->join();
t3->join();
delete t1;
delete t2;
delete t3;
}
Then I implement the server side
Status RaftMessagesStreamServiceImpl::RequestVote(ServerContext* context, ::grpc::ServerReaderWriter< ::raft_messages::RequestVoteResponse, RequestVoteRequest>* stream){
RequestVoteResponse response;
RequestVoteRequest request;
while (stream->Read(&request)) {
...
}
return Status::OK;
}
Then 2 problems happen:
When I test with 3 nodes, which actually creates 2 RaftMessagesStreamServiceImpl for each node, the statement from 1 to 3 cost a long time to execute.
There is no RPC received from server side.
There are similar problems when using Bidi Aysnc Server, However I can't figure out how this post can help me.
UPDATE
After some debugging, I found request_vote_stream->Write(request) returns 0, which, according to the document, means the stream is closed. However why is it closed?
After some debugging, I found that the two problem are all due to one problem that I create a client before I create a server.
Because I originally uses unary RPC calls, so a previous call from client only causes a gRPC error code 14. The program continues because every call sent after the server is created can be handled correctly.
However, when it comes to streaming calls, stub->RequestVote(&context_r) will end up calling a blocking function ClientReaderWriter::ClientReaderWriter, which will try to connect to the server, which is not created now.
/// Block to create a stream and write the initial metadata and \a request
/// out. Note that \a context will be used to fill in custom initial metadata
/// used to send to the server when starting the call.
ClientReaderWriter(::grpc::ChannelInterface* channel,
const ::grpc::internal::RpcMethod& method,
ClientContext* context)
: context_(context),
cq_(grpc_completion_queue_attributes{
GRPC_CQ_CURRENT_VERSION, GRPC_CQ_PLUCK,
GRPC_CQ_DEFAULT_POLLING}), // Pluckable cq
call_(channel->CreateCall(method, context, &cq_)) {
if (!context_->initial_metadata_corked_) {
::grpc::internal::CallOpSet<::grpc::internal::CallOpSendInitialMetadata>
ops;
ops.SendInitialMetadata(context->send_initial_metadata_,
context->initial_metadata_flags());
call_.PerformOps(&ops);
cq_.Pluck(&ops);
}
}
As a consequence, the connection has not yet been established.

Sending data in second thread with Mongoose server

I'm trying to create a multithread server application using mongoose web server library.
I have main thread serving connections and sending requests to processors that are working in their own threads. Then processors place results into queue and queue observer must send results back to clients.
Sources are looking that way:
Here I prepare the data for processors and place it to queue.
typedef std::pair<struct mg_connection*, const char*> TransferData;
int server_app::event_handler(struct mg_connection *conn, enum mg_event ev)
{
Request req;
if (ev == MG_AUTH)
return MG_TRUE; // Authorize all requests
else if (ev == MG_REQUEST)
{
req = parse_request(conn);
task_queue->push(TransferData(conn,req.second));
mg_printf(conn, "%s", ""); // (1)
return MG_MORE; // (2)
}
else
return MG_FALSE; // Rest of the events are not processed
}
And here I'm trying to send the result back. This function is working in it's own thread.
void server_app::check_results()
{
while(true)
{
TransferData res;
if(!res_queue->pop(res))
{
boost::this_thread::sleep_for(boost::chrono::milliseconds(100));
continue;
}
mg_printf_data(res.first, "%s", res.second); // (3)
}
}
The problem is a client doesn't receive anything from the server.
If I run check_result function manualy in the event_handler after placing a task into the queue and then pass computed result back to event_handler, I'm able to send it to client using mg_printf_data (with returning MG_TRUE). Any other way - I'm not.
What exactly should I change in this sources to make it works?
Ok... It looks like I've solved it myself.
I'd been looking into mongoose.c code and an hour later I found the piece of code below:
static void write_terminating_chunk(struct connection *conn) {
mg_write(&conn->mg_conn, "0\r\n\r\n", 5);
}
static int call_request_handler(struct connection *conn) {
int result;
conn->mg_conn.content = conn->ns_conn->recv_iobuf.buf;
if ((result = call_user(conn, MG_REQUEST)) == MG_TRUE) {
if (conn->ns_conn->flags & MG_HEADERS_SENT) {
write_terminating_chunk(conn);
}
close_local_endpoint(conn);
}
return result;
}
So I've tried to do mg_write(&conn->mg_conn, "0\r\n\r\n", 5); after line (3) and now it's working.

Client application crash causes Server to crash? (C++)

I'm not sure if this is a known issue that I am running into, but I couldn't find a good search string that would give me any useful results.
Anyway, here's the basic rundown:
we've got a relatively simple application that takes data from a source (DB or file) and streams that data over TCP to connected clients as new data comes in. its a relatively low number of clients; i would say at max 10 clients per server, so we have the following rough design:
client: connect to server, set to read (with timeout set to higher than the server heartbeat message frequency). It blocks on read.
server: one listening thread that accepts connections and then spawns a writer thread to read from the data source and write to the client. The writer thread is also detached(using boost::thread so just call the .detach() function). It blocks on writes indefinetly, but does check errno for errors before writing. We start the servers using a single perl script and calling "fork" for each server process.
The problem(s):
at seemingly random times, the client will shutdown with a "connection terminated (SUCCESFUL)" indicating that the remote server shutdown the socket on purpose. However, when this happens the SERVER application ALSO closes, without any errors or anything. it just crashes.
Now, to further the problem, we have multiple instances of the server app being started by a startup script running different files and different ports. When ONE of the servers crashes like this, ALL the servers crash out.
Both the server and client using the same "Connection" library created in-house. It's mostly a C++ wrapper for the C socket calls.
here's some rough code for the write and read function in the Connection libary:
int connectionTimeout_read = 60 * 60 * 1000;
int Socket::readUntil(char* buf, int amount) const
{
int readyFds = epoll_wait(epfd,epEvents,1,connectionTimeout_read);
if(readyFds < 0)
{
status = convertFlagToStatus(errno);
return 0;
}
if(readyFds == 0)
{
status = CONNECTION_TIMEOUT;
return 0;
}
int fd = epEvents[0].data.fd;
if( fd != socket)
{
status = CONNECTION_INCORRECT_SOCKET;
return 0;
}
int rec = recv(fd,buf,amount,MSG_WAITALL);
if(rec == 0)
status = CONNECTION_CLOSED;
else if(rec < 0)
status = convertFlagToStatus(errno);
else
status = CONNECTION_NORMAL;
lastReadBytes = rec;
return rec;
}
int Socket::write(const void* buf, int size) const
{
int readyFds = epoll_wait(epfd,epEvents,1,-1);
if(readyFds < 0)
{
status = convertFlagToStatus(errno);
return 0;
}
if(readyFds == 0)
{
status = CONNECTION_TERMINATED;
return 0;
}
int fd = epEvents[0].data.fd;
if(fd != socket)
{
status = CONNECTION_INCORRECT_SOCKET;
return 0;
}
if(epEvents[0].events != EPOLLOUT)
{
status = CONNECTION_CLOSED;
return 0;
}
int bytesWrote = ::send(socket, buf, size,0);
if(bytesWrote < 0)
status = convertFlagToStatus(errno);
lastWriteBytes = bytesWrote;
return bytesWrote;
}
Any help solving this mystery bug would be great! at the VERY least, I would like it to NOT crash out the server even if the client crashes (which is really strange for me, since there is no two-way communication).
Also, for reference, here is the server listening code:
while(server.getStatus() == connection::CONNECTION_NORMAL)
{
connection::Socket s = server.listen();
if(s.getStatus() != connection::CONNECTION_NORMAL)
{
fprintf(stdout,"failed to accept a socket. error: %s\n",connection::getStatusString(s.getStatus()));
}
DATASOURCE* dataSource;
dataSource = open_datasource(XXXX); /* edited */ if(dataSource == NULL)
{
fprintf(stdout,"FATAL ERROR. DATASOURCE NOT FOUND\n");
return;
}
boost::thread fileSender(Sender(s,dataSource));
fileSender.detach();
}
...And also here is the spawned child sending thread:
::signal(SIGPIPE,SIG_IGN);
//const int headerNeeds = 29;
const int BUFFERSIZE = 2000;
char buf[BUFFERSIZE];
bool running = true;
while(running)
{
memset(buf,'\0',BUFFERSIZE*sizeof(char));
unsigned int readBytes = 0;
while((readBytes = read_datasource(buf,sizeof(unsigned char),BUFFERSIZE,dataSource)) == 0)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(1000));
}
socket.write(buf,readBytes);
if(socket.getStatus() != connection::CONNECTION_NORMAL)
running = false;
}
fprintf(stdout,"socket error: %s\n",connection::getStatusString(socket.getStatus()));
socket.close();
fprintf(stdout,"sender exiting...\n");
Any insights would be welcome! Thanks in advance.
You've probably got everything backwards... when the server crashes, the OS will close all sockets. So the server crash happens first and causes the client to get the disconnect message (FIN flag in a TCP segment, actually), the crash is not a result of the socket closing.
Since you have multiple server processes crashing at the same time, I'd look at resources they share, and also any scheduled tasks that all servers would try to execute at the same time.
EDIT: You don't have a single client connecting to multiple servers, do you? Note that TCP connections are always bidirectional, so the server process does get feedback if a client disconnects. Some internet providers have even been caught generating RST packets on connections that fail some test for suspicious traffic.
Write a signal handler. Make sure it uses only raw I/O functions to log problems (open, write, close, not fwrite, not printf).
Check return values. Check for negative return value from write on a socket, but check all return values.
Thanks for all the comments and suggestions.
After looking through the code and adding the signal handling as Ben suggested, the applications themselves are far more stable. Thank you for all your input.
The original problem, however, was due to a rogue script that one of the admins was running as root that would randomly kill certain processes on the server-side machine (i won't get into what it was trying to do in reality; safe to say it was buggy).
Lesson learned: check the environment.
Thank you all for the advice.