Activity Management in distributed architecture using AWS SWF - amazon-web-services

I have two servers(EC2 instances). In one server(server 1) i have 3 Batch and on another(server 2) i have 4 Batch. Now, one of the batch in server 2 needs to be executed only after the successful execution of a batch in server 1.
updated
Promise<Void> r12 = null
new TryCatchFinally(){
// First server job sequencing
Promise<Void> r11 = client1.b1();
r12 = client1.b2(r11);
Promise<Void> r13 = client1.b3(r12);
Promise<Void> r14 = client1.b4(r13);
}
#Override
protected void doCatch(Throwable e) throws Throwable {
System.out.println("Failed to execute commands in server 1");
}
#Override
protected void doFinally() throws Throwable {
// cleanup
}
}
new TryCatchFinally(){
// Second server job sequencing
Promise<Void> r21 = client2.b1();
// Will execute only when both parameters are ready
Promise<Void> r22 = client2.b2(r21, r12);
Promise<Void> r23 = client2.b3(r22);
Promise<Void> r24 = client2.b4(r23);
}
#Override
protected void doCatch(Throwable e) throws Throwable {
System.out.println("Failed to execute commands in server 2");
}
#Override
protected void doFinally() throws Throwable {
// cleanup
}
}
Any of the activity in any server can throw any custom exception. But the execution of any activity in a sever should not be cancelled because of exception thrown by activity in another server. Activity in a server should only be cancelled in case one of the activity in its own server throws any exception. (Dependent activity should also gets cancelled out irrespective of server if the activity on which it is dependent fails or throw any exception). For this what I did is wrapped it into two separate try catch block.
how to terminate the Workflow Execution if the activity of server 1 and server 2 both throws any exception or fails?

You can wrap each Spring Batch execution into a SWF activity and then use SWF decider to sequence these activities. See AWS Flow Framework documentation and recipes for more info.
Added after reading the updated description of the problem:
You can use Promises to sequence activities in any way. So in your case I would do something like:
// First server job sequencing
Promise<Void> r11 = client1.b1();
Promise<Void> r12 = client1.b2(r11);
Promise<Void> r13 = client1.b3(r12);
Promise<Void> r14 = client1.b4(r13);
// Second server job sequencing
Promise<Void> r21 = client2.b1();
// Will execute only when both parameters are ready
Promise<Void> r22 = client2.b2(r21, r12);
Promise<Void> r23 = client2.b3(r22);
Promise<Void> r24 = client2.b4(r23);
If any of the activities throws an exception it would cancel all outstanding activities and fail the workflow unless the exception is explicitly catched and handled using TryCatchFinally. Activity that wasn't started (for example because it is waiting for its parameters of type Promise become ready) is cancelled immediately. An activity that is executing should explicitly handle cancellation. See "Activity Heartbeat" section from Error Handling page of the AWS Flow Framework Guide for more info.
Added the error handling part:
You wrap the part that shouldn't affect other parts of the worklfow in the TryCatch. So in this example any client2 activity throwing an exception cancels all future client2 activities, but not activities called on client1 as exception is not thrown into its scope.
// First server job sequencing
Promise<Void> r11 = client1.b1();
final Promise<Void> r12 = client1.b2(r11);
Promise<Void> r13 = client1.b3(r12);
Promise<Void> r14 = client1.b4(r13);
new TryCatch(){
#Override
protected void doTry() throws Throwable {
// Second server job sequencing
Promise<Void> r21 = client2.b1();
// Will execute only when both parameters are ready
Promise<Void> r22 = client2.b2(r21, r12);
Promise<Void> r23 = client2.b3(r22);
Promise<Void> r24 = client2.b4(r23);
}
#Override
protected void doCatch(Throwable e) throws Throwable {
// Handle exception without rethrowing it.
}
}

Related

visualvm does not show profiling information but shows sampling information

I am running a simple java program and trying to study the same using visual vm .
Not trying to debug application performance. app consist of 2 simple classes and creates 2 thread , which does nothing but sleep.
public class ThreadRunExample {
public static void main(String[] args){
Thread t1 = new Thread(new HeavyWorkRunnable(), "t1");
Thread t2 = new Thread(new HeavyWorkRunnable(), "t2");
System.out.println("Starting Runnable threads");
t1.start();
t2.start();
System.out.println("Runnable Threads has been started");
}
}
public class HeavyWorkRunnable implements Runnable{
#Override
public void run() {
System.out.println("Doing heavy processing - START "+Thread.currentThread().getName());
try {
Thread.sleep(1000);
//Get database connection, delete unused data from DB
doDBProcessing();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Doing heavy processing - END "+Thread.currentThread().getName());
}
private void doDBProcessing() throws InterruptedException {
System.out.println("Doing heavy processing - start "+Thread.currentThread().getName());
Thread.sleep(50000);
}
}
Visual VM Sampler tab shows the thread sleeping correctly
However profiler tab , after clicking for CPU profiling shows nothing .
During running of the app , profiler tab is blank . Any idea

gRPC: What are the best practices for long-running streaming?

We've implemented a Java gRPC service that runs in the cloud, with an unidirectional (client to server) streaming RPC which looks like:
rpc PushUpdates(stream Update) returns (Ack);
A C++ client (a mobile device) calls this rpc as soon as it boots up, to continuously send an update every 30 or so seconds, perpetually as long as the device is up and running.
ChannelArguments chan_args;
// this will be secure channel eventually
auto channel_p = CreateCustomChannel(remote_addr, InsecureChannelCredentials(), chan_args);
auto stub_p = DialTcc::NewStub(channel_p);
// ...
Ack ack;
auto strm_ctxt_p = make_unique<ClientContext>();
auto strm_p = stub_p->PushUpdates(strm_ctxt_p.get(), &ack);
// ...
While(true) {
// wait until we are ready to send a new update
Update updt;
// populate updt;
if(!strm_p->Write(updt)) {
// stream is not kosher, create a new one and restart
break;
}
}
Now different kinds of network interruptions happen while this is happening:
the gRPC service running in the cloud may go down (for maintenance) or may simply become unreachable.
the device's own ip address keeps changing as it is a mobile device.
We've seen that on such events, neither the channel, nor the Write() API is able to detect network disconnection reliably. At times the client keep calling Write() (which doesn't return false) but the server doesn't receive any data (wireshark doesn't show any activity at the outgoing port of the client device).
What are the best practices to recover in such cases, so that the server starts receiving the updates within X seconds from the time when such an event occurs? It is understandable that there would loss of X seconds worth data whenever such an event happens, but we want to recover reliably within X seconds.
gRPC version: 1.30.2, Client: C++-14/Linux, Sever: Java/Linux
Here's how we've hacked this. I want to check if this can be made any better or anyone from gRPC can guide me about a better solution.
The protobuf for our service looks like this. It has an RPC for pinging the service, which is used frequently to test connectivity.
// Message used in IsAlive RPC
message Empty {}
// Acknowledgement sent by the service for updates received
message UpdateAck {}
// Messages streamed to the service by the client
message Update {
...
...
}
service GrpcService {
// for checking if we're able to connect
rpc Ping(Empty) returns (Empty);
// streaming RPC for pushing updates by client
rpc PushUpdate(stream Update) returns (UpdateAck);
}
Here is how the c++ client looks, which does the following:
Connect():
Create the stub for calling the RPCs, if the stub is nullptr.
Call Ping() in regular intervals until it is successful.
On success call PushUpdate(...) RPC to create a new stream.
On failure reset the stream to nullptr.
Stream(): Do the following a while(true) loop:
Get the update to be pushed.
Call Write(...) on the stream with the update to be pushed.
If Write(...) fails for any reason break and the control goes back to Connect().
Once in every 30 minutes (or some regular interval), reset everything (stub, channel, stream) to nullptr to start afresh. This is required because at times Write(...) does not fail even if there is no connection between the client and the service. Write(...) calls are successful but the outgoing port on the client does not show any activity on wireshark!
Here is the code:
constexpr GRPC_TIMEOUT_S = 10;
constexpr RESTART_INTERVAL_M = 15;
constexpr GRPC_KEEPALIVE_TIME_MS = 10000;
string root_ca, tls_key, tls_cert; // for SSL
string remote_addr = "https://remote.com:5445";
...
...
void ResetStreaming() {
if (stub_p) {
if (strm_p) { // graceful restart/stop, this pair of API are called together, in this order
if (!strm_p->WritesDone()) {
// Log a message
}
strm_p->Finish(); // Log if return value of this is NOT grpc::OK
}
strm_p = nullptr;
strm_ctxt_p = nullptr;
stub_p = nullptr;
channel_p = nullptr;
}
}
void CreateStub() {
if (!stub_p) {
ChannelArguments chan_args;
chan_args.SetInt(GRPC_ARG_KEEPALIVE_TIME_MS, GRPC_KEEPALIVE_TIME_MS);
channel_p = CreateCustomChannel(
remote_addr,
SslCredentials(SslCredentialsOptions{root_ca, tls_key, tls_cert}),
chan_args);
stub_p = GrpcService::NewStub(m_channel_p);
}
}
void Stream() {
const auto restart_time = steady_clock::now() + minutes(RESTART_INTERVAL_M);
while (!stop) {
// restart every RESTART_INTERVAL_M (15m) even if ALL IS WELL!!
if (steady_clock::now() > restart_time) {
break;
}
Update updt = GetUpdate(); // get the update to be sent
if (!stop) {
if (channel_p->GetState(true) == GRPC_CHANNEL_SHUTDOWN ||
!strm_p->Write(updt)) {
// could not write!!
return; // we will Connect() again
}
}
}
// stopped due to stop = true or interval to create new stream has expired
ResetStreaming(); // channel, stub, stream are recreated once in every 15m
}
bool PingRemote() {
ClientContext ctxt;
ctxt.set_deadline(system_clock::now() + seconds(GRPC_TIMEOUT_S));
Empty req, resp;
CreateStub();
if (stub_p->Ping(&ctxt, req, &resp).ok()) {
static UpdateAck ack;
strm_ctxt_p = make_unique<ClientContext>(); // need new context
strm_p = stub_p->PushUpdate(strm_ctxt_p.get(), &ack);
return true;
}
if (strm_p) {
strm_p = nullptr;
strm_ctxt_p = nullptr;
}
return false;
}
void Connect() {
while (!stop) {
if (PingRemote() || stop) {
break;
}
sleep_for(seconds(5)); // wait before retrying
}
}
// set to true from another thread when we want to stop
atomic<bool> stop = false;
void StreamUntilStopped() {
if (stop) {
return;
}
strm_thread_p = make_unique<thread>([&] {
while (!stop) {
Connect();
Stream();
}
});
}
// called by the thread that sets stop = true
void Finish() {
strm_thread_p->join();
}
With this we are seeing that the streaming recovers within 15 minutes (or RESTART_INTERVAL_M) whenever there is a disruption for any reason. This code runs in a fast path, so I am curious to know if this can be made any better.

Can I run some code in a WebJob after the timeout has been reached?

I have a continuos WebJob using the ServiceBusTrigger. I also set a timeout of 10 minutes.
If the function finishes before, it sends a response to another queue.
But in case of a timeout, it just exits. Is it possible to run some code when timeout is reached? Like a callback function. So I can send a proper message that the process timed out.
[Timeout("00:10:00")]
public static async Task ProcessPreflightQueueMessage([ServiceBusTrigger("%preflightQueue%")] string message, CancellationToken token, ILogger logger)
{ ... }
Yes you could, while the function timeout, it will throw TaskCanceledException, you could catch it and do your stuff.
Here is my test.
public class Functions
{
[Timeout("00:00:30")]
public static async Task TimeoutJob(
[QueueTrigger("myqueue")] string message,
CancellationToken token,
TextWriter log)
{
try
{
await log.WriteLineAsync(message);
log.WriteLine($"-- [{DateTime.Now.ToString()}] Processing Begin --");
await Task.Delay(TimeSpan.FromMinutes(1), token);
log.WriteLine($"-- [{DateTime.Now.ToString()}] Complete Time-consuming jobs --");
}
catch (TaskCanceledException)
{
log.WriteLine("timeout");
}
log.WriteLine($"-- [{DateTime.Now.ToString()}] Processing End -- ");
}
}

Catching Listener Exceptions in long running Cloud PubSub Subscriber service

I am trying to write a long running Subscriber service in Java. I have set up the Listeners to listen to any failures inside the Subscriber service. I am trying to make this fault tolerant and I do not quite understand few things, Below are my doubts/questions.
I have followed the basic setup shown here https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-examples/src/main/java/com/google/cloud/examples/pubsub/snippets/SubscriberSnippets.java. Specifically, I have setup addListener as shown below.
As shown in the following code, initializeSubscriber acts a state variable which will determine if the Subscriber service should restart. Inside the while loop, this variable is continuously monitored to determine if the restart is required.
My question here is,
1. How do I raise an exception inside Subscriber.Listener's failed method and capture it in the main while loop. I tried throwing a new Exception() in failed method and catching it in catch block inside while, However, I am unable to compile the code as it is a checked exception.
2. As shown here, I use Java Executor thread to run the Listener. How do I handle the Listener failures ? Will I able to catch Listener failures under general Exception catch block as shown here ?
try {
boolean initializeSubscriber = true;
while (true) {
try {
if (initializeSubscriber) {
createSingleThreadedSubscriber();
addErrorListenerToSubscriber();
subscriber.startAsync().awaitRunning();
initializeSubscriber = false;
}
// Checks the status of subscriber service every minute
Thread.sleep(60000);
} catch (Exception ex) {
LOGGER.error("Could not start the Subscriber service", ex);
cleanupSubscriber();
initializeSubscriber = true;
}
}
} catch (RuntimeException e) {
} finally {
shutdown();
}
private void addErrorListenerToSubscriber() {
subscriber.addListener(
new Subscriber.Listener() {
#Override
public void failed(Subscriber.State from, Throwable failure) throws RuntimeException {
LOGGER.info("Subscriber reached a failed state due to " + failure.getMessage()
+ ",Restarting Subscriber service");
initializeSubscriber = true;
}
},
Executors.newSingleThreadExecutor());
}
private void cleanupSubscriber() {
try {
if (subscriber != null) {
subscriber.stopAsync().awaitTerminated();
}
if (!subscriptionListener.isShutdown()) {
subscriptionListener.shutdown();
}
} catch (Exception ex) {
LOGGER.error("Error in cleaning up Subscriber thread " + ex);
}
}
It should not be necessary to add a listener to the subscriber if you just want to recreate the subscriber on a failure. You could instead catch the exception on awaitTerminated:
try {
boolean initializeSubscriber = true;
while (initializeSubscriber) {
try {
createSingleThreadedSubscriber();
subscriber.startAsync().awaitRunning();
initializeSubscriber = false;
subscriber.awaitTerminated();
} catch (Exception ex) {
LOGGER.error("Error in the Subscriber service", ex);
cleanupSubscriber();
initializeSubscriber = true;
}
}
} catch (RuntimeException e) {
} finally {
shutdown();
}
If the subscriber shutdown successfully because of a call to stopAsync, then awaitTerminated will not throw an exception. If there was some kind of exception, then awaitTerminated will throw an IllegalStateException because the state will be FAILED instead of TERMINATED.
Note that transient errors are handled by the library itself. For example, if the server become briefly unavailable due to a network hiccup, the library will seamlessly reconnect and continue to deliver messages. Failures that result in a change in state for the subscriber are likely permanent failures such as permission issues (where the account running the subscriber does not have permission to subscribe to the subscription) or resource issues (such as the subscription having been deleted). In these permanent failure cases, recreating the subscriber will likely just result in the same error unless one takes manual steps to intervene and fix the problem.

How to handle AskTime out Exception in Playframework?

The play application which we are using, has to be kept always alive. This application is basically a RabbitMQ listener which has an infinite while loop, which keeps listening to capture the messages from Message Queue.
The following code is placed in controller, and this play application has to be kept alive at all times
public class Application extends Controller {
public static Result wanHLPT() {
ckmsg()
return ok();
}
public static Result ckmsg() {
try{
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("localhost");
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(QUEUE_NAME, false, false, false, null);
while (true) {
QueueingConsumer.Delivery delivery = consumer.nextDelivery();
String message = new String(delivery.getBody());
// business logic
process()
}
}
}
}
But the play application throws akka exception [AskTimeOut Exception: Timed Out],
Please provide any pointers on how to handle this.
Here is the exception:
! #6f977jo6m - Internal server error, for (GET) [/myapp] ->
play.api.Application$$anon$1: Execution exception[[AskTimeoutException: Timed out]]
at play.api.Application$class.handleError(Application.scala:289) ~ [play_2.10.jar:2.1.0]
at play.api.DefaultApplication.handleError(Application.scala:383) ~[play_2.10.jar:2.1.0]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anon$2$$anonfun$handle$1.apply(PlayDefaultUpstreamHandler.scala:132) ~[play_2.10.jar:2.1.0]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anon$2$$anonfun$handle$1.apply(PlayDefaultUpstreamHandler.scala:128) ~[play_2.10.jar:2.1.0]
at play.api.libs.concurrent.PlayPromise$$anonfun$extend1$1.apply(Promise.scala:113) ~[play_2.10.jar:2.1.0]
at play.api.libs.concurrent.PlayPromise$$anonfun$extend1$1.apply(Promise.scala:113) ~[play_2.10.jar:2.1.0]
at play.api.libs.concurrent.PlayPromise$$anonfun$extend$1$$anonfun$apply$1.apply(Promise.scala:104) ~[play_2.10.jar:2.1.0]
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) ~[scala-library.jar:na]
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ~[scala-library.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0]
at java.lang.Thread.run(Thread.java:781) ~[na:1.7.0]
akka.pattern.AskTimeoutException: Timed out
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:310) ~[akka-actor_2.10.jar:na]
at akka.actor.DefaultScheduler$$anon$8.run(Scheduler.scala:193) ~[akka-actor_2.10.jar:na]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:137) ~[akka-actor_2.10.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1417) ~[scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262) ~[scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) ~[scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) ~[scala-library.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) ~[scala-library.jar:na]