PCF Tasks on Active/Passive - cloud-foundry

Can someone help me understand how PCF Tasks in Active/Passive environment would work? It's my understanding that when deployed in Active, and mirrored in a Passive environment that the PCF Tasks would still run on the defined Job Schedules regardless if it's running on active or passive.
If this is true, is there a way for my PCF Task (a Java application) to programmatically check if it's running on Passive (then do nothing), or Active (do my operations)? I don't want to perform tasks on Passive until failover happens (where passive becomes Active, and active becomes Passive) and I only have one Task running from Active at any given time.
I tried getting the A Record from the FQDN (my apps route hostname) and then comparing it to localhost IP to determine if my running current IP matches the resolved hostname IP and therefore I'm running on Active... but I believe I am only getting an IP of a private Diego Cell or something (not sure yet).
private boolean isActive() {
try {
InetAddress inetHost = InetAddress.getByName(properties.getFqdn());
InetAddress inetSelf = InetAddress.getLocalHost();
logger.info("host FQDN IP: {}, self localhost IP: {}", inetHost.getHostAddress(), inetSelf.getHostAddress());
if (null != inetHost && null != inetSelf) { return (inetHost.getHostAddress().equals(inetSelf.getHostAddress())); }
} catch (UnknownHostException e) {
logger.error(e.getMessage());
}
return false;
}
What am I missing here? Doesn't seem like it should be that complicated given that Tasks are part of PCF and Active/Passive is a normal and preferred setup.
I'd really just like Tasks during failover or failback to just start/stop working without any additional interactions.
Thank you for any suggestions!

Related

Connection to AWS MemoryDB cluster sometimes fails

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).
After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:
Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
If we connect directly to the primary read/write node we get no such errors.
If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
We use .NET Core 6
We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
The application is hosted in AWS ECS
StackExchangeRedisCache is added to the service collection in a startup file :
services.AddStackExchangeRedisCache(o =>
{
o.InstanceName = redisConfiguration.Instance;
o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});
...where ToRedisConfiguration returns a basic ConfigurationOptions object :
new ConfigurationOptions()
{
EndPoints =
{
{ "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
},
User = "username",
Password = "password",
Ssl = true,
AbortOnConnectFail = false,
ConnectTimeout = 60000
};
We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.
We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.
Any help or advice is appreciated, thanks!

Firebase function connection with GCP Redis instance in the same VPC keeps on disconnecting

I am working on multiple Firebase cloud functions (all hosted in the same region) that connect with a GCP hosted Redis instance in the same region, using a VPC connector. I am using version 3.0.2 of the nodejs library for Redis. In the cloud functions' debug logs, I am seeing frequent connection reset logs, triggered for each cloud function with no fixed pattern around the timeline for the connection reset. And each time, the error captured in the error event handler is ECONNRESET. While creating the Redis instance, I have provided a retry_strategy to reconnect after 5 ms with maximum of 10 such attempts, along with the retry_unfulfilled_commands set to true, expecting that any unfulfilled command at the time of connection reset will be automatically retried (refer the code below).
const redisLib = require('redis');
const client = redisLib.createClient(REDIS_PORT, REDIS_HOST, {
enable_offline_queue: true,
retry_unfulfilled_commands: true,
retry_strategy: function(options) {
if (options.error && options.error.code === "ECONNREFUSED") {
// End reconnecting on a specific error and flush all commands with
// a individual error
return new Error("The server refused the connection");
}
if (options.attempt > REDIS_CONNECTION_RETRY_ATTEMPTS) {
// End reconnecting with built in error
console.log('Connection retry count exceeded 10');
return undefined;
}
// reconnect after 5 ms
console.log('Retrying connection after 5 ms');
return 5;
},
});
client.on('connect', () => {
console.log('Redis instance connected');
});
client.on('error', (err) => {
console.error(`Error connecting to Redis instance - ${err}`);
});
exports.getUserDataForId = (userId) => {
console.log('getUserDataForId invoked');
return new Promise((resolve, reject) => {
if(!client.connected) {
console.log('Redis instance not yet connected');
}
client.get(userId, (err, reply) => {
if(err) {
console.error(JSON.stringify(err));
reject(err);
} else {
resolve(reply);
}
});
});
}
// more such exports for different operations
Following are the questions / issues I am facing.
Why is the connection getting reset intermittently?
I have seen logs that even if the cloud function is being executed, the connection to Redis server lost resulting in failure of the command.
With retry_unfulfilled_commands set to true, I hoped it will handle the scenario as mentioned in point number 2 above, but as per debug logs, the cloud function times out in such scenario. This is what I observed in the logs in that case.
getUserDataForId invoked
Retrying connection after 5 ms
Redis instance connected
Function execution took 60002 ms, finished with status: 'timeout' --> coming from wrapper cloud function
Should I, instead of having a Redis connection instance at global level, try to have a connection created during each such Redis operation? It might have some performance issues as well as issues around number of concurrent Redis connections (since I have multiple cloud functions and all those will be creating Redis connections for each simultaneous invocation), right?
So, how to best handle it since I am facing all these issues during development itself, so not really sure if it's code related issue or some infrastructure configuration related issue.
This behavior could be caused by background activities.
"Background activity is anything that happens after your function has
terminated"
When the background activity interferes with subsequent invocations in Cloud Functions, unexpected behavior and errors that are hard to diagnose may occur. Accessing the network after a function terminates usually leads to "ECONNRESET" errors.
To troubleshoot this, make sure that there is no background activity by searching the logs for entries after the line saying that the invocation finished. Background activity can sometimes be buried deeper in the code, especially when asynchronous operations such as callbacks or timers are present. Review your code to make sure all asynchronous operations finish before you terminate the function.
Source

uknown reason for google cloud compute engine VM shutdown?

One of the vm instances on google cloud compute was shutdown, with an event log in stackdriver without ip or actor (user or service or system) which initiated the event. The instance has onHostMaintenance set to migrate and automaticRestart set to true. This particular instance has migrated on maintenance without error before. The stackdriver event log looks like
{
actor: {
user: ""
},
event_subtype: "compute.instances.stop",
event_timestamp_us: "1531781734907624",
event_type: "GCE_API_CALL",
ip_address: "",
}
The user and ip_address fields are NOT redacted. They have empty values on actual log.
is this common? how does one identify the cause for shutdown in these peculiar cases ?
According to your event type, I dive in the doc of Activity logs,
Compute Engine API calls - GCE_API_CALL events are API calls that change the state of a resource.
It seems like someone may use API calls to shutdown your VM. Looking your settings and logs on the VM instance, it can't be a hostError or Maintenance event. As far as you turn on the onHostMaintenance set to migrate and automaticRestart set to true. Your VM will always be migrated to another hardware.
Out of curiosity, did you restart your VM? did it shutdown again? How often does it happen?
Seems like app engine had an outage during that time, incident .

akka cluster give false alarm in reporting node unreachable

I got a cluster event listener running on each node who send email to notify me when nodes are unreachable, and I noticed two strange things:
most of the time, unreachable event are followed by reachable again event
when unreachable event occurs, I query the state of cluster, it shows that all node are still UP
Here is my conf:
akka {
loglevel = INFO
loggers = ["akka.event.slf4j.Slf4jLogger"]
jvm-exit-on-fatal-error = on
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
//will be overwrite on runtime
log-remote-lifecycle-events = off
netty.tcp {
hostname = "127.0.0.1"
port = 9989
}
}
cluster {
failure-detector {
threshold = 12.0
acceptable-heartbeat-pause = 10 s
}
use-dispatcher = cluster-dispatcher
}
}
//relieve unreachable report rate
cluster-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 4
parallelism-max = 8
}
}
Please read the cluster membership lifecycle section in the documentation: http://doc.akka.io/docs/akka/2.4.0/common/cluster.html#Membership_Lifecycle
Unreachability is temporary, and indicates that there were no heartbeats for a while from the remote node. This can be reverted once heartbeats come again. This is useful to reroute data from overloaded nodes to others or compensating smaller, intermittent networking issues. Please note that a cluster member does not go to DOWN from unreachable automatically unless configured so: http://doc.akka.io/docs/akka/2.4.0/scala/cluster-usage.html#Automatic_vs__Manual_Downing
The reason why DOWNing is manual and not automatic by default is because of the risk of split-brain scenarios and their consequences for example when Cluster Singletons are used (which won't be singletons after the cluster falls into two parts because of a broken network cable). For more options for automatically resolving such cases there is the SBR (Split Brain Resolver) in the commercial version of Akka: http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html
Also, DOWN-ing is permanent, a node, once marked as DOWN is forever banished from the surviving part of the cluster, i.e. even if it turns out to be alive in the future, it won't be allowed back again (see Fencing and STONITH for explanation: https://en.wikipedia.org/wiki/STONITH or http://advogato.org/person/lmb/diary/105.html).

Can't reach wicket quickstart from outside firewall

I have a project which, for purposes of server configuration, is just a wicket quickstart archetype. I've added some application code, but haven't really done anything to change the default jetty configuration.
I can run and test my application locally using:
http://localhost:8080
or:
http://bekkar:8080 (my PC's network name)
or:
http://192.168.1.2:8080/ (my PC's local IP)
I want to access my wicket app from outside my router firewall. (I eventually will test it on my Blackberry, but for now I'm using Google Chrome to try to reach it externally.)
Using http://www.whatismyip.com/ I found my router's IP.
I use:
http://###.###.###.###:8080
and I get a screen that says Authentication Required, asking for a username and password. I don't have any kind of authentication set up in my wicket app.
I have a NetGear router, WGR614v7. Using the router admin, under port forwarding, I add the following custom service:
Service Name=wicket
Starting Port=8080
Ending Port=8080
Server IP Address=192.168.1.2 //my computer's local IP
After adding the port forwarding service definition, I get a different message from Chrome:
Oops! Google Chrome could not connect to ###.###.###.###:8080
How can I make my wicket jetty quickstart accessible from outside my router firewall? I don't know if this is a wicket/jetty issue (belonging on SO) or a firewall issue (belonging on serverfault), so I'll post it here, first.
Thanks!
First, try with just simple apache, or woof. Be sure to bind it to 0.0.0.0 (all IPs).
A) If you can't reach it, it's the router config problem.
B) If that works, you know it't jetty/wicket config.
case A) I don't know that router, but look for port forwarding. I wasn't able to get ASUS WL500gP passing requests in, so I am not the right one to advice here :)
case B) Does Jetty bind to 0.0.0.0? Can you reach it from other machine on the local network?
Not much useful answer, but I hope it helps a bit.
I run jetty/wicket apps on my system all the time and access them remotely. I don't think there is anything special that I've done with Jetty, and especially not wicket to make this work. But if it helps, here is an example Start.java file (this is from one of my apps -- not sure if it is the same as the one in quickstart, as I don't have a quickstart available right now):
public class Start {
public static void main(String[] args) throws Exception {
Server server = new Server();
SocketConnector connector = new SocketConnector();
// Set some timeout options to make debugging easier.
connector.setMaxIdleTime(1000 * 60 * 60);
connector.setSoLingerTime(-1);
connector.setPort(8080);
server.setConnectors(new Connector[] { connector });
WebAppContext bb = new WebAppContext();
bb.setServer(server);
bb.setContextPath("/");
bb.setWar("src/main/webapp");
// START JMX SERVER
// MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
// MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer);
// server.getContainer().addEventListener(mBeanContainer);
// mBeanContainer.start();
server.addHandler(bb);
try {
System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");
server.start();
System.in.read();
System.out.println(">>> STOPPING EMBEDDED JETTY SERVER");
// while (System.in.available() == 0) {
// Thread.sleep(5000);
// }
server.stop();
server.join();
} catch (Exception e) {
e.printStackTrace();
System.exit(100);
}
}
}
I'm using a DLink router, so I'm not sure how to configure yours. However, you should also check your router to see if it has remote web admin turned on, and if it is on port 8080. If so, turn it off, as it might be interfering with your port forwarding.