Akka 2.2 ClusterSingletonManager sending HandOverToMe to [None]

Akka 2.2 ClusterSingletonManager sending HandOverToMe to [None] - akka

I'm using Akka 2.2 contrib's project ClusterSingletonManager to guarantee there is always one and just one specific type of actor (master) in a cluster. However, I've observed an odd behaviour (which, incidentally, may be expected, but can't understand why). Whenever a master drops out of the cluster and joins in later, the following sequence of actions occur:
[INFO] [04/30/2013 17:47:35.805] [ClusterSystem-akka.actor.default-dispatcher-9] [akka://ClusterSystem/system/cluster/core/daemon] Cluster Node [akka.tcp://ClusterSystem#127.0.0.1:2551] - Welcome from [akka.tcp://ClusterSystem#127.0.0.1:2552]
[INFO] [04/30/2013 17:47:48.703] [ClusterSystem-akka.actor.default-dispatcher-8] [akka://ClusterSystem/user/singleton] Member removed [akka.tcp://ClusterSystem#127.0.0.1:52435]
[INFO] [04/30/2013 17:47:48.712] [ClusterSystem-akka.actor.default-dispatcher-2] [akka://ClusterSystem/user/singleton] ClusterSingletonManager state change [Start -> BecomingLeader]
[INFO] [04/30/2013 17:47:49.752] [ClusterSystem-akka.actor.default-dispatcher-9] [akka://ClusterSystem/user/singleton] Retry [1], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:50.850] [ClusterSystem-akka.actor.default-dispatcher-21] [akka://ClusterSystem/user/singleton] Retry [2], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:51.951] [ClusterSystem-akka.actor.default-dispatcher-20] [akka://ClusterSystem/user/singleton] Retry [3], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:47:53.049] [ClusterSystem-akka.actor.default-dispatcher-3]
...
[INFO] [04/30/2013 17:48:10.650] [ClusterSystem-akka.actor.default-dispatcher-21] [akka://ClusterSystem/user/singleton] Retry [20], sending HandOverToMe to [None]
[INFO] [04/30/2013 17:48:11.751] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] Timeout in BecomingLeader. Previous leader unknown, removed and no TakeOver request.
[INFO] [04/30/2013 17:48:11.752] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] Singleton manager [akka.tcp://ClusterSystem#127.0.0.1:2551] starting singleton actor
[INFO] [04/30/2013 17:48:11.754] [ClusterSystem-akka.actor.default-dispatcher-4] [akka://ClusterSystem/user/singleton] ClusterSingletonManager state change [BecomingLeader -> Leader]
Why is it attempting to send an HandOverToMe to [None]? It takes about 20 seconds (20 retries) until it becomes the new leader, though in this particular situation the previous one was well known...

I'm not sure if this will answer your question, but in looking at the source code for ClusterSingletonManager, you can see the chain of events that leads to this scenario. This class uses the Finite State Machine logic in Akka, and the behavior you are seeing is kicked off due to a state transition from Start -> BecomingLeader. First, look at the Start state:
when(Start) {
case Event(StartLeaderChangedBuffer, _) ⇒
leaderChangedBuffer = context.actorOf(Props[LeaderChangedBuffer].withDispatcher(context.props.dispatcher))
getNextLeaderChanged()
stay
case Event(InitialLeaderState(leaderOption, memberCount), _) ⇒
leaderChangedReceived = true
if (leaderOption == selfAddressOption && memberCount == 1)
// alone, leader immediately
gotoLeader(None)
else if (leaderOption == selfAddressOption)
goto(BecomingLeader) using BecomingLeaderData(None)
else
goto(NonLeader) using NonLeaderData(leaderOption)
}
The part to look at here is:
else if (leaderOption == selfAddressOption)
goto(BecomingLeader) using BecomingLeaderData(None)
To me, it looks like this piece is saying "If I'm the leader, change start to Become Leader with None as the previousLeader option"
Then, if you look at the BecomingLeader state:
when(BecomingLeader) {
...
case Event(HandOverRetry(count), BecomingLeaderData(previousLeaderOption)) ⇒
if (count <= maxHandOverRetries) {
logInfo("Retry [{}], sending HandOverToMe to [{}]", count, previousLeaderOption)
previousLeaderOption foreach { peer(_) ! HandOverToMe }
setTimer(HandOverRetryTimer, HandOverRetry(count + 1), retryInterval, repeat = false)
} else if (previousLeaderOption forall removed.contains) {
// can't send HandOverToMe, previousLeader unknown for new node (or restart)
// previous leader might be down or removed, so no TakeOverFromMe message is received
logInfo("Timeout in BecomingLeader. Previous leader unknown, removed and no TakeOver request.")
gotoLeader(None)
} else
throw new ClusterSingletonManagerIsStuck(
s"Becoming singleton leader was stuck because previous leader [${previousLeaderOption}] is unresponsive")
}
This is the block that keeps repeating that message you are seeing in the log. It basically looks like it's attempting to get a previous leader to hand over responsibility to w/o knowing who the previous leader was because in the state transition, it passed in None as the previous leader. The million dollar question is "If it doesn't know who the previous leader is, why keep attempting handoffs that will never succeed?".

Related

Call method once when Flask app started despite many Gunicorn workers

I have a simple Flask app that starts with Gunicorn which has 4 workers.
I want to clear and warmup cache when server restarted. But when I do this inside create_app() method it is executing 4 times.
def create_app(test_config=None):
app = Flask(__name__)
# ... different configuration here
t = threading.Thread(target=reset_cache, args=(app,))
t.start()
return app
[2022-10-28 09:33:33 +0000] [7] [INFO] Booting worker with pid: 7
[2022-10-28 09:33:33 +0000] [8] [INFO] Booting worker with pid: 8
[2022-10-28 09:33:33 +0000] [9] [INFO] Booting worker with pid: 9
[2022-10-28 09:33:33 +0000] [10] [INFO] Booting worker with pid: 10
2022-10-28 09:33:36,908 INFO webapp reset_cache:38 Clearing cache
2022-10-28 09:33:36,908 INFO webapp reset_cache:38 Clearing cache
2022-10-28 09:33:36,908 INFO webapp reset_cache:38 Clearing cache
2022-10-28 09:33:36,909 INFO webapp reset_cache:38 Clearing cache
How to make it only one-time without using any queues, rq-workers or celery?
Signals, mutex, some special check of worker id (but it is always dynamic)?
Tried Haven't found any solution so far.

I used Redis locks for that.
Here is an example using flask-caching, which I had in project, but you can replace set client from whatever place you have redis client:
import time
from webapp.models import cache # cache = flask_caching.Cache()
def reset_cache(app):
with app.app_context():
client = app.extensions["cache"][cache]._write_client # redis client
lock = client.lock("warmup-cache-key")
locked = lock.acquire(blocking=False, blocking_timeout=1)
if locked:
app.logger.info("Clearing cache")
cache.clear()
app.logger.info("Warming up cache")
# function call here with `cache.set(...)`
app.logger.info("Completed warmup cache")
# time.sleep(5) # add some delay if procedure is really fast
lock.release()
It can be easily extended with threads, loops or whatever you need to set value to cache.

Error in EI - System may be unstable: HTTPS ConnectingIOReactor encountered a runtime exception

i have many warnings in EI 6.5.0. What does it mean? How to solve it?
[HTTPS-Sender I/O dispatcher-14] WARN {org.apache.synapse.transport.passthru.PassThroughHttpSSLSender} - System may be unstable: HTTPS ConnectingIOReactor encountered a runtime exception : Task org.apache.axis2.transport.base.threads.NativeWorkerPool$1#238612ec rejected from org.apache.axis2.transport.base.threads.Axis2ThreadPoolExecutor#3b57573[Shutting down, pool size = 15, active threads = 15, queued tasks = 0, completed tasks = 121209]
java.util.concurrent.RejectedExecutionException: Task org.apache.axis2.transport.base.threads.NativeWorkerPool$1#238612ec rejected from org.apache.axis2.transport.base.threads.Axis2ThreadPoolExecutor#3b57573[Shutting down, pool size = 15, active threads = 15, queued tasks = 0, completed tasks = 121209]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.axis2.transport.base.threads.NativeWorkerPool.execute(NativeWorkerPool.java:169)
at org.apache.synapse.transport.passthru.TargetErrorHandler.handleError(TargetErrorHandler.java:72)
at org.apache.synapse.transport.passthru.TargetHandler.closed(TargetHandler.java:590)
at org.apache.synapse.transport.passthru.ClientIODispatch.onClosed(ClientIODispatch.java:73)
at org.apache.synapse.transport.passthru.ClientIODispatch.onClosed(ClientIODispatch.java:41)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.disconnected(AbstractIODispatch.java:100)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionClosed(BaseIOReactor.java:271)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processClosedSessions(AbstractIOReactor.java:439)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:284)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:586)
at java.lang.Thread.run(Thread.java:748)

Some httpc requests neither complete nor trigger error

This Erlang based web client issues N concurrent requests to a given URL, e.g., escript client.erl http://127.0.0.1:1234/random?num=1000 3200 issues 3200 concurrent requests. For completeness, a corresponding server can be found here.
The client uses spawn_link to spawn N processes to issue one request each. The success/failure of httpc:request call in the spawned process is handled in the spawned process (within the dispatch_requests function) and propagated to the parent process. The parent process also handles both data messages from a spawned process and EXIT messages corresponding to the abnormal termination of a spawned process. So, the parent process waits to receive N messages from/about the spawned processes before it terminates normally.
Given the above context, the client hangs on some executions (e.g., the server is terminated forcefully) as the parent process never receives N messages from the children processes. I am observing this behavior on Raspberry Pi 3B running Raspbian 9.9 and esl-erlang 22.0-1.
The parent process does not seem to be not handling all cases of termination of child processes. If so, what are these cases? If not, what might be the reason for fewer than N messages?
Client code:
% escript client.erl http://127.0.0.1:1234/random?num=5 30
-module(client).
-export([main/1]).
-import(httpc, [request/1]).
-mode(compile).
dispatch_request(Url, Parent) ->
Start = erlang:monotonic_time(microsecond),
{Status, Value} = httpc:request(get, {Url, []}, [{timeout, 60000}], []),
Elapsed_Time = (erlang:monotonic_time(microsecond) - Start) / 1000,
Msg = case Status of
ok ->
io_lib:format("~pms OK", [Elapsed_Time]);
error ->
io_lib:format("~pms REQ_ERR ~p", [Elapsed_Time, element(1, Value)])
end,
Parent ! {Status, Msg}.
wait_on_children(0, NumOfSucc, NumOfFail) ->
io:format("Success: ~p~n", [NumOfSucc]),
io:format("Failure: ~p~n", [NumOfFail]);
wait_on_children(Num, NumOfSucc, NumOfFail) ->
receive
{'EXIT', ChildPid, {ErrorCode, _}} ->
io:format("Child ~p crashed ~p~n", [ChildPid, ErrorCode]),
wait_on_children(Num - 1, NumOfSucc, NumOfFail);
{Verdict, Msg} ->
io:format("~s~n", [Msg]),
case Verdict of
ok -> wait_on_children(Num - 1, NumOfSucc + 1, NumOfFail);
error -> wait_on_children(Num - 1, NumOfSucc, NumOfFail + 1)
end
end.
main(Args) ->
inets:start(),
Url = lists:nth(1, Args),
Num = list_to_integer(lists:nth(2, Args)),
Parent = self(),
process_flag(trap_exit, true),
[spawn_link(fun() -> dispatch_request(Url, Parent) end) ||
_ <- lists:seq(1, Num)],
wait_on_children(Num, 0, 0).

Slick unit testing - duplicate statements in logs

I have a suit of tests for slick in playframework environment, but when they are run, the statements are duplicated in logs, and it seems that they are duplicated as many times as number of tests. So if I have only one test I have one e.g. insert and select statement, but with few I have situation like that:
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] a.e.s.Slf4jLogger - Slf4jLogger started
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] a.e.s.Slf4jLogger - Slf4jLogger started
[info] a.e.s.Slf4jLogger - Slf4jLogger started
[debug] s.j.J.statement - Preparing insert statement (returning: ID): insert into `USER` (`EMAIL`,`PASSWORD`,`ACTIVATION_TOKEN`,`ACTIVATED`,`CREATED`) values (?,?,?,?,?)
[debug] s.j.J.statement - Preparing insert statement (returning: ID): insert into `USER` (`EMAIL`,`PASSWORD`,`ACTIVATION_TOKEN`,`ACTIVATED`,`CREATED`) values (?,?,?,?,?)
[debug] s.j.J.statement - Preparing statement: select `ACTIVATION_TOKEN`, `PASSWORD`, `ACTIVATED`, `CREATED`, `EMAIL`, `ID` from `USER` where `EMAIL` = 'test#example.com' limit 1
[debug] s.j.J.statement - Preparing statement: select `ACTIVATION_TOKEN`, `PASSWORD`, `ACTIVATED`, `CREATED`, `EMAIL`, `ID` from `USER` where `EMAIL` = 'test#example.com' limit 1
[info] c.z.h.p.HikariPool - HikariCP pool db is shutting down.
[info] c.z.h.p.HikariPool - HikariCP pool db is shutting down.
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] c.z.h.HikariDataSource - HikariCP pool db is starting.
[info] c.z.h.p.HikariPool - HikariCP pool db is shutting down.
[info] c.z.h.p.HikariPool - HikariCP pool db is shutting down.
In database state results look correct - there are no redundant statements, only these logs concerns me. "Slf4jLogger started" message is 10 times, so it's not because of parallel execution of tests. Number of duplicates is up to number of tests (currently 4), they are placed sequentially so no other statements from other parallel execution between.
Unit code:
class UserSpec extends PlaySpecification {
val userRepo = Injector.inject[UserRepo]
import scala.concurrent.ExecutionContext.Implicits.global
def fakeApp: FakeApplication = {
FakeApplication(additionalConfiguration =
Map(
"slick.dbs.default.driver" -> "slick.driver.H2Driver$",
"slick.dbs.default.db.driver" -> "org.h2.Driver",
"slick.dbs.default.db.url" -> "jdbc:h2:mem:test;MODE=MySQL;DATABASE_TO_UPPER=FALSE"
))}
"User" should {
"be created as not activated" in new WithApplication(fakeApp) {
val email = "test#example.com"
val action = userRepo.create(email, "Password")
.flatMap(_ => userRepo.findByEmail(email))
val result = Await.result(action, Duration.Inf)
result must not(beNone)
result.map {
case User(id, email2, _, _, activated, _) => {
activated must beFalse
email2 must beEqualTo(email)
}
}
}
"cannot be create if same email" in new WithApplication(fakeApp) {
val email = "test2#example.com"
val action = userRepo.create(email, "Password")
.flatMap(_ => userRepo.create(email, "Password"))
val result = Await.result(action, Duration.Inf)
result must beNone
}
// rest omitted
}

Server Errors While Writing With Python Cassandra Driver

code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I am inserting into Cassandra Cassandra 2.0.13(single node for testing) by python cassandra-driver version 2.6
The following are my keyspace and table definitions:
CREATE KEYSPACE test_keyspace WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' };
CREATE TABLE test_table (
key text PRIMARY KEY,
column1 text,
...,
column17 text
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
What I tried:
1) multiprocessing(protocol version set to 1)
each process has its own cluster, session(default_timeout set to 30.0)
def get_cassandra_session():
"""creates cluster and gets the session base on key space"""
# be aware that session cannot be shared between threads/processes
# or it will raise OperationTimedOut Exception
if CLUSTER_HOST2:
cluster = cassandra.cluster.Cluster([CLUSTER_HOST1, CLUSTER_HOST2])
else:
# if only one address is available, we have to use older protocol version
cluster = cassandra.cluster.Cluster([CLUSTER_HOST1], protocol_version=1)
session = cluster.connect(KEY_SPACE)
session.default_timeout = 30.0
return session
2) batch insert (protocol version set to 2 because BatchStatement is enabled on Cassandra 2.X)
def batch_insert(session, batch_queue, batch):
try:
insert_user = session.prepare("INSERT INTO " + db.TABLE + " (" + db.COLUMN1 + "," + db.COLUMN2 + "," + db.COLUMN3 +
"," + db.COLUMN4 + ") VALUES (?,?,?,?)")
while batch_queue.qsize() > 0:
'''batch queue size is 1000'''
row_tuple = batch_queue.get()
batch.add(insert_user, row_tuple)
session.execute(batch)
except Exception as e:
logger.error("batch insert fail.... %s", e)
the above function is invoked by:
batch = BatchStatement(consistency_level=ConsistencyLevel.ONE)
batch_insert(session, batch_queue, batch)
tuples are stored in batch_queue.
3) synchronizing execution
Several days ago I post another question Cassandra update fails , cassandra was complaining about TimeOut issue. I was using synchronize execution for updating.
Can anyone help, is this my code issue or python cassandra-driver issue or Cassandra itself ?
Thanks a million!

If your question is about those errors at the top, those are server-side error responses.
The first says that the coordinator you contacted cannot satisfy the request at CL.ONE, with the nodes it believes are alive. This can happen if all replicas are down (more likely with a low replication factor).
The other two errors are timeouts, where the coordinator didn't get responses from 'live' nodes in a time configured in the cassandra.yaml.
All of these indicate that the cluster you're connected to is not healthy. This could be because it is overwhelmed (high GC pauses), or experiencing network issues. Check the server logs for clues.

I got the following error, which looks very similar:
cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level LOCAL_ONE" info={'consistency': 'LOCAL_ONE', 'alive_replicas': 0, 'required_replicas': 1}
When I added a sleep(0.5) in the code, it worked fine. I was trying to write too much too fast...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Akka 2.2 ClusterSingletonManager sending HandOverToMe to [None] - akka

Related

Call method once when Flask app started despite many Gunicorn workers

Error in EI - System may be unstable: HTTPS ConnectingIOReactor encountered a runtime exception

Some httpc requests neither complete nor trigger error

Slick unit testing - duplicate statements in logs

Server Errors While Writing With Python Cassandra Driver

Categories

Resources