Clickhouse 2 Nodes Cluster (Keeper Replication) - database-replication

Is a 2 nodes Clickhouse setup possible? If yes would it be bad or unreasonable?
Most tutorials require a 3 node setup. For example the official documentation:
https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper
I found this example with a two node setup and not sure if it's a good idea.
https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/clickhouse-keeper/
The alternative would be to use the second server for backups (with https://github.com/AlexAkulov/clickhouse-backup) but would prefer a replication setup.
Thanks for any hint!

3 clickhouse-keeper nodes require for avoid split brain situation when connection between servers lost and each server will think - i'm leader
so, you just can setup two nodes clickhouse-server + 1 separatelly clickhouse-keeper
and use Engine=ReplicatedMergeTree

I found this example with a two node setup and not sure if it's a good idea.
I made this example. Some people use this in production, and they are happy, though their ingestion rate is very low.
2 nodes is not a problem ( split brain is impossible with Keeper and Zookeeper ). But with 2 nodes of Keeper or Zookeeper you don't HighAvailability, if one node on Keeper is down, the second node stops to serve requests and Replicated tables in ClickHouse became readonly on all replicas.
Embedded Keeper or Keeper on the same server with ClickHouse is a bad idea if you have heavy loaded system with a high ingestion rate. ClickHouse server utilizes all CPU and all I/O, so Keeper or Zookeeper produce timeout errors because they starve for CPU and especially I/O (disk is too slow for them).

Related

Akka design: How to add/remove routee from cluster aware router dynamically

I have the following use case and I am not sure if the akka toolkit provide this out of the box:
I have a number of nodes (instance/machine) that can run a finite number of long running task in the background and cannot accept more work while at max capacity.
Each instance can only process 50 tasks.
All instances are behind a load balancer.
Each task can respond to messages from the client who initiated the task, since the client sends the messages via the load balancer the instances need to route it to the correct instance that handles the task.
I have tried initially cluster sharding, but there doesn't seem to be a way to cap the maximum number of shard regions/actors per node (= #tasks).
Then I tried it with a cluster aware router, which acts as a guard for accepting or rejecting work. This seems to work reasonable well, one problem is that once it reaches capacity I need to remove it as a routee and add it back once it has capacity again.
Is there something out of the box that supports this use case or should I carry on with the routing option and if so how can I achieve this?
I'll update the description if you have further questions or something is unclear.
Your scenario sounds like a good fit for the work pulling pattern. The gist of this pattern is:
A master actor coordinates units of work among a number of worker actors.
Workers register themselves to the master, meaning that workers can be added or removed dynamically.
When the master receives work to be done, the master notifies the workers that work is available. Workers pull units of work when they're ready, do what needs to be done with their respective units of work, then ask the master for more work when they're finished.
To learn more about this pattern, read the following (the first two links are listed in the Akka documentation):
The original post (by Derek Wyatt): http://letitcrash.com/post/29044669086/balancing-workload-across-nodes-with-akka-2
A follow-on post (by Michael Pollmeier): http://www.michaelpollmeier.com/akka-work-pulling-pattern
An application of the pattern in a clustered environment with a cluster-aware router (by Ryan Tanner): https://www.conspire.com/blog/2013/10/akka-at-conspire-part-5-the-importance-of/

How to restart one live node from a multi node cassandra cluster?

I have a production cassandra cluster of 6 nodes. I have made some changes to the cassandra.yaml file on one node and hence need to restart it.
How can I do it without losing any data or causing any cluster related issues?
Can I just kill the cassandra process on that particular node and start it again.
Cluster Info:
6 nodes. All active.
I am using AWS Ec2Snitch.
Thanks.
In case you are using replication factor greater than 1, and not using ALL consistency setting on your writes/reads, you can perform steps listed below, without any downtime/data loss. In case you have one of the limitations listed above, you'll need to increase your replication factor/change requests consistency before you continue.
Perform nodetool drain on that node (http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsDrain.html).
Stop the service.
Start the service.
In Cassandra, if durable writes are enabled, you should not lose data anyway - there's a mechanism of commitlog log replay in case of accidental restart, so you should not lose any data if doing just restart, but replaying commitlog can take some time.
The steps written above are a part of official upgrade procedure, and should be the "safest" option. You can do nodetool flush + restart, this will ensure that commitlog replay will be minimal and can be faster than drain approach.
Can I just kill the cassandra process on that particular node and start it again.
Essentially, yes. I'm assuming you have a RF of 3 with 6 nodes, so it shouldn't be a big deal. If you want, to do what I call a "clean shutdown" you could run the following commands first:
nodetool disablegossip
nodetool drain
And then (depending on your install):
sudo service cassandra stop
Or:
kill `cat cassandra.pid`
Note that if you do not complete these steps, you should still be ok. The drain just flushes the memtables to disk. If that doesn't happen, the commit log is reconciled against what's on disk at boot-time anyway. Those steps will just make your boot go faster.

Elasticsearch percolation dead slow on AWS EC2

Recently we switched our cluster to EC2 and everything is working great... except percolation :(
We use Elasticsearch 2.2.0.
To reindex (and percolate) our data we use a separate EC2 c3.8xlarge instance (32 cores, 60GB, 2 x 160 GB SSD) and tell our index to include only this node in allocation.
Because we'll distribute it amongst the rest of the nodes later, we use 10 shards, no replicas (just for indexing and percolation).
There are about 22 million documents in the index and 15.000 percolators. The index is a tad smaller than 11GB (and so easily fits into memory).
About 16 php processes talk to the REST API doing multi percolate requests with 200 requests in each (we made it smaller because of the performance, it was 1000 per request before).
One percolation request (a real one, tapped off of the php processes running) is taking around 2m20s under load (of the 16 php processes). That would've been ok if one of the resources on the EC2 was maxed out but that's the strange thing (see stats output here but also seen on htop, iotop and iostat): load, cpu, memory, heap, io; everything is well (very well) within limits. There doesn't seem to be a shortage of resources but still, percolation performance is bad.
When we back off the php processes and try the percolate request again, it comes out at around 15s. Just to be clear: I don't have a problem with a 2min+ multi percolate request. As long as I know that one of the resources is fully utilized (and I can act upon it by giving it more of what it wants).
So, ok, it's not the usual suspects, let's try different stuff:
To rule out network, coordination, etc issues we also did the same request from the node itself (enabling the client) with the same pressure from the php processes: no change
We upped the processors configuration in elasticsearch.yml and restarted the node to fake our way to a higher usage of resources: no change.
We tried tweaking the percolate and get pool size and queue size: no change.
When we looked at the hot threads, we DiscovereUsageTrackingQueryCachingPolicy was coming up a lot so we did as suggested in this issue: no change.
Maybe it's the amount of replicas, seeing Elasticsearch uses those to do searches as well? We upped it to 3 and used more EC2 to spread them out: no change.
To determine if we could actually use all resources on EC2, we did stress tests and everything seemed fine, getting it to loads of over 40. Also IO, memory, etc showed no issues under high strain.
It could still be the batch size. Under load we tried a batch of just one percolator in a multi percolate request, directly on the data & client node (dedicated to this index) and found that it used 1m50s. When we tried a batch of 200 percolators (still in one multi percolate request) it used 2m02s (which fits roughly with the 15s result of earlier, without pressure).
This last point might be interesting! It seems that it's stuck somewhere for a loooong time and then goes through the percolate phase quite smoothly.
Can anyone make anything out of this? Anything we have missed? We can provide more data if needed.
Have a look at the thread on the Elastic Discuss forum to see the solution.
TLDR;
Use multiple nodes on one big server to get better resource utilization.

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH

Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months now without any issues. Every night we have a Java app that runs which will flush DB 0 of our redis cache and then repopulate it with updated data. However we had 3 instances between July 31 and August 5 where our DB was successfully flushed and then we were not able to write the new data to the database.
We were getting the following exception in our application:
redis.clients.jedis.exceptions.JedisDataException:
redis.clients.jedis.exceptions.JedisDataException: READONLY You can't
write against a read only slave.
When we look at the cache events in Elasticache we can see
Failover from master node prod-redis-001 to replica node
prod-redis-002 completed
We have not been able to diagnose the issue and since the app was running fine for the past 6 months I am wondering if it is something related to a recent Elasticache release that was done on the 30th of June.
https://aws.amazon.com/releasenotes/Amazon-ElastiCache
We have always been writing to our master node and we only have 1 replica node.
If someone could offer any insight it would be much appreciated.
EDIT: This seems to be an intermittent problem. Some days it will fail other days it runs fine.
We have been in contact with AWS support for the past few weeks and this is what we have found.
Most Redis requests are synchronous including the flush so it will block all other requests. In our case we are actually flushing 19m keys and it takes more then 30 seconds.
Elasticache performs a health check periodically and since the flush is running the health check will be blocked, thus causing a failover.
We have been asking the support team how often the health check is performed so we can get an idea of why our flush is only causing a failover 3-4 times a week. The best answer we can get is "We think its every 30 seconds". However our flush consistently takes more then 30 seconds and doesn't consistently fail.
They said that they may implement the ability to configure the timing of the health check however they said this would not be done anytime soon.
The best advice they could give us is:
1) Create a completely new cluster for loading the new data on, and
instead of flushing the previous cluster, re-point your application(s)
to the new cluster, and remove the old one.
2) If the data that you are flushing is an update version of the data,
consider not flushing, but updating and overwriting new keys?
3) Instead of flushing the data, set the expiry of the items to be
when you would normally flush, and let the keys be reclaimed (possibly
with a random time to avoid thundering herd issues), and then reload
the data.
Hope this helps :)
Currently for Redis versions from 6.2 AWS ElastiCache has a new feature of thread monitoring. So the health check doesn't happen in the same thread as all other actions of Redis. Redis can continue to proceed a long command / lua script, but will still considered healthy. Because of this new feature failovers should happen less.