Ambari recovery after one cluster node goes AWOL - amazon-web-services

I have Ambari and HDP installed on an AWS cluster of four nodes. One of my nodes died and is no longer accessible (can not talk to it and it does not talk to anyone else). I am okay with the data loss but don't know how to tell Ambari to let go of that particular host. What should I do?

If you did not have any masters on this host, you may want to follow this manual
I'd expect that HDFS supports the removal of an inaccessible datanode.
But if you had masters on this host, that is major trouble and I'd expect that removing the dead host is not supported by Ambari (without manual database edits)
You may also want to ask at https://community.hortonworks.com/, maybe somebody suggests a workaround (some kind of dirty hack)

Related

Spawn EC2 instance via Python

Can someone help me to understand the basics of spawning EC2 instances and deploying AMIs and how to configure them properly?
Current situation:
In my company we have 1 server and a few clients which run calculations and return the results when they are done. The system is written in Python but sometimes we run out of machine power so I am considering to support the clients with additional EC2 clients - on demand. The clients connect to the server via an internal IP which is set in a config file.
Question:
Am I assuming right that I just create an AMI where our Python client sits in autostart and once its started it connects to the public IP and picks up new tasks? Is that the entire magic or do I miss some really great features in this concept?
Question II
While spawning a new instance, can I start such instance with updated configuration or meta information or do I have to update my AMI before all the time I make a small change?
if you want to stick with just plain spawning EC2 instances, here are the answers to your questions:
Question I - This is one of the valid approaches and yes, if your Python client will be configured properly, it will 'just work'.
Question II - Yes, you can achieve that, which is very well explained here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html. There's also another way of having your configuration stored somewhere else, and just fetch it when the instance is starting.

Scaling from one EC2 instance to two when your application and database are on the same instance

If I have one EC2 instance that hosts my web application and my MariaDB database, and I want to scale out at some point by separating the web application and database into separate instances, what is the standard practice for doing so without incurring any downtime? It seems like a complicated problem to me, but all the posts I've seen discussing the benefits of keeping the web and data tier separate from the get-go mostly talk about security benefits and don't seem to emphasize the scalability benefits which makes me think that it's not as complex a problem as it seems.
Also, in this same scenario, if scaling up and keeping the application and database coupled would be less complex, how would it work? -- keeping in mind the 0 downtime requirement.
Familiarize yourself with the way replication works in MariaDB and the solution becomes intuitively obvious.
You create a replica database server by copying the existing database to a new server using mysqldump with particular attention to the options --master-data and --single-transaction to make a backup. Loading the results onto your new database server creates a replica of the original database as it existed at the moment you started making the backup. InnoDB MVCC assures that the version of each row in each table, as it existed at the beginning of the backup, is what appears on the new server as a result of loading this backup. (Yes, you have to be using InnoDB, as you should be doing anyway.)
You then connect the new database (as a slave) to the old database (as master), directing it to begin replicating from that same point in time -- the point in time identified by the master log coordinates contained in the backup -- the time the backup was started.
You wait for the slave to be in sync with the master.
Monitoring the replication status using SHOW MASTER STATUS; on the master and SHOW SLAVE STATUS; on the slave, it is trivial to determine when the slave is indeed "current" with the master. MariaDB replication is "asynchronous" in the sense that changes on the master are made before changes on the slave, but with a slave server of appropriate capacity, the typical replication lag is on the order or milliseconds... and, again, is easily determined. In the time it takes to stop/start your application, any lingering data can be confirmed to have finished replicating across.
Make the slave writable (typically a slave is set to read-only mode, with the only source of changes being the replication SQL thread, which can of course still write to it) ... then monitor replication to verify sync, stop app, point app to new database, verify replication still in sync, start app... done. Now, disconnect the slave from the master database and abandon the old master.
Of course, truly zero downtime is effectively impossible, since at some point the application must be reconfigured to connect to a different database... but the total downtime is essentially determined by how fast you can type, or automate the necessary steps to poll both database servers and compare replication coordinates, and make the transition.
At the risk of stating the obvious, never put anything other than the database on the database server, and never collocate it with the application. No exceptions in production should even be open to discussion. A problem that comes up all too often, as seen here, here, here, and here is more often than not attributable to people disregarding this principle, running the application and its database on the same server. Performance and stability are not only at risk, but the symptoms that arise also give the (incorrect) impression that MySQL (or MariaDB or Percona Server) is at fault, "crashing," when in fact the application is at fault, prompting the OS to force-crash the database in an effort to try to preserve overall machine stability in the face of inevitable memory exhaustion.
One possible solution:
Put a Load Balancer in front of your EC2 instance, intially just directing traffic to the single instance you have.
Spin up a second instance that will run a copy of your website, get it all configurely and pointing at the DB on the first instance and then add it into the load-balancer so it starts to get traffic
Optional: Add a third instance configured the same as the second, also running a copy of the website only.
Take the original instance out of the LB pool so that web traffic now only goes to #2 and #3.
Deinstall the website from the #1 instance, so it is left only running the db server.

How to achieve Available zone aware for neutron-server

I have read the manual.
http://docs.openstack.org/mitaka/networking-guide/adv-config-availability-zone.html
it tell us how to support multi AZ for dhcp/l3 router service.
But I still not know where the neutron server and qpid I will place in multi AZ environment.
Obviously, if we start neutron-server/qpid processes in only one available zone, it is not a good solution, there's still a single point failure problem.
But I did not find any solution from official site of openstack, till now.
Anyone know that?
Regards,
Joe

Redundancy without central control point?

If it possible to provide a service to multiple clients whereby if the server providing this service goes down, another one takes it's place- without some sort of centralised "control" which detects whether the main server has gone down and to redirect the clients to the new server?
Is it possible to do without having a centralised interface/gateway?
In other words, its a bit like asking can you design a node balancer without having a centralised control to direct clients?
Well, you are not giving much information about the "service" you are asking about, so I'll answer in a generic way.
For the first part of my answer, I'll assume you are talking about a "centralized interface/gateway" involving ip addresses. For this, there's CARP (Common Address Redundancy Protocol), quoted from the wiki:
The Common Address Redundancy Protocol or CARP is a protocol which
allows multiple hosts on the same local network to share a set of IP
addresses. Its primary purpose is to provide failover redundancy,
especially when used with firewalls and routers. In some
configurations CARP can also provide load balancing functionality. It
is a free, non patent-encumbered alternative to Cisco's HSRP. CARP is
mostly implemented in BSD operating systems.
Quoting the netbsd's "Introduction to CARP":
CARP works by allowing a group of hosts on the same network segment to
share an IP address. This group of hosts is referred to as a
"redundancy group". The redundancy group is assigned an IP address
that is shared amongst the group members. Within the group, one host
is designated the "master" and the rest as "backups". The master host
is the one that currently "holds" the shared IP; it responds to any
traffic or ARP requests directed towards it. Each host may belong to
more than one redundancy group at a time.
This might solve your question at the network level, by having the slaves takeover the ip address in order, without a single point of failure.
Now, for the second part of the answer (the application level), with distributed erlang, you can have several nodes (a cluster) that will give you fault tolerance and redundancy (so you would not use ip addresses here, but "distributed erlang" -a cluster of erlang nodes- instead).
You would have lots of nodes lying around with your Distributed Applciation started, and your application resource file would contain a list of nodes (ordered) where the application can be run.
Distributed erlang will control which of the nodes is "the master" and will automagically start and stop your application in the different nodes, as they go up and down.
Quoting (as less as possible) from http://www.erlang.org/doc/design_principles/distributed_applications.html:
In a distributed system with several Erlang nodes, there may be a need
to control applications in a distributed manner. If the node, where a
certain application is running, goes down, the application should be
restarted at another node.
The application will be started at the first node, specified by the
distributed configuration parameter, which is up and running. The
application is started as usual.
For distribution of application control to work properly, the nodes
where a distributed application may run must contact each other and
negotiate where to start the application.
When started, the node will wait for all nodes specified by
sync_nodes_mandatory and sync_nodes_optional to come up. When all
nodes have come up, or when all mandatory nodes have come up and the
time specified by sync_nodes_timeout has elapsed, all applications
will be started. If not all mandatory nodes have come up, the node
will terminate.
If the node where the application is running goes down, the
application is restarted (after the specified timeout) at the first
node, specified by the distributed configuration parameter, which is
up and running. This is called a failover
distributed = [{Application, [Timeout,] NodeDesc}]
If a node is started, which has higher priority according to
distributed, than the node where a distributed application is
currently running, the application will be restarted at the new node
and stopped at the old node. This is called a takeover.
Ok, that was meant as a general overview, since it can be a long topic :)
For the specific details, it is highly recommended to read the Distributed OTP Applications chapter for learnyousomeerlang (and of course the previous link: http://www.erlang.org/doc/design_principles/distributed_applications.html)
Also, your "service" might depend on other external systems like databases, so you should consider fault tolerance and redundancy there, too. The whole architecture needs to be fault tolerance and distributed for "the service" to work in this way.
Hope it helps!
This answer is a general overview to high availability for networked applications, not specific to Erlang. I don't know too much about what is available in the OTP framework yet because I am new to the language.
There are a few different problems here:
Client connection must be moved to the backup machine
The session may contain state data
How to detect a crash
Problem 1 - Moving client connection
This may be solved in many different ways and on different layers of the network architecture. The easiest thing is to code it right into the client, so that when a connection is lost it reconnects to another machine.
If you need network transparency you may use some technology to sync TCP states between different machines and then reroute all traffic to the new machine, which may be entirely invisible for the client. This is much harder to do than the first suggestion.
I'm sure there are lots of things to do in-between these two.
Problem 2 - State data
You obviously need to transfer the session state from the crashed machine unto the backup machine. This is really hard to do in a reliable way and you may lose the last few transactions because the crashed machine may not be able to send the last state before the crash. You can use a synchronized call in this way to be really sure about not losing state:
Transaction/message comes from the client into the main machine.
Main machine updates some state.
New state is sent to backup machine.
Backup machine confirms arrival of the new state.
Main machine confirms success to the client.
This may potentially be expensive (or at least not responsive enough) in some scenarios since you depend on the backup machine and the connection to it, including latency, before even confirming anything to the client. To make it perform better you can let the client check with the backup machine upon connection what transactions it received and then resend the lost ones, making it the client's responsibility to queue the work.
Problem 3 - Detecting a crash
This is an interesting problem because a crash is not always well-defined. Did something really crash? Consider a network program that closes the connection between the client and server, but both are still up and connected to the network. Or worse, makes the client disconnect from the server without the server noticing. Here are some questions to think about:
Should the client connect to the backup machine?
What if the main server updates some state and send it to the backup machine while the backup have the real client connected - will there be a data race?
Can both the main and backup machine be up at the same time or do you need to shut down work on one of them and move all sessions?
Do you need some sort of authority on this matter, some protocol to decide which one is master and which one is slave? Who is that authority? How do you decentralise it?
What if your nodes loses their connection between them but both continue to work as expected (called network partitioning)?
See Google's paper "Chubby lock server" (PDF) and "Paxos made live" (PDF) to get an idea.
Briefly,this solution involves using a consensus protocol to elect a master among a group of servers that handles all the requests. If the master fails, the protocol is used again to elect the next master.
Also, see gen_leader for an example in leader election which works with detecting failures and transferring service ownership.

Cache data on Multiple Hosts in AppFabric

Let me first explain that I am very new when it comes to use AppFabric for improving the Responsiveness of your application. I am trying to configure the Server Cluster with 2 Nodes over XML provider over Network Shared location.
My requirement is that the cached data should be created on both the Hosts so that If One of the host is down my other host in the Cluster should be able to serve the request and provide the cached data. As I said I have 2 Host in my Cluster and one of them is defined as Lead Host. Now when I am saving the data in cache I could not see the data in both the hosts (Not sure is there any specific command where you can see the data in a specific host). So what I want to test is that I’ll stop one of the Cache host and try to see if still I able to get the data from the second cache host.
thanks in advance
-Nitin
What you're talking about here is High Availability. To enable this, you'll need to be running Windows Server Enterprise Edition - if you're on Standard Edition then you just can't do it. You also really need a minimum of three hosts, so that if one goes down there are still two copies of your cached data to provide failover. If you can meet these requirements then the only extra step to create a highly-available cache is to set the Secondaries flag when you call new-cache e.g.
new-cache myHACache -Secondaries 1
There's no programmatic way to query what data is held on a specific host, because you only ever address the logical cache, not an individual physical host.
From our experience, using SQL authentication to the database does not work. Its clearly stated that only Integrated Security option is supported. Also we faced issues with the service running with "Integrated Security" since our SQL cluster was running under a domain account and AppFabric needs to run under "Network service" and we couldnt successfully connect to the sql cluster from AppFabric service.
This was a painful experience for us and I hope AppFabric caching improves the way it sends out "error messages and error codes". And also allows us to decide how we want to connect to the sql. KInd of stupid having to undergo this pain of "has to run as Network Service" and "no SQL authentication".