NameNode keeps going down - hdfs

I am having a problem with the NameNode status ambari shows. The following is happening:
- The NameNode keeps going down a few seconds after I start it through ambari (it looks like it never really goes up, but the start process runs successfully);
Despite being DOWN according to ambari, if I run JPS in the server the NameNode is hosted it shows that the service is running:
[hdfs#NNVM ~]$ jps
39395 NameNode
4463 Jps
and I can access NameNode UI properly;
I already restarted both the namenode and ambari-agent the manually but the behavior keeps the same;
This problem started after some HBase/Phoenix heavy queries that caused the namenode to go down (not sure if this is actually related but the exact same configurations were working well before this episode);
I've been digging for some hours and I am not being able to find error details in the namenode logs nor in the ambari-agent logs that allows me to understand the problem;
I am using HDP 2.4.0, Ambari 2.2.1.1 and no HA options.
Can someone help in this?
Thanks in advance
Edited: to add ambari version.

Related

AWS EC2 - FTP return error 451 when trying to upload new file

I am running two t2.medium EC2 servers on AWS. They are both launched from the same AMI and with similar settings, FTP (except passwords ofc) and locations. The only difference in the two servers is the content in the /var/www/html folder.
So far they have been working as expected but yesterday something weird started happening. Whenever I try to upload a new version of a (php) file on one of the servers, it fails and returns the error "server did not report OK, got 451". I've tried different FTP-users, different IDEs and rebooting my EC2-server without any luck. This only happens on one of the servers and it started happening "out of the blue"
Any suggestions how to fix this or at least in what direction I should continue my debugging?
The comment by #korgen lead me to the error log of the server. When I ran sudo less /var/log/secure I quickly saw the error message: error:
Failed to write to /var/log/btmp: No space left on device
I checked the storage volume by running the command df -h and I saw that 20.0 / 20.0 GB was in use. I increased the server volume size in AWS and after a quick reboot it all now works again.
I hope this helps a future lost soul :-)

Apache Superset error when installing locally using Docker Compose

I'm trying to install to my Ubuntu 20.04 local machine using docker-compose. When I run sudo docker-compose -f docker-compose-non-dev.yml up, I got several errors and the process keep giving errors and did not end, so I aborted. Can you please tell me what the problem is?
The errors I get during Init Step 1/4 [Starting] -- Applying DB migrations are:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) >relation "logs" does not exist
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) >relation "ab_permission_view_role" does not exist
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) >relation "report_schedule" does not exist
I had the same same issue on Mac OS. And similar issues have been reported in the GitHub issues page as well, but it was not reproducible by everyone.
There is a possibility that something may have gone wrong in the first run.
Try running docker-compose down -v and then run docker compose up.
If the above fails, try upgrading your docker installation. Installing a new version solved my problem.
I had the same issue (Mac OS Monterey) where I had an instance of docker running Postgres for one of my apps, so when Superset started, it was looking at that instance of Postgres which obviously didn't have the appropriate databases/tables/views/etc.
So just stopping that other instance and restarting the Superset containers fixed the errors and properly started Superset. #embarrassed #oops
I experienced the same issue but these errors were at the end of a long string of cascading errors. I had this error consistently across all runs.
Looking at the first error, it seems like the initialisation script is not waiting for PostgreSQL to be ready and starts transacting right away. If the first transactions fail, many others fail subsequently. In my case the database needed a few more seconds to be ready so, I just added a sleep 60 at the beginning of docker/docker-bootstrap.sh to give time to PostgreSQL to start before other services start working.
I deleted the previously-created docker volumes and ran docker-compose -f docker-compose-non-dev.yml up again and now all works fine.

Remote go-agent doesn't connect to go-server

Go-agent and go-server same version - v18.8.0. Server and agent are installed in different machines. In go-server/agents , the agent is not listed. The agent configuration file in /etc/default/go-agent is updated with hostname of go server.
Please help.
As mentioned in my question - the /etc/default/go-agent file is updated with the remote go-server url. After making this change, restarting the agent did not solve the problem. So, I went to restart the entire machine and now the remote go-agent gets listed in the go-server/agents.
Anytime the server (go-server) goes down, it should be ensured that every machine/runtime running go-agent is restarted.
FYI - This is with go-version#18.8.0 and go-agent#18.8.0

Puppet agents aren't applying changes from PuppetMaster

We have a deployment in AWS where we have a single PuppetMaster box that services hundreds of other servers within the AWS ecosystem. Starting yesterday, we noticed that puppet changes were not applying to the agents. At first we thought it was only newly provisioned boxes, but now we see that we simply aren't getting any error message on any of the machines where puppet agent runs.
# puppet agent --test --verbose
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Caching catalog for blarg-follower-0e5385bace7e84fe2
Info: Applying configuration version '1529498155'
Notice: Finished catalog run in 0.24 seconds
I have access to the PuppetMaster and have validated that the code there is up to date. Need help figuring out how to get better logging out of this and debugging what is wrong between the agent and the puppet master.
In this case the issue was that our Puppet Master's /etc/puppet/puppet.conf file had been modified, and indeed the agents weren't getting the full catalog from Puppet Master. We found a backup copy of the file, restored it, and we were back in business.

How to troubleshoot the installation for DSX Local?

The installation for DSX-Local keeps failing on step 32.
I tried with new installations both for RHEL 7.4 and CentOS
It seems the error is somehow related to the network - not %100 sure. The cloudant and redis pods show a imagepullback status.
I tried multiple things - but always fails on step 32. I believe maybe it has to do with the network between the nodes (inside VirtualBox on a host-only network), but not sure.
How to troubleshoot this problem?
Before answer the question, I think you need to provide more information about the installation you have - 1 node or 3/9 node? What is the version? What architecture?
The cluster will have its own docker registry setup, Tag the image with localhost:5000/ and then push, then it will pushing to that registry. Back to your question, if you see pods are "imagepullback", then probably not able to pull from the registry.
You can use curl http://:5000/v2/_catalog to see if you can query the registry to make sure it is working properly