How to remove banned.users in hadoop Hortonworks - mapreduce

I have HDP Hortonworks 2.5.3 cluster, MAPREDUCE jobs in YARN are getting failed with the error:
java.io.IOException: DistCp failure: Job job_1498784032636_0015 has
failed:
Application application_1498784032636_0015 failed 2 times due to AM Container for appattempt_1498784032636_0015_000002 exited with
exitCode: -1000 For more detailed output, check the application
tracking page:
http://asterdart0005.labs.teradata.com:8088/cluster/app/application_1498784032636_0015 Then click on links to logs of each attempt. Diagnostics: Application
application_1498784032636_0015 initialization failed (exitCode=255)
with output: main : command provided 0 main : run as user is hdfs main
: requested yarn user is hdfs Requested user hdfs is banned
later i googled, it seems the hdfs user is banned user, as per the configuration in the file /etc/hadoop/conf/container-executor.cfg on each node, here is the content of the file:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
I have modified the file in all nodes (namenode, edge and data nodes), as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
#banned.users=hdfs,yarn,mapred,bin
min.user.id=500
and restarted all services in HDFS, YARN and MapReduce2 through Ambari, after restarting my jobs are failing with the same error, and checked the /etc/hadoop/conf/container-executor.cfg content, looks it reset to initial stage as below:
yarn.nodemanager.local-dirs=/hadoop/yarn/local
yarn.nodemanager.log-dirs=/hadoop/yarn/log
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred,bin
min.user.id=500
any idea whats the solution here, to remove the users from the banned users list?

First thing to note is , you can not comment banned_users line, instead set correct users in value of banned_users list. (i.e. if you do not want to ban user hdfs then change banned.users=hdfs,yarn,mapred,bin to banned.users=yarn,mapred,bin). If you comment banned_users list then anyway by default hdfs, yarn and mapred will be banned.
Another thing, you can follow steps given below to propagate changes to all nodes.
​Go to Ambari server node
Modify /var/lib/ambari-server/resources/common-services/YARN/<version>/package/templates/container-executor.cfg.j2 to configure banned users.
Restart Ambari server and all Ambari agents

Related

Unable to start Informatica Server

I installed Informatica server on my Windows 10 personal desktop but I have not been able to bring up the service. Everytime I start it, it comes down automatically. I look at the log and here is what it says.
C:\Informatica\9.6.1\tomcat\logs\exceptions.log
2020-02-29 00:02:06,141 ERROR [Domain Monitor] [NODE_10017] Cannot write node metadata to the configuration file [C:\Informatica\9.6.1\isp\config\nodemeta.xml]. Verify that the file is accessible and there are no problems with the network or file path.
com.informatica.isp.corecommon.exceptions.ISPException: [FrameworkUtils_0021] Cannot find [WORKGROUP\LAPTOP-HH4DEO09$] while applying read-write permissions.
C:\Informatica\9.6.1\tomcat\logs\node.log
com.informatica.isp.corecommon.exceptions.ISPException: [SPC_10013] Process for service _AdminConsole failed to start.
2020-02-29 01:54:31,346 ERROR [Thread 6 of 6 in DomainServiceThreadPool] [DOM_10055] Cannot start the service process [_AdminConsole] on node [node01_LAPTOP-HH4DEO09]. Detail Message: [[SPC_10013] Process for service _AdminConsole failed to start.]
I have given all permissions on the C:\Informatica\9.6.1\isp\config\nodemeta.xml file as well as its parent directory.
Please help me with troubleshooting this.
Thanks
Sree

Getting a GETHDFS ERROR Admin Yield Error

I am getting this error on my GetHDFS and I don't understand why I set the
Hadoop Configuration Resources, Kerberos Principal, Kerebros Keytab and there are files in the path I just checked via superputty and it's a valid path.
Currently the GetHDFS is just linked to a logAttribute as I am trying to get each step working before moving to the next.
Overall Process: GETHDFS -> PUTEMAIL, I am trying to print out a count of the rows of the path(csv)
It looks like your core-site.xml has a default filesystem that references a hostname 'gwhdpdevnnha' that is not reachable from where NiFi is running. You can trouble shoot this outside of NiFi by checking if you can run ping with that hostname from the terminal.

Concurrent workflow not starting from PMCMD Command

I have a requirement to start workflow concurrently with multiple instances, all instances need to run in parallel. When I run an instance it is running and related param file is being picked up. But when I start another instance to run in parallel with previous instance, it is giving below Error.
"Start Workflow Advanced: ERROR: Workflow [wf_name]: Could not start execution of this workflow because the current run on this Integration Service has not completed yet."
I tried doing this using PMCMDcommand like below. It's starting without any param file and without instance name. But PMCMD log is showing the the workflow is started for the given instance successfully.
pmcmd startworkflow -sv 'INT_......' -d 'DOM_......' -u 'venkat' -p MyPass.... -f 'MyFold...' -nowait -rin $inst_name $wf_name
This is working fine in our test environment. But not working in QA. Is there a configuration setting to avoid this behavior.
Please make sure the workflow is properly configured to allow multiple executions: the Configure Concurrent Execution has to be enabled and Allow concurrent run... needs to be correctly set. If you run with same instance name, the Allow concurent run with same instance name must be chosen. Otherwise, choose the Allow concurent run only with unique instance name, add the instance name and desired parameter file to the list below.
In your command I don't see the parameterfile, so I assume the latter should be the proper setup.
The issue is resolved by restarting the integration service. We did not restart integration service to fix this issue. But that resolved this issue. When we contacted informatica support for resolution, below KB link is provided by them. https://kb.informatica.com/solution/23/Pages/59/501120.aspx
Please find the thread I have opened in Informatica network.
https://network.informatica.com/thread/83540

GoCD Custom Command

I am trying to run a very simple custom command "echo helloworld" in GoCD as per the Getting Started Guide Part 2 however, the job does not finish with the Console saying Waiting for console logs and raw output saying Console log for this job is unavailable as it may have been purged by Go or deleted externally.
My job looks like the following which was taken from typing "echo" in the Lookup Command (which is different to the Getting Started example which I tried first with the same result)
Judging from the screenshot, the problem seems to be that no agent is assigned to the task. For an agent to be assigned, it must satisfy all of these conditions:
An agent must be running, and connected to the server
The agent must be enabled on the "Agents" page
If you use environments, the job and the agent need to be in the same environment
The agent needs to have all of the resources assigned that are configured in the job
Found the issue.
The Pipelines have to be in the same Environment to work.

Cannot do simple task on ec2 spark cluster from local pyspark

I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:
$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark
Then do a simple task
>>> data=sc.parallelize(range(1000),10)`
>>> data.count()
Works as expected:
14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000
But now if I try the same thing from local machine,
$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark
it can't seem to connect to the cluster
14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.
Any help will be greatly appreciated!
Others are asking same question on mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465
The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your spark task.
In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.
Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.
Even less likely, your cluster may actually be down, and you need to restart your spark workers.