Dependent jar file not found in mapreduce job - mapreduce

I have 2 almost identical CDH 5.8 clusters, namely, Lab & Production. I have a mapreduce job that runs fine in Lab but fails in Production cluster. I spent over 10 hours on this already. I made sure I am running exact same code and also compared the configurations between the clusters. I couldn't find any difference.
Only difference I could see is when I run in Production, I see these warnings:
Also note, the path of the cached file starts with "file://null/"
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/commons-httpclient-3.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/commons-httpclient-3.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/hadoop-yarn-server-common.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-server-common.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/stax-api-1.0-2.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/stax-api-1.0-2.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hbase/lib/snappy-java-1.0.4.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/snappy-java-1.0.4.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 INFO impl.YarnClientImpl: Submitted application application_1502835801144_0005
17/08/16 10:13:14 INFO mapreduce.Job: The url to track the job: http://myserver.com:8088/proxy/application_1502835801144_0005/
17/08/16 10:13:14 INFO mapreduce.Job: Running job: job_1502835801144_0005
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 running in uber mode : false
17/08/16 10:13:15 INFO mapreduce.Job: map 0% reduce 0%
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 failed with state FAILED due to: Application application_1502835801144_0005 failed 2 times due to AM Container for appattempt_1502835801144_0005_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1502835801144_0005/Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar does not exist
Failing this attempt. Failing the application.
17/08/16 10:13:15 INFO mapreduce.Job: Counters: 0
17/08/16 10:13:16 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x25ba0c30a33ea46
17/08/16 10:13:16 INFO zookeeper.ZooKeeper: Session: 0x25ba0c30a33ea46 closed
17/08/16 10:13:16 INFO zookeeper.ClientCnxn: EventThread shut down
As we can see, the job tries to start but fails saying that a jar file is not found. I made sure the jar file exist in local fs with ample permissions. I suspect the issue happens when it tries to copy the jar files into the distributed cache and fails somehow.
Here is my shell script that start the MR job:
#!/bin/bash
LIBJARS=`ls -m /var/cdr-ingest-mapreduce/lib/*.jar |tr -d ' '|tr -d '\n'`
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hbase/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop/client/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*.jar |tr -d ' '|tr -d '\n'`"
job_start_timestamp=''
if [ -n "$1" ]; then
job_start_timestamp="-overridedJobStartTimestamp $1"
fi
export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
yarn jar `ls /var/cdr-ingest-mapreduce/cdr-ingest-mapreduce-core*.jar` com.blah.CdrIngestor \
-libjars ${LIBJARS} \
-zookeeper 44.88.111.216,44.88.111.220,44.88.111.211 \
-yarnResourceManagerHost 44.88.111.220 \
-yarnResourceManagerPort 8032 \
-yarnResourceManagerSchedulerHost 44.88.111.220 \
-yarnResourceManagerSchedulerPort 8030 \
-mrClientSubmitFileReplication 6 \
-logFile '/var/log/cdr_ingest_mapreduce/cdr_ingest_mapreduce' \
-hdfsTempOutputDirectory '/cdr/temp_cdr_ingest' \
-versions '3' \
-jobConfigDir '/etc/cdr-ingest-mapreduce' \
${job_start_timestamp}
Node Manager Log:
2017-08-16 18:34:28,438 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
2017-08-16 18:34:28,551 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-08-16 18:34:31,638 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2017-08-16 18:34:31,851 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: container_1502835801144_0006_01_000001 has no corresponding application!
2017-08-16 18:36:08,221 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2017-08-16 18:36:08,364 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1502933671610_0001 CONTAINERID=container_1502933671610_0001_01_000001
More logs from Node Manager showing that the jars were not copied to the cache (I am not sure what the 4th parameter "NULL" in the message is):
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1502835577753_0001_01_000001
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1502835577753_0001_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/commons-lang3-3.4.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/cdr-ingest-mapreduce-core-1.0.3-SNAPSHOT.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/opencsv-3.8.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download resource { { file:/var/cdr-ingest-mapreduce/lib/dataplatform-common-1.0.7.jar, 1502740240000, FILE, null },pending,[(container_1502835577753_0001_01_000001)],31900834426583787,DOWNLOADING}
Any help is appreciated.

Basically, the mapper/reducer was trying to read the dependent jar file(s) from the node manager's local filesystem. I confirmed that by comparing the configurations between the 2 clusters. The value of "fs.defaultFS" was set to "file:///" for the cluster that wasn't working. It looks like, that value comes from the file /etc/hadoop/conf/core-site.xml on the server (edge server) where my mapreduce was started. This file had no configurations because I had no service/role deployed on that edge server. I deployed HDFS/HttpFs on the edge server and redeployed the client configurations across the cluster. Alternatively, one could deploy a gateway role on the server to pull the configurations without having to run any role. Thanks to #tk421 for the tip. This created the contents in /etc/hadoop/conf/core-site.xml and fixed my problem.
For those who don't want to deploy any service/role on the edge server, you could copy the file contents from one of your your data node.
I added this little code snippet before starting the job to print the configuration values:
for (Entry<String, String> entry : config) {
System.out.println(entry.getKey() + "-->" + entry.getValue());
}
// Start and wait for the job to finish
jobStatus = job.waitForCompletion(true);

Related

failed to start 'instance-controller' service on EMR master node

I started observing below validation error on EMR console,
Upon checking the status of the instance controller service, observed that
sudo systemctl status instnace-controller.service output is not consistent, it varies between running and auto-restart.
Master node system logs shows;
(console) 2023-02-03 21:55:23 About to start instance controller.
(console) 2023-02-03 21:55:23 Listing currently running instance controllers:
hadoop 8439 1 0 21:55 ? 00:00:00 /bin/bash -l /usr/bin/instance-controller
hadoop 8510 8439 0 21:55 ? 00:00:00 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.instancecontroller.Main
hadoop 8541 8439 0 21:55 ? 00:00:00 grep -i instance
root 8542 8439 0 21:55 ? 00:00:00 sudo tee -a /emr/instance-state/console.log-2023-02-03-21-55 /dev/console
oozie 26477 1 26 21:53 ? 00:00:22 /etc/alternatives/jre/bin/java -Xmx1024m -Xmx1024m -Doozie.home.dir=/usr/lib/oozie -Doozie.config.dir=/etc/oozie/conf -Doozie.log.dir=/var/log/oozie -Doozie.data.dir=/var/lib/oozie -Doozie.instance.id=ip-10-111-24-159.pvt.lp192.cazena.com -Doozie.config.file=oozie-site.xml -Doozie.log4j.file=oozie-log4j.properties -Doozie.log4j.reload=10 -Djava.library.path= -cp /usr/lib/oozie/embedded-oozie-server/*:/usr/lib/oozie/embedded-oozie-server/dependency/*:/usr/lib/oozie/lib/*:/usr/lib/oozie/libtools/*:/usr/lib/oozie/libext/*:/usr/lib/oozie/embedded-oozie-server:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/conf/*:/usr/share/aws/emr/emrfs/auxlib/* org.apache.oozie.server.EmbeddedOozieServer
root 27455 1 42 21:54 ? 00:00:35 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.logpusher.Main /etc/logpusher/logpusher.properties
(console) 2023-02-03 21:55:23 Displaying last 10 lines of instance controller logfile:
2023-02-03 21:55:17,719 INFO main: isV2FrameworkEnabled: false, extraInstanceData.numCandidates: 1
2023-02-03 21:55:17,735 WARN main: Invalid metrics information null fetched from checkpoint, will start continuing from current moment instead.
2023-02-03 21:55:17,735 INFO main: Initialized YARN checkpointing state with ckpFileAvl: true, ckpInfo: [ lastCkpTs(0), totalHdfsBytesReadCompletedApps(0), totalHdfsBytesWrittenCompletedApps(0), totalS3BytesReadCompletedApps(0), totalS3BytesWrittenCompletedApps(0)]
2023-02-03 21:55:17,745 ERROR main: Thread + 'main' failed with error
java.lang.RuntimeException: LocalStartupState is FAILED, so not allowing instance controller to start
at aws157.instancecontroller.common.InstanceConfigurator.hasAlreadyBeenConfigured(InstanceConfigurator.java:124)
at aws157.instancecontroller.common.InstanceConfigurator.<init>(InstanceConfigurator.java:100)
at aws157.instancecontroller.InstanceController.<init>(InstanceController.java:223)
at aws157.instancecontroller.Main.runV1Framework(Main.java:239)
at aws157.instancecontroller.Main.main(Main.java:222)
I tried restarting service multiple time with
sudo systemctl start instance-controller.service, rebooted the node hoping that service will start back after reboot. But it is not working. (Btw, this worked on lower environment)
Jobs on the cluster are running fine though without any issues, but I am not able to see application logs pushed to S3 or on console.
Need inputs on how to restart instance controller service.

Django hosting with Azure Web App Server Error 500

I followed this tutorial as well as 2 others trying to host my project using Azure. https://learn.microsoft.com/en-us/azure/app-service/containers/tutorial-python-postgresql-app?tabs=bash#clone-the-sample-app I managed to host the sample web app used in the tutorial, but could not host my own project
**I keep getting "Server Error 500". I've spent around 36 hours trying to fix the problem.**
I checked the application logs - nothing
I checked the kudu/scm logs - nothing
I looked under "App Service logs" and checked the ftp logs - nothing
I checked to see if all the files had been uploaded at this location "<>.scm.azurewebsites.net/wwwroot/" The staticfiles successfully uploaded.
I went to "Web SSH" and installed all the dependencies** "pip install -r requirements.txt"
then did "python manage.py runserver" AND NO ERRORS, but it did not want to connect to "127.0.0.1:8000" or "localhost:8000" ???
I spend around 6 hours searching for answers - tried everything - nothing worked
WEBSITES_PORT set to 8000 (tried different ports and removed this setting after no luck)
I changed DEBUG to False and True - didn't work
I did set all the necessary environment variables (eg, DB_HOST, DB_PASSWORD ...)
The App Service plan is F1 (free)
I went to all the pages on my web app and got server error 500 on all the pages except when logging into admin, after logging into admin I got the error again.
Possible Solutions I thought might work
I might be missing an important "Application setting" ???
One of the dependencies might be causing the problem - but I highly doubt it
I dont know pls help sir
This was about what the logs kept saying
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
2020-06-24T08:28:13.331Z INFO - Starting container for site
2020-06-24T08:28:13.331Z INFO - docker run -d -p 5480:8000 --name forexflowcom_0_136ed024 -e WEBSITE_SITE_NAME=forexflowcom -e WEBSITE_AUTH_ENABLED=False -e WEBSITE_ROLE_INSTANCE_ID=0 -e WEBSITE_HOSTNAME=forexflowcom.azurewebsites.net -e WEBSITE_INSTANCE_ID=9072c805cf2bc663ced034398777a5d5f6115a51e64a73b6fc69b73f64c8660e -e HTTP_LOGGING_ENABLED=1 appsvc/python:3.7_20200101.1
2020-06-24T08:28:16.751Z INFO - Initiating warmup request to container forexflowcom_0_136ed024 for site forexflowcom
2020-06-24T08:28:28.970Z INFO - Container forexflowcom_0_136ed024 for site forexflowcom initialized successfully and is ready to serve requests.
2020-06-24T09:34:28.003Z INFO - Starting container for site
2020-06-24T09:34:28.010Z INFO - docker run -d -p 5757:8000 --name forexflowcom_1_86357e3d -e WEBSITE_SITE_NAME=forexflowcom -e WEBSITE_AUTH_ENABLED=False -e WEBSITE_ROLE_INSTANCE_ID=0 -e WEBSITE_HOSTNAME=forexflowcom.azurewebsites.net -e WEBSITE_INSTANCE_ID=9072c805cf2bc663ced034398777a5d5f6115a51e64a73b6fc69b73f64c8660e -e HTTP_LOGGING_ENABLED=1 appsvc/python:3.7_20200101.1
2020-06-24T09:34:31.507Z INFO - Initiating warmup request to container forexflowcom_1_86357e3d for site forexflowcom
2020-06-24T09:34:49.002Z INFO - Container forexflowcom_1_86357e3d for site forexflowcom initialized successfully and is ready to serve requests.
2020-06-24T09:38:04.238Z INFO - Starting container for site
2020-06-24T09:38:04.240Z INFO - docker run -d -p 7958:8000 --name forexflowcom_2_79f5bea0 -e WEBSITE_SITE_NAME=forexflowcom -e WEBSITE_AUTH_ENABLED=False -e WEBSITE_ROLE_INSTANCE_ID=0 -e WEBSITE_HOSTNAME=forexflowcom.azurewebsites.net -e WEBSITE_INSTANCE_ID=9072c805cf2bc663ced034398777a5d5f6115a51e64a73b6fc69b73f64c8660e -e HTTP_LOGGING_ENABLED=1 appsvc/python:3.7_20200101.1
2020-06-24T09:38:08.317Z INFO - Initiating warmup request to container forexflowcom_2_79f5bea0 for site forexflowcom
2020-06-24T09:38:23.838Z INFO - Waiting for response to warmup request for container forexflowcom_2_79f5bea0. Elapsed time = 15.5210597 sec
2020-06-24T09:38:41.054Z INFO - Container forexflowcom_2_79f5bea0 for site forexflowcom initialized successfully and is ready to serve requests.
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
EDIT // EDIT // EDIT // EDIT // EDIT
I found The solution
in settings.py I had:
try:
from .local_settings import *
except ImportError:
print("No local file, your in production")
after removing this It worked

Activate a Conda Environment During Ray Setup

I'm trying to start a local Ray cluster but the initialization and setup commands are raising errors and I'm not sure what they mean.
For each command, the following message is shown after it is executed (the full logs are shown further down):
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
They don't appear to be stopping some commands from executing successfully, but I'm unable to activate a conda environment on each node using:
# List of shell commands to run to set up each nodes.
setup_commands:
- conda activate pytorch-dev
Any help or explanation would be greatly appreciated.
My cluster configuration file (cluster_config_local.yaml) contains:
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
## NOTE: Typically for local clusters, min_workers == initial_workers == max_workers.
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == initial_workers == max_workers.
min_workers: 12
# The initial number of worker nodes to launch in addition to the head node.
# Typically, min_workers == initial_workers == max_workers.
initial_workers: 12
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == initial_workers == max_workers.
max_workers: 12
# Autoscaling parameters.
# Ignore this if min_workers == initial_workers == max_workers.
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
container_name: "" # e.g. ray_docker
run_options: [] # Extra options to pass into "docker run"
# Local specific configuration.
provider:
type: local
head_ip: cs19090bs #Lab 3, machine 311
worker_ips: [
cs19091bs, cs19093bs, cs19094bs, cs19095bs, cs19096bs,
cs19103bs, cs19102bs, cs19101bs, cs19100bs, cs19099bs, cs19098bs, cs19097bs
]
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: user
ssh_private_key: ~/.ssh/id_rsa
# Leave this empty.
head_node: {}
# Leave this empty.
worker_nodes: {}
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands:
- conda activate pytorch-dev
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --redis-address=$RAY_HEAD_IP:6379
The full logs that are shown when I execute ray up cluster_config_local.yaml are:
2019-11-11 10:18:06,930 INFO node_provider.py:41 -- ClusterState: Loaded cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
This will create a new cluster [y/N]: y
2019-11-11 10:18:08,413 INFO commands.py:201 -- get_or_create_head_node: Launching new head node...
2019-11-11 10:18:08,414 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:08,416 INFO commands.py:214 -- get_or_create_head_node: Updating files on head node...
2019-11-11 10:18:08,417 INFO updater.py:356 -- NodeUpdater: cs19090bs: Updating to 345f31e4c980153f1c40ae2c0be26b703d4bbfde
2019-11-11 10:18:08,419 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:08,419 INFO updater.py:398 -- NodeUpdater: cs19090bs: Waiting for remote shell...
2019-11-11 10:18:08,420 INFO updater.py:210 -- NodeUpdater: cs19090bs: Waiting for IP...
2019-11-11 10:18:08,429 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Got IP [LogTimer=9ms]
2019-11-11 10:18:08,442 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running uptime on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
10:18:10 up 4 days, 22:41, 1 user, load average: 1.14, 0.56, 0.38
2019-11-11 10:18:10,178 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Got remote shell [LogTimer=1759ms]
2019-11-11 10:18:10,181 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:10,182 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running mkdir -p ~ on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:11,640 INFO updater.py:460 -- NodeUpdater: cs19090bs: Syncing /tmp/ray-bootstrap-aomvoo_d to ~/ray_bootstrap_config.yaml...
sending incremental file list
ray-bootstrap-aomvoo_d
sent 120 bytes received 47 bytes 111.33 bytes/sec
total size is 1,063 speedup is 6.37
2019-11-11 10:18:12,147 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Synced /tmp/ray-bootstrap-aomvoo_d to ~/ray_bootstrap_config.yaml [LogTimer=1964ms]
2019-11-11 10:18:12,147 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running mkdir -p ~ on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:13,610 INFO updater.py:460 -- NodeUpdater: cs19090bs: Syncing /home/cosc/student/atu31/.ssh/id_rsa to ~/ray_bootstrap_key.pem...
sending incremental file list
sent 60 bytes received 12 bytes 48.00 bytes/sec
total size is 3,243 speedup is 45.04
2019-11-11 10:18:14,131 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Synced /home/cosc/student/atu31/.ssh/id_rsa to ~/ray_bootstrap_key.pem [LogTimer=1984ms]
2019-11-11 10:18:14,133 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:14,134 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Initialization commands completed [LogTimer=0ms]
2019-11-11 10:18:14,134 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running conda activate pytorch-dev on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:15,740 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Setup commands completed [LogTimer=1605ms]
2019-11-11 10:18:15,740 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running ray stop on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:17,809 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:19,923 INFO scripts.py:303 -- Using IP address 132.181.15.173 for this node.
2019-11-11 10:18:19,924 INFO resource_spec.py:205 -- Starting Ray with 7.62 GiB memory available for workers and up to 3.81 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2019-11-11 10:18:20,169 INFO scripts.py:333 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --redis-address 132.181.15.173:6379
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(redis_address="132.181.15.173:6379")
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
2019-11-11 10:18:20,221 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Ray start commands completed [LogTimer=4480ms]
2019-11-11 10:18:20,222 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Applied config 345f31e4c980153f1c40ae2c0be26b703d4bbfde [LogTimer=11804ms]
2019-11-11 10:18:20,224 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:20,226 INFO commands.py:281 -- get_or_create_head_node: Head node up-to-date, IP address is: 132.181.15.173
To monitor auto-scaling activity, you can run:
ray exec cluster/cluster_config_local.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'
To open a console on the cluster:
ray attach cluster_config_local.yaml
To get a remote shell to the cluster manually, run:
ssh -i ~/.ssh/id_rsa user#132.181.15.173
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
This error message is harmless (and should be muted in Ray). See How to tell bash not to issue warnings "cannot set terminal process group" and "no job control in this shell" when it can't assert job control?.

Hadoop streaming mapreduce does not run

I have downloaded (since I don't have space for running CDH or Sandbox) Hadoop 2.6.0 and hadoop streaming from here
I ran the command of
bin/hadoop jar contrib/hadoop-streaming-2.6.0.jar \
-file ${HADOOP_HOME}/py_mapred/mapper.py -mapper ${HADOOP_HOME}/py_mapred/mapper.py \
-file ${HADOOP_HOME}/py_mapred/reducer.py -reducer ${HADOOP_HOME}/py_mapred/reducer.py \
-input /input/davinci/* -output /input/davinci-output
where I stored the downloaded streaming jar in ${HADOOP_HOME}/contrib, and the other files in py_mapred. At the same time, I copyFromLocal to /input directory on hdfs. Now, when I run the command, the following lines show up:
15/08/14 17:35:45 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
15/08/14 17:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/usr/local/cellar/hadoop/2.6.0/py_mapred/mapper.py, /usr/local/cellar/hadoop/2.6.0/py_mapred/reducer.py, /var/folders/c5/4xfj65v15g91f71c_b9whnpr0000gn/T/hadoop-unjar3313567263260134566/] [] /var/folders/c5/4xfj65v15g91f71c_b9whnpr0000gn/T/streamjob9165494241574343777.jar tmpDir=null
15/08/14 17:35:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/08/14 17:35:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/08/14 17:35:48 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/14 17:35:48 INFO mapreduce.JobSubmitter: number of splits:2
15/08/14 17:35:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1439538212023_0002
15/08/14 17:35:49 INFO impl.YarnClientImpl: Submitted application application_1439538212023_0002
15/08/14 17:35:49 INFO mapreduce.Job: The url to track the job: http://Jonathans-MacBook-Pro.local:8088/proxy/application_1439538212023_0002/
15/08/14 17:35:49 INFO mapreduce.Job: Running job: job_1439538212023_0002
It looks like the command has been accepted. I checked on localhost:8088 and the job does register. However it's not running, despite the fact that it says Running job: job_1439538212023_0002. Is there something wrong with my command? Is it due to permission setting? Why isn't the job running?
Thank you
Here is right way for streaming:
bin/hadoop jar contrib/hadoop-streaming-2.6.0.jar \
-file ${HADOOP_HOME}/py_mapred/mapper.py -mapper '/usr/bin/python mapper.py' -file ${HADOOP_HOME}/py_mapred/reducer.py -reducer '/usr/bin/python reducer.py' -input /input/davinci/* -output /input/davinci-output

Hadoop MapReduce Job Hangs

I am trying to simulate the Hadoop environment using latest Hadoop version 2.6.0, Java SDK 1.70 on my Ubuntu desktop. I configured the hadoop with necessary environment parameters and all its processes are up and running and they can be seen with the following jps command:
nandu#nandu-Desktop:~$ jps
2810 NameNode
3149 SecondaryNameNode
3416 NodeManager
3292 ResourceManager
2966 DataNode
4805 Jps
I could also see the above information, plus the dfs files through the Firefox browser. However, when I tried to run a simple WordCound MapReduce job, it hangs and it doesn't produce any output or shows any error message(s). After a while I killed the process using the "hadoop job -kill " command. Can you please guide me, to find the cause of this issue and how to resolve it? I am giving below the Job start and kill(end) screenshot.
If you need additional information, please let me know.
Your help will be highly appreciated.
Thanks,
===================================================================
nandu#nandu-Desktop:~/dev$ hadoop jar wc.jar WordCount /user/nandu/input /user/nandu/output
15/02/27 10:35:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/27 10:35:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/27 10:35:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/02/27 10:35:21 INFO input.FileInputFormat: Total input paths to process : 2
15/02/27 10:35:21 INFO mapreduce.JobSubmitter: number of splits:2
15/02/27 10:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425048764581_0003
15/02/27 10:35:22 INFO impl.YarnClientImpl: Submitted application application_1425048764581_0003
15/02/27 10:35:22 INFO mapreduce.Job: The url to track the job: http://nandu-Desktop:8088/proxy/application_1425048764581_0003/
15/02/27 10:35:22 INFO mapreduce.Job: Running job: job_1425048764581_0003
==================== at this point the job was killed ===================
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 running in uber mode : false
15/02/27 10:38:23 INFO mapreduce.Job: map 0% reduce 0%
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 failed with state KILLED due to: Application killed by user.
15/02/27 10:38:23 INFO mapreduce.Job: Counters: 0
I encountered similar problem while running provided MapReduce sample in hadoop package. In my case it was hanging due to low disk space on my VM (about 1.5 GB was empty). When I freed some disk space it ran pretty fine. Also, please check other system resource requirements are fulfilled.