Activate a Conda Environment During Ray Setup - ray

I'm trying to start a local Ray cluster but the initialization and setup commands are raising errors and I'm not sure what they mean.
For each command, the following message is shown after it is executed (the full logs are shown further down):
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
They don't appear to be stopping some commands from executing successfully, but I'm unable to activate a conda environment on each node using:
# List of shell commands to run to set up each nodes.
setup_commands:
- conda activate pytorch-dev
Any help or explanation would be greatly appreciated.
My cluster configuration file (cluster_config_local.yaml) contains:
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
## NOTE: Typically for local clusters, min_workers == initial_workers == max_workers.
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == initial_workers == max_workers.
min_workers: 12
# The initial number of worker nodes to launch in addition to the head node.
# Typically, min_workers == initial_workers == max_workers.
initial_workers: 12
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == initial_workers == max_workers.
max_workers: 12
# Autoscaling parameters.
# Ignore this if min_workers == initial_workers == max_workers.
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
container_name: "" # e.g. ray_docker
run_options: [] # Extra options to pass into "docker run"
# Local specific configuration.
provider:
type: local
head_ip: cs19090bs #Lab 3, machine 311
worker_ips: [
cs19091bs, cs19093bs, cs19094bs, cs19095bs, cs19096bs,
cs19103bs, cs19102bs, cs19101bs, cs19100bs, cs19099bs, cs19098bs, cs19097bs
]
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: user
ssh_private_key: ~/.ssh/id_rsa
# Leave this empty.
head_node: {}
# Leave this empty.
worker_nodes: {}
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands:
- conda activate pytorch-dev
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --redis-address=$RAY_HEAD_IP:6379
The full logs that are shown when I execute ray up cluster_config_local.yaml are:
2019-11-11 10:18:06,930 INFO node_provider.py:41 -- ClusterState: Loaded cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
This will create a new cluster [y/N]: y
2019-11-11 10:18:08,413 INFO commands.py:201 -- get_or_create_head_node: Launching new head node...
2019-11-11 10:18:08,414 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:08,416 INFO commands.py:214 -- get_or_create_head_node: Updating files on head node...
2019-11-11 10:18:08,417 INFO updater.py:356 -- NodeUpdater: cs19090bs: Updating to 345f31e4c980153f1c40ae2c0be26b703d4bbfde
2019-11-11 10:18:08,419 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:08,419 INFO updater.py:398 -- NodeUpdater: cs19090bs: Waiting for remote shell...
2019-11-11 10:18:08,420 INFO updater.py:210 -- NodeUpdater: cs19090bs: Waiting for IP...
2019-11-11 10:18:08,429 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Got IP [LogTimer=9ms]
2019-11-11 10:18:08,442 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running uptime on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
10:18:10 up 4 days, 22:41, 1 user, load average: 1.14, 0.56, 0.38
2019-11-11 10:18:10,178 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Got remote shell [LogTimer=1759ms]
2019-11-11 10:18:10,181 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:10,182 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running mkdir -p ~ on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:11,640 INFO updater.py:460 -- NodeUpdater: cs19090bs: Syncing /tmp/ray-bootstrap-aomvoo_d to ~/ray_bootstrap_config.yaml...
sending incremental file list
ray-bootstrap-aomvoo_d
sent 120 bytes received 47 bytes 111.33 bytes/sec
total size is 1,063 speedup is 6.37
2019-11-11 10:18:12,147 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Synced /tmp/ray-bootstrap-aomvoo_d to ~/ray_bootstrap_config.yaml [LogTimer=1964ms]
2019-11-11 10:18:12,147 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running mkdir -p ~ on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:13,610 INFO updater.py:460 -- NodeUpdater: cs19090bs: Syncing /home/cosc/student/atu31/.ssh/id_rsa to ~/ray_bootstrap_key.pem...
sending incremental file list
sent 60 bytes received 12 bytes 48.00 bytes/sec
total size is 3,243 speedup is 45.04
2019-11-11 10:18:14,131 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Synced /home/cosc/student/atu31/.ssh/id_rsa to ~/ray_bootstrap_key.pem [LogTimer=1984ms]
2019-11-11 10:18:14,133 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:14,134 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Initialization commands completed [LogTimer=0ms]
2019-11-11 10:18:14,134 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running conda activate pytorch-dev on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:15,740 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Setup commands completed [LogTimer=1605ms]
2019-11-11 10:18:15,740 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running ray stop on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:17,809 INFO updater.py:262 -- NodeUpdater: cs19090bs: Running ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml on 132.181.15.173...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-11-11 10:18:19,923 INFO scripts.py:303 -- Using IP address 132.181.15.173 for this node.
2019-11-11 10:18:19,924 INFO resource_spec.py:205 -- Starting Ray with 7.62 GiB memory available for workers and up to 3.81 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2019-11-11 10:18:20,169 INFO scripts.py:333 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --redis-address 132.181.15.173:6379
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(redis_address="132.181.15.173:6379")
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
2019-11-11 10:18:20,221 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Ray start commands completed [LogTimer=4480ms]
2019-11-11 10:18:20,222 INFO log_timer.py:21 -- NodeUpdater: cs19090bs: Applied config 345f31e4c980153f1c40ae2c0be26b703d4bbfde [LogTimer=11804ms]
2019-11-11 10:18:20,224 INFO node_provider.py:85 -- ClusterState: Writing cluster state: ['cs19091bs', 'cs19093bs', 'cs19094bs', 'cs19095bs', 'cs19096bs', 'cs19090bs', 'cs19103bs', 'cs19102bs', 'cs19101bs', 'cs19100bs', 'cs19099bs', 'cs19098bs', 'cs19097bs']
2019-11-11 10:18:20,226 INFO commands.py:281 -- get_or_create_head_node: Head node up-to-date, IP address is: 132.181.15.173
To monitor auto-scaling activity, you can run:
ray exec cluster/cluster_config_local.yaml 'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'
To open a console on the cluster:
ray attach cluster_config_local.yaml
To get a remote shell to the cluster manually, run:
ssh -i ~/.ssh/id_rsa user#132.181.15.173

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
This error message is harmless (and should be muted in Ray). See How to tell bash not to issue warnings "cannot set terminal process group" and "no job control in this shell" when it can't assert job control?.

Related

failed to start 'instance-controller' service on EMR master node

I started observing below validation error on EMR console,
Upon checking the status of the instance controller service, observed that
sudo systemctl status instnace-controller.service output is not consistent, it varies between running and auto-restart.
Master node system logs shows;
(console) 2023-02-03 21:55:23 About to start instance controller.
(console) 2023-02-03 21:55:23 Listing currently running instance controllers:
hadoop 8439 1 0 21:55 ? 00:00:00 /bin/bash -l /usr/bin/instance-controller
hadoop 8510 8439 0 21:55 ? 00:00:00 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.instancecontroller.Main
hadoop 8541 8439 0 21:55 ? 00:00:00 grep -i instance
root 8542 8439 0 21:55 ? 00:00:00 sudo tee -a /emr/instance-state/console.log-2023-02-03-21-55 /dev/console
oozie 26477 1 26 21:53 ? 00:00:22 /etc/alternatives/jre/bin/java -Xmx1024m -Xmx1024m -Doozie.home.dir=/usr/lib/oozie -Doozie.config.dir=/etc/oozie/conf -Doozie.log.dir=/var/log/oozie -Doozie.data.dir=/var/lib/oozie -Doozie.instance.id=ip-10-111-24-159.pvt.lp192.cazena.com -Doozie.config.file=oozie-site.xml -Doozie.log4j.file=oozie-log4j.properties -Doozie.log4j.reload=10 -Djava.library.path= -cp /usr/lib/oozie/embedded-oozie-server/*:/usr/lib/oozie/embedded-oozie-server/dependency/*:/usr/lib/oozie/lib/*:/usr/lib/oozie/libtools/*:/usr/lib/oozie/libext/*:/usr/lib/oozie/embedded-oozie-server:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/conf/*:/usr/share/aws/emr/emrfs/auxlib/* org.apache.oozie.server.EmbeddedOozieServer
root 27455 1 42 21:54 ? 00:00:35 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.logpusher.Main /etc/logpusher/logpusher.properties
(console) 2023-02-03 21:55:23 Displaying last 10 lines of instance controller logfile:
2023-02-03 21:55:17,719 INFO main: isV2FrameworkEnabled: false, extraInstanceData.numCandidates: 1
2023-02-03 21:55:17,735 WARN main: Invalid metrics information null fetched from checkpoint, will start continuing from current moment instead.
2023-02-03 21:55:17,735 INFO main: Initialized YARN checkpointing state with ckpFileAvl: true, ckpInfo: [ lastCkpTs(0), totalHdfsBytesReadCompletedApps(0), totalHdfsBytesWrittenCompletedApps(0), totalS3BytesReadCompletedApps(0), totalS3BytesWrittenCompletedApps(0)]
2023-02-03 21:55:17,745 ERROR main: Thread + 'main' failed with error
java.lang.RuntimeException: LocalStartupState is FAILED, so not allowing instance controller to start
at aws157.instancecontroller.common.InstanceConfigurator.hasAlreadyBeenConfigured(InstanceConfigurator.java:124)
at aws157.instancecontroller.common.InstanceConfigurator.<init>(InstanceConfigurator.java:100)
at aws157.instancecontroller.InstanceController.<init>(InstanceController.java:223)
at aws157.instancecontroller.Main.runV1Framework(Main.java:239)
at aws157.instancecontroller.Main.main(Main.java:222)
I tried restarting service multiple time with
sudo systemctl start instance-controller.service, rebooted the node hoping that service will start back after reboot. But it is not working. (Btw, this worked on lower environment)
Jobs on the cluster are running fine though without any issues, but I am not able to see application logs pushed to S3 or on console.
Need inputs on how to restart instance controller service.

ssh tunnel script hangs forever on beanstalk deployment

I'm attempting to create a ssh tunnel, when deploying an application to aws beanstalk. I want to put the tunnel as a background process, that is always connected on application deploy. The script is hanging forever on the deployment and I can't see why.
"/home/ec2-user/eclair-ssh-tunnel.sh":
mode: "000500" # u+rx
owner: root
group: root
content: |
cd /root
eval $(ssh-agent -s)
DISPLAY=":0.0" SSH_ASKPASS="./askpass_script" ssh-add eclair-test-key </dev/null
# we want this command to keep running in the backgriund
# so we add & at then end
nohup ssh -L 48682:localhost:8080 ubuntu#[host...] -N &
and here is the output I'm getting from /var/log/eb-activity.log:
[2019-06-14T14:53:23.268Z] INFO [15615] - [Application update suredbits-api-root-0.37.0-testnet-ssh-tunnel-fix-port-9#30/AppDeployStage1/AppDeployPostHook/01_eclair-ssh-tunnel.sh] : Starting activity...
The ssh tunnel is spawned, and I can find it by doing:
[ec2-user#ip-172-31-25-154 ~]$ ps aux | grep 48682
root 16047 0.0 0.0 175560 6704 ? S 14:53 0:00 ssh -L 48682:localhost:8080 ubuntu#ec2-34-221-186-19.us-west-2.compute.amazonaws.com -N
If I kill that process, the deployment continues as expected, which indicates that the bug is in the tunnel script. I can't seem to find out where though.
You need to add -n option to ssh when run it in background to avoid reading from stdin.

Dependent jar file not found in mapreduce job

I have 2 almost identical CDH 5.8 clusters, namely, Lab & Production. I have a mapreduce job that runs fine in Lab but fails in Production cluster. I spent over 10 hours on this already. I made sure I am running exact same code and also compared the configurations between the clusters. I couldn't find any difference.
Only difference I could see is when I run in Production, I see these warnings:
Also note, the path of the cached file starts with "file://null/"
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/commons-httpclient-3.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/commons-httpclient-3.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/hadoop-yarn-server-common.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-server-common.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/stax-api-1.0-2.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/stax-api-1.0-2.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hbase/lib/snappy-java-1.0.4.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/snappy-java-1.0.4.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 INFO impl.YarnClientImpl: Submitted application application_1502835801144_0005
17/08/16 10:13:14 INFO mapreduce.Job: The url to track the job: http://myserver.com:8088/proxy/application_1502835801144_0005/
17/08/16 10:13:14 INFO mapreduce.Job: Running job: job_1502835801144_0005
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 running in uber mode : false
17/08/16 10:13:15 INFO mapreduce.Job: map 0% reduce 0%
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 failed with state FAILED due to: Application application_1502835801144_0005 failed 2 times due to AM Container for appattempt_1502835801144_0005_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1502835801144_0005/Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar does not exist
Failing this attempt. Failing the application.
17/08/16 10:13:15 INFO mapreduce.Job: Counters: 0
17/08/16 10:13:16 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x25ba0c30a33ea46
17/08/16 10:13:16 INFO zookeeper.ZooKeeper: Session: 0x25ba0c30a33ea46 closed
17/08/16 10:13:16 INFO zookeeper.ClientCnxn: EventThread shut down
As we can see, the job tries to start but fails saying that a jar file is not found. I made sure the jar file exist in local fs with ample permissions. I suspect the issue happens when it tries to copy the jar files into the distributed cache and fails somehow.
Here is my shell script that start the MR job:
#!/bin/bash
LIBJARS=`ls -m /var/cdr-ingest-mapreduce/lib/*.jar |tr -d ' '|tr -d '\n'`
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hbase/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop/client/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*.jar |tr -d ' '|tr -d '\n'`"
job_start_timestamp=''
if [ -n "$1" ]; then
job_start_timestamp="-overridedJobStartTimestamp $1"
fi
export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
yarn jar `ls /var/cdr-ingest-mapreduce/cdr-ingest-mapreduce-core*.jar` com.blah.CdrIngestor \
-libjars ${LIBJARS} \
-zookeeper 44.88.111.216,44.88.111.220,44.88.111.211 \
-yarnResourceManagerHost 44.88.111.220 \
-yarnResourceManagerPort 8032 \
-yarnResourceManagerSchedulerHost 44.88.111.220 \
-yarnResourceManagerSchedulerPort 8030 \
-mrClientSubmitFileReplication 6 \
-logFile '/var/log/cdr_ingest_mapreduce/cdr_ingest_mapreduce' \
-hdfsTempOutputDirectory '/cdr/temp_cdr_ingest' \
-versions '3' \
-jobConfigDir '/etc/cdr-ingest-mapreduce' \
${job_start_timestamp}
Node Manager Log:
2017-08-16 18:34:28,438 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
2017-08-16 18:34:28,551 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-08-16 18:34:31,638 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2017-08-16 18:34:31,851 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: container_1502835801144_0006_01_000001 has no corresponding application!
2017-08-16 18:36:08,221 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2017-08-16 18:36:08,364 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1502933671610_0001 CONTAINERID=container_1502933671610_0001_01_000001
More logs from Node Manager showing that the jars were not copied to the cache (I am not sure what the 4th parameter "NULL" in the message is):
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1502835577753_0001_01_000001
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1502835577753_0001_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/commons-lang3-3.4.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/cdr-ingest-mapreduce-core-1.0.3-SNAPSHOT.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/opencsv-3.8.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download resource { { file:/var/cdr-ingest-mapreduce/lib/dataplatform-common-1.0.7.jar, 1502740240000, FILE, null },pending,[(container_1502835577753_0001_01_000001)],31900834426583787,DOWNLOADING}
Any help is appreciated.
Basically, the mapper/reducer was trying to read the dependent jar file(s) from the node manager's local filesystem. I confirmed that by comparing the configurations between the 2 clusters. The value of "fs.defaultFS" was set to "file:///" for the cluster that wasn't working. It looks like, that value comes from the file /etc/hadoop/conf/core-site.xml on the server (edge server) where my mapreduce was started. This file had no configurations because I had no service/role deployed on that edge server. I deployed HDFS/HttpFs on the edge server and redeployed the client configurations across the cluster. Alternatively, one could deploy a gateway role on the server to pull the configurations without having to run any role. Thanks to #tk421 for the tip. This created the contents in /etc/hadoop/conf/core-site.xml and fixed my problem.
For those who don't want to deploy any service/role on the edge server, you could copy the file contents from one of your your data node.
I added this little code snippet before starting the job to print the configuration values:
for (Entry<String, String> entry : config) {
System.out.println(entry.getKey() + "-->" + entry.getValue());
}
// Start and wait for the job to finish
jobStatus = job.waitForCompletion(true);

Geth private network problems generating ether

Short description
I have three Ethereum nodes connected in a private network and I am using the interactive Javascript console with geth.
The problem is, I cannot find a way to get ether on any of the accounts. The balance is always 0.
Details
For all three nodes, the configuration and output are similar with the difference only in their addresses and account numbers.
File tree before running geth:
~/eth/
database/
keystore/
genesis/
CustomGenesis.json
Contents of CustomGenesis.json:
{
"config": {
"chainId": 15,
"homesteadBlock": 0,
"eip155Block": 0,
"eip158Block": 0
},
"nonce": "0x0000000000000042",
"timestamp": "0x00",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"extraData": "0x00",
"gasLimit": "0x08000000",
"difficulty": "0x0400",
"mixhash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"coinbase": "0xd77821c8b92e3e29bc63c8f2a94a6c6a64b28b53",
"alloc": {
"0x862e90e6b6ebfe0535081d07be8e0f38e422932c": {"balance": "100"},
"0x47e4cf0cc71e7257663f3d2f95e3f8982ece3ad8": {"balance": "200"},
"0x1df2f4f40c03367a9bf42b28a090fed1cccb3068": {"balance": "300"},
"0xd77821c8b92e3e29bc63c8f2a94a6c6a64b28b53": {"balance": "4444444444444444444"},
"0x28685a4b9418c1cb85725318756aa815e8e34497": {"balance": "5555555555555555555"},
"0x86f0526280fea57255c6391a4c7dbdbe8e1181ab": {"balance": "6666666666666666666"}
}
}
While in the directory ~/eth/ I started geth with:
sudo geth --networkid 15 --datadir ./database --nodiscover --maxpeers 2 --rpc --rpcport 8080 --rpccorsdomain * --rpcapi "db,eth,net,web3" --port 30303 --identity TestNet init ./genesis/CustomGenesis.json
... which produced the following output:
INFO [07-12|13:12:46] Starting peer-to-peer node instance=Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1
INFO [07-12|13:12:46] Allocated cache and file handles database=/home/ethereum6/eth/database/geth/chaindata cache=128 handles=1024
INFO [07-12|13:12:46] Writing default main-net genesis block
INFO [07-12|13:12:47] Initialised chain configuration config="{ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Metropolis: 9223372036854775807 Engine: ethash}"
INFO [07-12|13:12:47] Disk storage enabled for ethash caches dir=/home/ethereum6/eth/database/geth/ethash count=3
INFO [07-12|13:12:47] Disk storage enabled for ethash DAGs dir=/home/ethereum6/.ethash count=2
WARN [07-12|13:12:47] Upgrading db log bloom bins
INFO [07-12|13:12:47] Bloom-bin upgrade completed elapsed=222.754µs
INFO [07-12|13:12:47] Initialising Ethereum protocol versions="[63 62]" network=15
INFO [07-12|13:12:47] Loaded most recent local header number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Loaded most recent local full block number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Loaded most recent local fast block number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Starting P2P networking
INFO [07-12|13:12:47] HTTP endpoint opened: http://127.0.0.1:8080
INFO [07-12|13:12:47] RLPx listener up self="enode://5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf#[::]:30303?discport=0"
INFO [07-12|13:12:47] IPC endpoint opened: /home/ethereum6/eth/database/geth.ipc
Interactive Javascript console is started in another terminal with:
sudo geth attach ipc:$HOME/eth/database/geth.ipc
... which gives:
Welcome to the Geth JavaScript console!
instance: Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1
coinbase: 0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540
at block: 0 (Thu, 01 Jan 1970 03:00:00 MSK)
datadir: /home/ethereum6/eth/database
modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
>
Etherbase is set on all nodes with miner.setEtherbase(personal.listAccounts[0]). Each node only has one account. (3 nodes, 3 accounts)
> eth.accounts
["0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540"]
> personal.listAccounts
["0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540"]
>
Calling admin.nodeInfo gives:
> admin.nodeInfo
{
enode: "enode://5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf#[::]:30303?discport=0",
id: "5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf",
ip: "::",
listenAddr: "[::]:30303",
name: "Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1",
ports: {
discovery: 0,
listener: 30303
},
protocols: {
eth: {
difficulty: 17179869184,
genesis: "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3",
head: "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3",
network: 15
}
}
}
>
The nodes are connected with admin.addPeer(..) such that each node shows two peers when calling admin.peers.
When I start mining with miner.start(), this is the output that I receive in the interactive js console:
> miner.start()
null
>
... and in the other terminal running the node:
INFO [07-12|13:16:34] Updated mining threads threads=0
INFO [07-12|13:16:34] Transaction pool price threshold updated price=18000000000
INFO [07-12|13:16:34] Starting mining operation
INFO [07-12|13:16:34] Commit new mining work number=1 txs=0 uncles=0 elapsed=749.279µs
After that nothing happens and the balance on all accounts is still 0 when checking with eth.getBalance(eth.accounts[0]).
What options do I have to try and get the nodes on the private network to start mining ether?
Why does the preallocation of ether not work in CustomGenesis.json?
Was the difficulty provided in CustomGenesis.json ignored? admin.nodeInfo showed a different number.
All comments and suggestions are welcome, thanks!
You probably set the genesis difficulty so high, that your CPU miners don't have a chance of finding a block. You probably want to set the difficulty to something more reasonable, such as 1 million (0x100000 in hex).
Ok, I'll provide what input I have (bear in mind that I am new as well, so we are in the same boat!)
The part I am confident in, is the whole balance part, so:
1. Make a new account (on whichever node): personal.newAccount("password")
2. Set the new account to be the coinbase of this node: miner.setEtherbase(eth.accounts[0])
3. Start the mining: miner.start()
Then, you can check the balance while you are mining. Try:
web3.fromWei(eth.getBalance(eth.coinbase), "ether")
The problem was apparently the way the genesis block was initialized.
The Wrong Way
By calling geth with init and the other command-line arguments:
geth --networkid 15 --datadir ./database --nodiscover --maxpeers 2 --rpc --rpcport 8080 --rpccorsdomain * --rpcapi "db,eth,net,web3" --port 30303 --identity TestNet init ./genesis/CustomGenesis.json
the node is started with the mainnet genesis block:
...
INFO [07-12|13:12:46] Writing default main-net genesis block
...
and after that, everything else cannot work the way expected.
The Solution
Call geth with init and --datadir arguments only:
geth --datadir /path/to/database init /path/to/CustomGenesis.json
A short output is given and geth immediately exits when the initialization is finished:
INFO [07-13|10:30:49] Allocated cache and file handles database=/path/to/database/geth/chaindata cache=16 handles=16
INFO [07-13|10:30:49] Writing custom genesis block
INFO [07-13|10:30:49] Successfully wrote genesis state database=chaindata hash=ed4e11…f40ac3
INFO [07-13|10:30:49] Allocated cache and file handles database=/path/to/database/geth/lightchaindata cache=16 handles=16
INFO [07-13|10:30:49] Writing custom genesis block
INFO [07-13|10:30:49] Successfully wrote genesis state database=lightchaindata hash=ed4e11…f40ac3
and after this, everything else works as expected.
Big thanks to Péter for helping me figure this out!

Can't attach gdbserver to process through kubectl

It looks like I have some sort of permissions problem with kubectl. I have a Docker image, that contains server with native dynamic library + gdbserver. When I'm trying to debug Docker container running on my local machine all is fine. I'm using the following workflow:
start gdb
target remote | docker exec -i CONTAINER gdbserver - --attach PID
set sysroot /path/to/local/binary
Good to go!
But when I'm trying to do such operation with kubectl I'm getting the following error:
Cannot attach to lwp 7: Operation not permitted (1)
Exiting
Remote connection closed
The only difference is step 2:
target remote | kubectl exec -i POD -- gdbserver - --attach PID
I think you might need to add ptrace() capabilities and seccomm profile in your yaml file.
--cap-add sys_ptrace