AWS Replication agent problem when launching - amazon-web-services

I am trying to launch aws replication agent in a CENTOS 8.3 and always returns me an error during the process of replication agent installation ( python3 aws-replication-installer-init.py ......)
The output of the process shows me:
The installation of the AWS Replication Agent has started.
Identifying volumes for replication.
Identified volume for replication: /dev/sdb of size 7 GiB
Identified volume for replication: /dev/sda of size 11 GiB
All volumes for replication were successfully identified.
Downloading the AWS Replication Agent onto the source server... Finished.
Installing the AWS Replication Agent onto the source server...
Error: Failed Installing the AWS Replication Agent
Installation failed.
If i check the aws_replication_agent_installer.log i can see that appears messages like:
make -C /lib/modules/4.18.0-348.2.1.el8_5.x86_64/build M=/tmp/tmp8mdbz3st/AgentDriver modules
.....................
retcode: 0
Build essentials returned with code None
--- Building software
running: 'which zypper'
retcode: 256
running: 'make'
retcode: 0
running: 'chmod 0770 ./aws-replication-driver-commander'
retcode: 0
running: '/sbin/rmmod aws-replication-driver'
retcode: 256
running: '/sbin/insmod ./aws-replication-driver.ko'
retcode: 256
running: '/sbin/rmmod aws-replication-driver'
retcode: 256
Cannot insert module. Try 0.
running: '/sbin/rmmod aws-replication-driver'
retcode: 256
running: '/sbin/insmod ./aws-replication-driver.ko'
retcode: 256
running: '/sbin/rmmod aws-replication-driver'
retcode: 256
Cannot insert module. Try 1.
............
Cannot insert module. Try 9.
Installation returned with code 2
Installation failed due to unspecified error:
stderr: sh: /var/lib/aws-replication-agent/stopAgent.sh: No such file or directory
which: no zypper in (/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/usr/sbin)
which: no apt-get in (/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/usr/sbin)
which: no zypper in (/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/usr/sbin)
rmmod: ERROR: Module aws_replication_driver is not currently loaded
insmod: ERROR: could not insert module ./aws-replication-driver.ko: Required key not available
rmmod: ERROR: Module aws_replication_driver is not currently loaded
Any issue of the error?

Launching with the command:
mokutil --disable-validation
will allow to change kernel modules (next boot will confirm it introducing password that must be entered afet command mokutil)

Related

failed to start 'instance-controller' service on EMR master node

I started observing below validation error on EMR console,
Upon checking the status of the instance controller service, observed that
sudo systemctl status instnace-controller.service output is not consistent, it varies between running and auto-restart.
Master node system logs shows;
(console) 2023-02-03 21:55:23 About to start instance controller.
(console) 2023-02-03 21:55:23 Listing currently running instance controllers:
hadoop 8439 1 0 21:55 ? 00:00:00 /bin/bash -l /usr/bin/instance-controller
hadoop 8510 8439 0 21:55 ? 00:00:00 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.instancecontroller.Main
hadoop 8541 8439 0 21:55 ? 00:00:00 grep -i instance
root 8542 8439 0 21:55 ? 00:00:00 sudo tee -a /emr/instance-state/console.log-2023-02-03-21-55 /dev/console
oozie 26477 1 26 21:53 ? 00:00:22 /etc/alternatives/jre/bin/java -Xmx1024m -Xmx1024m -Doozie.home.dir=/usr/lib/oozie -Doozie.config.dir=/etc/oozie/conf -Doozie.log.dir=/var/log/oozie -Doozie.data.dir=/var/lib/oozie -Doozie.instance.id=ip-10-111-24-159.pvt.lp192.cazena.com -Doozie.config.file=oozie-site.xml -Doozie.log4j.file=oozie-log4j.properties -Doozie.log4j.reload=10 -Djava.library.path= -cp /usr/lib/oozie/embedded-oozie-server/*:/usr/lib/oozie/embedded-oozie-server/dependency/*:/usr/lib/oozie/lib/*:/usr/lib/oozie/libtools/*:/usr/lib/oozie/libext/*:/usr/lib/oozie/embedded-oozie-server:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/conf/*:/usr/share/aws/emr/emrfs/auxlib/* org.apache.oozie.server.EmbeddedOozieServer
root 27455 1 42 21:54 ? 00:00:35 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.logpusher.Main /etc/logpusher/logpusher.properties
(console) 2023-02-03 21:55:23 Displaying last 10 lines of instance controller logfile:
2023-02-03 21:55:17,719 INFO main: isV2FrameworkEnabled: false, extraInstanceData.numCandidates: 1
2023-02-03 21:55:17,735 WARN main: Invalid metrics information null fetched from checkpoint, will start continuing from current moment instead.
2023-02-03 21:55:17,735 INFO main: Initialized YARN checkpointing state with ckpFileAvl: true, ckpInfo: [ lastCkpTs(0), totalHdfsBytesReadCompletedApps(0), totalHdfsBytesWrittenCompletedApps(0), totalS3BytesReadCompletedApps(0), totalS3BytesWrittenCompletedApps(0)]
2023-02-03 21:55:17,745 ERROR main: Thread + 'main' failed with error
java.lang.RuntimeException: LocalStartupState is FAILED, so not allowing instance controller to start
at aws157.instancecontroller.common.InstanceConfigurator.hasAlreadyBeenConfigured(InstanceConfigurator.java:124)
at aws157.instancecontroller.common.InstanceConfigurator.<init>(InstanceConfigurator.java:100)
at aws157.instancecontroller.InstanceController.<init>(InstanceController.java:223)
at aws157.instancecontroller.Main.runV1Framework(Main.java:239)
at aws157.instancecontroller.Main.main(Main.java:222)
I tried restarting service multiple time with
sudo systemctl start instance-controller.service, rebooted the node hoping that service will start back after reboot. But it is not working. (Btw, this worked on lower environment)
Jobs on the cluster are running fine though without any issues, but I am not able to see application logs pushed to S3 or on console.
Need inputs on how to restart instance controller service.

Subprocess Error when Launching ec2 cluster instance

I am getting subprocess error when launching ec2 cluster instance.
The terminal lags on
Waiting for cluster to enter 'ssh-ready' state'
when running
./spark-ec2 --key-pair=ru_spark --identity-file=ru_spark.pem --region=us-east-1 --zone=us-east-1a launch mycluster
Console:
Warning: Permanently added 'ec2-52-87-225-32.compute-1.amazonaws.com,52.87.225.32' (RSA) to the list of known hosts.
Connection to ec2-52-87-225-32.compute-1.amazonaws.com closed.
Warning: Permanently added 'ec2-52-87-225-32.compute-1.amazonaws.com,52.87.225.32' (RSA) to the list of known hosts.
Transferring cluster's SSH key to slaves...
ec2-34-207-153-79.compute-1.amazonaws.com
Warning: Permanently added 'ec2-34-207-153-79.compute-1.amazonaws.com,34.207.153.79' (RSA) to the list of known hosts.
Cloning spark-ec2 scripts from https://github.com/amplab/spark-ec2/tree/branch-1.6 on master...
Warning: Permanently added 'ec2-52-87-225-32.compute-1.amazonaws.com,52.87.225.32' (RSA) to the list of known hosts.
Cloning into 'spark-ec2'...
error: Peer reports incompatible or unsupported protocol version. while accessing https://github.com/amplab/spark-ec2/info/refs?service=git-upload-pack
fatal: HTTP request failed
Connection to ec2-52-87-225-32.compute-1.amazonaws.com closed.
Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', '-i', 'ru_spark.pem', '-t', '-t', u'root#ec2-52-87-225-32.compute-1.amazonaws.com', 'rm -rf spark-ec2 && git clone https://github.com/amplab/spark-ec2 -b branch-1.6 spark-ec2']' returned non-zero exit status 128
I updated to curl ssl, changed file permissions to 400 and 600 for ru_spark.pem but neither have helped solve the issue.

Not able to access HDFS

I installed cloudera vm and started trying some basic stuff. First I just wanted to ls the hdfs directoires. so I issued the below command.
[cloudera#quickstart ~]$ hadoop fs -ls /
ls: Failed on local exception: java.net.SocketException: Network is unreachable; Host Details : local host is: "quickstart.cloudera/10.0.2.15"; destination host is: "quickstart.cloudera":8020;
though ps -fu hdfs says both namenode and data node is running. I checked the status using the service command.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
Thinking all the problems will be resolved if I restart all the services, I executed the below command.
[cloudera#quickstart conf]$ sudo /home/cloudera/cloudera-manager --express --force
[QuickStart] Shutting down CDH services via init scripts...
[QuickStart] Disabling CDH services on boot...
[QuickStart] Starting Cloudera Manager daemons...
[QuickStart] Waiting for Cloudera Manager API...
[QuickStart] Configuring deployment...
Submitted jobs: 92
[QuickStart] Deploying client configuration...
Submitted jobs: 93
[QuickStart] Starting Cloudera Management Service...
Submitted jobs: 101
[QuickStart] Enabling Cloudera Manager daemons on boot...
Now I thought all services will be up so again checked the status of namenode service. Again it came failed.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
Now I decided to manually stop and start the namenode service. Again not much use.
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode stop
no namenode to stop
Stopped Hadoop namenode: [ OK ]
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode status
Hadoop namenode is not running [FAILED]
[cloudera#quickstart ~]$ sudo service hadoop-hdfs-namenode start
starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out
Failed to start Hadoop namenode. Return value: 1 [FAILED]
I checked the file /var/log/hadoop-hdfs/hadoop-hdfs-namenode-quickstart.cloudera.out . It just said below
log4j:ERROR Could not find value for key log4j.appender.RFA
log4j:ERROR Could not instantiate appender named "RFA".
I also checked /var/log/hadoop-hdfs/hadoop-cmf-hdfs-NAMENODE-quickstart.cloudera.log.out . Found below when I searched for error. Can anyone please suggest me what is the best way to get the services back on track. Unfortunately I am not able to access cloudera manager from browser. Anything that I can do from command line?
2016-02-24 21:02:48,105 WARN com.cloudera.cmf.event.publish.EventStorePublisherWithRetry: Failed to publish event: SimpleEvent{attributes={ROLE_TYPE=[NAMENODE], CATEGORY=[LOG_MESSAGE], ROLE=[hdfs-NAMENODE], SEVERITY=[IMPORTANT], SERVICE=[hdfs], HOST_IDS=[quickstart.cloudera], SERVICE_TYPE=[HDFS], LOG_LEVEL=[WARN], HOSTS=[quickstart.cloudera], EVENTCODE=[EV_LOG_EVENT]}, content=Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage directories!, timestamp=1456295437905} - 1 of 17 failure(s) in last 79302s
java.io.IOException: Error connecting to quickstart.cloudera/10.0.2.15:7184
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:249)
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:198)
at com.cloudera.cmf.event.shaded.org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:133)
at com.cloudera.cmf.event.publish.AvroEventStorePublishProxy.checkSpecificRequestor(AvroEventStorePublishProxy.java:122)
at com.cloudera.cmf.event.publish.AvroEventStorePublishProxy.publishEvent(AvroEventStorePublishProxy.java:196)
at com.cloudera.cmf.event.publish.EventStorePublisherWithRetry$PublishEventTask.run(EventStorePublisherWithRetry.java:242)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Network is unreachable
You can try this:
check witch process is using the port 7184 of namenode (i.e netstat linux command)
and kill that and then restart
Or
change you namenode port from conf and restart hadoop

Why am I getting a permission denied error on docker/aws eb?

I don't know why but I cannot seem to figure out why this is happening. I can build and run the docker image locally.
Recent Events:
2015-05-25 12:57:07 UTC+1000 ERROR Update environment operation is complete, but with errors. For more information, see troubleshooting documentation.
2015-05-25 12:57:07 UTC+1000 INFO New application version was deployed to running EC2 instances.
2015-05-25 12:57:04 UTC+1000 INFO Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
2015-05-25 12:57:04 UTC+1000 ERROR [Instance: i-4775ec9b] Command failed on instance. Return code: 1 Output: (TRUNCATED)... run Docker container: vel="fatal" msg="Error response from daemon: Cannot start container 02c057b331bf3a3d912bf064f1dca3e00c95746b5748c3c4a28a5c6b452ff335: [8] System error: exec: \"bin/app\": permission denied" . Check snapshot logs for details. Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/04run.sh failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
2015-05-25 12:57:03 UTC+1000 ERROR Failed to run Docker container: vel="fatal" msg="Error response from daemon: Cannot start container 02c057b331bf3a3d912bf064f1dca3e00c95746b5748c3c4a28a5c6b452ff335: [8] System error: exec: \"bin/app\": permission denied" . Check snapshot logs for details.
Dockerfile:
FROM java:8u45-jre
MAINTAINER Terence Munro <terry#zenkey.com.au>
ADD ["opt", "/opt"]
WORKDIR /opt/docker
RUN ["chown", "-R", "daemon:daemon", "."]
USER daemon
ENTRYPOINT ["bin/app"]
EXPOSE 9000
Dockerrun.aws.json:
{
"AWSEBDockerrunVersion": "1",
"Ports": [
{
"ContainerPort": "9000"
}
],
"Volumes": []
}
Additional logs as attachment at: https://forums.aws.amazon.com/thread.jspa?threadID=181270
Any help is extremely appreciated.
#nick-humrich suggestion of trying eb local run worked. So using eb deploy ended up working.
I had previously been uploading through the web interface.
Initially using eb deploy was giving me a ERROR: TypeError :: data must be a byte string but I found this issue which was resolved by uninstalling pyopenssl.
So I don't know why the web interface was giving me permission denied perhaps something to do with the zip file?
But anyway I'm able to deploy now thank you.
I had a similar problem running Docker on Elastic Beanstalk. When I pointed CMD in the Dockerfile to a shell script (/path/to/my_script.sh), the EB deployment would fail with
/path/to/my_script.sh: Permission denied.
Apparently, even though I had run RUN chmod +x /path/to/my_script.sh during the Docker build, by the time the image was run, the permissions had been changed. Eventually, to make it work I settled on:
CMD ["/bin/bash","-c","chmod +x /path/to/my_script.sh && /path/to/my_script.sh"]

Spark cluster fails after running and no exception thrown

I'm trying to run a stand-alone Spark application on EC2 Yarn command line. I'm submitting the following spark-submit script:
./bin/spark-submit --class PageRankGraphX --master yarn-cluster --properties-file spark-defaults.conf.2 --executor-memory 2G --total-executor-cores 5 ./SparkPageRank-assembly-1.0.jar s3://linkfilefull/full/links_small.txt s3://conansoutputbucket/smalloutput.txt 10 0.15 2
This is the output - there is no exception or error thrown, the job simply fails after running:
15/04/15 21:27:03 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/
appUser: hadoop
15/04/15 21:27:04 INFO yarn.Client: Application report from ASM:
application identifier: application_1429126831428_0027
appId: 27
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-172-31-1-67.eu-west-1.compute.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1429133214320
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://172.31.10.227:9046/proxy/application_1429126831428_0027/A
appUser: hadoop
Does anyone know what could be causing this or how I could investigate? When I try to access the yarn logs, it says logs are disabled or not ready.
Check out Amazon's documentation on enabling access to the web UI of Hadoop. Once in the UI, you can check the stderr output for the application, where the exception will most likely be. As others mentioned, this log will also be available on S3.