I am using AWS EMR-6.5.0 with Hadoop-3.2.1
I'm following this guide to launch the stream job: https://levelup.gitconnected.com/map-reduce-with-python-hadoop-on-aws-emr-341bdd07b804
When I run the command :
$ hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input books-input -output books-output
I get the error:
ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs://ip-172-31-55-89.ec2.internal/172.31.55.89:8032/user/hadoop/books-input/1340.txt
Streaming Command Failed!
Complete log:
2022-08-26 15:55:12,295 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-55-89.ec2.internal/172.31.55.89:8032
2022-08-26 15:55:12,592 INFO client.AHSProxy: Connecting to Application History server at ip-172-31-55-89.ec2.internal/172.31.55.89:8032
2022-08-26 15:55:12,653 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-55-89.ec2.internal/172.31.55.89:8032
2022-08-26 15:55:12,654 INFO client.AHSProxy: Connecting to Application History server at ip-172-31-55-89.ec2.internal/172.31.55.89:8032
2022-08-26 15:55:13,083 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1661529292338_0001
2022-08-26 15:55:14,507 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
2022-08-26 15:55:14,518 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 049362b7cf53ff5f739d6b1532457f2c6cd495e8]
2022-08-26 15:55:14,690 INFO mapred.FileInputFormat: Total input files to process : 49
2022-08-26 15:55:14,691 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1661529292338_0001
2022-08-26 15:55:14,769 ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs:/ip-172-31-55-89.ec2.internal/172.31.55.89:8032/user/hadoop/books-input/1340.txt
Streaming Command Failed!
I don't know why it says "not a file" for a .txt file,
Related
I referred to this question but I have a different question
Get VS Code Python extension to connect to Jupyter running on remote AWS EMR master node
Environment:
The Jupyter Notebook is on AWS EMR.
To access notebook from the browser using a SOCKS5 proxy. To do so, I have to connect to the VPN to work, SSH using PuTTY using .ppk file and Tunnel (Dynamic Port Forwarding).
To automate the step above, I am using PLink(Homepage | Docs | Download) : c:\stuff\plink.exe --ssh -i c:\stuff\file.ppk -D XXXX user-name#<some_IP_address
I can successfully enable proxy by using msedge.exe --proxy-server="socks5://<address>" or chrome.exe --proxy-server="socks5://<address>"
VS Code Version:
C:\Users\ablaze>code --version
1.55.2
3c4e3df9e89829dce27b7b5c24508306b151f30d
x64
Objective:
How can I access the remote Jupyter Notebook hosted on AWS EMR behind a proxy from Visual Studio Code?
I tried and failed:
Given Visual Studio Code is built on top of Electron and benefits from all the networking stack capabilities of Chromium.
I used Plink like above to SSH and then executed the following command on my windows command prompt: C:\Users\ablaze>code --proxy-server="socks5://<address>" --verbose
Error message:
[17668:0505/104410.116:WARNING:dns_config_service_win.cc(692)] Failed to read DnsConfig.
[main 2021-05-05T14:44:10.224Z] Starting VS Code
[main 2021-05-05T14:44:10.224Z] from: c:\Users\ablaze\AppData\Local\Programs\Microsoft VS Code\resources\app
[main 2021-05-05T14:44:10.224Z] args: {
_: [],
...
'no-proxy-server': false,
'proxy-server': 'socks5://localhost:8088',
...
logsPath: 'C:\\Users\\ablaze\\AppData\\Roaming\\Code\\logs\\20210505T104410'
}
...
[main 2021-05-05T14:44:10.258Z] windowsManager#open pathsToOpen [
{
backupPath: 'C:\\Users\\ablaze\\AppData\\Roaming\\Code\\Backups\\1620224957079',
remoteAuthority: undefined
}
]
To double check I also opened the log file C:\Users\ablaze\AppData\Roaming\Code\logs\20210505T104410\main.log
...
[2021-05-05 10:44:40.345] [main] [trace] update#checkForUpdates, state = idle
[2021-05-05 10:44:40.345] [main] [info] update#setState checking for updates
[2021-05-05 10:44:40.345] [main] [trace] RequestService#request https://update.code.visualstudio.com/api/update/win32-x64-user/stable/3c4e3df9e89829dce27b7b5c24508306b151f30d
[2021-05-05 10:44:40.346] [main] [trace] resolveShellEnv(): skipped (Windows)
[2021-05-05 10:44:44.354] [main] [error] Error: net::ERR_PROXY_CONNECTION_FAILED
at SimpleURLLoaderWrapper.<anonymous> (electron/js2c/browser_init.js:109:6508)
at SimpleURLLoaderWrapper.emit (events.js:315:20)
[2021-05-05 10:44:44.354] [main] [info] update#setState idle
I think I am almost there.
I created an instance of AWS BeanStalk and added an oracle DB instance to it.
When I found the log, I saw the driver was loaded but it keeps saying that URL is
invalid.
Here are my RDS info and log message.
[RDS Info]
Endpoint = aa1c9autjaqoufk.c2k1ch01futy.ap-northeast-2.rds.amazonaws.com
Port = 1521
Public Access = yes
[System Log]
25-Jun-2018 02:42:56.759 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["http-nio-8080"]
25-Jun-2018 02:42:56.787 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read
25-Jun-2018 02:42:56.796 INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["ajp-nio-8009"]
25-Jun-2018 02:42:56.799 INFO [main] org.apache.tomcat.util.net.NioSelectorPool.getSharedSelector Using a shared selector for servlet write/read
25-Jun-2018 02:42:56.800 INFO [main] org.apache.catalina.startup.Catalina.load Initialization processed in 1366 ms
25-Jun-2018 02:42:56.842 INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service Catalina
25-Jun-2018 02:42:56.848 INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet Engine: Apache Tomcat/8.0.50
25-Jun-2018 02:42:56.872 INFO [localhost-startStop-1] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory /var/lib/tomcat8/webapps/ROOT
25-Jun-2018 02:42:58.613 INFO [localhost-startStop-1] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
25-Jun-2018 02:42:58.689 INFO [localhost-startStop-1] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory /var/lib/tomcat8/webapps/ROOT has finished in 1,817 ms
25-Jun-2018 02:42:58.693 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
25-Jun-2018 02:42:58.720 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["ajp-nio-8009"]
25-Jun-2018 02:42:58.736 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in 1935 ms
Loading driver...
Driver loaded!
jdbc:oracle:oci://aa1c9autjaqoufk.c2k1ch01futy.ap-northeast-2.rds.amazonaws.com:1521/ebdb?user=username&password=password
SQLException: Invalid Oracle URL specified
SQLState: 99999
VendorError: 17067
Closing the connection.
SQLException: Invalid Oracle URL specified
SQLState: 99999
VendorError: 17067
Closing the connection.
I included ojdbc8 drvier in my web project library and made a build.
Is this about driver? What am I doing wrong?
Message clearly says your URL is incorrect,
It should be something like below.
//step1 load the driver class
Class.forName("oracle.jdbc.driver.OracleDriver");
//step2 create the connection object
Connection con=DriverManager.getConnection(
"jdbc:oracle:thin:#aa1c9autjaqoufk.c2k1ch01futy.ap-northeast-2.rds.amazonaws.com:1521:edb","username","password");
`
I have created a runnable jar file and executing in Hadoop, I am not getting the Output. The code works fine. I have checked it in eclipse with adding hadoop jar files and I have got the Output absolutely right
hduser#Strawhats:~$ hadoop jar /home/hduser/Desktop/project.jar /user/hduser/input /user/hduser/output
17/02/20 19:18:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/20 19:18:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/02/20 19:18:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/02/20 19:18:05 INFO mapred.FileInputFormat: Total input paths to process : 1
17/02/20 19:18:05 INFO mapreduce.JobSubmitter: number of splits:2
17/02/20 19:18:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487596891791_0003
17/02/20 19:18:06 INFO impl.YarnClientImpl: Submitted application application_1487596891791_0003
17/02/20 19:18:06 INFO mapreduce.Job: The url to track the job: http://Strawhats:8088/proxy/application_1487596891791_0003/
17/02/20 19:18:06 INFO mapreduce.Job: Running job: job_1487596891791_0003
I have downloaded (since I don't have space for running CDH or Sandbox) Hadoop 2.6.0 and hadoop streaming from here
I ran the command of
bin/hadoop jar contrib/hadoop-streaming-2.6.0.jar \
-file ${HADOOP_HOME}/py_mapred/mapper.py -mapper ${HADOOP_HOME}/py_mapred/mapper.py \
-file ${HADOOP_HOME}/py_mapred/reducer.py -reducer ${HADOOP_HOME}/py_mapred/reducer.py \
-input /input/davinci/* -output /input/davinci-output
where I stored the downloaded streaming jar in ${HADOOP_HOME}/contrib, and the other files in py_mapred. At the same time, I copyFromLocal to /input directory on hdfs. Now, when I run the command, the following lines show up:
15/08/14 17:35:45 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
15/08/14 17:35:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/usr/local/cellar/hadoop/2.6.0/py_mapred/mapper.py, /usr/local/cellar/hadoop/2.6.0/py_mapred/reducer.py, /var/folders/c5/4xfj65v15g91f71c_b9whnpr0000gn/T/hadoop-unjar3313567263260134566/] [] /var/folders/c5/4xfj65v15g91f71c_b9whnpr0000gn/T/streamjob9165494241574343777.jar tmpDir=null
15/08/14 17:35:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/08/14 17:35:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/08/14 17:35:48 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/14 17:35:48 INFO mapreduce.JobSubmitter: number of splits:2
15/08/14 17:35:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1439538212023_0002
15/08/14 17:35:49 INFO impl.YarnClientImpl: Submitted application application_1439538212023_0002
15/08/14 17:35:49 INFO mapreduce.Job: The url to track the job: http://Jonathans-MacBook-Pro.local:8088/proxy/application_1439538212023_0002/
15/08/14 17:35:49 INFO mapreduce.Job: Running job: job_1439538212023_0002
It looks like the command has been accepted. I checked on localhost:8088 and the job does register. However it's not running, despite the fact that it says Running job: job_1439538212023_0002. Is there something wrong with my command? Is it due to permission setting? Why isn't the job running?
Thank you
Here is right way for streaming:
bin/hadoop jar contrib/hadoop-streaming-2.6.0.jar \
-file ${HADOOP_HOME}/py_mapred/mapper.py -mapper '/usr/bin/python mapper.py' -file ${HADOOP_HOME}/py_mapred/reducer.py -reducer '/usr/bin/python reducer.py' -input /input/davinci/* -output /input/davinci-output
I am trying to run the wordcount example in c++, on Hadoop 1.0.4, on Ubuntu 12.04, but I am getting the following error:
Command:
hadoop pipes -D hadoop.pipes.java.recordreader=true -D
hadoop.pipes.java.recordwriter=true -input bin/input.txt -output
bin/output.txt -program bin/wordcount.
Error message:
13/06/14 13:50:11 WARN mapred.JobClient: No job jar file set. User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
13/06/14 13:50:11 INFO util.NativeCodeLoader:
Loaded the native-hadoop library 13/06/14 13:50:11 WARN
snappy.LoadSnappy: Snappy native library not loaded 13/06/14 13:50:11
INFO mapred.FileInputFormat: Total input paths to process : 1 13/06/14
13:50:11 INFO mapred.JobClient: Running job: job_201306141334_0003
13/06/14 13:50:12 INFO mapred.JobClient: map 0% reduce 0% 13/06/14
13:50:24 INFO mapred.JobClient: Task Id :
attempt_201306141334_0003_m_000000_0, Status : FAILED
java.io.IOException at
org.apache.hadoop.mapred.pipes.OutputHandler.waitForAuthentication(OutputHandler.java:188)
at
org.apache.hadoop.mapred.pipes.Application.waitForAuthentication(Application.java:194)
at
org.apache.hadoop.mapred.pipes.Application.(Application.java:149)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:71)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201306141334_0003_m_000000_0: Server failed to authenticate.
Exiting 13/06/14 13:50:24 INFO mapred.JobClient: Task Id :
attempt_201306141334_0003_m_000001_0, Status : FAILED
I didn't find any solution and i've been trying for quite a while to make it work.
I appreciate your help,
Thanks.
Found this SO question (hadoop not running in the multinode cluster) where that user got similar errors and it ended up being that they did not "Set a class" according to the top answer. This was Java however.
I found this tutorial about running the C++ wordcount example in Hadoop. Hopefully this helps you out.
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop