Django && hadoop - django

I just want to access the hdfs from web with django,so I use the hadoopy,I just write
def list(reqeust):
return HttpResponse(hadoopy.ls("."))
in views.py,but something is wrong,there are some informations about the error:"IOError at /list/ Ran[hadoop fs -ls .]: /bin/sh: 1: hadoop: not found",I think the "hadoop" command can't be resolved by shell,but I don't know what to do

The hadoopy library you're attempting to use is simply acting as a wrapper over the existing Apache Hadoop bash command scripts (hadoop, hdfs, mapred, etc. commands) and thereby requires those to be installed and available on your OS's or Application's PATH env-var, so it may call a hadoop fs -ls <path> shell command when you attempt to do hadoopy.ls(…).

Related

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?

I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?
The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"

Unable to execute a step on a running EMR

I have an EMR cluster 5.28.1 running in AWS but I forgot to install from python libraries as part of the bootstrap action. Now that the cluster is running, I was simply attempting to add a step via the EMR console. Here are my settings
JAR: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class: None
Arguments: s3://xxxx/install_python_libraries.sh
Unfortunately, I get the following error.
Cannot run program "s3://xxxxx/install_python_libraries.sh" (in directory "."): error=2, No such file or directory
I am not sure what I am doing wrong. The shell script looks like this.
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip-3.6 install boto3
sudo pip-3.6 install xmltodict
I also tried this by simply using 'command-runner.jar' but I get the same error. Can you please help me figure out the problem so I do this via the console? I would like to install the libraries on all nodes - master and core.
Thanks
The issue is the xxx.sh files EOL/carriage return type.
In other words, if it is Windows ("\r\n") then it will not work and return the ./ file not found error.
Convert it to unix type ("\n") using something like notepad++ and it will run fine.
(In notepad++ edit>EOL Conversion>Unix(LF) hit save and try again)

spark cluster on aws emr cant find spark-env.sh

I am playing with apache-spark on aws emr, and trying to use this to set the cluster to use python3,
I use the command as the last command in a bootstrap script
sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh
When I use it the cluster crashes during the bootstrap with the following error.
sed: can't read /etc/spark/conf/spark-env.sh: No such file or
directory
How should I set it to use python3 properly?
This is not a duplicate of, My issue is that the cluster is not finding the spark-env.sh file while bootstrapping, while the other question addresses the issue of the system not finding python3
In the end I did not use that script, but Used the EMR configuration file that is available on the creation stage, It gave me the proper configurations via spark_submit (in the aws gui) If you need it to be available for pyspark scripts in a more programatic way, you can use os.environ to set the pyspark python version in the python script

Unable to copy from local file system to hdfs

I am trying to copy a text file on my Mac Desktop to hdfs, for that purpose I am using this code
hadoop fs -copyFromLocal Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
But it is throwing an Error
copyFromLocal: `deckofcards.txt': No such file or directory
It sure exists on the desktop
Your command is missing a slash / at the source file path. It should be:
hadoop fs -copyFromLocal /Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
more correctly/efficiently,
hdfs dfs -put /Users/Vishnu/Desktop/deckofcards.txt /user/gsaikiran/cards1
Also, if you are dealing with HDFS specifically, better to use hdfs dfs syntax instead of hadoop fs [1]. (It doesn't change the output in your case, but hdfs dfs command is designed for interacting with HDFS whereas hadoop fs is the deprecated one)

Hadoop dfsadmin -report command is not working in mapr

I need to know the dfs report of the mapr cluster but when i am executing following command i am getting error
hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
report: FileSystem maprfs:/// is not an HDFS file system
Usage: java DFSAdmin [-report] [-live] [-dead] [-decommissioning]
Is there any way to do it in MAPR.
I tried this link as well but it doesn't provided needed information.
Try below commands:
maprcli node list
maprcli dashboard info