Hadoop commands from python script?

Hadoop commands from python script? - python-2.7

I have multiple hadoop commands to be run and these are going to be invoked from a python script. Currently, I tried the following way.
import os
import xml.etree.ElementTree as etree
import subprocess
filename = "sample.xml"
__currentlocation__ = os.getcwd()
__fullpath__ = os.path.join(__currentlocation__,filename)
tree = etree.parse(__fullpath__)
root = tree.getroot()
hivetable = root.find("hivetable").text
dburl = root.find("dburl").text
username = root.find("username").text
password = root.find("password").text
tablename = root.find("tablename").text
mappers = root.find("mappers").text
targetdir = root.find("targetdir").text
print hivetable
print dburl
print username
print password
print tablename
print mappers
print targetdir
p = subprocess.call(['hadoop','fs','-rmr',targetdir],stdout = subprocess.PIPE, stderr = subprocess.PIPE)
But, the code is not working.It is neither throwing an error not deleting the directory.

I suggest you slightly change your approach, or this is how I'm doing it. I make use of python library import commands which then depends how you will use it (https://docs.python.org/2/library/commands.html).
Here is a lil demo:
import commands as com
print com.getoutput('hadoop fs -ls /')
This gives you output like (depending on what you have in the HDFS dir )
/usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/hadoop-env.sh: line 25: /Library/Java/JavaVirtualMachines/jdk1.8.0_112.jdk/Contents/Home: Is a directory
Found 2 items
drwxr-xr-x - someone supergroup 0 2017-03-29 13:48 /hdfs_dir_1
drwxr-xr-x - someone supergroup 0 2017-03-24 13:42 /hdfs_dir_2
Note: the lib commands doesn't work with python 3 (to my knowledge), I'm using python 2.7.
Note: Be aware of the limitation of commands
If you will use subprocess which is the equivalent to commands for python 3 then you might consider to find a proper way to deal with your 'pipelines'. I find this discussion useful in that sense: (subprocess popen to run commands (HDFS/hadoop))
I hope this suggestion helps you!
Best

Related

PYTHON: How To Print SSH Key Using Python

I've been trying to make a script that can print the Ubuntu SSH key located in ~/.ssh/authorised_keys/
Basically I want the script to print out exactly what cat ~/.ssh/authorised_keys/ would output.
I have tried using subprocess.check_output but it always returns an error.
Thanks

What about this ?
import os
os.system('cat ~/.ssh/authorised_keys')

If you want to capture the output to a variable, use subprocess. If not, you can use os.system as user803422 has mentioned
import os, subprocess
path = '~/.ssh/authorized_keys'
cmd = 'cat ' + os.path.expanduser(path)
output = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
response = output.communicate()
print (response)

You can read the file directly in Python, there is not really a need to use subprocess:
import os
print(open(os.path.expanduser('~/.ssh/authorized_keys')).read())

Running locally blastn against nt db thru python script

I have a fasta file with sequence that I want to blast locally to 'nt' database dowloaded on my computer from ncbi website
I dowloaded blast 2.6.0.
In order to access blast from anywhere, I did:
gedit ~/.bashrc
export PATH=/usr/local/ncbi-blast-2.6.0+/bin:$PATH
then I did:
source ~/.bashrc
Then I downloaded 'nt' database (155.6GB) and stored it in /usr/local/blastdb
I want to run in python script this command:
from Bio.Blast.Applications import NcbiblastnCommandline
cline = NcbiblastnCommandline(query="/home/proprietaire/Desktop/JADE/stage_scripts/seq_error_fasta.fasta", db="/usr/local/blastdb/nt", evalue=0.001, out="blast_result_local.xml", outfmt=5)
But it is not working for a reason. Please help me figure out what I'm doing wrong. Thank you for your help.
EDIT:
'seq_error_fasta.fasta' : is my fasta file with 64 sequences that I want to blast to 'nt' database.
My 'seq_error_fasta.fasta' contains sequence loaded with error like S, J, X so I want to blast them to 'nt' db in order to get the closest better sequence
I found out that I need to format the nt database dowloaded from ncbi so I did this:
makeblastdb -dbtype nucl -in nt
Then I added this after my cline variable in my python script:
stdout, stderr = cline()
The script is running but unfortunately I'm getting this error now:
Bus error (core dumped)
I think it's a ram memory problem so I thought that I need to shorten 'nt' db by taking only the bacteria sequence. I looked on NCBI for a whole bacteria only database but there is multiple database of different species like more then a thousand.
I also tried blast online using this script:
f = open('output_blast.xml','w')
for rec in SeqIO.parse(open("seq_error_fasta.fasta"), 'fasta):
result_handle = NCBIWWW.qblast("blastn", "nt", rec.format("fasta"), format_type="XML", alignments=1, perc_ident=95, expect= 0.001)
f.write(result_handle.read())
f.close()
but this only doing one query sequence and returning all hits, althought I specified 1 hit and 95% of identity.
This is driving me crazy lollll Please help

Download NCBI nr database:
$ mkdir db
$ cd db
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr*
Run blast:
import io
import shlex
import subprocess
BLAST_OUTFMT6 = """\
'6 qacc sacc pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq'\
"""
BLAST_OUTFMT6_COLUMN_NAMES = [
'query_id', 'subject_id', 'pc_identity', 'alignment_length', 'mismatches', 'gap_opens',
'q_start', 'q_end', 's_start', 's_end', 'evalue', 'bitscore', 'qseq', 'sseq',
]
def blastp(sequence, db, evalue=0.001, max_target_seqs=100000):
system_command = (
'blastp -db {db} -outfmt {outfmt} -evalue {evalue} -max_target_seqs {max_target_seqs}'
.format(db=db, outfmt=BLAST_OUTFMT6, evalue=evalue, max_target_seqs=max_target_seqs)
)
cp = subprocess.Popen(
shlex.split(system_command),
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
universal_newlines=True)
result, error_message = cp.communicate(sequence)
if error_message.strip():
print("Error: {}".format(error_message))
return result
if __name__ == '__main__':
result = blastp('AAAAAAAAAAAAAA', db='/path/to/db/nr')
print(result)

How to change the values of a parameter in multiple files using python

I am a new user of Python. I got to learn a way of changing value of a parameter in a single file. The script:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('3.7',sigma)
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
It's run in command line with test.txt as
test.py test.txt 2.
3.7 is replaced by 2 in test.txt, as a result.
Now if I want to do the same for all the .txt files in the directory e.g.
test.py *.txt 2
what are the suggested modifications?
Your suggestions are highly appreciated.
Hafiz.

bash (or whatever your shell is) will expand the *.txt (to test0.txt test1.txt ... or whatever the *.txt files in your current directory are called) before passing it to your python script. your python script will therefore get many arguments (and not just 2 as you expect). print sys.argv to inspect.
you could solve that in bash itself with something like
for name in *.txt; do test.py ${name} 2; done
otherwise you would need to treat sys.argv differently in python and allow for more than 2 arguments.

Importing glob solved that issue. But I've got some queries.
Query 1:
I'm rewriting my code as:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('3.7'|'3',sigma) #gives syntax error
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
I want to replace 3.7 or 3 by sigma. What will be the corrected code?
Query 2:
I'm rewriting it in the following manner:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('x="2"','x=sigma')
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
With
py test.py test.txt 3.
I get x=sigma, but I want to get x=3
What'd be the modification?
Regards,
Hafiz

ConfigParser does not load sections when run via crontab

When I run python script via command line, everything works just perfect, but wehn the script is being running from cron ConfigParser creates an empty list of sections
me = singleton.SingleInstance()
######### Accessing the configuration file #######################################
config = ConfigParser.RawConfigParser()
config.read('./matrix.cfg')
sections = config.sections()
######### Build the current map ##################################################
print sections
Here is the cron job
* * * * * /usr/bin/python /etc/portmatrix/matrix.py | logger
and here is the output
Feb 12 12:59:01 dns01 CRON[30879]: (root) CMD (/usr/bin/python /etc/portmatrix/matrix.py | logger)
Feb 12 12:59:01 dns01 logger: []

ConfigParser tries to read the file ./matrix.cfg.
Now the path of this file is ./ which means in the current directory.
So what assumption do you make about the current directory when being run from cron? (I guess you have a file /etc/portmatrix/matrix.cfg and you assume that ./ really means "in the same directory as the running script" - this however is not true)
The simple fix is to provide the full path to the configuration file. E.g.:
config = ConfigParser.RawConfigParser()
config.read('/etc/portmatrix/matrix.cfg')

I ran into a similar situation and I was able to fix it thanks to umläute's answer.
I'm using os however, so in your case, this would be something like:
import os
...
base_path = os.path.dirname(os.path.realpath(__file__))
config = ConfigParser.RawConfigParser()
config.read(os.path.join(base_path, 'matrix.cfg')

Program for Backup - Python

Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')

See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hadoop commands from python script? - python-2.7

Related

PYTHON: How To Print SSH Key Using Python

Running locally blastn against nt db thru python script

How to change the values of a parameter in multiple files using python

ConfigParser does not load sections when run via crontab

Program for Backup - Python

Categories

Resources