GitPython: How can I access the contents of a file in a commit in GitPython - python-2.7

I am new to GitPython and I am trying to get the content of a file within a commit. I am able to get each file from a specific commit, but I am getting an error each time I run the command. Now, I know that the file exist in GitPython, but each time I run my program, I am getting the following error:
returned non-zero exit status 1
I am using Python 2.7.6 and Ubuntu Linux 14.04.
I know that the file exist, since I also go directly into Git from the command line, check out the respective commit, search for the file, and find it. I also run the cat command on it, and the file contents are displayed. Many times when the error shows up, it says that the file in question does not exist. I am trying to go through each commit with GitPython, get every blob or file from each individual commit, and run an external Java program on the content of that file. The Java program is designed to return a string to Python. To capture the string returned from my Java code, I am also using subprocess.check_output. Any help will be greatly appreciated.
I tried passing in the command as a list:
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava','absolute/path/to/file']
subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
And I have also tried passing the command as a string:
subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
Is it possible to access the contents of a file from GitPython?
For example, say there is a commit and it has one file foo.java
In that file is the following lines of code:
foo.java
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
public class foo{
public static void main(String[] args) throws Exception{}
}
I want to access everything in the file and run an external program on it.
Any help would be greatly appreciated. Below is a piece of the code I am using to do so
#! usr/bin/env python
__author__ = 'rahkeemg'
from git import *
import git, json, subprocess, re
git_dir = '/home/rahkeemg/Documents/GitRepositories/WhereHows'
# make an instance of the repository from specified path
repo = Repo(path=git_dir)
heads = repo.heads # obtain the different repositories
master = heads.master # get the master repository
print master
# get all of the commits on the master branch
commits = list(repo.iter_commits(master))
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava']
# start at the very 1st commit, or start at commit 0
for i in range(len(commits) - 1, 0, -1):
commit = commits[i]
commit_num = len(commits) - 1 - i
print commit_num, ": ", commit.hexsha, '\n', commit.message, '\n'
for entry in commit.tree.traverse():
if re.search(r'\.java', entry.path):
current_file = str(entry.abspath.strip())
# add the current file or blob to the list for the command to run
cmd.append(current_file)
print entry.abspath
try:
# This is the scenario where I pass arguments into command as a string
print subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
# scenario where I pass arguments into command as a list
j_response = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
except subprocess.CalledProcessError as e:
print "Error on file: ", current_file
# Use pop on list to remove the last string, which is the selected file at the moment, to make place for the next file.
cmd.pop()

First of all, when you traverse the commit history like this, the file will not be checked out. All you get is the filename, maybe leading to the file or maybe not, but certainly it will not lead to the file from different revision than currently checked-out.
However, there is a solution to this. Remember that in principle, anything you could do with some git command, you can do with GitPython.
To get file contents from specific revision, you can do the following, which I've taken from that page:
git show <treeish>:<file>
therefore, in GitPython:
file_contents = repo.git.show('{}:{}'.format(commit.hexsha, entry.path))
However, that still wouldn't make the file appear on disk. If you need some real path for the file, you can use tempfile:
f = tempfile.NamedTemporaryFile(delete=False)
f.write(file_contents)
f.close()
# at this point file with name f.name contains contents of
# the file from path entry.path at revision commit.hexsha
# your program launch goes here, use f.name as filename to be read
os.unlink(f.name) # delete the temp file

Related

How to download a snappy.parquet file from s3 using Boto in Python

I'm new to this, and trying to download a snappy.parquet file from Amazon s3 I can later convert to CSV file.
I tried working with the following example I've found online, and I get an empty folder. can anyone please help me?
import boto
import sys, os
from boto.s3.key import Key
from boto.exception import S3ResponseError
DOWNLOAD_LOCATION_PATH =""
BUCKET_NAME = ""
AWS_ACCESS_KEY_ID= ""
AWS_ACCESS_SECRET_KEY = ""
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_ACCESS_SECRET_KEY)
bucket = conn.get_bucket(BUCKET_NAME)
#goto through the list of files
bucket_list = bucket.list()
for l in bucket_list:
key_string = str(l.key)
s3_path = DOWNLOAD_LOCATION_PATH + key_string
try:
print ("Current File is ", s3_path)
l.get_contents_to_filename(s3_path)
except (OSError, S3ResponseError) as e:
pass
# check if the file has been downloaded locally
if not os.path.exists(s3_path):
try:
os.makedirs(s3_path)
except OSError as exc:
# let guard againts race conditions
import errno
if exc.errno != errno.EEXIST:
raise
The script you are using appears to recursively download the contents of the specified S3 bucket (BUCKET_NAME) to the specified local directory (DOWNLOAD_LOCATION_PATH). FWIW, I notice this script looks like it comes from here.
The "Current File is ..." output line should show you the progress of these files being written. One problem you might be having is due to this line:
s3_path = DOWNLOAD_LOCATION_PATH + key_string
If you had specified DOWNLOAD_LOCATION_PATH at the top as a directory without a trailing '/' character, e.g. like this:
DOWNLOAD_LOCATION_PATH = '/tmp/my_dir'
then the files being downloaded would be written not underneath the /tmp/my_dir directory, but directly in /tmp/ with a my_dir prefix on each filename! You can fix this by changing this line to:
s3_path = os.path.join(DOWNLOAD_LOCATION_PATH, key_string)
Other than that, the script appears to work alright. You may want to add this line at the very top:
from __future__ import print_function
if you are still using Python 2.x, otherwise the print output will look a bit odd (print will think you are printing a 2-Tuple).
Your question also makes it sound like you really only want/need to download a single file from the bucket -- if so, this isn't really a great script to be using, since it's downloading everything.

How to load retrained_graph.pb and retrained_label.txt using pycharm editor

Using pete warden tutorials i had trained the inception network and training of which i am getting two files
1.retrained_graph.pb
2.retrained_label.txt
Using this i wanted to classify the flower image.
I had install pycharm and linked all the tensorflow library , i had also test the sample tensorflow code it is working fine.
Now when i run the label_image.py program which is
import tensorflow as tf, sys
image_path = sys.argv[1]
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("/tf_files/retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("/tf_files/retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for node_id in top_k:
human_string = label_lines[node_id]
score = predictions[0][node_id]
print('%s (score = %.5f)' % (human_string, score))
i am getting this error message
/home/chandan/Tensorflow/bin/python /home/chandan/PycharmProjects/tf/tf_folder/tf_files/label_image.py
Traceback (most recent call last):
File "/home/chandan/PycharmProjects/tf/tf_folder/tf_files/label_image.py", line 7, in <module>
image_path = sys.argv[1]
IndexError: list index out of range
Could any one please help me with this issue.
You are getting this error because it is expecting image name (with path) as an argument.
In pycharm go to View->Tool windows->Terminal.
It is same as opening separate terminal. And run
python label_image.py /image_path/image_name.jpg
You are trying to get the command line argument by calling sys.argv[1]. So you need to give command line arguments to satisfy it. Looks like the argument required is a test image, you should pass its location as a parameter.
Pycharm should have a script parameters and interpreter options dialog which you can use to enter the required parameters.
Or you can call the script from a command line and enter the parameter via;
>python my_python_script.py my_python_parameter.jpg
EDIT:
According to the documents (I don't have pycharm installed on this computer), you should go to Run/Debug configuration menu and edit the configurations for your script. Add the absolute path of your file into Script Parameters box in quotes.
Or alternatively if you just want to skip the parameter thing completely, just get the path as raw_input (input in python3) or just simply give it to image_path = r"absolute_image_path.jpg"

I don't understand how this "os.join" function is working? I am getting errors constantly and no reading on os functions is helping me

Here's the code
sys.path.append( "../tools/" )
from parse_out_email_text import parseOutText #(its just another .py file that has a function I wrote)
from_sara = open("from_sara.txt", "r")
from_chris = open("from_chris.txt", "r")
from_data = []
word_data = []
temp_counter = 0
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
temp_counter += 1
if temp_counter < 200:
path = os.path.join('..', path[:-1]) #(THIS IS THE PART I CAN'T GET MY HEAD AROUND)
print path
email = open(path, "r")
email.close()
print "emails processed"
from_sara.close()
from_chris.close()
When I run this, it gives me an error as shown below:
Traceback (most recent call last):
..\maildir/bailey-s/deleted_items/101.
File "C:/Users/AmitSingh/Desktop/Data/Udacity/Naya_attempt/vectorize_text.py", line 47, in <module>
email = open(path, "r")
IOError: [Errno 2] No such file or directory: '..\\maildir/bailey-s/deleted_items/101.'
I don't even have this """'..\maildir/bailey-s/deleted_items/101.'""" directory path on my laptop, I tried to change the path by replacing the '..' in the code by the actual path name to the folder where I keep all the files, and nothing changes.
path = os.path.join('..', path[:-1])
This code is part of an online course on machine learning and I have been stuck at this point for 3 hours now. Any help would be really appreciated.
(P.S. This is not a homework question and there are no grades attached to this, its a free online course)
your test data is not there so it cannot find it. you should run start-up code again and make sure the necessary maildir are all there.
Go to tools inside your udacity project directory and run startup.py.
It is about 400 Mb so sit back and relax!
I know this is extremely late, but I found this post after having the exact same problem.
All the answers that I found here and on other sites, even the issue requests in the original github, were just "run startup.py" I already did that. However, it was telling me:
Traceback (most recent call last):
File "K:\documents\Udacity\Mini-Projects\ud120-projects\text_learning\vectorize_text.py", line 48, in <module>
email = open(path, "r")
FileNotFoundError: [Errno 2] No such file or directory: '..\\maildir/bailey-s/deleted_items/101.'
Just like yours. I then found where this file was located and it was indeed on my computer
I added 'tools' to the os.path.join() line as you can see here:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
temp_counter += 1
if temp_counter < 200:
#path = os.path.join('..', path[:-1]) <---original
path = os.path.join('..','tools', path[:-1])
print(path)
email = open(path, "r")
This worked for me finally. So, I hope it helps anyone else that stumbles on this problem in the future.
Also, I noticed on some examples I found of other repos of the lessons. Their 'tools' folder was named 'utils'.
Here is an example, this is a repo that someone tweaked to use jupyter notebooks to run the lessons So, use the one that you have.
In your Udacity course folder, first go to tools directory, check if you have maildir folder present and if it has got subfolders in it, if they are present then go back to text_learning/vectorize_text.py, find this line of code path = os.path.join('..', path[:-1]), change it to path = os.path.join('../tools/', path[:-1]),
On terminal, cd text_learning , then python vectorize_text.py, this should solve the issue.
If this does not solve the issue, then Go to tools inside your udacity project directory and run startup.py. Wait till the process is complete
Repeat step 1.

Django: No such file or directory

I have a process that scans a tape library and looks for media that has expired, so they can be removed and reused before sending the tapes to an offsite vault. (We have some 7 day policies that never make it offsite.) This process takes around 20 minutes to run, so I didn't want it to run on-demand when loading/refreshing the page. Rather, I set up a django-cron job (I know I could have done this in Linux cron, but wanted the project to be as self-contained as possible) to run the scan, and creates a file in /tmp. I've verified that this works -- the file exists in /tmp from this morning's execution. The problem I'm having is that now I want to display a list of those expired (scratch) media on my web page, but the script is saying that it can't find the file. When the file was created, I use the absolute filename "/tmp/scratch.2015-11-13.out" (for example), but here's the error I get in the browser:
IOError at /
[Errno 2] No such file or directory: '/tmp/corpscratch.2015-11-13.out'
My assumption is that this is a "web root" issue, but I just can't figure it out. I tried copying the file to the /static/ and /media/ directories configured in django, and even in the django root directory, and the project root directory, but nothing seems to work. When it says it cant' find /tmp/file, where is it really looking?
def sample():
""" Just testing """
today = datetime.date.today() #format 2015-11-31
inputfile = "/tmp/corpscratch.%s.out" % str(today)
with open(inputfile) as fh: # This is the line reporting the error
lines = [line.strip('\n') for line in fh]
print(lines)
The print statement was used for testing in the shell (which works, I might add), but the browser gives an error.
And the file does exist:
$ ls /tmp/corpscratch.2015-11-13.out
/tmp/corpscratch.2015-11-13.out
Thanks.
Edit: was mistaken, doesn't work in python shell either. Was thinking of a previous issue.
Use this instead:
today = datetime.datetime.today().date()
inputfile = "/tmp/corpscratch.%s.out" % str(today)
Or:
today = datetime.datetime.today().strftime('%Y-%m-%d')
inputfile = "/tmp/corpscratch.%s.out" % today # No need to use str()
See the difference:
>>> str(datetime.datetime.today().date())
'2015-11-13'
>>> str(datetime.datetime.today())
'2015-11-13 15:56:19.578569'
I ended up finding this elsewhere:
today = datetime.date.today() #format 2015-11-31
inputfilename = "tmp/corpscratch.%s.out" % str(today)
inputfile = os.path.join(settings.PROJECT_ROOT, inputfilename)
With settings.py containing the following:
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
Completely resolved my issues.

How to replace string in multiple files in the folder?

I m trying to read two files and replace content of one file with content of other file in files present in folder which also has sub directories.
But its tell sub process not defined.
i'm new to python and shell script can anybody help me with this please?
import os
import sys
import os.path
f = open ( "file1.txt",'r')
g = open ( "file2.txt",'r')
text1=f.readlines()
text2=g.readlines()
i = 0;
for line in text1:
l = line.replace("\r\n", "")
t = text2[i].replace("\r\n", "")
args = "find . -name *.tml"
Path = subprocess.Popen( args , shell=True )
os.system(" sed -r -i 's/" + l + "/" + t + "/g' " + Path)
i = i + 1;
To specifically address your actual error, you need to import the subprocess module as you are making use of it (oddly) in your code:
import subprocess
After that, you will find more problems. I will try and keep it as simple as possible with my suggestions. Code first, then I will break it down. Keep in mind, there are more robust ways to accomplish this task. But I am doing my best to keep in mind your experience level and making it make your current approach as closely as possible.
import subprocess
import sys
# 1
results = subprocess.Popen("find . -name '*.tml'",
shell=True, stdout=subprocess.PIPE)
if results.wait() != 0:
print "error trying to find tml files"
sys.exit(1)
# 2
tml_files = []
for tml in results.stdout:
tml_files.append(tml.strip())
if not tml_files:
print "no tml files found"
sys.exit(0)
tml_string = " ".join(tml_files)
# 3
with open ("file1.txt") as f, open("file2.txt") as g:
while True:
# 4
f_line = f.readline()
if not f_line:
break
g_line = g.readline()
if not g_line:
break
f_line = f_line.strip()
g_line = g_line.strip()
if not f_line or not g_line:
continue
# 5
cmd = "sed -i -e 's/%s/%s/g' %s" % \
(f_line.strip(), g_line.strip(), tml_string)
ret = subprocess.Popen(cmd, shell=True).wait()
if ret != 0:
print "error doing string replacement"
sys.exit(1)
You do not need to read in your entire files at once. If they are large this could be a lot of memory. You can consume a line at a time, and you can also make use of what is called "context managers" when you open the files. This will ensure they close properly no matter what happens:
We start with a subprocess command that is run only once to find all your .tml files. Your version had the same command being run multiple times. If the search path is the same, then we only need it once. This checks the exit code of the command and quits if it failed.
We loop over stdout on the subprocess command, and add the stripped lines to a list. This is a more robust way of your replace("\r\n"). It removes whitespace. A "list comprehension" would be better suited here (down the line). If we didn't find any tml files, then we have no work to do, so we exit. Otherwise, we join them together in a space-separated string to be suitable for our command later.
This is called "context managers". You can open the file in a way that no matter what they will be closed properly. The file is open for the length of the context within that code block. We are going to loop forever, and break when appropriate.
We pull a line, one at a time, from each file. If either line is blank, we reached the end of the file and cannot do any more work, so we break out. We then strip the newlines, and if either string is empty (blank line) we still can't do any work, but we just continue to the next available line.
A modified version of your sed command. We construct the command string on each loop for the source and replacement strings, and tack on the tml file string. Bear in mind this is a very naive approach to the replacement. It really expects your replacement strings to be safe characters and not break the s///g sed format. But we run that with another subprocess command. The wait() simply waits for the return code, and we check it for an error. This approach replaces your os.system() version.
Hope this helps. Eventually you can improve this to do more checking and safe operations.