How to download a snappy.parquet file from s3 using Boto in Python

How to download a snappy.parquet file from s3 using Boto in Python - python-2.7

I'm new to this, and trying to download a snappy.parquet file from Amazon s3 I can later convert to CSV file.
I tried working with the following example I've found online, and I get an empty folder. can anyone please help me?
import boto
import sys, os
from boto.s3.key import Key
from boto.exception import S3ResponseError
DOWNLOAD_LOCATION_PATH =""
BUCKET_NAME = ""
AWS_ACCESS_KEY_ID= ""
AWS_ACCESS_SECRET_KEY = ""
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_ACCESS_SECRET_KEY)
bucket = conn.get_bucket(BUCKET_NAME)
#goto through the list of files
bucket_list = bucket.list()
for l in bucket_list:
key_string = str(l.key)
s3_path = DOWNLOAD_LOCATION_PATH + key_string
try:
print ("Current File is ", s3_path)
l.get_contents_to_filename(s3_path)
except (OSError, S3ResponseError) as e:
pass
# check if the file has been downloaded locally
if not os.path.exists(s3_path):
try:
os.makedirs(s3_path)
except OSError as exc:
# let guard againts race conditions
import errno
if exc.errno != errno.EEXIST:
raise

The script you are using appears to recursively download the contents of the specified S3 bucket (BUCKET_NAME) to the specified local directory (DOWNLOAD_LOCATION_PATH). FWIW, I notice this script looks like it comes from here.
The "Current File is ..." output line should show you the progress of these files being written. One problem you might be having is due to this line:
s3_path = DOWNLOAD_LOCATION_PATH + key_string
If you had specified DOWNLOAD_LOCATION_PATH at the top as a directory without a trailing '/' character, e.g. like this:
DOWNLOAD_LOCATION_PATH = '/tmp/my_dir'
then the files being downloaded would be written not underneath the /tmp/my_dir directory, but directly in /tmp/ with a my_dir prefix on each filename! You can fix this by changing this line to:
s3_path = os.path.join(DOWNLOAD_LOCATION_PATH, key_string)
Other than that, the script appears to work alright. You may want to add this line at the very top:
from __future__ import print_function
if you are still using Python 2.x, otherwise the print output will look a bit odd (print will think you are printing a 2-Tuple).
Your question also makes it sound like you really only want/need to download a single file from the bucket -- if so, this isn't really a great script to be using, since it's downloading everything.

Related

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

I am very new to programming. I am working on a pipeline to analyze DMARC report files that are sent to my email account, that I am manually placing in an s3 bucket. The goal of this task is to download, extract, and analyze files using parsedmarc: https://github.com/domainaware/parsedmarc The part I'm having difficulty with is setting a conditional statement to extract .gz files if the target file is not a .zip file. I'm assuming the gzip library will be sufficient for this purpose. Here is the code I have so far. I'm using python3 and the boto3 library for AWS. Any help is appreciated!
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for sandbox account. Access keys are pulled from /.aws/config and /.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in /.aws/config and /.aws/credentials
boto3.setup_default_session(region_name="aws-region-goes-here")
boto3.setup_default_session(profile_name="aws-account-profile-name-goes-here")
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource(s3)
s3_resource.Bucket('dmarc-parsing').download_file('source-dmarc-report-filename.zip' '/home/user/dmarc/parseme.zip')
#Use the zipfile python library to extract the file into its raw state.
with zipfile.ZipFile('/home/user/dmarc/parseme.zip', 'r') as zip_ref:
zip_ref.extractall('/home/user/dmarc')
#Ingest all locations for xml file source
dmarc_report_directory = '/home/user/dmarc/'
dmarc_report_file = 'parseme.xml'
"""I need an if statement here for extracting .gz files if the file type is not .zip. The contents of every archive are .xml files"""
#Set report output variables using functions in parsedmarc. Variable set to equal the output
pd_report_output=parsedmarc.parse_aggregate_report_file(_input=f"{dmarc_report_directory}{dmarc_report_file}")
#use jsonify to make the output in json format
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
dkim_status = pd_report_jsonified['records'][0]['policy_evaluated']['dkim']
spf_status = pd_report_jsonified['records'][0]['policy_evaluated']['spf']
if dkim_status == 'fail' or spf_status == 'fail':
print(f"{dmarc_report_file} reports failure. oh crap. report:")
else:
print(f"{dmarc_report_file} passes. great. report:")
pp.pprint(pd_report_jsonified['records'][0]['auth_results'])
if __name__ == "__main__":
main()

Here is the code using the parsedmarc.parse_aggregate_report_xml method I found. Hope this helps others in parsing these reports:
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for account. Access keys are pulled from ~/.aws/config and ~/.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in ~/.aws/config and ~/.aws/credentials
boto3.setup_default_session(profile_name="aws_profile_name_goes_here", region_name="region_goes_here")
source_file = 'filename_in_s3_bucket.zip'
destination_directory = '/tmp/'
destination_file = 'compressed_report_file'
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource('s3')
s3_resource.Bucket('bucket-name-for-dmarc-report-files').download_file(source_file, f"{destination_directory}{destination_file}")
#Extract xml
outputxml = parsedmarc.extract_xml(f"{destination_directory}{destination_file}")
#run parse dmarc analysis & convert output to json
pd_report_output = parsedmarc.parse_aggregate_report_xml(outputxml)
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
#loop through results and find relevant status info and pass fail status
dmarc_report_status = ''
for record in pd_report_jsonified['records']:
if False in record['alignment'].values():
dmarc_report_status = 'Failed'
#************ add logic for interpreting results
#if fail, publish to sns
if dmarc_report_status == 'Failed':
message = "Your dmarc report failed a least one check. Review the log for details"
sns_resource = boto3.resource('sns')
sns_topic = sns_resource.Topic('arn:aws:sns:us-west-2:112896196555:TestDMARC')
sns_publish_response = sns_topic.publish(Message=message)
if __name__ == "__main__":
main()

GitPython: How can I access the contents of a file in a commit in GitPython

I am new to GitPython and I am trying to get the content of a file within a commit. I am able to get each file from a specific commit, but I am getting an error each time I run the command. Now, I know that the file exist in GitPython, but each time I run my program, I am getting the following error:
returned non-zero exit status 1
I am using Python 2.7.6 and Ubuntu Linux 14.04.
I know that the file exist, since I also go directly into Git from the command line, check out the respective commit, search for the file, and find it. I also run the cat command on it, and the file contents are displayed. Many times when the error shows up, it says that the file in question does not exist. I am trying to go through each commit with GitPython, get every blob or file from each individual commit, and run an external Java program on the content of that file. The Java program is designed to return a string to Python. To capture the string returned from my Java code, I am also using subprocess.check_output. Any help will be greatly appreciated.
I tried passing in the command as a list:
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava','absolute/path/to/file']
subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
And I have also tried passing the command as a string:
subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
Is it possible to access the contents of a file from GitPython?
For example, say there is a commit and it has one file foo.java
In that file is the following lines of code:
foo.java
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
public class foo{
public static void main(String[] args) throws Exception{}
}
I want to access everything in the file and run an external program on it.
Any help would be greatly appreciated. Below is a piece of the code I am using to do so
#! usr/bin/env python
__author__ = 'rahkeemg'
from git import *
import git, json, subprocess, re
git_dir = '/home/rahkeemg/Documents/GitRepositories/WhereHows'
# make an instance of the repository from specified path
repo = Repo(path=git_dir)
heads = repo.heads # obtain the different repositories
master = heads.master # get the master repository
print master
# get all of the commits on the master branch
commits = list(repo.iter_commits(master))
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava']
# start at the very 1st commit, or start at commit 0
for i in range(len(commits) - 1, 0, -1):
commit = commits[i]
commit_num = len(commits) - 1 - i
print commit_num, ": ", commit.hexsha, '\n', commit.message, '\n'
for entry in commit.tree.traverse():
if re.search(r'\.java', entry.path):
current_file = str(entry.abspath.strip())
# add the current file or blob to the list for the command to run
cmd.append(current_file)
print entry.abspath
try:
# This is the scenario where I pass arguments into command as a string
print subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
# scenario where I pass arguments into command as a list
j_response = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
except subprocess.CalledProcessError as e:
print "Error on file: ", current_file
# Use pop on list to remove the last string, which is the selected file at the moment, to make place for the next file.
cmd.pop()

First of all, when you traverse the commit history like this, the file will not be checked out. All you get is the filename, maybe leading to the file or maybe not, but certainly it will not lead to the file from different revision than currently checked-out.
However, there is a solution to this. Remember that in principle, anything you could do with some git command, you can do with GitPython.
To get file contents from specific revision, you can do the following, which I've taken from that page:
git show <treeish>:<file>
therefore, in GitPython:
file_contents = repo.git.show('{}:{}'.format(commit.hexsha, entry.path))
However, that still wouldn't make the file appear on disk. If you need some real path for the file, you can use tempfile:
f = tempfile.NamedTemporaryFile(delete=False)
f.write(file_contents)
f.close()
# at this point file with name f.name contains contents of
# the file from path entry.path at revision commit.hexsha
# your program launch goes here, use f.name as filename to be read
os.unlink(f.name) # delete the temp file

Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2

I’m doing simple operation to of downloading the gzip files from S3 bucket to the local directory. I’m extracting those into another local directory and then uploading them back to S3 bucket again into archive folder path. While doing this operation I want to make sure I am processing same set of files that I initially download from S3 bucket which is (f_name) in below code. Now, below code is not uploading those back to S3 , that’s where I’m stuck. But able to download from S3 and extract it into local directory. Can you please help me understand what is wrong with the _uploadFile function?
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket= "event-logs-dev” ## S3 Bucket name
local_download_directory= "/Users/TargetData/Download/test_queue1/“ ## local directory to download the gzip files from S3.
Target_directory_to_extract = "/Users/TargetData/unzip” ##local directory to gunzip the downloaded files.
Target_s3_path_to_upload= "event-logs-dev/data/clean/xact/logs/archive/“ ## S3 bucket path to upload the files.
def decompressAllFilesFromNetfiler(self,aws_bucket,local_download_directory,Target_d irectory_to_extract,Target_s3_path_to_upload):
zipFiles = [f for f in os.listdir(local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')]) and os.access(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')], os.R_OK):
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(Target_directory_to_extract, f_name_with_path, Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,f_name,Target_s3_path_to_upload,Target_directory_to_extract)
def _uploadFile(self, aws_bucket, f_name,Target_s3_path_to_upload,Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload, f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
print "Uploaded extracted file to: %s" % (full_key_name)
key = aws_bucket.new_key(full_key_name)
key.set_contents_from_filename(path)
except:
if full_key_name is None:
print "Error uploading”
Currently, the output prints that Uploaded extracted file to: event-logs-dev/data/clean/xact/logs/archive/1442235602129200000.tar.gz, but nothing is uploaded to S3 bucket. Your help is greatly appreciated!! Thank you in advance!

It appears that you have cut and pasted parts of your code - and maybe formatting was lost as your code above will not work as pasted. I've taken the liberty to make it PEP8 (mostly) however there is still some missing code to create the S3 objects. Since your import the modules, I presume that you have that section of code and just didn't paste it.
here is a cleaned up version of your code formatted correctly. I also added a Exception code to your try: block to print out the error you get. You should update the Exception to be more specific to the Exceptions thrown for make_key or set_contents_... but the general Exception will get you started. If nothing more this is more readable, but you should include your S3 connection code too - and remove anything that is specific to your domain (e.g. keys, trade secrets, etc).
#!/usr/bin/env python
"""
do some download
some extract
and some upload
"""
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket = 'event-logs-dev'
local_download_directory = '/Users/TargetData/Download/test_queue1/'
Target_directory_to_extract = '/Users/TargetData/unzip'
Target_s3_path_to_upload = 'event-logs-dev/data/clean/xact/logs/archive/'
'''
MUST BE SOME MAGIC HERE TO GET AN S3 CONNECTION ???
aws_bucket IS NOT A BUCKET OBJECT ...
'''
def decompressAllFilesFromNetfiler(self,
aws_bucket,
local_download_directory,
Target_directory_to_extract,
Target_s3_path_to_upload):
'''
decompress stuff
'''
zipFiles = [f for f in os.listdir(
local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.access(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.R_OK:
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(
Target_directory_to_extract,
f_name_with_path,
Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract)
def _uploadFile(self,
aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload,
f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
S3CONN = S3Connection()
BUCKET = S3CONN.get_bucket(aws_bucket)
key = BUCKET.new_key(full_key_name)
key.set_contents_from_filename(path)
print "Uploaded extracted file to: {}".format(full_key_name)
except Exception as UploadERR:
if full_key_name is None:
print 'Error uploading'
else:
print "Error : {}".format(UploadERR)

Zipping a file from a directory and placing it in another Directory

I am trying to set up a program to put my minecraft server world into a zip and place it into another directory on another drive (/media/500gb/MinecraftWorldBackups)
But I keep getting this error
Although the folder doesn't contain a folder or file called 'h'
What do I need to do to fix this I believe it is due to file and folder?
#!/usr/bin/env python
import time, zipfile
while True:
FileName = 'MinecraftBackup_' + str(int(time.time()))
Path = '/home/bertie/Desktop/FeedTheBeastServer/world/'
print(FileName)
Zip = zipfile.ZipFile('/media/500gb/MinecraftWorldBackups/'+FileName+'.zip','w')
for each in Path:
print(each)
try: Zip.write(Path + each)
except IOError: None
Zip.Close()
print('Done')
time.sleep(60)

Program for Backup - Python

Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')

See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to download a snappy.parquet file from s3 using Boto in Python - python-2.7

Related

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

GitPython: How can I access the contents of a file in a commit in GitPython

Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2

Zipping a file from a directory and placing it in another Directory

Program for Backup - Python

Categories

Resources