Fetch a file(.csv) from S3 bucket and copy to an RDS - python-2.7

I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.

It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.

If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.

Related

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

I am very new to programming. I am working on a pipeline to analyze DMARC report files that are sent to my email account, that I am manually placing in an s3 bucket. The goal of this task is to download, extract, and analyze files using parsedmarc: https://github.com/domainaware/parsedmarc The part I'm having difficulty with is setting a conditional statement to extract .gz files if the target file is not a .zip file. I'm assuming the gzip library will be sufficient for this purpose. Here is the code I have so far. I'm using python3 and the boto3 library for AWS. Any help is appreciated!
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for sandbox account. Access keys are pulled from /.aws/config and /.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in /.aws/config and /.aws/credentials
boto3.setup_default_session(region_name="aws-region-goes-here")
boto3.setup_default_session(profile_name="aws-account-profile-name-goes-here")
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource(s3)
s3_resource.Bucket('dmarc-parsing').download_file('source-dmarc-report-filename.zip' '/home/user/dmarc/parseme.zip')
#Use the zipfile python library to extract the file into its raw state.
with zipfile.ZipFile('/home/user/dmarc/parseme.zip', 'r') as zip_ref:
zip_ref.extractall('/home/user/dmarc')
#Ingest all locations for xml file source
dmarc_report_directory = '/home/user/dmarc/'
dmarc_report_file = 'parseme.xml'
"""I need an if statement here for extracting .gz files if the file type is not .zip. The contents of every archive are .xml files"""
#Set report output variables using functions in parsedmarc. Variable set to equal the output
pd_report_output=parsedmarc.parse_aggregate_report_file(_input=f"{dmarc_report_directory}{dmarc_report_file}")
#use jsonify to make the output in json format
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
dkim_status = pd_report_jsonified['records'][0]['policy_evaluated']['dkim']
spf_status = pd_report_jsonified['records'][0]['policy_evaluated']['spf']
if dkim_status == 'fail' or spf_status == 'fail':
print(f"{dmarc_report_file} reports failure. oh crap. report:")
else:
print(f"{dmarc_report_file} passes. great. report:")
pp.pprint(pd_report_jsonified['records'][0]['auth_results'])
if __name__ == "__main__":
main()
Here is the code using the parsedmarc.parse_aggregate_report_xml method I found. Hope this helps others in parsing these reports:
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for account. Access keys are pulled from ~/.aws/config and ~/.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in ~/.aws/config and ~/.aws/credentials
boto3.setup_default_session(profile_name="aws_profile_name_goes_here", region_name="region_goes_here")
source_file = 'filename_in_s3_bucket.zip'
destination_directory = '/tmp/'
destination_file = 'compressed_report_file'
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource('s3')
s3_resource.Bucket('bucket-name-for-dmarc-report-files').download_file(source_file, f"{destination_directory}{destination_file}")
#Extract xml
outputxml = parsedmarc.extract_xml(f"{destination_directory}{destination_file}")
#run parse dmarc analysis & convert output to json
pd_report_output = parsedmarc.parse_aggregate_report_xml(outputxml)
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
#loop through results and find relevant status info and pass fail status
dmarc_report_status = ''
for record in pd_report_jsonified['records']:
if False in record['alignment'].values():
dmarc_report_status = 'Failed'
#************ add logic for interpreting results
#if fail, publish to sns
if dmarc_report_status == 'Failed':
message = "Your dmarc report failed a least one check. Review the log for details"
sns_resource = boto3.resource('sns')
sns_topic = sns_resource.Topic('arn:aws:sns:us-west-2:112896196555:TestDMARC')
sns_publish_response = sns_topic.publish(Message=message)
if __name__ == "__main__":
main()

How to Pass dynamically create Powerpoint file to Google Cloud Storage in python

I'm trying to build a power-point in my python 2.7 application and upload it on the fly to Google Cloud Storage.
I can create the ppt, store on my local hard drive as an intermediate step and then pick up from there to upload to the Google cloud storage. This works well. However, my production application will run on Google App Server so I want to be able to create the powerpoint and upload to Google Storage directly (without the intermediate step).
Any ideas of how to do this? The blob.upload_from_file() seems to only be able to pickup files that are physically stored somewhere but as my app is building these powerpoints I don't know what to pass to the blob.upload_from_file as an argument? I tried to use StringIO module but its generating the error message below.
from google.cloud import storage
from pptx import Presentation
from StringIO import StringIO
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Hello, World!"
subtitle.text = "python-pptx was here!"
out_file = StringIO()
prs.save(out_file)
client = storage.Client()
bucket = client.get_bucket([GCP_Bucket_Name])
blob = bucket.blob('test.pptx')
blob.upload_from_file(out_file)
print blob2.public_url
ValueError: Stream must be at beginning.
You are uploading the out_file object, which is just an empty StringIO object.
Instead you have to:
Save the Presentation object to the file system prs.save(savingPath)
Read the object and put it in a StringIO object.
Finally upload the the StringIO object.
Here is the code:
from google.cloud import storage
from pptx import Presentation
from StringIO import StringIO
#Creates presentation
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Hello, World!"
subtitle.text = "python-pptx was here!"
prs.save('test.pptx')
client = storage.Client()
bucket = client.get_bucket('yourBucket')
blob = bucket.blob('test.pptx')
with open('test.pptx','r') as ppt:
out_file = StringIO(ppt.read())
blob.upload_from_file(out_file)
print blob.public_url
Finally you mentioned the code will run on Google App Server, this product does not exist. If you meant Google App Engine keep in mind that not all the runtimes have access to the file system. So you might not be able to save the pptx file to the file system of the application.
I solved this problem by moving my GAE application from Standard to the Flexible environment in GAE. The flexible environment allows write to the preso file to the root directory and I pick it up from there, upload to Google storage and then delete the file in the root. This worked for me.

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

Vertica Query Result Export and Import using Python

I want to copy certain data from a Vertica cluster (lets say a test cluster) to another Vertica cluster (lets say QA cluster). Manually I can do this by dumping the result of a query into a CSV file and then importing it on the other cluster. But, how can I do it on a Python script without using os or system commands. I want to do it purely using some Python module or adapter. As of now I am using python-vertica adapter, I am able to connect to Test cluster and get the data into a python list, but I am unable to export it to a CSV file natively using the adapter (i.e. without using python csv module). Also, how can I import the CSV file in my QA cluster using the same adapter (or a different vertica module for python)?
You can do it with COPY FROM VERTICA for simple problems. Read here for more info.
For python you can use in my template:
Environment:
python=2.7.x
vertica-python==0.7.3
Vertica Analytic Database v8.1.1-10
Source code example:
#!/usr/bin/env python2
# coding: UTF-8
import csv
import cStringIO
# connection info: username, password, etc
SRC_DB_INFO = {...}
DST_DB_INFO = {...}
csvbuffer = cStringIO.StringIO()
csvwriter = csv.writer(csvbuffer, delimiter='|', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
# establish connection to source database
connection = vertica_python.connect(**SRC_DB_INFO)
cursor = connection.cursor()
cursor.execute('SELECT * FROM A')
# convert data to csv format
for row in cursor.iterate():
csvwriter.writerow(row)
# cleanup
cursor.close()
connection.close()
# establish connection to destination database
connection = vertica_python.connect(**DST_DB_INFO)
cursor = connection.cursor()
# copy data
cursor.copy('COPY B FROM STDIN ABORT ON ERROR', csvbuffer.getvalue())
connection.commit()
# cleanup
cursor.close()
connection.close()

Getting "new-line character seen in unquoted field" when parsing csv document using django-storages

I am trying to parse csv files that have been uploaded to Amazon S3 using django-storages. I keep getting a "Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?". The normal work around for this is to open the file with "rU", but that does not seem to work with django storages. If I drop the file directly on the server and open from there it works, I just want to avoid storing the files directly on the server if possible. Here is the code I am using:
import csv
from django.core.files.storage import default_storage as s3_storage
n = 'csvdumps/130331548894.csv'
csvf = s3_storage.open(n, "rU")
csvReader = csv.reader(csvf)
for item in csvReader:
print item
I can see that this is a django-storage reported bug here http://jgrid.org/david/django-storages/issue/80/trying-to-parse-csv-file-from-django but perhaps you can try this:-
csvf = s3_storage.open(n.splitlines(), "rU")
Would also be great if you could share a link to access some of your S3 (sample) csv files though so I can open them to check the line endings.