Creating a dmarc parser using parsedmarc in python3 for use in AWS s3 - amazon-web-services

I am very new to programming. I am working on a pipeline to analyze DMARC report files that are sent to my email account, that I am manually placing in an s3 bucket. The goal of this task is to download, extract, and analyze files using parsedmarc: https://github.com/domainaware/parsedmarc The part I'm having difficulty with is setting a conditional statement to extract .gz files if the target file is not a .zip file. I'm assuming the gzip library will be sufficient for this purpose. Here is the code I have so far. I'm using python3 and the boto3 library for AWS. Any help is appreciated!
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for sandbox account. Access keys are pulled from /.aws/config and /.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in /.aws/config and /.aws/credentials
boto3.setup_default_session(region_name="aws-region-goes-here")
boto3.setup_default_session(profile_name="aws-account-profile-name-goes-here")
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource(s3)
s3_resource.Bucket('dmarc-parsing').download_file('source-dmarc-report-filename.zip' '/home/user/dmarc/parseme.zip')
#Use the zipfile python library to extract the file into its raw state.
with zipfile.ZipFile('/home/user/dmarc/parseme.zip', 'r') as zip_ref:
zip_ref.extractall('/home/user/dmarc')
#Ingest all locations for xml file source
dmarc_report_directory = '/home/user/dmarc/'
dmarc_report_file = 'parseme.xml'
"""I need an if statement here for extracting .gz files if the file type is not .zip. The contents of every archive are .xml files"""
#Set report output variables using functions in parsedmarc. Variable set to equal the output
pd_report_output=parsedmarc.parse_aggregate_report_file(_input=f"{dmarc_report_directory}{dmarc_report_file}")
#use jsonify to make the output in json format
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
dkim_status = pd_report_jsonified['records'][0]['policy_evaluated']['dkim']
spf_status = pd_report_jsonified['records'][0]['policy_evaluated']['spf']
if dkim_status == 'fail' or spf_status == 'fail':
print(f"{dmarc_report_file} reports failure. oh crap. report:")
else:
print(f"{dmarc_report_file} passes. great. report:")
pp.pprint(pd_report_jsonified['records'][0]['auth_results'])
if __name__ == "__main__":
main()

Here is the code using the parsedmarc.parse_aggregate_report_xml method I found. Hope this helps others in parsing these reports:
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for account. Access keys are pulled from ~/.aws/config and ~/.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in ~/.aws/config and ~/.aws/credentials
boto3.setup_default_session(profile_name="aws_profile_name_goes_here", region_name="region_goes_here")
source_file = 'filename_in_s3_bucket.zip'
destination_directory = '/tmp/'
destination_file = 'compressed_report_file'
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource('s3')
s3_resource.Bucket('bucket-name-for-dmarc-report-files').download_file(source_file, f"{destination_directory}{destination_file}")
#Extract xml
outputxml = parsedmarc.extract_xml(f"{destination_directory}{destination_file}")
#run parse dmarc analysis & convert output to json
pd_report_output = parsedmarc.parse_aggregate_report_xml(outputxml)
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
#loop through results and find relevant status info and pass fail status
dmarc_report_status = ''
for record in pd_report_jsonified['records']:
if False in record['alignment'].values():
dmarc_report_status = 'Failed'
#************ add logic for interpreting results
#if fail, publish to sns
if dmarc_report_status == 'Failed':
message = "Your dmarc report failed a least one check. Review the log for details"
sns_resource = boto3.resource('sns')
sns_topic = sns_resource.Topic('arn:aws:sns:us-west-2:112896196555:TestDMARC')
sns_publish_response = sns_topic.publish(Message=message)
if __name__ == "__main__":
main()

Related

How copy file automatically bewteen 2 buckets with two different projects gcp?

Actually i use that command , and it works well :
gsutil cp gs:/bucket1/file.xml gs://bucket2/destination_folder
(bucket1 is in project1 in GCP and bucket2 is in another project in GCP)
But i would like to do that command every day at 9am, how can i do that on my GCP project in a easy way ?
Edit : It will copy the file over and over each day from the source bucket to the destination bucket( the two buckets are in a different project each). (actually when the file arrive in the destination bucket, it is consume and ingest in bigquery automatically , i just want to trigg my command gsutil and stop to do it manually each morning )
(except the method with Data transfert because i have not the right of the source project so i cannot activate the service account for data transfert , i have only the rights on destination project.)
Bests regards,
Actually i can copy a file from a bucket into another bucket into a specfic folder (RQ : the 2 buckets are on the same gcp project)
I don't arrive to use the second method with a gs://
EDIT 2:
import base64
import sys
import urllib.parse
# Imports the Google Cloud client library , dont forget the requirement or else it's ko
from google.cloud import storage
def copy_blob(
bucket_name ="prod-data", blob_name="test.csv", destination_bucket_name = "prod-data-f", destination_blob_name ="channel_p"
):
"""Copies a blob from one bucket to another with a new name."""
bucket_name = "prod-data"
blob_name = "test.csv"
destination_bucket_name = "prod-data-f"
destination_blob_name = "channel_p/test.csv"
storage_client = storage.Client()
source_bucket = storage_client.bucket(bucket_name)
source_blob = source_bucket.blob("huhu/"+blob_name)
destination_bucket = storage_client.bucket(destination_bucket_name)
blob_copy = source_bucket.copy_blob(
source_blob, destination_bucket, destination_blob_name
)
# Second Method (KO)
#
# client = storage.Client()
# with open('gs://prod-data-f/channelp.xml','wb') as file_obj:
# client.download_blob_to_file(
# 'gs://pathsource/somefolder/channelp.xml', file_obj)
#
# End of second Method
print(
"Blob {} in bucket {} copied to blob {} in bucket {}.".format(
source_blob.name,
source_bucket.name,
blob_copy.name,
destination_bucket.name,
)
)
Data transfer is obviously the right tool for doing this, but since you cannot use it, there are alternative solutions.
One of them is to copy files using a Cloud Function (you can use this snippet), and trigger each day at 9am that Cloud Function using Cloud Scheduler. Cloud Function can also be triggered by a Pub/Sub message.
The solution that i was seeking (it works for me when i test):
Main.py
import base64
import os
import sys
import json
import uuid
import logging
from time import sleep
from flask import request
from random import uniform
from google.cloud import firestore
from google.cloud.exceptions import Forbidden, NotFound
from google.cloud import storage
# set retry deadline to 60s
DEFAULT_RETRY = storage.retry.DEFAULT_RETRY.with_deadline(60)
def Move2FinalBucket(data, context):
# if 'data' in event:
# name = base64.b64decode(event['data']).decode('utf-8')
# else:
# name = 'NO_DATA'
# print('Message {}!'.format(name))
# Get cache source bucket
cache_bucket = storage.Client().get_bucket('nameofmysourcebucket', timeout=540, retry=DEFAULT_RETRY)
# Get source file to copy
blob2transfer = cache_bucket.blob('uu/oo/pp/filename.csv')
# Get cache destination bucket
destination_bucket = storage.Client().get_bucket('nameofmydestinationbucket', timeout=540, retry=DEFAULT_RETRY)
# Get destination file
new_file = destination_bucket.blob('kk/filename.csv')
#rewrite into new_file
new_file.rewrite(blob2transfer, timeout=540, retry=DEFAULT_RETRY)
requirement.txt
# Function dependencies, for example:
# package>=version
#google-cloud-storage==1.22.0
google-cloud-storage
google-cloud-firestore
google-api-core
flask==1.1.4
Dont forget to add a service account with the right Storage admin on this CF and it will works.
Best regards,

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

Uploading multiple files to Google Cloud Storage via Python Client Library

The GCP python docs have a script with the following function:
def upload_pyspark_file(project_id, bucket_name, filename, file):
"""Uploads the PySpark file in this directory to the configured
input bucket."""
print('Uploading pyspark file to GCS')
client = storage.Client(project=project_id)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
blob.upload_from_file(file)
I've created an argument parsing function in my script that takes in multiple arguments (file names) to upload to a GCS bucket. I'm trying to adapt the above function to parse those multiple args and upload those files, but am unsure how to proceed. My confusion is with the 'filename' and 'file' variables above. How can I adapt the function for my specific purpose?
I don't suppose you're still looking for something like this?
from google.cloud import storage
import os
files = os.listdir('data-files')
client = storage.Client.from_service_account_json('cred.json')
bucket = client.get_bucket('xxxxxx')
def upload_pyspark_file(filename, file):
# """Uploads the PySpark file in this directory to the configured
# input bucket."""
# print('Uploading pyspark file to GCS')
# client = storage.Client(project=project_id)
# bucket = client.get_bucket(bucket_name)
print('Uploading from ', file, 'to', filename)
blob = bucket.blob(filename)
blob.upload_from_file(file)
for f in files:
upload_pyspark_file(f, "data-files\\{0}".format(f))
The difference between file and filename is as you may have guessed, file is the source file and filename is the destination file.

Issue with uploading files from local directory to aws S3 using python 2.7 and boto 2

I’m doing simple operation to of downloading the gzip files from S3 bucket to the local directory. I’m extracting those into another local directory and then uploading them back to S3 bucket again into archive folder path. While doing this operation I want to make sure I am processing same set of files that I initially download from S3 bucket which is (f_name) in below code. Now, below code is not uploading those back to S3 , that’s where I’m stuck. But able to download from S3 and extract it into local directory. Can you please help me understand what is wrong with the _uploadFile function?
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket= "event-logs-dev” ## S3 Bucket name
local_download_directory= "/Users/TargetData/Download/test_queue1/“ ## local directory to download the gzip files from S3.
Target_directory_to_extract = "/Users/TargetData/unzip” ##local directory to gunzip the downloaded files.
Target_s3_path_to_upload= "event-logs-dev/data/clean/xact/logs/archive/“ ## S3 bucket path to upload the files.
def decompressAllFilesFromNetfiler(self,aws_bucket,local_download_directory,Target_d irectory_to_extract,Target_s3_path_to_upload):
zipFiles = [f for f in os.listdir(local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')]) and os.access(Target_directory_to_extract+"/"+f_name[:-len('.tar.gz')], os.R_OK):
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(Target_directory_to_extract, f_name_with_path, Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,f_name,Target_s3_path_to_upload,Target_directory_to_extract)
def _uploadFile(self, aws_bucket, f_name,Target_s3_path_to_upload,Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload, f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
print "Uploaded extracted file to: %s" % (full_key_name)
key = aws_bucket.new_key(full_key_name)
key.set_contents_from_filename(path)
except:
if full_key_name is None:
print "Error uploading”
Currently, the output prints that Uploaded extracted file to: event-logs-dev/data/clean/xact/logs/archive/1442235602129200000.tar.gz, but nothing is uploaded to S3 bucket. Your help is greatly appreciated!! Thank you in advance!
It appears that you have cut and pasted parts of your code - and maybe formatting was lost as your code above will not work as pasted. I've taken the liberty to make it PEP8 (mostly) however there is still some missing code to create the S3 objects. Since your import the modules, I presume that you have that section of code and just didn't paste it.
here is a cleaned up version of your code formatted correctly. I also added a Exception code to your try: block to print out the error you get. You should update the Exception to be more specific to the Exceptions thrown for make_key or set_contents_... but the general Exception will get you started. If nothing more this is more readable, but you should include your S3 connection code too - and remove anything that is specific to your domain (e.g. keys, trade secrets, etc).
#!/usr/bin/env python
"""
do some download
some extract
and some upload
"""
from boto.s3.connection import S3Connection
from boto.s3.key import *
import os
import os.path
aws_bucket = 'event-logs-dev'
local_download_directory = '/Users/TargetData/Download/test_queue1/'
Target_directory_to_extract = '/Users/TargetData/unzip'
Target_s3_path_to_upload = 'event-logs-dev/data/clean/xact/logs/archive/'
'''
MUST BE SOME MAGIC HERE TO GET AN S3 CONNECTION ???
aws_bucket IS NOT A BUCKET OBJECT ...
'''
def decompressAllFilesFromNetfiler(self,
aws_bucket,
local_download_directory,
Target_directory_to_extract,
Target_s3_path_to_upload):
'''
decompress stuff
'''
zipFiles = [f for f in os.listdir(
local_download_directory) if re.match(r'.*\.tar\.gz', f)]
for f_name in zipFiles:
if os.path.exists(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.access(
"{}/{}".format(Target_directory_to_extract,
f_name[:len('.tar.gz')])) and os.R_OK:
print ('File {} already exists!'.format(f_name))
else:
f_name_with_path = os.path.join(local_download_directory, f_name)
os.system('mkdir -p {} && tar vxzf {} -C {}'.format(
Target_directory_to_extract,
f_name_with_path,
Target_directory_to_extract))
print ('Extracted file {}'.format(f_name))
self._uploadFile(aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract)
def _uploadFile(self,
aws_bucket,
f_name,
Target_s3_path_to_upload,
Target_directory_to_extract):
full_key_name = os.path.expanduser(os.path.join(Target_s3_path_to_upload,
f_name))
path = os.path.expanduser(os.path.join(Target_directory_to_extract, f_name))
try:
S3CONN = S3Connection()
BUCKET = S3CONN.get_bucket(aws_bucket)
key = BUCKET.new_key(full_key_name)
key.set_contents_from_filename(path)
print "Uploaded extracted file to: {}".format(full_key_name)
except Exception as UploadERR:
if full_key_name is None:
print 'Error uploading'
else:
print "Error : {}".format(UploadERR)

django boto3: NoCredentialsError -- Unable to locate credentials

I am trying to use boto3 in my django project to upload files to Amazon S3. Credentials are defined in settings.py:
AWS_ACCESS_KEY = xxxxxxxx
AWS_SECRET_KEY = xxxxxxxx
S3_BUCKET = xxxxxxx
In views.py:
import boto3
s3 = boto3.client('s3')
path = os.path.dirname(os.path.realpath(__file__))
s3.upload_file(path+'/myphoto.png', S3_BUCKET, 'myphoto.png')
The system complains about Unable to locate credentials. I have two questions:
(a) It seems that I am supposed to create a credential file ~/.aws/credentials. But in a django project, where do I have to put it?
(b) The s3 method upload_file takes a file path/name as its first argument. Is it possible that I provide a file stream obtained by a form input element <input type="file" name="fileToUpload">?
This is what I use for a direct upload, i hope it provides some assistance.
import boto
from boto.exception import S3CreateError
from boto.s3.connection import S3Connection
conn = S3Connection(settings.AWS_ACCESS_KEY,
settings.AWS_SECRET_KEY,
is_secure=True)
try:
bucket = conn.create_bucket(settings.S3_BUCKET)
except S3CreateError as e:
bucket = conn.get_bucket(settings.S3_BUCKET)
k = boto.s3.key.Key(bucket)
k.key = filename
k.set_contents_from_filename(filepath)
Not sure about (a) but django is very flexible with file management.
Regarding (b) you can also sign the upload and do it directly from the client to reduce bandwidth usage, its quite sneaky and secure too. You need to use some JavaScript to manage the upload. If you want details I can include them here.