Uploading multiple files to Google Cloud Storage via Python Client Library - google-cloud-platform

The GCP python docs have a script with the following function:
def upload_pyspark_file(project_id, bucket_name, filename, file):
"""Uploads the PySpark file in this directory to the configured
input bucket."""
print('Uploading pyspark file to GCS')
client = storage.Client(project=project_id)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
blob.upload_from_file(file)
I've created an argument parsing function in my script that takes in multiple arguments (file names) to upload to a GCS bucket. I'm trying to adapt the above function to parse those multiple args and upload those files, but am unsure how to proceed. My confusion is with the 'filename' and 'file' variables above. How can I adapt the function for my specific purpose?

I don't suppose you're still looking for something like this?
from google.cloud import storage
import os
files = os.listdir('data-files')
client = storage.Client.from_service_account_json('cred.json')
bucket = client.get_bucket('xxxxxx')
def upload_pyspark_file(filename, file):
# """Uploads the PySpark file in this directory to the configured
# input bucket."""
# print('Uploading pyspark file to GCS')
# client = storage.Client(project=project_id)
# bucket = client.get_bucket(bucket_name)
print('Uploading from ', file, 'to', filename)
blob = bucket.blob(filename)
blob.upload_from_file(file)
for f in files:
upload_pyspark_file(f, "data-files\\{0}".format(f))
The difference between file and filename is as you may have guessed, file is the source file and filename is the destination file.

Related

File truncated on upload to GCS

I am uploading a relatively small(<1 MiB) .jsonl file on Google CLoud storage using the python API. The function I used is from the gcp documentation:
def upload_blob(key_path,bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client.from_service_account_json(key_path)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
The issue I am having is that the .jsonl file is getting truncated at 9500 lines after the upload. In fact, the 9500th line is not complete. I am not sure what the issue is and don't think there would be any limit for this small file. Any help is appreciated.
I had a similar problem some time ago. In my case the upload to bucket was called inside a with python clause right after the line where I recorded contents to source_file_name, so I just needed to move the upload line outside the with in order to properly recorded and close local file to be uploaded.

Combine all txt files in an S3 bucket into 1 large file

Problem: I am trying to combine large amounts of small-sized text files into 1 large-sized file in S3 bucket. Using python:
The code I tested to try this locally is below. It works perfectly. (obtained from another post):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('UBXEvents*'):
if filename == outfilename: # don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
Now, since my files are located in an S3 bucket, I have trouble referencing the S3 bucket. I wanted to run this code for all files (using wild card *) in an S3 but I am having a hard time connecting the two.
Below is the s3 object I created:
object = client.get_object(
Bucket= 'my_bucket_name',
Key='bucket_path/prefix_of_file_name*'
)
Question: How would I reference the S3 bucket/path in my combining code above?
Obtaining a list of files
You can obtain a list of files in the bucket like this:
import boto3
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket='my-bucket', Prefix = 'folder1/')
for object in response['Contents']:
# Do stuff here
print(object['Key'])
Reading & Writing to Amazon S3
Normally, you would need to download each file from Amazon S3 to the local disk (using download_file() and then read the contents). However, you might instead want to use smart-open ยท PyPI, which is a library that allows files to be opened on S3 using similar syntax to the normal Python open() command.
Here's a program that uses smart-open to read files from S3 and combine them into an output file in S3:
import boto3
from smart_open import open
BUCKET = 'my-bucket'
PREFIX = 'folder1/' # Optional
s3_client = boto3.client('s3')
# Open output file with smart-open
with open(f's3://{BUCKET}/out.txt', 'w') as out_file:
response = s3_client.list_objects_v2(Bucket=BUCKET, Prefix = PREFIX)
for object in response['Contents']:
print(f"Copying {object['Key']}")
# Open input file with smart-open
with open(f"s3://{BUCKET}/{object['Key']}", 'r') as in_file:
# Read content from input file
for line in in_file:
# Write content to output file
out_file.write(line)

Creating a dmarc parser using parsedmarc in python3 for use in AWS s3

I am very new to programming. I am working on a pipeline to analyze DMARC report files that are sent to my email account, that I am manually placing in an s3 bucket. The goal of this task is to download, extract, and analyze files using parsedmarc: https://github.com/domainaware/parsedmarc The part I'm having difficulty with is setting a conditional statement to extract .gz files if the target file is not a .zip file. I'm assuming the gzip library will be sufficient for this purpose. Here is the code I have so far. I'm using python3 and the boto3 library for AWS. Any help is appreciated!
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for sandbox account. Access keys are pulled from /.aws/config and /.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in /.aws/config and /.aws/credentials
boto3.setup_default_session(region_name="aws-region-goes-here")
boto3.setup_default_session(profile_name="aws-account-profile-name-goes-here")
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource(s3)
s3_resource.Bucket('dmarc-parsing').download_file('source-dmarc-report-filename.zip' '/home/user/dmarc/parseme.zip')
#Use the zipfile python library to extract the file into its raw state.
with zipfile.ZipFile('/home/user/dmarc/parseme.zip', 'r') as zip_ref:
zip_ref.extractall('/home/user/dmarc')
#Ingest all locations for xml file source
dmarc_report_directory = '/home/user/dmarc/'
dmarc_report_file = 'parseme.xml'
"""I need an if statement here for extracting .gz files if the file type is not .zip. The contents of every archive are .xml files"""
#Set report output variables using functions in parsedmarc. Variable set to equal the output
pd_report_output=parsedmarc.parse_aggregate_report_file(_input=f"{dmarc_report_directory}{dmarc_report_file}")
#use jsonify to make the output in json format
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
dkim_status = pd_report_jsonified['records'][0]['policy_evaluated']['dkim']
spf_status = pd_report_jsonified['records'][0]['policy_evaluated']['spf']
if dkim_status == 'fail' or spf_status == 'fail':
print(f"{dmarc_report_file} reports failure. oh crap. report:")
else:
print(f"{dmarc_report_file} passes. great. report:")
pp.pprint(pd_report_jsonified['records'][0]['auth_results'])
if __name__ == "__main__":
main()
Here is the code using the parsedmarc.parse_aggregate_report_xml method I found. Hope this helps others in parsing these reports:
import parsedmarc
import pprint
import json
import boto3
import zipfile
import gzip
pp = pprint.PrettyPrinter(indent=2)
def main():
#Set default session profile and region for account. Access keys are pulled from ~/.aws/config and ~/.aws/credentials.
#The 'profile_name' value comes from the header for the account in question in ~/.aws/config and ~/.aws/credentials
boto3.setup_default_session(profile_name="aws_profile_name_goes_here", region_name="region_goes_here")
source_file = 'filename_in_s3_bucket.zip'
destination_directory = '/tmp/'
destination_file = 'compressed_report_file'
#Define the s3 resource, the bucket name, and the file to download. It's hardcoded for now...
s3_resource = boto3.resource('s3')
s3_resource.Bucket('bucket-name-for-dmarc-report-files').download_file(source_file, f"{destination_directory}{destination_file}")
#Extract xml
outputxml = parsedmarc.extract_xml(f"{destination_directory}{destination_file}")
#run parse dmarc analysis & convert output to json
pd_report_output = parsedmarc.parse_aggregate_report_xml(outputxml)
pd_report_jsonified = json.loads(json.dumps(pd_report_output))
#loop through results and find relevant status info and pass fail status
dmarc_report_status = ''
for record in pd_report_jsonified['records']:
if False in record['alignment'].values():
dmarc_report_status = 'Failed'
#************ add logic for interpreting results
#if fail, publish to sns
if dmarc_report_status == 'Failed':
message = "Your dmarc report failed a least one check. Review the log for details"
sns_resource = boto3.resource('sns')
sns_topic = sns_resource.Topic('arn:aws:sns:us-west-2:112896196555:TestDMARC')
sns_publish_response = sns_topic.publish(Message=message)
if __name__ == "__main__":
main()

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.
With boto3 + lambda, how can i achieve my goal?
I didn't see any extract part in boto3 document.
You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.
# python imports
import boto3
from io import BytesIO
import gzip
# setup constants
bucket = '<bucket_name>'
gzipped_key = '<key_name.gz>'
uncompressed_key = '<key_name>'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False) # optional
s3.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
Ensure that your key is reading in correctly:
# read the body of the s3 key object into a string to ensure download
s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read()
print(len(s)) # check to ensure some data was returned
The above answers are for gzip files, for zip files, you may try
import boto3
import zipfile
from io import BytesIO
bucket = 'bucket1'
s3 = boto3.client('s3', use_ssl=False)
Key_unzip = 'result_files/'
prefix = "folder_name/"
zipped_keys = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/")
file_list = []
for key in zipped_keys['Contents']:
file_list.append(key['Key'])
#This will give you list of files in the folder you mentioned as prefix
s3_resource = boto3.resource('s3')
#Now create zip object one by one, this below is for 1st file in file_list
zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0])
print (zip_obj)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key='result_files/' + f'{filename}')
This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)
Sample code:
import gzip
import io
import boto3
bucket = '<bucket_name>'
key = '<key_name>'
s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])

AWS static site file upload via boto 3 set the right content-types

What are the right content types for the different types of files of a static site hosted at AWS and how to set these in a smart way via boto3?
I use the upload_file method:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('allecijfers.nl')
bucket.upload_file('C:/Hugo/Sites/allecijfers/public/test/index.html', 'test/index.html', ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/html'})
This works well for the html files. I initially left out the ExtraArgs which results in a file download (probably because the content type is binary?). I found this page that states several content types but I am not sure how to apply it.
E.g. probably the CSS files should be uploaded with 'ContentType': 'text/css'.
But what about the js files, index.xml, etc? And how to do this in a smart way? FYI this is my current script to upload from Windows to AWS, this requires string.replace("\","/") which probably is not the smartest either?
for root, dirs, files in os.walk(local_root + local_dir):
for filename in files:
# construct the full local path
local_path = os.path.join(root, filename).replace("\\","/")
# construct the full S3 path
relative_path = os.path.relpath(local_path, local_root)
s3_path = os.path.join(relative_path).replace("\\","/")
bucket.upload_file(local_path, s3_path, ExtraArgs={'ACL': 'public-read', 'ContentType': 'text/html'})
I uploaded my complete Hugo site from the same source using the AWS CLI to the same S3 bucket and this works perfect without specifying content types, is this also possible via boto 3?
Many thanks in advance for your help!
There is a python built-in library to guess mimetypes.
So you could just lookup each filename first. It works like this:
import mimetypes
print(mimetypes.guess_type('filename.html'))
Result:
('text/html', None)
In your code. I also slightly improved the portability of your code with respect to the windows path. Now it will do the same thing, but be portable to a Unix platform by looking up the platform specific separator (os.path.sep) that will be being used in any paths.
import boto3
import mimetypes
s3 = boto3.resource('s3')
bucket = s3.Bucket('allecijfers.nl')
for root, dirs, files in os.walk(local_root + local_dir):
for filename in files:
# construct the full local path (Not sure why you were converting to a
# unix path when you'd want this correctly as a windows path
local_path = os.path.join(root, filename)
# construct the full S3 path
relative_path = os.path.relpath(local_path, local_root)
s3_path = relative_path.replace(os.path.sep,"/")
# Get content type guess
content_type = mimetypes.guess_type(filename)[0]
bucket.upload_file(
File=local_path,
Bucket=bucket,
Key=s3_path,
ExtraArgs={'ACL': 'public-read', 'ContentType': content_type}
)