Lambda reading file on S3 - flushing S3 cache - amazon-web-services

I have a problem regarding cache on S3. Basically I have a lambda that reads a file on S3 which is used as configuration. This file is a JSON. I am using python with boto3 to extract the needed info.
Snippet of my code:
s3 = boto3.resource('s3')
bucketname = "configurationbucket"
itemname = "conf.json"
obj = s3.Object(bucketname, itemname)
body = obj.get()['Body'].read()
json_parameters = json.loads(body)
def my_handler(event, context):
# using json_paramters data
The problem is that when I change the json content and I upload the file again on S3, my lambda seems to read the old values, which I suppose is due to S3 doing caching somewhere.
Now I think that there are two ways to solve this problem:
to force S3 to invalidate its cache content
to force my lambda to reload the file from S3 without using the cache
I do prefer the first solution, because I think it will reduce computation time (reloading the file is an expensive procedure). So, how can I flush my cache? I didn't find on console or on AWS guide the way to do this in a simple manner

problem is , the code outside of function handler is initialized only once. It won't be re-initialised when the lambda is warm
def my_handler(event, context):
# read from S3 here
obj = s3.Object(bucketname, itemname)
body = obj.get()['Body'].read()
json_parameters = json.loads(body)
# use json_paramters data

Related

Why is lambda getting randomly timed out while trying to read the head object for a key on S3 bucket?

I am working on a feature where a user can upload multiple files which need to be parsed and converted to PDF if required. For that, I'm using AWS and when the user selects N files for upload then the following happens:
The client browser is connected to an AWS WebSocket API which is responsible for sending back the parsed data to respective clients later.
A signed URL for S3 is get from the webserver using which all of the user's files are uploaded onto an S3 bucket.
As soon as each file is uploaded, a lambda function is triggered for it which fetches the object for that file in order to get the content and some metadata to associate the files with respective clients.
Once the files are parsed, the response data is sent back to the respective connected clients via the WebSocket and the browser JS catches the event data and renders it.
The issue I'm facing here is that the lambda function randomly times out at the line which fetches the object of the file (either just head_object or get_object). This is happening for roughly 50% of the files (Usually I test by just sending 15 files at once and 6-7 of them fail)
import boto3
s3 = boto3.client("s3")
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = urllib.parse.unquote_plus(event["Records"][0]["s3"]["object"]["key"], encoding="utf-8")
response = s3.get_object(Bucket=bucket, Key=key) # This or head_object gets stuck for 50% of the files
What I have observed is that even if the head_object or get_object is fetched for a file which already exists on S3 instead of getting it for the file who's upload triggered the lambda. Then also it times out with the same rate.
But if the objects are fetched in bulk via some local script using boto3 then they are fetched under a second for 15 files.
I have also tried using my own AWS Access ID and Secret key in lambda to avoid any issue caused by the temporarily generated keys.
So it seems that the multiple lambda instances are having trouble in getting the S3 file objects in parallel, which shouldn't happen though as AWS is supposed to scale well.
What should be done to get around it?

When uploading a file into aws s3 with boto3 is it possible to get the s3 object url as a return value?

I am uploading an image-file into AWS S3 using boto3 library. I noticed that the S3 object url ending does not match with the given Key. Is it possible to get the S3 object url as a return value from boto3 upload_file function?
example:
import boto3
s3 = boto3.client('s3')
file_location = ...
bucket = ...
folder = ...
filename = ...
url = s3.upload_file(
Filename=file_location,
Bucket=bucket,
Key=f'{folder}/{filename}',
)
I read from docs that it might be possible with a callback function, but I could not get it working with boto3.
If not what is the simplest way to get the uploaded object url?
Using the AWS SDK, you can get a URL for an object in an Amazon S3 bucket. I am not sure there is a Python example for this use case however, you can get an idea how to perform this task by looking at the Java example.
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/GetObjectUrl.java
Okey, my problem was that the the filenames I was adding to create the file Key contained a hashtag symbol which has a specific meaning in url. AWS was automatically changing the hashtags into %23, which created a mismatch between Key and URL. Now I changed the naming convention of the file into containing no hashtag, so no more problem occurs.

AWS Lambda create folder in S3 bucket

I have a Lambda that runs when files are uploaded to S3-A bucket and moves those files to another bucket S3-B. The challenge is that I need create a folder inside S3-B bucket with a corresponding date of uploaded files and move the files to the folder. Any help or ideas are greatly apprecited. It might sound confusing so feel free to ask questions.Thank you!
Here's a Lambda function that can be triggered by an Amazon S3 Event and move the object to another bucket:
import json
import urllib
from datetime import date
import boto3
DEST_BUCKET = 'bucket-b'
def lambda_handler(event, context):
s3_client = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
dest_key = str(date.today()) + '/' + key
s3_client.copy_object(
Bucket=DEST_BUCKET,
Key=dest_key,
CopySource=f'{bucket}/{key}'
)
The only thing to consider is timezones. The Lambda function runs in UTC and you might be expecting a slightly different date in your timezone, so you might need to adjust the time accordingly.
Just to clear up some confusion, in S3 there is no such thing as a folder. What you see in the interface is actually running the ListObjects using a prefix. The prefix is what you are seeing as the folder hierarchy.
To help illustrate this an object might have a key (which is a piece of metadata that defines its name) of folder/subfolder/file.txt, in the console you're actually using a prefix of folder/subfolder/*. This makes sense if you think of S3 more like a key value store, where the value is the object itself.
For this reason you can make a key on a prefix that has not existed before without creating any other hierarchical features.
In your Lambda function, you will need to download the files locally and then upload them to their new object key (remembering to delete the old object). Some SDKS will have an automated function that will perform all of these steps for you (such as Boto3 with the copy function).

How to use the Amazon Textract with PDF files

I already can use the textract but with JPEG files. I would like to use it with PDF files.
I have the code bellow:
import boto3
# Document
documentName = "Path to document in JPEG"
# Read document content
with open(documentName, 'rb') as document:
imageBytes = bytearray(document.read())
# Amazon Textract client
textract = boto3.client('textract')
documentText = ""
# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
#print(response)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
documentText = documentText + item["Text"]
# print('\033[94m' + item["Text"] + '\033[0m')
# # print(item["Text"])
# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)
As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.
I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.
How can I do that?
EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.
As #syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:
Create new S3 bucket in console and write down bucket name,
then
import random
import boto3
bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'
s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
ClientRequestToken=random.randint(1,1e10))
jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)
It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.
According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.
Edit:
I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is
...
pages = [response]
while nextToken := response.get('NextToken'):
response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
pages.append(response)
As mentioned in the AWS Textract FAQ page https://aws.amazon.com/textract/faqs/. pdf files are supported and in Sdk as well https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html
Sample usage https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/12-pdf-text.py
Since you want to work with PDF files meaning that you'll utilize Amazon Textract Asynchronous API (StartDocumentAnalysis, StartDocumentTextDetection) then currently it's not possible to directly parse in PDF files.
This is because Amazon Textract Asynchronous APIs only support document location as S3 objects.
From AWS Textract doc:
Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects.
Upload the pdf to S3 bucket. After that, you can use easily use available functions startDocumentAnalysis to fetch pdf directly from s3 and do textract.
It works (almost), I had to make ClientRequestToken a string instead of an integer.

aws lambda s3 events for existing files

I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?
I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.
My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).
Is there a way to apply the lambda to all the existing contents of my bucket?
Thanks
For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.
Try using s3cmd.
s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder
This will modify metadata and invoke an event for lambda
I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.
Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:
if 'Records' in event:
# this means we are working off of an S3 event
records_to_process = event['Records']
else:
# this is for ad-hoc posts via API Gateway trigger for Lambda
records_to_process = [{
"s3":{"bucket": {"name": event["queryStringParameters"]["bucket"]},
"object":{"key": event["queryStringParameters"]["file"]}}
}]
for record in records_to_process:
# below lines of code s same as the earlier snippet of code
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
Postman result of sending the post request
Try to copy your bucket content and catch create events with lambda.
copy:
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket
for larger buckets:
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py