Using Eventbridge to trigger Glue job but with delay - amazon-web-services

I want to create an Eventbridge rule which triggers after certain number of files are uploaded into the S3 bucket. For ex: Consider a certain prefix in bucket is empty(bucket/folder/[empty]), the user needs to upload 5 files. Only after those five files are uploaded can the Eventbridge be triggered. I tried searching for rule pattern, but unable to find anything related to this. Currently using
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["test-bucket-for-event"]
},
"object": {
"key": [{
"prefix": "folder/Latest/"
}]
}
}
}
Can i mention like, numbers here, like using greater than 5 etc.
Or how to configure that.
Help is appreciated.
Thanks

I created a Lambda function which is used to trigger glue job after certain number of files are created
import json
import boto3
def lambda_handler(event,context):
bucket = "bucket-name"
folder = "Folder/Subfolder/"
objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=folder)
conn = boto3.client('s3')
for key in conn.list_objects(Bucket=bucket,Prefix=folder)['Contents']:
file_name = list(key['Key'])
print(''.join(file_name))
fileCount = objs['KeyCount']
fileCount = fileCount - 1
print(fileCount)
if fileCount >= 5:
print('the files are present,going to send notification') #here add the glue trigger
else:
print("None")

Related

How do I use AWS Lambda to trigger Comprehend with S3?

I'm currently using aws lambda to trigger an amazon comprehend job, but the code is only used to run one piece of text under sentiment analysis.
import boto3
def lambda_handler(event, context):
s3 = boto3.client("s3")
bucket = "bucketName"
key = "textName.txt"
file = s3.get_object(Bucket = bucket, Key = key)
analysisdata = str(file['Body'].read())
comprehend = boto3.client("comprehend")
sentiment = comprehend.detect_sentiment(Text = analysisdata, LanguageCode = "en")
print(sentiment)
return 'Sentiment detected'
I want to run a file where each line in the text file is a new piece of text to analyze with sentiment analysis (it's an option if you manually enter stuff into comprehend), but is there a way to alter this code to do that? And have the output sentiment analysis file be placed into that same S3 bucket? Thank you in advance.
It looks like you can use start_sentiment_detection_job():
response = client.start_sentiment_detection_job(
InputDataConfig={
'S3Uri': 'string',
'InputFormat': 'ONE_DOC_PER_FILE'|'ONE_DOC_PER_LINE',
'DocumentReaderConfig': {
'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT'|'TEXTRACT_ANALYZE_DOCUMENT',
'DocumentReadMode': 'SERVICE_DEFAULT'|'FORCE_DOCUMENT_READ_ACTION',
'FeatureTypes': [
'TABLES'|'FORMS',
]
}
},
OutputDataConfig={
'S3Uri': 'string',
'KmsKeyId': 'string'
},
...
)
It can read from an object in Amazon S3 (S3Uri) and store the output in an S3 object.
It looks like you could use 'InputFormat': 'ONE_DOC_PER_LINE' to meet your requirements.

Add a new item to a Dynamodb using a AWS lambda function each time a function is executed with Cloudwatch

I'm trying to modify a Dynamodb table each time a Lambda function is executed.
Specifically, I create a simple lambda function that returns a list of S3 bucket names and this function run each minute thanks to a Cloudwatch's rule.
However, as I said before, my goal is to also update a Dynamodb each time the same function is executed. Specifically I want to add each time a new Item with the same attribute (so let's say the function is executed 1000 times, I want 1K items/rows).
However I don't know how to do it. Any suggestions? Here's the code:
import json
import boto3
s3 = boto3.resource('s3')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Table')
def lambda_handler(event, context):
bucket_list = []
for b in s3.buckets.all():
print(b.name)
bucket_list.append(b.name)
response = "done"
table.put_item(
Item = {
"Update": response
}
)
return {
"statusCode": 200,
"body": bucket_list
}
Thank you in advance
Your problem is that PutItem does overwrite exiting items, if they are the same. So every time you try to insert Update=done, it just overwrites the same item.
The very first sentence of the documentation states:
Creates a new item, or replaces an old item with a new item.
So what you need to do is to put something in your item that is unique, so that a new item is created instead of the old one being overwritten.
You could create a UUID or something like that, but I think it would be beneficial to use the time of execution. This way you could see when your last execution was etc.
from datetime import datetime
[...]
table.put_item(
Item = {
"Update": response,
"ProcessingTime": datetime.now().isoformat()
}
)
Adding to what Jens stated, which is 100% correct.
You could use data from the event. The event will look something like this:
{
"id": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
"detail-type": "Scheduled Event",
"source": "aws.events",
"account": "123456789012",
"time": "1970-01-01T00:00:00Z",
"region": "us-west-2",
"resources": [
"arn:aws:events:us-west-2:123456789012:rule/ExampleRule"
],
"detail": {}
}
The id value will be 100% unique, and the time value will be the time it was triggered.

Google Cloud Vision API using Cloud Shell: How can I run the API for multiple images? What should my request.json look like?

I've run a test on a single image [using Cloud Shell] and the request.json is like the below. How can I run the Vision API for an entire folder of images?
Also, why does the user permissions for images need to be public for the API to run?
Thanks.
{
"requests": [
{
"image": {
"source": {
"gcsImageUri": "gs://visionapitest/landmark/test.jpeg"
}
},
"features": [
{
"type": "LABEL_DETECTION",
"maxResults": 10
}
]
}
]
}
If you want to perform the request by using the Cloud Shell, you have to do it in the following way
{
"requests": [
{
"image": {
"source": {
"gcsImageUri": "gs://visionapitest/landmark/test.jpeg"
}
},
"features": [
{
"type": "LABEL_DETECTION",
"maxResults": 10
}
]
},
{
"image": {
"source": {
"gcsImageUri": "gs://visionapitest/landmark/test2.jpeg"
}
},
"features": [
{
"type": "LABEL_DETECTION",
"maxResults": 10
}
]
}, … ]}
Please note, that isn’t a way to specify a complete folder, as you can see the “requests” field is an array of AnnotateImageRequest objects, so you have to itemize every image within the JSON file.
On the other hand, you can dynamically create the “requests” array by using one of the available Vision Client Libraries in order to read all the images within the folder. I would like to share a python code snippet I took from the Vision API documentation, although it only contemplated an image but I modified it to read the entire folder.
from google.cloud import vision_v1
from google.cloud.vision_v1 import enums
from google.cloud import storage
from google.cloud.vision_v1 import types
from re import search
def sample_async_batch_annotate_images(
bucket_name,
output_uri
):
"""Perform async batch image annotation."""
client = vision_v1.ImageAnnotatorClient()
storage_client = storage.Client()
blobs = storage_client.list_blobs(
bucket_name, prefix='vision/label/', delimiter='/'
)
requests = []
for blob in blobs:
if search('jpg',blob.name):
input_image_uri = 'gs://' + bucket_name +'/'+ blob.name
print(input_image_uri)
source = {"image_uri": input_image_uri}
image = {"source": source}
features = [
{"type": enums.Feature.Type.LABEL_DETECTION},
]
request = types.AnnotateImageRequest(image=image, features=features)
requests.append(request)
gcs_destination = {"uri": output_uri}
# The max number of responses to output in each JSON file
batch_size = 2
output_config = {"gcs_destination": gcs_destination,
"batch_size": batch_size}
operation = client.async_batch_annotate_images(requests, output_config)
print("Waiting for operation to complete...")
response = operation.result(90)
# The output is written to GCS with the provided output_uri as prefix
gcs_output_uri = response.output_config.gcs_destination.uri
print("Output written to GCS with prefix: {}".format(gcs_output_uri))
However, you can take this as a reference but it would depend on your use-case and code language preference.
Regarding the question about the permissions, I guess you refer to the Cloud Storage bucket ones. Per my understanding it is not necessary to make your images public, you only have to give read/write Cloud Storage permissions within the bucket to the service account with which you are executing the requests.

Boto3 - Create S3 'object created' notification to trigger a lambda function

How do I use boto3 to simulate the Add Event Source action on the AWS GUI Console in the Event Sources tab.
I want to programatically create a trigger such that if an object is created in MyBucket, it will call MyLambda function(qualified with an alias).
The relevant api call that I see in the Boto3 documentation is create_event_source_mapping but it states explicitly that it is only for AWS Pull Model while I think that S3 belongs to the Push Model. Anyways, I tried using it but it didn't work.
Scenarios:
Passing a prefix filter would be nice too.
I was looking at the wrong side. This is configured on S3
s3 = boto3.resource('s3')
bucket_name = 'mybucket'
bucket_notification = s3.BucketNotification(bucket_name)
response = bucket_notification.put(
NotificationConfiguration={'LambdaFunctionConfigurations': [
{
'LambdaFunctionArn': 'arn:aws:lambda:us-east-1:033333333:function:mylambda:staging',
'Events': [
's3:ObjectCreated:*'
],
},
]})

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?
The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.
Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.
Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results
You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good
If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn
I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)
there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!