How do I use AWS Lambda to trigger Comprehend with S3? - amazon-web-services

I'm currently using aws lambda to trigger an amazon comprehend job, but the code is only used to run one piece of text under sentiment analysis.
import boto3
def lambda_handler(event, context):
s3 = boto3.client("s3")
bucket = "bucketName"
key = "textName.txt"
file = s3.get_object(Bucket = bucket, Key = key)
analysisdata = str(file['Body'].read())
comprehend = boto3.client("comprehend")
sentiment = comprehend.detect_sentiment(Text = analysisdata, LanguageCode = "en")
print(sentiment)
return 'Sentiment detected'
I want to run a file where each line in the text file is a new piece of text to analyze with sentiment analysis (it's an option if you manually enter stuff into comprehend), but is there a way to alter this code to do that? And have the output sentiment analysis file be placed into that same S3 bucket? Thank you in advance.

It looks like you can use start_sentiment_detection_job():
response = client.start_sentiment_detection_job(
InputDataConfig={
'S3Uri': 'string',
'InputFormat': 'ONE_DOC_PER_FILE'|'ONE_DOC_PER_LINE',
'DocumentReaderConfig': {
'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT'|'TEXTRACT_ANALYZE_DOCUMENT',
'DocumentReadMode': 'SERVICE_DEFAULT'|'FORCE_DOCUMENT_READ_ACTION',
'FeatureTypes': [
'TABLES'|'FORMS',
]
}
},
OutputDataConfig={
'S3Uri': 'string',
'KmsKeyId': 'string'
},
...
)
It can read from an object in Amazon S3 (S3Uri) and store the output in an S3 object.
It looks like you could use 'InputFormat': 'ONE_DOC_PER_LINE' to meet your requirements.

Related

Why is AWS Lambda returning a Key Error when trying to upload an image to S3 and updating a DynamoDB Table with API Gateway?

I am trying to upload a binary Image to S3 and update a DynamoDB table in the same AWS Lambda Function. The problem is, whenever I try to make an API call, I get the following error in postman:
{
"errorMessage": "'itemId'",
"errorType": "KeyError",
"requestId": "bccaead6-cb60-4a5e-9fc7-14ff25380451",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n s3_upload = s3.put_object(Bucket=bucket, Key=event[\"itemId\"] + \".png\", Body=decode_content)\n"
]
}
My events section takes in 3 Strings and whenever I try and access those strings, I get this error. However, if I try and access them without trying to upload to an S3 Bucket, everything works fine. My Lambda Function looks like this:
import json
import boto3
import base64
dynamoclient = boto3.resource("dynamodb")
s3 = boto3.client("s3")
table = dynamoclient.Table("Items")
bucket = "images"
def lambda_handler(event, context):
get_file_content = event["content"]
decode_content = base64.b64decode(get_file_content)
s3_upload = s3.put_object(Bucket=bucket, Key=event["itemId"] + ".png", Body=decode_content)
table.put_item(
Item={
'itemID': event["itemId"],
'itemName': event['itemName'],
'itemDescription': event['itemDescription']
}
)
return {
"code":200,
"message": "Item was added successfully"
}
Again, if I remove everything about the S3 file upload, everything works fine and I am able to update the DynamoDB table successfully. As for the API Gateway side, I have added the image/png to the Binary Media Types section. Additionally, for the Mapping Templates section for AWS API Gateway, I have added the content type image/png. In the template for the content type, I have the following lines:
{
"content": "$input.body"
}
For my Postman POST request, in the headers section, I have put this:
Finally, for the body section, I have added the raw event data with this:
{
"itemId": "0fx170",
"itemName": "Mouse",
"itemDescription": "Smooth"
}
Lastly, for the binary section, I have uploaded my PNG file.
What could be going wrong?

Can an AWS Athena query read from S3 through an S3 Object Lambda?

I understand I can write an S3 Object Lambda which can transparently alter the returned S3 contents on the fly during retrieval, without modifying the original object.
I understand also that Athena can read json or csv files from S3.
My question is, can both of these capabilities be combined, so that Athena queries would read data which is transparently altered on the fly via S3 Object Lambda prior to being parsed by Athena?
SOME CODE
Suppose I write a CSV file:
hello
world
Then I write an S3 Object lambda:
import boto3
import requests
def lambda_handler(event, context):
print(event)
object_get_context = event["getObjectContext"]
request_route = object_get_context["outputRoute"]
request_token = object_get_context["outputToken"]
s3_url = object_get_context["inputS3Url"]
# Get object from S3
response = requests.get(s3_url)
original_object = response.content.decode('utf-8')
# Transform object
transformed_object = original_object.upper()
# Write object back to S3 Object Lambda
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=transformed_object,
RequestRoute=request_route,
RequestToken=request_token)
return {'status_code': 200}
(astute readers will notice this is the example from aws docs)
Now suppose I create an Athena EXTERNAL table and write this query:
SELECT * from hello
How can I ensure that the Athena query will return WORLD instead of world in this scenario?

I would like to know how to import data into the app by modifying this lambda code

import boto3
import json
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = "cloud-translate-output"
key = "key value"
try:
data = s3.get_object(Bucket=bucket, Key=key)
json_data = data["Body"].read()
return{
"response_code" : 200,
"data": str(json_data)
}
except Exception as e:
print (e)
raise e
I'm making ios app with xcode.
And I want to use aws to bring data from s3 to app in order of app-api gateway-lambda-s3. But is there a way to use the data of api using api in app, and if I upload this data to bucket number 1 of s3, the cloudformation will translate the uploaded text file and automatically save it to bucket number 2, and I want to import the text data file stored in bucket number 2 back to app through lambda, not key value, is there a way to use only the name of the bucket?
if I upload this data to bucket number 1 of s3, the cloudformation will translate the uploaded text file and automatically save it to bucket number 2
Sadly, this is not how CloudFormation work. It can't read or translate automatically any files from buckets, or upload them to new buckets.
I would stick with a lambda function. It is more suited to such tasks.

Creating a CloudWatch Metrics from the Athena Query results

My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return

How do you full text search an Amazon S3 bucket?

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?
The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.
Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.
Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results
You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good
If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn
I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)
there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!