How do you full text search an Amazon S3 bucket? - amazon-web-services

I have a bucket on S3 in which I have large amount of text files.
I want to search for some text within a text file. It contains raw data only.
And each text file has a different name.
For example, I have a bucket name:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& I want to search text like "I am human" in the above text files.
How to achieve this? Is it even possible?

The only way to do this will be via CloudSearch, which can use S3 as a source. It works using rapid retrieval to build an index. This should work very well but thoroughly check out the pricing model to make sure that this won't be too costly for you.
The alternative is as Jack said - you'd otherwise need to transfer the files out of S3 to an EC2 and build a search application there.

Since october 1st, 2015 Amazon offers another search service with Elastic Search, in more or less the same vein as cloud search you can stream data from Amazon S3 buckets.
It will work with a lambda function to make sure any new data sent to an S3 bucket triggers an event notification to this Lambda and update the ES index.
All steps are well detailed in amazon doc with Java and Javascript example.
At a high level, setting up to stream data to Amazon ES requires the following steps:
Creating an Amazon S3 bucket and an Amazon ES domain
Creating a Lambda deployment package.
Configuring a Lambda function.
Granting authorization to stream data to Amazon ES.

Although not an AWS native service, there is Mixpeek, which runs text extraction like Tika, Tesseract and ImageAI on your S3 files then places them in a Lucene index to make them searchable.
You integrate it as follows:
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
Then you parse the results

You can use Filestash (Disclaimer: I'm the author), install you own instance and connect to your S3 bucket. Eventually give it a bit of time to index the entire thing if you have a whole lot of data and you should be good

If you have an EMR, then create a spark application and do a search . We did this. This will work as distributed searcn

I know this is really old, but hopefully someone find my solution handy.
This is a python script, using boto3.
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking#emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)

there is serverless and cheaper option available
Use AWS Glue and you can convert the txt fils into a table
use AWS AThena and you can run sql queries on top of it.
I wouldrecommend you to put data in parquets on s3 and this makes the data size on s3 very small and super fast!

Related

Using Eventbridge to trigger Glue job but with delay

I want to create an Eventbridge rule which triggers after certain number of files are uploaded into the S3 bucket. For ex: Consider a certain prefix in bucket is empty(bucket/folder/[empty]), the user needs to upload 5 files. Only after those five files are uploaded can the Eventbridge be triggered. I tried searching for rule pattern, but unable to find anything related to this. Currently using
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["test-bucket-for-event"]
},
"object": {
"key": [{
"prefix": "folder/Latest/"
}]
}
}
}
Can i mention like, numbers here, like using greater than 5 etc.
Or how to configure that.
Help is appreciated.
Thanks
I created a Lambda function which is used to trigger glue job after certain number of files are created
import json
import boto3
def lambda_handler(event,context):
bucket = "bucket-name"
folder = "Folder/Subfolder/"
objs = boto3.client('s3').list_objects_v2(Bucket=bucket,Prefix=folder)
conn = boto3.client('s3')
for key in conn.list_objects(Bucket=bucket,Prefix=folder)['Contents']:
file_name = list(key['Key'])
print(''.join(file_name))
fileCount = objs['KeyCount']
fileCount = fileCount - 1
print(fileCount)
if fileCount >= 5:
print('the files are present,going to send notification') #here add the glue trigger
else:
print("None")

How do I use AWS Lambda to trigger Comprehend with S3?

I'm currently using aws lambda to trigger an amazon comprehend job, but the code is only used to run one piece of text under sentiment analysis.
import boto3
def lambda_handler(event, context):
s3 = boto3.client("s3")
bucket = "bucketName"
key = "textName.txt"
file = s3.get_object(Bucket = bucket, Key = key)
analysisdata = str(file['Body'].read())
comprehend = boto3.client("comprehend")
sentiment = comprehend.detect_sentiment(Text = analysisdata, LanguageCode = "en")
print(sentiment)
return 'Sentiment detected'
I want to run a file where each line in the text file is a new piece of text to analyze with sentiment analysis (it's an option if you manually enter stuff into comprehend), but is there a way to alter this code to do that? And have the output sentiment analysis file be placed into that same S3 bucket? Thank you in advance.
It looks like you can use start_sentiment_detection_job():
response = client.start_sentiment_detection_job(
InputDataConfig={
'S3Uri': 'string',
'InputFormat': 'ONE_DOC_PER_FILE'|'ONE_DOC_PER_LINE',
'DocumentReaderConfig': {
'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT'|'TEXTRACT_ANALYZE_DOCUMENT',
'DocumentReadMode': 'SERVICE_DEFAULT'|'FORCE_DOCUMENT_READ_ACTION',
'FeatureTypes': [
'TABLES'|'FORMS',
]
}
},
OutputDataConfig={
'S3Uri': 'string',
'KmsKeyId': 'string'
},
...
)
It can read from an object in Amazon S3 (S3Uri) and store the output in an S3 object.
It looks like you could use 'InputFormat': 'ONE_DOC_PER_LINE' to meet your requirements.

Why is AWS Lambda returning a Key Error when trying to upload an image to S3 and updating a DynamoDB Table with API Gateway?

I am trying to upload a binary Image to S3 and update a DynamoDB table in the same AWS Lambda Function. The problem is, whenever I try to make an API call, I get the following error in postman:
{
"errorMessage": "'itemId'",
"errorType": "KeyError",
"requestId": "bccaead6-cb60-4a5e-9fc7-14ff25380451",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n s3_upload = s3.put_object(Bucket=bucket, Key=event[\"itemId\"] + \".png\", Body=decode_content)\n"
]
}
My events section takes in 3 Strings and whenever I try and access those strings, I get this error. However, if I try and access them without trying to upload to an S3 Bucket, everything works fine. My Lambda Function looks like this:
import json
import boto3
import base64
dynamoclient = boto3.resource("dynamodb")
s3 = boto3.client("s3")
table = dynamoclient.Table("Items")
bucket = "images"
def lambda_handler(event, context):
get_file_content = event["content"]
decode_content = base64.b64decode(get_file_content)
s3_upload = s3.put_object(Bucket=bucket, Key=event["itemId"] + ".png", Body=decode_content)
table.put_item(
Item={
'itemID': event["itemId"],
'itemName': event['itemName'],
'itemDescription': event['itemDescription']
}
)
return {
"code":200,
"message": "Item was added successfully"
}
Again, if I remove everything about the S3 file upload, everything works fine and I am able to update the DynamoDB table successfully. As for the API Gateway side, I have added the image/png to the Binary Media Types section. Additionally, for the Mapping Templates section for AWS API Gateway, I have added the content type image/png. In the template for the content type, I have the following lines:
{
"content": "$input.body"
}
For my Postman POST request, in the headers section, I have put this:
Finally, for the body section, I have added the raw event data with this:
{
"itemId": "0fx170",
"itemName": "Mouse",
"itemDescription": "Smooth"
}
Lastly, for the binary section, I have uploaded my PNG file.
What could be going wrong?

Search for 2 strings from multiple pdfs in AWS S3 Bucket which has sub directories without downloading those in local machine

Im looking to search for two words in multiple pdfs located in AWS S3 bucket. However, I dont want to download those docs in local machine, instead if the search part could directly run on those pdfs via URL. Point to note that these PDFs are located in multiple sub directories within a bucket ( like year folder, then month folder, then date ).
Amazon S3 does not have a 'Search' capability. It is a "simple storage service".
You would either need to download those documents to some form of compute platform (eg EC2, Lambda, or your own computer) and perform the searches, or you could pre-index the documents using a service like Amazon OpenSearch Service and then send the query to the search service.
Running a direct scan of PDFs to search for texts in an S3 bucket is HARD:
Some PDFs contain text that were embedded inside images (They are not readable in text form)
If you want to download a PDF without saving it, consider using memory-optimized machines and don't store the files in the hard drive of the virtual machines and use in-memory streams.
In order to get around texts inside images, it would require you to use OCR logic which is also HARD to execute. You'll prolly want to use AWS Textract or Google Vision for OCR. If compliance and security is an issue, you could use Tesseract.
If in any case that you have a reliable OCR solution, I would suggest to run a text extraction job after an upload event happens, this will save you tons of money to pay for any OCR service that you'll consume, it will also enable your organization to cache the contents of the pdf in text format in more search-friendly services like AWS OpenSearch
Here's a tutorial which uses Tika (for PDF OCR) and OpenSearch (for search engine) to search the contents of PDF files within an S3 bucket:
import boto3
from tika import parser
from opensearchpy import OpenSearch
from config import *
import sys
# opensearch object
os = OpenSearch(opensearch_uri)
s3_file_name="prescription.pdf"
bucket_name="mixpeek-demo"
def download_file():
"""Download the file
:param str s3_file_name: name of s3 file
:param str bucket_name: bucket name of where the s3 file is stored
"""
# s3 boto3 client instantiation
s3_client = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name
)
# open in memory
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)
print("file downloaded")
# parse the file
parsed_pdf_content = parser.from_file(s3_file_name)['content']
print("file contents extracted")
# insert parsed pdf content into elasticsearch
insert_into_search_engine(s3_file_name, parsed_pdf_content)
print("file contents inserted into search engine")
def insert_into_search_engine(s3_file_name, parsed_pdf_content):
"""Download the file
:param str s3_file_name: name of s3 file
:param str parsed_pdf_content: extracted contents of PDF file
"""
doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
# insert
resp = os.index(
index = index_name,
body = doc,
id = 1,
refresh = True
)
print('\nAdding document:')
print(resp)
def create_index():
"""Create the index
"""
index_body = {
'settings': {
'index': {
'number_of_shards': 1
}
}
}
response = os.indices.create(index_name, body=index_body)
print('\nCreating index:')
print(response)
if __name__ == '__main__':
globals()[sys.argv[1]]()
full tutorial: https://medium.com/#mixpeek/search-text-from-pdf-files-stored-in-an-s3-bucket-2f10947eebd3
Corresponding github repo: https://github.com/mixpeek/pdf-search-s3

Boto3 - Create S3 'object created' notification to trigger a lambda function

How do I use boto3 to simulate the Add Event Source action on the AWS GUI Console in the Event Sources tab.
I want to programatically create a trigger such that if an object is created in MyBucket, it will call MyLambda function(qualified with an alias).
The relevant api call that I see in the Boto3 documentation is create_event_source_mapping but it states explicitly that it is only for AWS Pull Model while I think that S3 belongs to the Push Model. Anyways, I tried using it but it didn't work.
Scenarios:
Passing a prefix filter would be nice too.
I was looking at the wrong side. This is configured on S3
s3 = boto3.resource('s3')
bucket_name = 'mybucket'
bucket_notification = s3.BucketNotification(bucket_name)
response = bucket_notification.put(
NotificationConfiguration={'LambdaFunctionConfigurations': [
{
'LambdaFunctionArn': 'arn:aws:lambda:us-east-1:033333333:function:mylambda:staging',
'Events': [
's3:ObjectCreated:*'
],
},
]})