AWS Lambda Python - Push part of MediaInfo function response to SNS - python-2.7

firstly I am relatively new to code and attempting to teach myself what I need! I have managed to butcher bits of example code that I have found on various forums to get to where I am now. I am running an AWS Lambda function that triggers when a new file is uploaded to a bucket, and then sends the file off to MediaInfo (I built a self contained CLI executable that is uploaded to the Lambda function) the result of this is in XML format, and I have managed to pass this onto a DynamoDB database.
My question is - I want to export the XML produced by this function and push it to an SNS topic so that I can pick it up and use elsewhere (knack database). Here is my Lambda code in full (changed private info).
import logging
import subprocess
import boto3
SIGNED_URL_EXPIRATION = 300 # The number of seconds that the Signed
URL is valid
DYNAMODB_TABLE_NAME = "demo_metadata"
DYNAMO = boto3.resource("dynamodb")
TABLE = DYNAMO.Table(DYNAMODB_TABLE_NAME)
logger = logging.getLogger('boto3')
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""
:param event:
:param context:
"""
# Loop through records provided by S3 Event trigger
for s3_record in event['Records']:
logger.info("Working on new s3_record...")
# Extract the Key and Bucket names for the asset uploaded to S3
key = s3_record['s3']['object']['key']
bucket = s3_record['s3']['bucket']['name']
logger.info("Bucket: {} \t Key: {}".format(bucket, key))
# Generate a signed URL for the uploaded asset
signed_url = get_signed_url(SIGNED_URL_EXPIRATION, bucket, key)
logger.info("Signed URL: {}".format(signed_url))
# Launch MediaInfo
# Pass the signed URL of the uploaded asset to MediaInfo as an
input
# MediaInfo will extract the technical metadata from the asset
# The extracted metadata will be outputted in XML format and
# stored in the variable xml_output
xml_output = subprocess.check_output(["./mediainfo", "--full", "--output=XML", signed_url])
logger.info("Output: {}".format(xml_output))
save_record(key, xml_output)
def save_record(key, xml_output):
"""
Save record to DynamoDB
:param key: S3 Key Name
:param xml_output: Technical Metadata in XML Format
:return: xml_output
"""
logger.info("Saving record to DynamoDB...")
TABLE.put_item(
Item={
'keyName': key,
'technicalMetadata': xml_output
}
)
logger.info("Saved record to DynamoDB")
def get_signed_url(expires_in, bucket, obj):
"""
Generate a signed URL
:param expires_in: URL Expiration time in seconds
:param bucket:
:param obj: S3 Key name
:return: Signed URL
"""
s3_cli = boto3.client("s3")
presigned_url = s3_cli.generate_presigned_url('get_object', Params= {'Bucket': bucket, 'Key': obj},
ExpiresIn=expires_in)
return presigned_url
The output I get from the Lambda function when using the aws GUI is here, and this is what I went to send to an SNS topic.
<Height>1080</Height>
<Height>1 080 pixels</Height>
<Stored_Height>1088</Stored_Height>
<Sampled_Width>1920</Sampled_Width>
<Sampled_Height>1080</Sampled_Height>
<Pixel_aspect_ratio>1.000</Pixel_aspect_ratio>
<Display_aspect_ratio>1.778</Display_aspect_ratio>
<Display_aspect_ratio>16:9</Display_aspect_ratio>
<Rotation>0.000</Rotation>
<Frame_rate_mode>CFR</Frame_rate_mode>
<Frame_rate_mode>Constant</Frame_rate_mode>
<Frame_rate>29.970</Frame_rate>
<Frame_rate>29.970 (30000/1001) fps</Frame_rate>
<FrameRate_Num>30000</FrameRate_Num>
<FrameRate_Den>1001</FrameRate_Den>
<Frame_count>630</Frame_count>
<Resolution>8</Resolution>
<Resolution>8 bits</Resolution>
<Colorimetry>4:2:0</Colorimetry>
<Color_space>YUV</Color_space>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Chroma_subsampling>4:2:0</Chroma_subsampling>
<Bit_depth>8</Bit_depth>
<Bit_depth>8 bits</Bit_depth>
<Scan_type>Progressive</Scan_type>
<Scan_type>Progressive</Scan_type>
<Interlacement>PPF</Interlacement>
<Interlacement>Progressive</Interlacement>
<Bits__Pixel_Frame_>0.129</Bits__Pixel_Frame_>
<Stream_size>21374449</Stream_size>
<Stream_size>20.4 MiB (99%)</Stream_size>
<Stream_size>20 MiB</Stream_size>
<Stream_size>20 MiB</Stream_size>
<Stream_size>20.4 MiB</Stream_size>
<Stream_size>20.38 MiB</Stream_size>
<Stream_size>20.4 MiB (99%)</Stream_size>
<Proportion_of_this_stream>0.98750</Proportion_of_this_stream>
<Encoded_date>UTC 2017-11-24 19:29:16</Encoded_date>
<Tagged_date>UTC 2017-11-24 19:29:16</Tagged_date>
<Buffer_size>16000000</Buffer_size>
<Color_range>Limited</Color_range>
<colour_description_present>Yes</colour_description_present>
<Color_primaries>BT.709</Color_primaries>
<Transfer_characteristics>BT.709</Transfer_characteristics>
<Matrix_coefficients>BT.709</Matrix_coefficients>
</track>
<track type="Audio">
<Count>272</Count>
<Count_of_stream_of_this_kind>1</Count_of_stream_of_this_kind>
<Kind_of_stream>Audio</Kind_of_stream>
<Kind_of_stream>Audio</Kind_of_stream>
<Stream_identifier>0</Stream_identifier>
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<ID>2</ID>
<Format>AAC</Format>
<Format_Info>Advanced Audio Codec</Format_Info>
<Commercial_name>AAC</Commercial_name>
<Format_profile>LC</Format_profile>
<Codec_ID>40</Codec_ID>
<Codec>AAC LC</Codec>
<Codec>AAC LC</Codec>
<Codec_Family>AAC</Codec_Family>
</File>
</Mediainfo>
[INFO] 2018-04-22T18:50:01.803Z efde8294-465d-11e8-9ad2-0db0d6b36746 Saving record to DynamoDB...
[INFO] 2018-04-22T18:50:02.21Z efde8294-465d-11e8-9ad2-0db0d6b36746
Saved record to DynamoDB
END RequestId: efde8294-465d-11e8-9ad2-0db0d6b36746
REPORT RequestId: efde8294-465d-11e8-9ad2-0db0d6b36746 Duration: 9769.02 ms Billed Duration: 9800 ms Memory Size: 128 MB Max Memory Used: 61 MB
Many thanks in advance to anyone with advice!

Related

unable to get object metadata from S3. Check object key, region and/or access permissions."

I have a Lambda function that scans for text and is triggered by an S3 bucket. I get this error when trying to upload a photo directly into s3 bucket using browser
Unable to get object metadata from S3. Check object key, region, and/or access permissions
However, if I hardcode the key (e.g., image01.jpg) which is in my bucket, there are no errors.
import json
import boto3
def lambda_handler(event, context):
# Get bucket and file name
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
location = key[:17]
s3Client = boto3.client('s3')
client = boto3.client('rekognition', region_name='us-east-1')
response=client.detect_text(Image={'S3Object':
{'Bucket':'myarrowbucket','Name':key}})
detectedText = response['TextDetections']
I am confused as it was working a few weeks ago but now i am getting that error
ANSWER
I have seen this question answered many times and i tried every solution , the one which worked for me was 'key' name . i was getting the metadata error when the filename contained special characters e.g - or _ but when i changed the names of the files uploaded it works . Hope this answer helps someone.

Sagemaker python sdk: accessing custom_attributes in inference job

I am using Sagemaker python sdk for my inference job and following this guide. I am triggering my sagemaker inference job from Airflow with below python callable:
def transform(sage_role, inference_file_local_path, **kwargs):
"""
Python callable to execute Sagemaker SDK train job. It takes infer_batch_output, infer_batch_input, model_artifact,
instance_type and infer_file_name as run time parameter.
:param inference_file_local_path: Local entry_point path for Inference file.
:param sage_role: Sagemaker execution role.
"""
model = TensorFlowModel(entry_point=infer_file_name,
source_dir=inference_file_local_path,
model_data=model_artifact,
role=sage_role,
framework_version="2.5.1")
tensorflow_serving_transformer = model.transformer(
instance_count=1,
instance_type=instance_type,
accept="text/csv",
strategy="SingleRecord",
max_payload=10,
max_concurrent_transforms=10,
output_path=batch_output)
return tensorflow_serving_transformer.transform(data=batch_input, content_type='text/csv')
and my simply inference.py looks like:
def input_handler(data, context):
""" Pre-process request input before it is sent to TensorFlow Serving REST API
Args:
data (obj): the request data, in format of dict or string
context (Context): an object containing request and configuration details
Returns:
(dict): a JSON-serializable dict that contains request body and headers
"""
if context.request_content_type == 'application/x-npy':
# very simple numpy handler
payload = np.load(data.read().decode('utf-8'))
x_user_feature = np.asarray(payload.item().get('test').get('feature_a_list'))
x_channel_feature = np.asarray(payload.item().get('test').get('feature_b_list'))
examples = []
for index, elem in enumerate(x_user_feature):
examples.append({'feature_a_list': elem, 'feature_b_list': x_channel_feature[index]})
return json.dumps({'instances': examples})
if context.request_content_type == 'text/csv':
payload = pd.read_csv(data)
print("Model name is ..............")
model_name = context.model_name
print(model_name)
examples = []
row_ch = []
if config_exists(model_bucket, "{}{}".format(config_path, model_name)):
config_keys = get_s3_json_file(model_bucket, "{}{}".format(config_path, model_name))
feature_b_list = config_keys["feature_b_list"].split(",")
row_ch = [float(ch_feature_str) for ch_feature_str in feature_b_list]
if "column_names" in config_keys.keys():
cols = config_keys["column_names"].split(",")
payload.columns = cols
for index, row in payload.iterrows():
row_user = row['feature_a_list'].replace('[', '').replace(']', '').split()
row_user = [float(x) for x in row_user]
if not row_ch:
row_ch = row['feature_b_list'].replace('[', '').replace(']', '').split()
row_ch = [float(x) for x in row_ch]
example = {'feature_a_list': row_user, 'feature_b_list': row_ch}
examples.append(example)
raise ValueError('{{"error": "unsupported content type {}"}}'.format(
context.request_content_type or "unknown"))
def output_handler(data, context):
"""Post-process TensorFlow Serving output before it is returned to the client.
Args:
data (obj): the TensorFlow serving response
context (Context): an object containing request and configuration details
Returns:
(bytes, string): data to return to client, response content type
"""
if data.status_code != 200:
raise ValueError(data.content.decode('utf-8'))
response_content_type = context.accept_header
prediction = data.content
return prediction, response_content_type
It is working fine however I want to pass custom arguments to inference.py so that I can modify the input data accordingly based on requirement. I thought of using a config file per requirement and download it from s3 based on model name but as I am using model_data and passes model.tar.gz at runtime context.model_name is always None.
Is there a way I can pass run time argument to inference.py that I can use for customization?
In the docs I see sagemaker provides custom_attributes but I don't see any example of it on how to use it and access it in inference.py.
custom_attributes (string): content of ‘X-Amzn-SageMaker-Custom-Attributes’ header from the original request. For example, ‘tfs-model-name=half*plus*three,tfs-method=predict’
Currently CustomAttributes is supported in the InvokeEndpoint API call when using a realtime Endpoint.
As an example, you can look at passing JSON Lines as input to your Transform Job that contains the input payload and some custom arguments which you can consume in your inference.py file.
For example,
{
"input":"1,2,3,4",
"custom_args":"my_custom_arg"
}

Search for 2 strings from multiple pdfs in AWS S3 Bucket which has sub directories without downloading those in local machine

Im looking to search for two words in multiple pdfs located in AWS S3 bucket. However, I dont want to download those docs in local machine, instead if the search part could directly run on those pdfs via URL. Point to note that these PDFs are located in multiple sub directories within a bucket ( like year folder, then month folder, then date ).
Amazon S3 does not have a 'Search' capability. It is a "simple storage service".
You would either need to download those documents to some form of compute platform (eg EC2, Lambda, or your own computer) and perform the searches, or you could pre-index the documents using a service like Amazon OpenSearch Service and then send the query to the search service.
Running a direct scan of PDFs to search for texts in an S3 bucket is HARD:
Some PDFs contain text that were embedded inside images (They are not readable in text form)
If you want to download a PDF without saving it, consider using memory-optimized machines and don't store the files in the hard drive of the virtual machines and use in-memory streams.
In order to get around texts inside images, it would require you to use OCR logic which is also HARD to execute. You'll prolly want to use AWS Textract or Google Vision for OCR. If compliance and security is an issue, you could use Tesseract.
If in any case that you have a reliable OCR solution, I would suggest to run a text extraction job after an upload event happens, this will save you tons of money to pay for any OCR service that you'll consume, it will also enable your organization to cache the contents of the pdf in text format in more search-friendly services like AWS OpenSearch
Here's a tutorial which uses Tika (for PDF OCR) and OpenSearch (for search engine) to search the contents of PDF files within an S3 bucket:
import boto3
from tika import parser
from opensearchpy import OpenSearch
from config import *
import sys
# opensearch object
os = OpenSearch(opensearch_uri)
s3_file_name="prescription.pdf"
bucket_name="mixpeek-demo"
def download_file():
"""Download the file
:param str s3_file_name: name of s3 file
:param str bucket_name: bucket name of where the s3 file is stored
"""
# s3 boto3 client instantiation
s3_client = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name
)
# open in memory
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)
print("file downloaded")
# parse the file
parsed_pdf_content = parser.from_file(s3_file_name)['content']
print("file contents extracted")
# insert parsed pdf content into elasticsearch
insert_into_search_engine(s3_file_name, parsed_pdf_content)
print("file contents inserted into search engine")
def insert_into_search_engine(s3_file_name, parsed_pdf_content):
"""Download the file
:param str s3_file_name: name of s3 file
:param str parsed_pdf_content: extracted contents of PDF file
"""
doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
# insert
resp = os.index(
index = index_name,
body = doc,
id = 1,
refresh = True
)
print('\nAdding document:')
print(resp)
def create_index():
"""Create the index
"""
index_body = {
'settings': {
'index': {
'number_of_shards': 1
}
}
}
response = os.indices.create(index_name, body=index_body)
print('\nCreating index:')
print(response)
if __name__ == '__main__':
globals()[sys.argv[1]]()
full tutorial: https://medium.com/#mixpeek/search-text-from-pdf-files-stored-in-an-s3-bucket-2f10947eebd3
Corresponding github repo: https://github.com/mixpeek/pdf-search-s3

I would like to know how to import data into the app by modifying this lambda code

import boto3
import json
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = "cloud-translate-output"
key = "key value"
try:
data = s3.get_object(Bucket=bucket, Key=key)
json_data = data["Body"].read()
return{
"response_code" : 200,
"data": str(json_data)
}
except Exception as e:
print (e)
raise e
I'm making ios app with xcode.
And I want to use aws to bring data from s3 to app in order of app-api gateway-lambda-s3. But is there a way to use the data of api using api in app, and if I upload this data to bucket number 1 of s3, the cloudformation will translate the uploaded text file and automatically save it to bucket number 2, and I want to import the text data file stored in bucket number 2 back to app through lambda, not key value, is there a way to use only the name of the bucket?
if I upload this data to bucket number 1 of s3, the cloudformation will translate the uploaded text file and automatically save it to bucket number 2
Sadly, this is not how CloudFormation work. It can't read or translate automatically any files from buckets, or upload them to new buckets.
I would stick with a lambda function. It is more suited to such tasks.

Aws Lambda to write to firebase storage

I want to write a downloaded file to Firebase storage using AWS Lambda.
I have already written a dynamic link for fire-storage
Can someone hint me how to do that? I have ready for s3 I want to store it in Firestorage now.
Only the index.html to store in fire storage name google-list/index.html
def lambda_handler(event, context):
url='https://www.google.com/index.html' # put your url here
bucket = 'google-list' #your s3 bucket
key = 'index.html' #your path
#write to s3 want to replace with firestorage.
#s3=boto3.client('s3')
#http=urllib3.PoolManager()
#s3.upload_fileobj(http.request('GET', url,preload_content=False), bucket, key, ExtraArgs={'ACL':'public-read'})```