AWS web service to upload and analyse users video - amazon-web-services

I'm developing a prototype of a video analysis service on AWS.
The question is: am thinking in the right direction or I will fail to implement this architecture?
Architecture:
Flask on EC2.
User(authenticated) upload file via web view, I'm saving it to S3.
Lambda triggers SageMaker.
SageMaker taking a file from S3, making preparation and analysis then: 1) Saving the results to PostgreSQL DB. 2) Triggers lambda that sends a notification to Flask that analysis is Done
User receives a notification from Flask that the analysis is done.
Flask web page visualizes data from the analysis for the user.
It has only a prototyping purpose, I'm trying to keep it as simple as possible.
will appreciate any comments and recommendations.

rekognition can find labels, text, faces, and expression in images and video. I demonstrate how to find labels in a image that you have stored in a s3 bucket. use the key of the image object in the bucket for rekognition to use to label.
def detect_labels(bucket, key, max_labels=10, min_confidence=95, region="us-east-1"):
rekognition = boto3.client("rekognition", region,
aws_access_key_id=AWS_KEY_ID,
aws_secret_access_key=AWS_SECRET
)
response = rekognition.detect_labels(
Image={
"S3Object": {
"Bucket": bucket,
"Name": key,
}
},
MaxLabels=max_labels,
MinConfidence=min_confidence,
)
return response['Labels']

Related

Upload custom file to s3 from training script in training component of AWS SageMaker Pipeline

I am new to Sagmaker, and I have created a pipeline from the SageMaker notebook, consisting of training and deployment components.
In the training script, we can upload the model to s3 via SM_MODEL_DIR. But now, I want to upload the classification report to s3. I tried this code. But It shows this is not a proper s3 bucket.
df_classification_report = pd.DataFrame(class_report).transpose()
classification_report_file_name = os.path.join(args.output_data_dir,
f"{args.eval_model_name}_classification_report.csv")
df_classification_report.to_csv(classification_report_file_name)
# instantiate S3 client and upload to s3
# save classification report to s3
s3 = boto3.resource('s3')
print(f"classification_report is being uploaded to s3- {args.model_dir}")
s3.meta.client.upload_file(classification_report_file_name, args.model_dir,
f"{args.eval_model_name}_classification_report.csv")
And the error
Invalid bucket name "/opt/ml/output/data": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Can anybody help? I really appreciate any help you can provide.
SageMaker Training Jobs will compress any files located in /opt/ml/model which is the value of SM_MODEL_DIR and upload it to S3 automatically. You could look at saving your file to SM_MODEL_DIR (Your classification report will thus be uploaded to S3 in the model tar ball).
The upload_file() function requires you to pass an S3 bucket.
You could also look at manually specify an S3 bucket in your code to upload the file to.
s3.meta.client.upload_file(classification_report_file_name, <YourS3Bucket>,
f"{args.eval_model_name}_classification_report.csv")
You can save non model artifacts, such as reports, to output_data_dir. See here.
parser.add_argument("--output_data_dir", type=str,
default=os.environ.get('SM_OUTPUT_DATA_DIR'),
help="Directory to save output data artifacts.")
If you want the artifacts to be packaged with the model files then follow #Marc's answer. Maybe it makes sense in the case of a report that pertains to a specific model, though capturing this in a model registry makes more sense to me.
Note that these additional artifacts would be carried over if you deploy the model to an endpoint (might confuse the inference runtime model loading code).

How to connect AWS lambda with Aerospike db cluster

I have a requirement, in which an excel file is being uploaded to S3 bucket, so as soon as that file gets uploaded, I want to trigger a lambda function which will read that excel file and then persist the data in aerospike db.
For reading the excel file, I have got this piece of code
key = 'key-name'
bucket = 'bucket-name'
s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, key)
data = s3_object.get()['Body'].read().decode('utf-8').splitlines()
lines = csv.reader(data)
headers = next(lines)
print('headers: %s' %(headers))
for line in lines:
#print complete line
print(line)
But I not able to figure out how to connect to aerospike db, as boto3 library doesn't support aerospike.
Please help me in connecting to db cluster and persist the data ?
Or any reference would be helpful
I think the way to interact with Aerospike from something like AWS Lambda is to use the Aerospike REST Client that provides a server which translates Restful API requests into messages to an Aerospike Cluster (it is mentioned in the blog post).
Basically you can run a REST server (Aerospike REST Client) that you can send HTTP requests from AWS Lambda using Python to the server and the server will translate these requests to Aerospike operations and will be responsible of executing them.
This is the GitHub repository of Aerospike REST Client - it also contains couple of blog posts of how to use it and a Swagger UI documentation of the actual supported requests:
https://github.com/aerospike/aerospike-client-rest
There is also this blog post of Serverless Event Stream Processing with Aerospike which can help you get started:
https://medium.com/aerospike-developer-blog/serverless-event-stream-processing-with-aerospike-679f2a5cbba6

Best practices of uploading a file to S3 and metadata to RDS?

Context
I'm building a mock service to learn AWS. I want a user to be able to upload a sound file (which other users can listen to). To do this I need the sound file to be uploaded to S3 and metadata such as file name, name of uploader, length, S3 ID to RDS. It is preferable that the user uploads directly to S3 with a signed URL instead of doubling the data transfered by first uploading it to my server and from there to S3.
Optimally this would be transactional but from what I have gathered there's no functionality for that given. In order to implement this and minimize the risk of the cases where the file being successfully uploaded to S3 but not the metadata to RDS and vice versa my best guess is as follows:
My solution
With words:
First is an attempt to upload the file to S3 with a key (uuid) I generate locally or server-side. If this is successful I make a request to my API to upload the metadata including the key to RDS. If this is unsuccessful I remove the object from S3.
With code:
uuid = get_uuid_from_server();
s3Client.putObject({.., key: uuid, ..}, function(err, data) {
if (err) {
reject(err);
} else {
resolve(data);
// Upload metadata to RDS through API-call to EC2 server. Remove s3 object with key:
uuid if this call is unsuccessful
}
});
As I'm learning, my approaches are seldom the best practices but I was unable to find any good information on this particular problem. Is my approach/solution above in line with best practices?
Bonus question: is it beneficial for security purposes to generate the file's key (uuid) server-side instead of client-side?
Here are 2 approaches that you can pick, assuming the client is a web browser or mobile app.
1. Use your server as a proxy to S3.
Your server acts as a proxy between your clients and S3, you have full control of the upload flow, control the supported file types and can inspect file contents, for example: to make sure the file is a correct sound file, before uploading to S3.
2. Use your server to create pre-signed upload URLs
In this approach, your client first requests server to create a single or multiple (for multi-part upload) pre-signed URLs. Clients then upload to your S3 using those URLs. Your server can save those URLs to keep track later.
To be notified when the upload finishes successfully or unsuccessfully, you can either
(1) Ask clients to call another API,e.g: /ack after the upload finishes for a particular signed URL. If this API is not called after some time, e.g: 1 hour, you can check with S3 and delete the file accordingly. You can do this because you have the signed URL stored in your DB at the start of the upload.
or
(2) Make use of S3 events. You can configure ObjectCreated event in S3, which is fired whenever an object is created, and send all the events to a queue in SQS, and have your server process each event from there. This way, you do not rely on clients to update your server after an upload finishes. S3 will notify your server accordingly, for all successful uploads.

Updating data used by AWS Elastic Beanstalk deployed Webapp

I've created a dashboard and deployed it on AWS Elastic Beanstalk. The data fed into my dashboard is supplied by a CSV file in my S3 bucket, set to update every 12 hours with AWS EventBridge. For some reason, my deployed dashboard is not updating. It's still using the same old data from my previous deployment even though the CSV file has been updating correctly.
More specifically:
I'm trying to create a Dashboard with Plotly Dash to visualize some trends starting from 2020-01-01.
I had a Lambda function that scrapes the data and saves them as a CSV file in an S3 bucket. This CSV file gets overwritten every 12 hours to capture the latest available trends.
I used boto3 to fetch the CSV file directly from my S3 bucket and use its data to construct my dashboard.
The app was then deployed with Elastic Beanstalk.
Everything was written in a Cloud9 environment, except for setting up the EventBridge trigger.
Say I deployed the app on 2020-12-10. The CSV file would contain all data up till 2020-12-10, and my dashboard would show trends between 2020-01-01 and 2020-12-10.
However, if I check the dashboard anytime after 2020-12-10 (or when the CSV file is updated with data post 2020-12-10), it still shows the same trends (between 2020-01-01 and 2020-12-10), though the CSV file in my S3 bucket is up to date.
The dashboard would update only if I redeploy the app on Elastic Beanstalk. Not sure why this is the case since my app is pulling the data directly from the updated CSV file.
Is my architecture incorrect here? Or do I need to tweak some settings in AWS?
Thanks in advance!
Update:
I'm using the following codes to load my data into trends_data dataframe.
# define bucket name
bucket = "mobilitytrends"
# define s3 client
s3 = boto3.client('s3')
# define file names
historical_file_name = 'historical_trends.csv'
# load historical data from s3
data_obj = s3.get_object(Bucket= bucket, Key= historical_file_name)
trend_data = pd.read_csv(data_obj['Body'],low_memory = False)
I then have some functions that clean this dataframe. I have a scatterplot that's rendered using the code snippet below:
fig.add_scatter(x = filtered_trend.index,
y = filtered_trend[transportation],
line = dict(color = line_color[idx]),
name = transportation)
filtered_trend is a subset of trends_data, which gets selected based on some callback functions that I set up. But I don't think that's where the problem lies since everything worked fine locally.
In Dash, global variables will break your app. More specifically, modifying global variables will not work, at least not reliably.
One approach to avoid the use of global variables would be to create a single callback that first loads the data from S3, and then renders the layout. Other approaches are discussed in this similar question.
I had a similar problem, EB was not fetching the latest version of CSV from the s3 bucket.
The only option I could find was to restart the app server after a new version of the CSV is updated in s3 bucket.
you can use below code in AWS lambda function to restart your app server at specific times in a day:
import boto3
client = boto3.client('elasticbeanstalk', region_name='your-region')
def lambda_handler(event, context):
try:
response = client.restart_app_server(EnvironmentName='your-environment-name')
if response:
print('restarting app server')
else:
print('Failed to restart server')
except Exception as e:
print(e)
Make sure to set up cron using eventbridge for timings

Amazon S3: Do not allow client to modify already uploaded images?

We are using S3 for our image upload process. We approve all the images that are uploaded on our website. The process is like:
Clients upload images on S3 from javascript at a given path. (using token)
Once, we get back the url from S3, we save the S3 path in our database with 'isApproved flag false' in photos table.
Once the image is approved through our executive, the images start displaying on our website.
The problem is that the user may change the image (to some obscene image) after the approval process through the token generated. Can we somehow stop users from modifying the images like this?
One temporary fix is to shorten the token lifetime interval i.e. 5 minutes and approve the images after that interval only.
I saw this but didn't help as versioning is also replacing the already uploaded image and moving previously uploaded image to new versioned path.
Any better solutions?
You should create a workflow around the uploaded images. The process would be:
The client uploads the image
This triggers an Amazon S3 event notification to you/your system
If you approve the image, move it to the public bucket that is serving your content
If you do not approve the image, delete it
This could be an automated process using an AWS Lambda function to update your database and flag photos for approval, or it could be done manually after receiving an email notification via Amazon SNS. The choice is up to you.
The benefit of this method is that nothing can be substituted once approved.