AWS service for video optimization and compression - amazon-web-services

I am trying to build a video/audio/image upload feature for a mobile application. Currently we have set the file size limit to be 1 GB for video and 50 MB for audio and images. These uploaded files will be stored in an s3 bucket and we will use AWS Cloudfront CDN to serve them to users.
I am trying to compress/optimize the size of the media content using some AWS service after they store in S3 bucket. Ideally it will be great if I can put some restriction on the output file like no video file should be greater than 200 MB or with quality greater than 720p. Can someone please help me on this that what AWS service should I use and with some helpful links if available. Thanks

The AWS Elemental MediaConvert service transcodes files on-demand. The service supports output templates which can specify output parameters including resolution, so guaranteeing a 720P maximum resolution is simple.
AWS S3 supports File Events to trigger other AWS actions, such as running a Lambda Function when a new file arrives in a bucket. The Lambda function can load & customize a job template, then submit a transcoding job to MediaConvert to transcode the newly arrived file. See ( https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html) for details.
Limiting the size of an output file is not currently a feature within MediaConvert, but you could leverage other AWS tools to do this. Checking the size of a transcoded output could be done with another Lambda Function when the output file arrives in a certain bucket. This second Lambda Fn could then decide to re-transcode the input file with more aggressive job settings (higher compression, different codec, time clipping, etc) in order to produce a smaller output file.
Since file size is a factor for you, I recommend using QVBR or VBR Bit compression with a max bitrate cap to allow you to better predict the worst case file size at a given quality, duration & bitrate. You can allocate your '200MB' per file budget in different ways. For example, you could make 800 seconds (~13min) of 2mbps video, or 1600 seconds (~26min) of 1mbps video, et cetera. You may want to consider several quality tiers, or have your job assembly Lambda Fn do the math for you based on input file duration, which could be determined using mediainfo, ffprobe or other utilities.
FYI there are three ways customers can obtain help with AWS solution design and implementation:
[a] AWS Paid Professional Services - There is a large global AWS ProServices team able to help via paid service engagements.
The fastest way to start this dialog is by submitting the AWS Sales team 'contact me' form found here, and specifying 'Sales Support' : https://aws.amazon.com/contact-us/
[b] AWS Certified Consulting Partners -- AWS certified partners with expertise in many verticals. See search tool & listings here: https://iq.aws.amazon.com/services
[c] AWS Solutions Architects -- these services focused on Enterprise-level AWS accounts. The Sales contact form in item [a] is the best way to engage them. Purchasing AWS Enterprise Support will entitle the customer to a dedicated TAM /SA combination.

Related

Fastest way to get exact count of rows for a 100GB CSV file stored on S3

What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).
Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).
I know the approximate size of the file.
Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.
The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.
Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).

Putting a TWS file dependencies on AWS S3 stored file

I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.

Amazon Macie to read database data

I am doing some POC in Amazon Macie. I got from the documentation that it identifies PII data like credit card. Even I ran an example where I put some valid credit card numbers in CSV and put into S3 bucket and was identified by Macie.
I want to know if the same PII data is under some database backup/dump file, which is in S3 bucket. Will Macie be able to identify? I didn't find anything in the documentation.
So a couple of things are important here
Macie can only handle certain types of files and certain compression formats
If you specify S3 buckets that include files of a format that isn't supported in Macie, Macie doesn't classify them.
Compression formats
https://docs.aws.amazon.com/macie/latest/userguide/macie-compression-archive-formats.html
Encrypted Objects
Macie can only handle certain types of encrypted Amazon S3 objects
See the following link for more details:
https://docs.aws.amazon.com/macie/latest/userguide/macie-integration.html#macie-encrypted-objects
Macie Limits
Macie has a default limit on the amount of data that it can classify in an account. After this data limit is reached, Macie stops classifying the data. The default data classification limit is 3 TB. This can be increased if requested.
Macie's content classification engine processes up to the first 20 MB of an S3 object.
So specifically if you dump is compressed but in a suitable format inside the compression then yes Macie can classify, but on an important note it will only classify the first 20 MB of the file which is a problem if the file is large.
Typically I use lambda to split a large file into files just under 20 MB. You still need to think if you have X number of files how do you take a record from a file that has been classified as PII and map it back into something that is useable.

Doubts using Amazon S3 monthly calculator

I'm using Amazon S3 to store videos and some audios (average size of 25 mb each) and users of my web and android app (so far) can access them with no problem but I want to know how much I'll pay later exceeding the free stage of S3 so I checked the S3 monthly calculator.
I saw that there is 5 fields:
Storage: I put 3 gb cause right now there are 130 files (videos and audios)
PUT/COPY/POST/LIST Requests: I put 15 cause I'll upload manually around 10-15 files each month
GET/SELECT and Other Requests: I put 10000 cause a projection tells me that the users will watch/listen those files around 10000 times monthly
Data Returned by S3 Select: I put 250 Gb (10000 x 25 mb)
Data Scanned by S3 Select: I don't know what to put cause I don't need that amazon scans or analyze those files.
Am I using that calculator in a proper way?
What do I need to put in "Data Scanned by S3 Select"?
Can I put only zero?
For audio and video, you can definitely specify 0 for S3 Select -- both data scanned and data returned.
S3 Select is an optional feature that only works with certain types of text files -- like CSV and JSON -- where you make specific requests for S3 to scan through the files and return matching values, rather than you downloading the entire file and filtering it yourself.
This would not be used with audio or video files.
Also, don't overlook "Data transfer out." In addition to the "get" requests, you're billed for bandwidth when files are downloaded, so this needs to show the total size of all the downloads. This line item is data downloaded from S3 via the Internet.

How to use watchfornewfiles in Dataflow with GCS source bucket?

Referring to item: Watching for new files matching a filepattern in Apache Beam
Can you use this for simple use cases? My use case is that I have user uploads data to Cloud Storage -> Pipeline (Process csv to json) -> Big Query. I know Cloud Storage is bounded collection so it represents Batch Dataflow.
What I would like is to do is keep pipeline running in streaming mode and as soon as a file is uploaded to Cloud Storage, it will be processed through pipeline. Is this possible with watchfornewfiles?
I wrote my code as follows:
p.apply(TextIO.read().from("<bucketname>")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
None of the contents is being forwarded to Big Query, but the pipeline shows that it is streaming.
You may use Google Cloud Storage Triggers here :
https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
These triggers uses Cloud Functions similar to Cloud Pub/Sub which gets triggered on objects if they were: created/ deleted/archived/ or metadata change.
These event are sent using Pub/Sub notifications from Cloud Storage, but pay attention not to set many functions over the same bucket as there is some notification limits.
Also, at the end of the document there is a link to a sample implementation.