I am working on processing a dataset of large videos (~100 GB) for a collaborative project. To make it easier to share data and results, I am keeping all videos remotely on an amazon S3 bucket, and processing it by mounting the bucket on an EC2 instance.
One of the processing steps I am trying to do involves cropping the videos, and rewriting them into smaller segments. I am doing this with moviepy, splitting the video with the subclip method and calling:
subclip.write_videofile("PathtoS3Bucket"+VideoName.split('.')[0]+'part' +str(segment)+ '.mp4',codec = 'mpeg4',bitrate = "1500k",threads = 2)
I found that when the videos are too large (parameters set as above) calls to this function will sometimes generate empty files in my S3 bucket (~10% of the time). Does anyone have insight into features of moviepy/ffmpeg/S3 that would lead to this?
It is recommended not to use tools such as s3fs because these merely simulate a file system, whereas Amazon S3 is an object storage system.
It is generally better to create files locally, then copy them to S3 using standard API calls.
Related
I am creating as my own project a mobile app. One of the functionalities of the application is uploading video files (up to 500mb) and watching uploaded videos by other users.
I was thinking about various server solutions and how many people have so many opinions. Unfortunately, it is hard for me to find someone among my friends who knows the topic well and would be able to advise. For the beginning, I think it makes sense to use AWS (but I've never done it) and I would like to ask you (if I can) for advice.
In step one I upload a video file to AWS S3 via the application
AWS MediaConverter compresses the video file, the old one is removed and replaced with a new one (Elastic Transcoder is very expensive)
In the application, I can paste direct links to s3 which I can use to serve videos.
As far as I understand, I don't need any other services than AWS S3 and AWS MediaConverter.
Or maybe I am thinking wrong and using amazon for this does not make sense?
Thanks!
As per best practice aws resources should be per account (prod, stage, ...) and its also good to give devs their own accounts with defined limits (budget, region, ...).
Im now wondering how i can create a full working dev environment especially when it comes to S3 buckets.
Most of the services are pay per use so its totally fine to spin up some lambdas, SQS etc. to use the real services for dev.
Now to the real questions what should be done with static assets like pictures, downloads and so on which are stored in S3 buckets?
Duplicating those buckets for every dev/environment could come expensive as you pay for storage and/or data transfer.
What i thought was to give the devs S3 bucket a redirect rule and when a file is not found (e.g. 404) in the dev bucket it redirects to the prod bucket so that images, ... are retrieved from there.
I have testet this and it works pretty well but it solves only part of the problem.
The other part is how to replace those files in a convenient way?
Currently static assets and downloads are also in our git (maybe not the best idea after all ... - how you handle file changes which should go live with new features, currently its convenient to have it in git as well) and when someone changes stuff they push it and it gets deployed to prod.
We could of course sync back the devs S3 bucket to prod bucket with the new files uploaded but how to combine this with merge requests and have a good CI/CD experience?
What are your solutions to have S3 buckets for every dev so that they can spinn up their own completely working dev environment with everything available to them?
My experience is that you don't want to complicate things just to save a few dollars. S3 costs are pretty cheap, so if you're just talking about website assets, like HTML, CSS, JavaScript, and some images, then you're probably going to spend more time creating, managing, and troubleshooting a solution than you'll save. Time is, after all, your most precious resource.
If you do have large items that need to be stored to make your system work then maybe have the S3 bucket have a lifecycle policy on those large items and delete them after some reasonable amount of time. If/when a dev needs that object they can retrieve it again from its source and upload it again, manually. You could write a script to do that pretty easily.
What is the best way to with reliability get our client to send audio files to our S3 bucket that will process the audio files (ML processes that will do speech-to-text-insights)?
The files could be in .wav / mp3 other such audio formats. Also, some files may be larger in size.
Love to get best ideas? (e.g. API Gateway / Lambda / S3 ?) Would love to hear from anyone who may have done this before.
Some questions and answers to give context:
How do users interface with your system? We are looking for API based approach vs. a browser based approach. We can get browser based approach to work but not sure if that is the right technical/architectural / scalable approach
Do you require a bulk upload method? Yes. We would need bulk upload functionality and some individual files may be larger as well
Will it be controlled by a human, or do you want it to upload automatically somehow? Certainly want it automatically
ultimately, we are building a SaaS solution that will take the audio files and meta data and perform analytics on it and deliver results of our analysis through an API back to the App. So the approach we are looking for is something that will work within this context
I have a similar scenario.
If you intend to use Api Gateway/Lambda/s3 then you should know that there is a limit on the payload size that Gateway & Lambda can accept. Specifically, Api Gateway accepts payloads till 10 MB & Lambda till 6MB.
There is a workaround for this issue though. You can upload your files directly on an s3 bucket and attach a lambda trigger on object creation.
I'll leave some articles that may point you to the right direction :
Uploading a file using presigned URLs :
https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html
Lambda trigger on s3 object creation: https://medium.com/analytics-vidhya/trigger-aws-lambda-function-to-store-audio-from-api-in-s3-bucket-b2bc191f23ec
A holistic view of the same issue: https://sookocheff.com/post/api/uploading-large-payloads-through-api-gateway/
Related GitHub issue :
https://github.com/serverless/examples/issues/106
So from my pov, regarding uploading files, the best way would be to return a pre-signed URL, then have the client upload the file directly to S3. Otherwise, you'll have to implement uploading the file in chunks.
I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.
Can someone please let me know what is the efficient way of pulling data from s3. Basically I want to pull out data between for a given time range and apply some filters over the data ( JSON ) and store it in a DB. I am new to AWS and after little research found that I can do it via boto3 api, athena queries and aws CLI. But I need some advise on which one to go with.
If you are looking for the simplest and most straight-forward solution, I would recommend the aws cli. It's perfect for running commands to download a file, list a bucket, etc. from the command line or a shell script.
If you are looking for a solution that is a little more robust and integrates with your application, then any of the various AWS SDKs will do fine. The SDKs are a little more feature rich IMO and much cleaner than running shell commands in your application.
If your application that is pulling the data is written in python, then I definitely recommend boto3. Make sure to read the difference between a boto3 client vs resource.
Some options:
Download and process: Launch a temporary EC2 instance, have a script download the files of interest (eg one day's files?), use a Python program to process the data. This gives you full control over what is happening.
Amazon S3 Select: This is a simple way to extract data from CSV files, but it only operates on a single file at a time.
Amazon Athena: Provides an SQL interface to query across multiple files using Presto. Serverless, fast. Charged based on the amount of data read from disk (so it is cheaper on compressed data).
Amazon EMR: Hadoop service that provides very efficient processing of large quantities of data. Highly configurable, but quite complex for new users.
Based on your description (10 files, 300MB, 200k records) I would recommend starting with Amazon Athena since it provides a friendly SQL interface across many data files. Start by running queries across one file (this makes it faster for testing) and once you have the desired results, run it across all the data files.