I am about to embark on creating a rest api that accepts two images base 64 encoded from an external party.
I have heard from people that it would be prudent to think about protecting against a malicious file being sent to me via some sort of an attack.
My initial thoughts are that I need to think about virus scanning but also an incorrect or dodgy image (for example in image from a porn site).
We are using AWS as our cloud platform. Can anyone help me with some ideas as to best practices / how i can protect against this?
In general (there are obviously exceptions), if you are just receiving and storing images into something such as S3, you do not need to worry about viruses and whatnot from the image upload, as these files are not going to be executed, and should only be rendered as jpg/png/whatever file format you use.
If you wish to check for unsafe or inappropriate images, you could utilize Amazon Rekognition's Unsafe content detection feature. In addition you could also use Rekognition to perform image tagging and generate metadata for each image.
Related
My problem:
I want to stream media I record on the client (typescript code) to my AWS storage (services like YouTube / Twitch / Zoom / Google Meet can live record and save the record to their cloud. Some of them even have host-failure tolerance and create a file if the host has disconnected).
I want each stream to have a different file name so future triggers will be available from it.
I tried to save the stream into S3, but maybe there are more recommended storage solutions for my problems.
What services I tried:
S3: I tried to stream directly into S3 but it doesn't really support updating files.
I tried multi-part files but they are not host-failure tolerance.
I tried to upload each part and have a lambda to merge it (yes, it is very dirty and consuming) but I sometimes had ordering problems.
Kinesis-Video: I tried to use kinesis-video but couldn't enable the saving feature with the SDK.
By hand, I saw it saved a new file after a period of time or after a size was reached so maybe it is not my wanted solution.
Amazon IVS: I tried it because Twitch recommended this although it is way over my requirements.
I couldn't find an example of what I want to do in code with SDK (only by hand examples).
Questions
Do I look at the right services?
What can I do with the AWS-SDK to make it work?
Is there a good place with code examples for future problems? Or maybe a way to search for solutions?
Thank you for your help.
I know object is immutable and its content could not be edited, but I'm curious about why.
Is it because S3 use RESTful API and PUT doesn't support partial write? But why not just transfer the data of the updated blocks and update the file by S3 backend server? Or implement the HTTP PATCH method?
By the way, when playing video (.mp4) that store on S3, it seems it can support random read because I can do jump in the progress bar instantly without waiting, I'm not sure wheter the S3 client I use (RaiDrive) has local cache or S3 itself support random playing video.
It has nothing to do with a patch request. A patch request just means the method verb string is patch. Someone still has to provide the ability to do what you are asking for.
The main reason is that by making it immutable entirely makes the ability to provide distributed, replicated, versioned, scalabale, and highly available storage system like s3 a bit easier.
If it were mutable - it would make building a system like s3 a lot more difficult.
Many distributed storage systems move to immutability to make it easier to architect and design. Even google file system only allows appending which is why they use level db to support workloads instead of btree type dbs that require writing to the middle of files
There are cases in a project where I'd like to store images on a model.
For example:
Company Logos
Profile Pictures
Programming Languages
Etc.
Recently I've been using AWS S3 for file storage (primarily hosting on Heroku) via ImageField uploads.
I feel like there's a better way to store files than what I've been doing.
For some things (like for the examples above) I think it would make sense to actually just get an image url from a more publically available url than take up space in my own database.
For the experts in the Django community who have built and deployed really professional projects, do you typically store files directly into the Django media folder via ImageField?
or do you normally use a URLField and then pull a url from an API or an image link from the web (e.g., go on any Google image, right click and copy then paste image URL)?
Bonus: What does your image storing setup look like?
Hope this makes sense.
Thanks in advance!
The standard is what you've described, using something like AWS S3 to store the actual image and handle the URL in your database. Here's a few reasons why:
It's cheap. like really cheap
Instead of making your web server serve the files, you're offloading that onto the client (e.g. their browser grabbing the file from S3)
If you're using an ephemeral system (like Heroku), your only option is to use something like S3.
Control. Sure, you can pull an image link from somewhere else that isn't managed by you. But this does not scale. What happens if that server goes offline? What if they take that image down? This way, you control what happens to the objects.
An example of a decently large internet company but not large enough to run their own infrastructure (like Facebook/Instagram, Google, etc.) is VSCO. They're taking a decent amount of photo uploads every day and they're handling them with AWS.
I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.
Say I have a bucket called uploads with two directories, both of which contain images.
The first directory, called catalog, has images with various extensions (.jpg, .png, etc.)
The second directory, called brands, has images with no extensions.
I can request uploads/catalog/some-image.jpg and uploads/brands/extensionless-image, and they both return an image as I expect.
We're already using a third-party service, imgix, which is just an image-processing CDN that links to the S3 bucket so that we can request, say, a smaller or cropped version of the image in the bucket.
Ideally, I'd like to keep the images and objects in their current formats in the bucket, but I would like the client-side to be agnostic about which file it is requesting. In other words, I'd like to request some-image, and even though it may or may not actually have an extension in the bucket, I'd still like to somehow "intelligently guess" the image I'm requesting. We'll also assume that there are no collisions, i.e., there will never be an image some-image.jpg and some-image with both the same name (our objects are named with a collision-less algorithm).
This is what I've tried:
Simply request images in one directory by their extension, and the images in the other bucket without their extension (however, even though the policy is the same of requesting an image, the mechanism has to be implemented in two different ways. I would like a singular mechanism)
Another solution is to programmatically remove the extensions from all the images in catalog and re-sync the bucket
Anyone run into something similar before? Thoughts?
I suspect your best bet is going to be renaming the images. Not that there aren't other solutions, but because that is probably going to be the simplest and most straightforward approach.
First, S3 will not guess. The key on an S3 object is an opaque string from S3's perspective. The extension has no meaning, and even the slashes delimiting "directories" have no intrinsic meaning to S3. (Deleting a "directory" in S3 means sending a delete request for every individual object in the directory. The console creates a convenient illusion by doing this for you.)
S3 has redirect rules, but they only match and manipulate path prefixes, not suffixes, so no help there.
It would be possible, using a reverse proxy in front of S3, to inspect requests and for any 404 or 403, the proxy could retry the request with alternate extensions, until it found one that worked, and it could potentially "learn" the right extension for use on subsequent requests, but then you'd have the added turn-around time and additional cost for multiple requests.
I have developed systems whose job it is to "find" things requested over HTTP by trying multiple back-end URLs, without the requester being aware of the "hunting" going on in the background, and it can be very useful... but that is a much more complicated solution than you would probably want to consider, particularly in light of the fact that every millisecond counts when it comes to image loading.
There is no native solution for magic guessing with S3. You pretty much have to ask it for exactly what you want. Storage in S3 is cheap enough, of course, that you could probably duplicate your content, with and without extensions, without giving too much thought to the cost. If you used a Lambda event on the bucket, you could even automate the process of copying "kitten.jpg" to "kitten" each time "kitten.jpg" was modified.
If the content-type is set correctly in your object metadata, you should be fine regardless of extensions. If content-type header is not set, you can set it, for example using ImageMagick Identify to discover the image type and AWS CLI to set it.