where to store thumbnails, DB or S3? - amazon-web-services

We have lots of images stored in AWS s3. We plan to provide thumbnails for user to preview. If I store them in S3, I can only retrieve them one by one, which is not efficient. Show I store them in database? (I need query my database to decide which set of thumbnails to show the user to preview)

The best answer depends on the usage pattern of the images.
For many applications, S3 would be the best choice for the simple reason that you can easily use S3 as an origin for CloudFront, Amazon's CDN. By using CloudFront (or indeed any CDN), the images are hosted physically around the world and served from the fastest location for a given user.
With S3, you would not retrieve the images at all. You would simply use S3's URL in your final HTML page (or the CloudFront URL, if you go that route).
If you serve images from the database, that increases resource consumption on the DB (more IO, more CPU, and some RAM used to cache image queries that is not available to cache other queries).
No matter which route you go, pre-create the thumbnail rather than producing it on the fly. Storage space is cheap, and the delay to fetch (from S3 or the DB), process, then re-serve the thumbnail will lessen the user experience. Additionally if you create the thumbnail on the fly, you cannot benefit from a CDN.

If I store them in S3, I can only retrieve them one by one, which is not efficient.
No, it only looks inefficient because of the way you are using it.
S3 is massively parallel. It can serve your image to tens of thousands of simultaneous users without breaking a sweat. It can serve 100's of images to the same user in parallel -- so you can load 100 images in the same time it takes to load 1 image. So why is your page slow?
Your browsers is trying to be a good citizen and only pull 2-4 images from a site at a time. This "serialization" is what is slowing you down and causing the bottleneck.
You can trick the browser by hosting assets on multiple domains. This is called "domain sharding". You can do it with multiple buckets (put images into 4 different buckets, depending on the last digit of their ID). Alternatively, you can do it with CloudFront: http://abhishek-tiwari.com/post/CloudFront-design-patterns-and-best-practices/

As a best practice, you should store your static data in S3 & save their reference in Db.
In your particular case, you can save filename / hyperlink to the image file in a database that you can query upon depending on your business logic.
This will give you reference to all the images that you can now fetch from S3 & display it to your users.
This can also help you to replace your reference to thumbnail depending on your need. For example, if you are running a e-commerce site, you can replace the thumbnail reference to point to new product image without much effort.
I hope this helps.

Related

Tagging photos in s3 for sorting versus storing in the database

So as a photo company we present photos to customers in a gallery view. Right now we show all the photos from a game. This is done by getting a list of the objects and getting a presigned URL to display them on our site. And it works very well.
The photos belong to an "event" and each photo object is stored in an easy to maintain/search folder structure. And every photo has a unique name.
But we are going to build a sorting system so customers can filter their view. Our staff would upload the images to S3 first, and then the images would be sorted. And as of right now, we do not store any reference to the image in the database until the customer adds it to their cart.
So I believe we have 3 ways we can tag these images..
Either store a reference to the image in our database with tags.
Apply metadata to the s3 object
Apply tags to the s3 object
My issue with the database method is, we shoot hundreds of thousands of images a month, I feel that would overly bloat the database. Maybe we create a separate DB just for this (dynamo, etc?)?
Metadata could work, and for the most part the images will only be tagged or untagged an average of 1 time. But I understand that every time the metadata is changed that it would create a new copy of that image. We don't do versioning so there would still only exist one copy. But there would be a cost associated with duplicating an image, right? But the pro would be, the metadata would come down with the GET object, so a second request wouldn't be needed. But available or not in the presigned URL header?
Tags on the other hand can be set/unset as needed with no/minimal additional cost. Although getting objects and the tags would be two separate calls... but on the other hand I can get the list of objects by tag and therefore only getting the presigned urls for the objects that need to be displayed vs all of them. So that could be helpful.
That is at least how I understand it. If I am wrong please tell me.

Got some data want to storage on cloud, whtat different between Google cloud store, firestore, Firebase Real time DB, s3 etc

I am learning to make an app. It can let users input data into the app and will show it to other users.(For example, a social media app user upload something)
The data might be text, images etc. (But no video store because want to keep it lowcost)
I am looking for cloud storage servers, but I have no idea what the difference between firestore, Firebase Real-time DB, s3 etc
For example, if the user uploads a pasta image and a text saying 'that is good'
What database shout I choose? Will it need to be separate the text and the image to different DB?
If need separate, what will the structure be?

Storing raw text data vs analytics

I’ve been working on a hobby project that’s a django react site that give analytics and data viz for texts. Most likely will host on AWS. The user uploads a csv of texts. The current logic is that they get stored in the db and then when the user calls the api it runs the analytics on them and sends the analytics. I’m trying to decide whether to store the raw text data (what I have now) or run the analytics on the texts once when they're uploaded and then discard them, only storing the analytics.
My thoughts are:
Raw data:
pros:
changes to analytics won’t require re uploading
probably simpler db schema
cons:
more sensitive data (not sure how safe it is in a django db on AWS, not sure what measures I could put in place to protect it more)
more data to store (not sure what it would cost to store a lot of rows of texts)
Analytics:
pros:
less sensitive, less space
cons:
if something goes wrong with the analytics on the first run (that doesn’t throw an error), then they could be inaccurate and will remain that way

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Store image with tag and prefix to query fast (s3 aws)

I use Ionic to create a mobile app which can take photo and can upload image from mobile to s3. I wonder how to make a prefix or tag beside the upload image which help me query to this fast and unique. I think about make a prefix and create folder:
year/month/day/filename ( e.g: 2018/11/27/image.png )
If there are a lot of image in 2018/11/27/ folder, I think it will query slow and sometime the image filename not unique. Any suggest for that ?? Tks a lot.
Amazon S3 is an excellent storage service, but it is not a database.
You can store objects in Amazon S3 with whatever name you wish, but if you wish to list/sort/find objects quickly you should store the name of the object, together with its metadata, in a database. Then you can query the database to find the object of interest.
DynamoDB would be a good choice because it can be configured for guaranteed speed. You could also put DAX in front of DynamoDB for even greater performance.
With information about the objects stored in a database, you can quite frankly name each individual object anything you wish. Many people just use a UUID since it just needs to be a unique identifier. The object name itself does not need to convey any meaning - it is simply a Key to identify the object when it needs to be accessed later.
If, however, objects are typically processed in groups (such as having daily files grouped together into months for processing with Hadoop clusters), then locating objects in a particular path is useful. It allows the objects to be processed together without having to consult the database.