In my application, I have to run a periodical job where I need data about all the photos on my client's Flickr account. Currently, I perform several calls to flickr.photos.search to retrieve meta about all the photos each time.
What I ask is: is there a way to be get informed by Flickr when a photo is modified or deleted, so that I don't need to retrieve metas for each photos, but rather store theme once, and only download what has changed since the last time I run the job ?
Thx in advance for your help.
There is no such notification possible from the Flickr API to your code.
What you can do on the other hand is (recommended only if volume for change of photo metadata is high) -
Setup a cron job which would scan through the photos and store if the photo id's are deleted or not - which can be used later.
Related
So as a photo company we present photos to customers in a gallery view. Right now we show all the photos from a game. This is done by getting a list of the objects and getting a presigned URL to display them on our site. And it works very well.
The photos belong to an "event" and each photo object is stored in an easy to maintain/search folder structure. And every photo has a unique name.
But we are going to build a sorting system so customers can filter their view. Our staff would upload the images to S3 first, and then the images would be sorted. And as of right now, we do not store any reference to the image in the database until the customer adds it to their cart.
So I believe we have 3 ways we can tag these images..
Either store a reference to the image in our database with tags.
Apply metadata to the s3 object
Apply tags to the s3 object
My issue with the database method is, we shoot hundreds of thousands of images a month, I feel that would overly bloat the database. Maybe we create a separate DB just for this (dynamo, etc?)?
Metadata could work, and for the most part the images will only be tagged or untagged an average of 1 time. But I understand that every time the metadata is changed that it would create a new copy of that image. We don't do versioning so there would still only exist one copy. But there would be a cost associated with duplicating an image, right? But the pro would be, the metadata would come down with the GET object, so a second request wouldn't be needed. But available or not in the presigned URL header?
Tags on the other hand can be set/unset as needed with no/minimal additional cost. Although getting objects and the tags would be two separate calls... but on the other hand I can get the list of objects by tag and therefore only getting the presigned urls for the objects that need to be displayed vs all of them. So that could be helpful.
That is at least how I understand it. If I am wrong please tell me.
I am trying to do something that would be relatively simple for a relational database but I don't know how to do it for a nonrelational one.
I am trying to make a simple task web app on AWS where people can post their tasks.
I have a table called tasks which uses the userid from the auth token provisioned by AWS Cognito. I am wondering how I can return the user information. I do not want to rely on Cognito by simply calling it every time a user sends a request. So, my thought would be to create another table to store all of the user information. That, however, is not a very nonrelational way of doing things since JOINS are so bad.
So, I was wondering if I should do any of the following
a) Using RDS instead
b) Not use Cognito and set up my own Auth system
c) Just doing the JOIN with a table containing all of the user info
d) Doing the request to Cognito each time
Although I personally like the idea of cognito, at this time it has some major drawbacks...
You can not backup / restore a user pool without loosing their password, also you have to implement your own backup/restore.
A way around is to save the user password in a cognito custom attribute.
I expected by using api gateway/lambda authorizer to have all the user data in the lambda context but its not there. Or am indoing something wrong with api gateway template mapping 😬
Good thing api gateway/lambda authorizer, can be cached by up to an hour, wont call the authorizer function again which seems like a top feature.
Does not work well with cloudformation, with every attribute update it recreates the user pool without restoring the users, thus loosing the users.
I used it only in one implementation and ended up duplicating the users in DynamoDB as well.
I'm avoiding it ever since. I wish they solve these issues as it looks like a service to be included with every project saving lot of time.
Reading your post I asked myself the same questions and not sure the answer either 😄
Pricing seems fair.
The default 5 requests/second to get user info seems strange as it woukd be consumed by one page load doing multiple ajax api requests .
For this in DynamoDB, there is no need for another table. If the access patterns dictate you store the information in another object, then so be it, but more than likely it should be in the same table. Sounds like you need two different item types in the same table.
For the task PK of userid and SK of task::your-task-id. This would allow you to get all of a user's tasks easily or even a specific task very easily if you knew the task ID. You might even have an attribute that is a timestamp and then have a GSI that is the userID as the PK and the timestamp as the SK. then you could use the begins_with operator on the SK and "paginate through all of the user's tasks that are in the month of 2019-04".
For the user information, have the userID be the PK and the SK be user_info and attributes be the user's information.
The one challenge for this is if you were to go to extremes and one single user is doing thousands of ops per second. e.g. "All tweets by very popular celebrity". If you have such a use case there are ways around that as well, e.g. write sharding. These are just examples for you to play with. Without knowing all your access patterns, I cannot model everything you might want to do. I highly recommend you go watch this presentation from reInvent 2018.
I'm building a Django web application which allow users to select a photo from the computer system and keep populating onto the users timeline. The timeline will be showing 10 photos initially and then have a pull to refresh to fetch the next 10 photos on the timeline.
So my first question is I'm able to upload images which gets store on the file system,but how do I show only first 10 and then pull a refresh to fetch the next 10 and so on.
Next, I want the user experience of the app to be fast. So, I'm considering caching. So, i was thinking, what do I cache. Since there are 3 types of cache in Django- Database cache, MemCache, or FileSystem Caching.
So my secon question is should I cache the first 10 photos of each user or something else?
Kindly answer with your suggestions.
So my first question is I'm able to upload images which gets store on the file system,but how do I show only first 10 and then pull a refresh to fetch the next 10 and so on.
Fetch first 10 with your initial logic, fetch next photos in chronological order. You must have some timestamp relating to your photo posting. Fetch images according to that. You can use Django Paginator for this.
what do I cache
Whatever static data you want to show to the user frequently and wont change right away. You can cache per user or for all users. According to that you choose what to cache.
should I cache the first 10 photos of each user or something else
Depends on you. Are those first pictures common to all the users? Then you can cache. If not and the pictures are user dependent, there is no point caching them. The user will anyway have to fetch the first images. And I highly doubt the user will keep asking for the same first 10 photos frequently. Again, it's your logic. If you think caching will help, you can go ahead and cache.
The DiskCache project was first created for a similar problem (caching images). It includes a couple of features that will help you to cache and serve images efficiently. DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.
diskcache.DjangoCache provides a Django-compatible cache interface with a few extra features. In particular, the get and set methods permit reading and writing files. An example:
from django.core.cache import cache
with open('filename.jpg', 'rb') as reader:
cache.set('filename.jpg', reader, read=True)
Later you can get a reference to the file:
reader = cache.get('filename.jpg', read=True)
If you simply wanted the name of the file on disk (in the cache):
try:
with cache.get('filename.jpg', read=True) as reader:
filename = reader.name
except AttributeError:
filename = None
The code above requests a file from the cache. If there is no such value, it will return None. None will cause an exception to be raised by the with statement because it lacks an __exit__ method. In that case, the exception is caught and filename is set to None.
With the filename, you can use something like X-Accel-Redirect to tell Nginx to serve the file directly from disk.
We have lots of images stored in AWS s3. We plan to provide thumbnails for user to preview. If I store them in S3, I can only retrieve them one by one, which is not efficient. Show I store them in database? (I need query my database to decide which set of thumbnails to show the user to preview)
The best answer depends on the usage pattern of the images.
For many applications, S3 would be the best choice for the simple reason that you can easily use S3 as an origin for CloudFront, Amazon's CDN. By using CloudFront (or indeed any CDN), the images are hosted physically around the world and served from the fastest location for a given user.
With S3, you would not retrieve the images at all. You would simply use S3's URL in your final HTML page (or the CloudFront URL, if you go that route).
If you serve images from the database, that increases resource consumption on the DB (more IO, more CPU, and some RAM used to cache image queries that is not available to cache other queries).
No matter which route you go, pre-create the thumbnail rather than producing it on the fly. Storage space is cheap, and the delay to fetch (from S3 or the DB), process, then re-serve the thumbnail will lessen the user experience. Additionally if you create the thumbnail on the fly, you cannot benefit from a CDN.
If I store them in S3, I can only retrieve them one by one, which is not efficient.
No, it only looks inefficient because of the way you are using it.
S3 is massively parallel. It can serve your image to tens of thousands of simultaneous users without breaking a sweat. It can serve 100's of images to the same user in parallel -- so you can load 100 images in the same time it takes to load 1 image. So why is your page slow?
Your browsers is trying to be a good citizen and only pull 2-4 images from a site at a time. This "serialization" is what is slowing you down and causing the bottleneck.
You can trick the browser by hosting assets on multiple domains. This is called "domain sharding". You can do it with multiple buckets (put images into 4 different buckets, depending on the last digit of their ID). Alternatively, you can do it with CloudFront: http://abhishek-tiwari.com/post/CloudFront-design-patterns-and-best-practices/
As a best practice, you should store your static data in S3 & save their reference in Db.
In your particular case, you can save filename / hyperlink to the image file in a database that you can query upon depending on your business logic.
This will give you reference to all the images that you can now fetch from S3 & display it to your users.
This can also help you to replace your reference to thumbnail depending on your need. For example, if you are running a e-commerce site, you can replace the thumbnail reference to point to new product image without much effort.
I hope this helps.
I'm programming the API service, so the iOS app wants to "pull down to refresh" some records (AKA streams), but I'm not sure how to program the "pagination"-like feature.
I was thinking of making a query using offset and limit (start and end in python) but I think this is not the right approach.
Does anyone have an idea how this should be done?
My RESTful API is buit on django.
Thanks in advance.
If you only want to return the latest records, the app should query your API with a last_updated timestamp.
Based on that you can filter your queryset to match the records that have been added the last time the user has refreshed his timeline.
If no timestamp is set, you return all records (or a part of it ordered by date of creation).