Invalidate all files in a folder in cloudfront console - amazon-web-services

I know cloudfront provides a mechanism to invalidate a file, but what if I want to invalidate all files in a specific folder ? The documentation mentions that I can't use wildcards to do this.
Here's the instruction taken from the official documentation:
You must explicitly invalidate every object and every directory that you want CloudFront to stop serving. You cannot use wildcards to invalidate groups of objects, and you cannot invalidate all of the objects in a directory by specifying the directory path.

Back in 2013, in a previous version of this answer, I wrote:
You can't do this because "files" in cloudfront are not in "folders." Everything is an object and every object is independent.
At the time, that was entirely true. It's still true that everything is an object and every object is independent, but CloudFront has changed its invalidation logic. Keep reading.
At the time, this was also true, and again, to a certain extent, it still is:
The cloudfront documentation mentions "invalidating directories," but this refers to web sites that actually allow a directory listing [when] the listing is what you want to invalidate, so this won't help you either.
However, times have changed significantly.
Technically, each object is still independent, and CloudFront does not really store them in hierarchical folders, but the invalidation interface has been enhanced, to support a left-anchored wildcard match. You can invalidate the contents of a "folder" or any number of objects that you can match with a wildcard at the end of the string. Anything that matches will be evicted from the cache:
To invalidate objects, you can specify either the path for individual objects or a path that ends with the * wildcard, which might apply to one object or to many, as shown in the following examples:
/images/image1.jpg
/images/image*
/images/*
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html
Nice enhancement. But is there a catch?
Other than the fact that an invalidation requires -- as always -- 10 to 15 minutes to complete under normal operations, the answer is no, there's not really a catch. The first 1,000 invalidation paths (formerly "requests," and a "request" was for a single object) you submit within a month are free; after that, there is a charge, but:
The price is the same whether you're invalidating individual objects or using the * wildcard to invalidate multiple objects.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html#PayingForInvalidation
Note that if you don't include the * at the end, then an invalidation for /images/ (for example) will only tell CloudFront to invalidate whatever single object your origin server returns for requests for /images/.
The leading slash is documented as optional.

As of 2015-05-25, you can invalidate using a wildcard. Ex: /* or /images/*
It is also far less costly to do it this way, as something like /images/* counts as one object for invalidation, rather than being charged for the thousands of images in the /images directory.
http://aws.amazon.com/about-aws/whats-new/2015/05/amazon-cloudfront-makes-it-easier-to-invalidate-multiple-objects/

As long as you want to invalidate a reasonable amount of objects, one of the easier ways I've found is to select the objects in Cyberduck, right click > select Info and click on Distribution tab and you can invalidate from there. Cyberduck will submit one invalidation request to your Cloudfront with the list of selected files.
Cyberduck is open source too.
ps: not affiliated with the product in any way. Just listing an alternative.

Related

IPFS and Editing Permissions

I just uploaded a folder of 5 images to IPFS (using the Mac Desktop IPFS Client App, so it was a very simple drag and drop operation.)
So being that I’m the one that created and published this folder, does that mean that I’m the only one that’s allowed to make further modifications to it - like adding or deleting more images from it? Or can anyone out there on IPFS do that as well?
If they can, is there a way to prevent that from happening?
=======================================
UPDATED QUESTION:
My specific use-case has to do with updating the metadata of ERC721 Tokens - after they’ve already been minted.
Imagine for example a game where certain objects - like say a magical sword - gains special powers after a certain amount of usage or after the completion of certain missions by its owner. So we’d want to update this sword’s attributes by editing it’s Metadata and re-committing this updated metadata file to the Blockchain.
If our game has 100 swords for example, and we initially uploaded to IPFS a folder containing all 100 json files (one for each sword), then I’m pretty sure IPFS still let’s you access the specific files within the hashed-folder by their specific human-readable names (and not only by their hash.)
So if our sword happens to be sword #76, and our naming convention for our JSON files was of this format: “sword000.json” , then sword#76’s JSON metadata file would have a path such as:
http://ipfs.infura.io/QmY2xxxxxxxxxxxxxxxxxxxxxx/sword076.json
If we then edited the “sword076.json“ file and drag-n-dropped it back into our master JSON folder, it would obviously cause that folder’s Hash/CID value to change. BUT, as long as we’re able update our Solidity Contract’s “tokenURI” method to look for and serve our “.json” files from this newly updated HASH/CID folder name, we could still refer to the individual files within it by their regular English names. Which means we’d be good to go.
Whether or not this is a good scheme to employ is something we can definitely discuss, but I FIRST want to go back to my original question/concern, which is that I want to make sure that WE are the ONLY ones that can update the contents of our folder - and that no one else has permission to do that.
Does that make sense?
IPFS is immutable, meaning when you add your directory along with the files, the directory gets a unique CID based on the contents of the directory. So in a sense, nobody can modify it, not even you, because it's immutable. I believe this confusion can be resolved with more background on how IPFS works.
When you add things to IPFS each file is hashed, and given a CID. The same is true for directories, but their CID can more easily be understood as a sum of the contents of the directory. So if any files in the directory are updated, added, or deleted, the directory gets a new CID.
Understanding this, if someone else added the exact same content in the exact same way, they'd end up with the exact same CID! With this, if two people added the same CID, and a third person requested that file (or directory), both nodes would be able to serve the data, as we know it's exactly the same. The same is true if you simply shared your CID and another node pinned it, both nodes would have the same data, so if anyone requested it, both nodes would be able to serve it.
So your local copy, cannot be edited by anyone. In a sense, if you're relying on the IPFS CID as the address of your data, not even by you! This is why IPFS is typically referred to as "immutable", because any data you request via an IPFS CID will always be the same. If you change any of the data, you'll get a new CID.
More info can be found here: Content Addressing & Immutability
If you read all this and thought "well what if I want mutable data?", I'd recommend looking into IPNS and possibly ipfs-sync if you're looking for a tool to automatically update IPNS for you.

Repository pattern: isn't getting the entire domain object bad behavior (read method)?

A repository pattern is there to abstract away the actual data source and I do see a lot of benefits in that, but a repository should not use IQueryable to prevent leaking DB information and it should always return domain objects, not DTO's or POCO's, and it is this last thing I have trouble with getting my head around.
If a repository pattern always has to return a domain object, doesn't that mean it fetches way too much data most of the times? Lets say it returns an employee domain object with forty properties and in the service and view layers consuming that object only five of those properties are actually used.
It means the database has fetched a lot of unnecessary data a pumped that across the network. Doing that with one object is hardly noticeable, but if millions of records are pushed across that way and a lot of of the data is thrown away every time, is that not considered bad behavior?
Yes, when adding or editing or deleting the object, you will use the entire object, but reading the entire object and pushing it to another layer which uses only a fraction of it is not utilizing the underline database and network in the most optimal way. What am I missing here?
There's nothing preventing you from having a separate read model (which could a separately stored projection of the domain or a query-time projection) and separating out the command and query concerns - CQRS.
If you then put something like GraphQL in front of your read side then the consumer can decide exactly what data they want from the full model down to individual field/property level.
Your commands still interact with the full domain model as before (except where it's a performance no-brainer to use set based operations).

List object versions in a S3 bucket via signed URI

In our project we're storing objects in a S3 bucket with versioning enabled. There's no logic on the server besides creating a signed URI for the client to use. We'd like to keep it this way as we want the client to do all the processing.
To the problem. We're successfully able to generate signed URIs for a GET and PUT object for the whole objects, but we're unable to generate a URI for listing all available versions.
This is an example of a GET-url on a object in one of our buckets which works (the 99/2 are folders in the bucket):
https://bucketname.s3.amazonaws.com/99/2?AWSAccessKeyId=ourkey&Signature=signature&Expires=1410784420
According to the docs (GET versions) we're supposed to append ?versions and the different versions. We've tried the following:
https://bucketname.s3.amazonaws.com/99/2?AWSAccessKeyId=ourkey&Signature=signature&Expires=1410784420&versions
This then results in the browser complaining that the signature is wrong, it's missing "?versions". If I read docs I interpret it as it shouldn't be included in the signature unless we append a value to it as well, which we aren't. The problem is then that it doesn't matter if I then add it to the signature creation as it still fails with the error "There is no such thing as the ?versions sub-resource for a key".
Is there someone who has successfully created a signed uri for object to list it's versions? We'd really love to get some pointers on what we're doing wrong!
I'd also like to point out that we're not using the built in URI-generator as we couldn't get it to fit our needs.
Listing object versions is an operation performed "against" the bucket, not against the object... so your path is always going to be /, no matter what keys you want to list.
You specify the key prefix in the query string as prefix=....
The string to sign would then begin with /bucketname/?versions&prefix=....
You sort all of the query string parameters lexically, except for the subresource (versions, in this case), which goes first. If more than one subresource, you also sort them lexically among themselves, but they still go first. Everything is separated by & in the string to sign.
Significant caveat: the list api may not be appropriate to hand over to the client, since you can end up returning the wrong things... "prefix" is just that -- a prefix. If it doesn't match exactly, it can match on substrings, which might not be what you want. You may also need to use delimiter and max-keys and be prepared to handle the pagination through truncated listings that will become necessary when a large number of results is returned.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html

varnish invalidate url REGEX from backend

Say I have some highly-visited front-page, which displays number of some items by categories.
When some item is added / deleted I need to invalidate this front-page/url and some 2 others.
What is the best practice how to invalidate those urls from backend in Varnish (4.x)?
From what I captured, I can:
implement my HTTP PURGE handler in VCL configuration file, that "bans" urls matching received regex
from backend to Varnish, send 3x HTTP PURGE requests for those 3 urls.
But is this approach safe for this automatic usage? Basicly I need to invalidate some views everytime some related entity is inserted/updated/deleted.
Can it lead to ban list cumulation and increasing CPU consumption?
Is there any other approach? Thanks.
According this brilliant article http://www.smashingmagazine.com/2014/04/23/cache-invalidation-strategies-with-varnish-cache/ the solution are Tags.
X-depends-on: 3483 4376 32095 28372 #http-header created by backend
ban obj.http.x-depends-on ~ “\D4376\D” #ban rule emitted to discard dependant objects
What I missed is, that there is background process "ban-lurker", that iterates over cached objects, for which exists applicable and yet not tryed ban-rules and if all applicable objects were tested, ban rule is discarded. The ban rule only needs to be written such as it uses only data stored with cached object, not using e.g. req.url, since req object is not stored with object in cache and so lurker-process does not have it.
So now ban-way + tags looks pretty reliable to me.
Thanks Per Buer :)

Invalidating a path from the Django cache recursively

I am deleting a single path from the Django cache like this:
from models import Graph
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.db.models.signals import post_save
from django.core.cache import cache
def expire_page(path):
request = HttpRequest()
request.path = path
key = get_cache_key(request)
if cache.has_key(key):
cache.delete(key)
def invalidate_cache(sender, instance, **kwargs):
expire_page(instance.get_absolute_url())
post_save.connect(invalidate_cache, sender = Graph)
This works - but is there a way to delete recursively? My paths look like this:
/graph/123
/graph/123/2009-08-01/2009-10-21
Whenever the graph with id "123" is saved, the cache for both paths needs to be invalidated. Can this be done?
You might want to consider employing a generational caching strategy, it seems like it would fit what you are trying to accomplish. In the code that you have provided, you would store a "generation" number for each absolute url. So for example you would initialize the "/graph/123" to have a generation of one, then its cache key would become something like "/GENERATION/1/graph/123". When you want to expire the cache for that absolute url you increment its generation value (to two in this case). That way, the next time someone goes to look up "/graph/123" the cache key becomes "/GENERATION/2/graph/123". This also solves the issue of expiring all the sub pages since they should be referring to the same cache key as "/graph/123".
Its a bit tricky to understand at first but it is a really elegant caching strategy which if done correctly means you never have to actually delete anything from cache. For more information here is a presentation on generational caching, its for Rails but the concept is the same, regardless of language.
Another option is to use a cache that supports tagging keys and evicting keys by tag. Django's built-in cache API does not have support for this approach. But at least one cache backend (not part of Django proper) does have support.
DiskCache* is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django. To use DiskCache in your project simply install it and configure your CACHES setting.
Installation is easy with pip:
$ pip install diskcache
Then configure your CACHES setting:
CACHES = {
'default': {
'BACKEND': 'diskcache.DjangoCache',
'LOCATION': '/tmp/path/to/directory/',
}
}
The cache set method is extended by an optional tag keyword argument like so:
from django.core.cache import cache
cache.set('/graph/123', value, tag='/graph/123')
cache.set('/graph/123/2009-08-01/2009-10-21', other_value, tag='/graph/123')
diskcache.DjangoCache uses a diskcache.FanoutCache internally. The corresponding FanoutCache is accessible through the _cache attribute and exposes an evict method. To evict all keys tagged with /graph/123 simply:
cache._cache.evict('/graph/123')
Though it may feel awkward to access an underscore-prefixed attribute, the DiskCache project is stable and unlikely to make significant changes to the DjangoCache implementation.
The Django cache benchmarks page has a discussion of alternative cache backends.
Disclaimer: I am the original author of the DiskCache project.
Checkout shutils.rmtree() or os.removedirs(). I think the first is probably what you want.
Update based on several comments: Actually, the Django caching mechanism is more general and finer-grained than just using the path for the key (although you can use it at that level). We have some pages that have 7 or 8 separately cached subcomponents that expire based on a range of criteria. Our component cache names reflect the key objects (or object classes) and are used to identify what needs to be invalidated on certain updates.
All of our pages have an overall cache-key based on member/non-member status, but that is only about 95% of the page. The other 5% can change on a per-member basis and so is not cached at all.
How you iterate through your cache to find invalid items is a function of how it's actually stored. If it's files you can use simply globs and/or recursive directory deletes, if it's some other mechanism then you'll have to use something else.
What my answer, and some of the comments by others, are trying to say is that how you accomplish cache invalidation is intimately tied to how you are using/storing the cache.
Second Update: #andybak: So I guess your comment means that all of my commercial Django sites are going to explode in flames? Thanks for the heads up on that. I notice you did not attempt an answer to the problem.
Knipknap's problem is that he has a group of cache items that appear to be related and in a hierarchy because of their names, but the key-generation logic of the cache mechanism obliterates that name by creating an MD5 hash of the path + vary_on. Since there is no trace of the original path/params you will have to exhaustively guess all possible path/params combinations, hoping you can find the right group. I have other hobbies that are more interesting.
If you wish to be able to find groups of cached items based on some combination of path and/or parameter values you must either use cache keys that can be pattern matched directly or some system that retains this information for use at search time.
Because we had needs not-unrelated to the OP's problem, we took control of template fragment caching -- and specifically key generation -- over 2 years ago. It allows us to use regexps in a number of ways to efficiently invalidate groups of related cached items. We also added a default timeout and vary_on variable names (resolved at run time) configurable in settings.py, changed the ordering of name & timeout because it made no sense to always have to override the default timeout in order to name the fragment, made the fragment_name resolvable (ie. it can be a variable) to work better with a multi-level template inheritance scheme, and a few other things.
The only reason for my initial answer, which was indeed wrong for current Django, was because I have been using saner cache keys for so long I literally forgot the simple mechanism we walked away from.