CloudFront Top Referrers Report - ALL referrer URLs - amazon-web-services

In AWS I can find under:
Cloudfront >> Reports & Analytics >> Top Referrers (CloudFront Top Referrers Report)
There I get the top25 items. How can I get ALL of them?
I have turned on logging in my bucket, but it seems that the referrer is not part of the log-file. Any idea how amazon collects its top25 and how I can according to that get the whole list?
Thanks for your help, in advance.

Amazon's built in analytics are, as you've noticed, rather basic. The data you're looking for all lives in the logfiles that you can set cloudfront up to export (in the cs(Referer) field). If you know what you're looking for, you can set up a little pipeline to download logs, pull out the numbers you care about and generate reports.
Amazon also makes it easy[1] to set up Athena or Redshift to look directly at Cloudfront or S3 logfiles in their target bucket. After a one-time setup, you could query them directly for the numbers you need.
There are also paid services built to fill in the holes in Amazon's default reports. S3stat (https://www.s3stat.com/), for example, will give you a Top 200 Referrer list in its reports, with the ability to export complete lists.
[1] "easy", using Amazon's definition of the word, meaning really really hard.

Related

AWS AppSync searchItems type return data while table is empty

I deleted all the items in the DataTemplate table but when I query them again with the searchDataTemplates endpoint on the app or in AppSync it returns the old data, but when I use the listDataTemplates it returns nothing which is correct. Needed to repopulate the data in the table.
data template table
search endpoint
list endpoint
when I updated items individually it worked just fine but when i deleted all the items from the console (around 700 items) the search endpoint stopped working. Just the search
UPDATE:
I repopulated the data hoping it'd reset but now the listDataTemplates shows the new data and the search still shows the old data, is there some cache that needs to be reset?
SECOND UPDATE:
I removed the table and the appsync functions are gone however when i recreated the table (with no data) the testing out the function still returns the old data. I'm guessing the opensearch stuff hasn't been updated?
If you are using AppSync with Amplify CLI, #searchable will automatically create the followings:
An OpenSearch Domain
A Lambda Function that will be attached to the DynamoDB Streams and push the changes (create/update/delete) over to your OpenSearch Domain.
And the problem that you're facing is most likely due to the Lambda Function created failed to push the changes from DynamoDB Streams to OpenSearch. A quick suggestion is to check on the created Lambda Function first.
Reference: #searchable
This issue can only happen if caching is enabled in your application.
I am not sure what's the infrastructure you are using, so i would go ahead with some educated guess. Please feel free to correct me if i overstepped.
From your description of question, you have an AppSync as API layer and DynamoDb as primary database.
If these are the only two resources you have, please check the AppSync cache configuration.
Open AppSync console
from left panel select APIs -> your api -> caching
Validate Caching behavior is set to None
In case if you have AWS OpenSearch enabled for search query (i could be wrong, however picking up from previous comment). Then validate the cluster configuration.
Open AWS Open Search Service console
From left panel select Domains and click on the openserch domain that you are using
scroll to the bottom right and look for Advanced cluster settings and ensure the attribute Fielddata cache allocation is set to 0
If Fielddata cache allocation is not 0, update the cluster configuration and modify the advanced cluster setting to set the Fielddata cache allocation field to 0.
Wait for a few minutes (I would suggest 5 minutes) and then retry your use-case.
I hope this would help resolve your issue.

What is the difference between setting cache headers on CDN vs on AWS S3 objects?

I'm trying to figure out how to purge a set of URLs without purging one by one (which is inefficient and buggy).
I'm also trying to figure out how to do this without purging content that we don't want purged.
Essentially, when I push updated files to the S3 bucket that my CDN points to, I want to purge any files that have changed -- but not purge files that have stayed the same.
I'm trying to figure out the difference between setting cache headers on CDN vs setting cache headers (the x-amz-meta-surrogate-key specifically I think?).
Could I somehow configure the metadata for the changed objects (when I push them to the s3 bucket) such that those files get purged and not the others?
(for what its worth, I'm using Fastly for CDN service).
I'm trying to figure out how to purge a set of urls without purging one by one
This is typically done by setting a Surrogate-Key on your origin's response. You can set the same 'key' on multiple different pages to support purging all of those pieces of content at the same time from one purge request.
For example: you could have www.example.com/abc sending Surrogate-Key: red blue while www.example.com/xyz sending Surrogate-Key: green yellow red.
So with Fastly you can issue a 'purge by key' request and that means you can purge the /abc page using the blue key, as it's unique to that page (although in that case you might as well just 'purge by url') but you can purge both /abc and /xyz by issuing a 'purge by key' request using the key red as that key is set on the response for both pages.
As far as coupling this to AWS S3, there is a Fastly documentation page that might help...
You can mark content with a surrogate key and use it to purge groups of specific URLs at once without purging everything, or purging each URL singularly. On the Amazon S3 side, you can use the x-amz-meta-surrogate-key header to mark your content as you see fit, and then on the Fastly side set up a Header configuration to translate the S3 information into the header we look for. -- https://docs.fastly.com/en/guides/setting-surrogate-key-headers-for-amazon-s3-origins
Some other Fastly material that might help you here:
https://docs.fastly.com/en/guides/getting-started-with-surrogate-keys
https://developer.fastly.com/reference/http-headers/Surrogate-Key/

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

Cheapest way to use AWS for simple response

What I wanted to achieve is pretty simple, if you send a request to some address, the response you get is a single integer number, like 13 for example. I think it is equivalent to hosting a .html page with single number on that page and then I can parse that string in my application. (It is a Unity game, using the WWW class to send the request.)
(This is actually a version number. If it is greater than what I stored in my app I would update it and then send another request to other place and retrieve something bigger)
I am looking for the cheapest way that can handle this. I planned to use AWS but confused what component should be use? S3? EC2? Lambda? CloudFront?
If you think doing this on a web hosting or Heroku or something else is better, I also wanted to hear about it.
To serve up a simple value, S3 should do the trick.
Create a bucket in the console, using lonely lowercase letters, digits, and dashes in the name. The name has to be globally unique among all of S3, so make up something unique. We'll call the bucket name example-bucket.
Create your file on your computer with the desired contents. If plain text, call it version.txt.
In the AWS console, select the bucket, and upload the file. While clicking through the "next" screens, put a check next to "make everything public" and accept the defaults. Upload the file.
Now, go to https://example-bucket.s3.amazonaws.com/version.txt in your browser and verify (using your actual bucket name. That's your download link.
Done. As long as you don't expect to handle over about 800 requests per second, this will do exactly what you want.
Review the S3 pricing, of course.
Although this question is suitable for Server Fault,
EC2 using nginx or apache web server will be sufficient.
Put Load balancer in front of EC2 instances.

AWS cloudfront not updating on update of files in S3

I created a distribution in cloudfront using my files on S3.
It worked fine and all my files were available. But today I updated my files on S3 and tried to access them via Cloudfront, but it still gave old files.
What am I missing ?
Just ran into the same issue. At first I tried updating the cache control to be 0 and max-age=0 for the files in my S3 bucket I updated but that didn't work.
What did work was following the steps from #jpaljasma. Here's the steps with some screen shots.
First go to your AWS CloudFront service.
Then click on the CloudFront distrubition you want to invalidate.
Click on the invalidations tab then click on "Create Invalidation" which is circled in red.
In the "object path" text field, you can list the specific files ie /index.html or just use the wildcard /* to invalidate all. This forces cloudfront to get the latest from everything in your S3 bucket.
Once you filled in the text field click on "Invalidate", after CloudFront finishes invalidating you'll see your changes the next time you go to the web page.
Note: if you want to do it via aws command line interface you can do the following command
aws cloudfront create-invalidation --distribution-id <your distribution id> --paths "/*"
The /* will invalidate everything, replace that with specific files if you only updated a few.
To find the list of cloud front distribution id's you can do this command aws cloudfront list-distributions
Look at these two links for more info on those 2 commands:
https://docs.aws.amazon.com/cli/latest/reference/cloudfront/create-invalidation.html
https://docs.aws.amazon.com/cli/latest/reference/cloudfront/list-distributions.html
You should invalidate your objects in CloudFront distribution cache.
Back in the old days you'd have to do it 1 file at a time, now you can do it wildcard, e.g. /images/*
https://aws.amazon.com/about-aws/whats-new/2015/05/amazon-cloudfront-makes-it-easier-to-invalidate-multiple-objects/
How to change the Cache-Control max-age via the AWS S3 Console:
Navigate to the file whose Cache-Control you would like to change.
Check the box next to the file name (it will turn blue)
On the top right click Properties
Click Metadata
If you do not see a Key named Cache-Control, then click Add more metadata.
Set the Key to Cache-Control set the Value to max-age=0 (where 0 is the number of seconds you would like the file to remain in the cache). You can replace 0 with whatever you want.
The main advantage of using CloudFront is to get your files from a source (S3 in your case) and store it on edge servers to respond to GET requests faster. CloudFront will not go back to S3 source for each http request.
To have CloudFront serve latest fiels/objects, you have multiple options:
Use CloudFront to Invalidate modified Objects
You can use CloudFront to invalidate one or more files or directories manually or using a trigger. This option have been described in other responses here. More information at Invalidate Multiple Objects in CloudFront. This approach comes handy if you are updating your files infrequently and do not want to impact the performance benefits of cached objects.
Setting object expiration dates on S3 objects
This is now the recommended solution. It is straight forward:
Log in to AWS Management Console
Go into S3 bucket
Select all files
Choose "Actions" drop down from the menu
Select "Change metadata"
In the "Key" field, select "Cache-Control" from the drop down menu.
In the "Value" field, enter "max-age=300" (number of seconds)
Press "Save" button
The default cache value for CloudFront objects is 24 hours. By changing it to a lower value, CloudFront checks with the S3 source to see if a newer version of the object is available in S3.
I use a combination of these two methods to make sure updates are propagated to an edge locations quickly and avoid serving outdated files managed by CloudFront.
AWS however recommends changing the object names by using a version identifier in each file name. If you are using a build command and compiling your files, that option is usually available (as in react npm build command).
For immediate reflection of your changes, you have to invalidate objects in the Cloudfront - Distribution list -> settings -> Invalidations -> Create Invalidation.
This will clear the cache objects and load the latest ones from S3.
If you are updating only one file, you can also invalidate exactly one file.
It will just take few seconds to invalidate objects.
Distribution List -> settings -> Invalidations -> Create Invalidation
I also faced similar issues and found out its really easy to fix in your cloudfront distribution
Step 1.
Login To your AWS account and select your target distribution as shown in the picture below
Step 2.
Select Distribution settings and select behaviour tab
Step 3.
Select Edit and choose option All as per the below image
Step 4.
Save your settings and that's it
I also had this issue and solved it by using versioning (not the same as S3 versioning). Here is a comprehensive link to using versioning with cloudfront
Invalidating Files
In summary:
When you upload a new file or files to your S3 bucket, change the version, and update your links as appropriate. From the documentation the benefit of using versioning vs. invalidating (the other way to do this) is that there is no additional charge for making CloudFront refresh by version changes whereas there is with invalidation. If you have hundreds of files this may be problematic, but its possible that by adding a version to your root directory, or default root object (if applicable) it wouldn't be a problem. In my case, I have an SPA, all I have to do is change the version of my default root object (index.html to index2.html) and it instantly updates on CloudFront.
Thanks tedder42 and Chris Heald
I was able to reduce the cache duration in my origin i.e. s3 object and deliver the files more instantly then what it was by default 24 hours.
for some of my other distribution I also set forward all headers to origin in which cloudfront doesn't cache anything and sends all request to origin.
thanks.
Please refer to this answer this may help you.
What's the difference between Cache-Control: max-age=0 and no-cache?
Adding a variable Cache-Control to 0 in the header to the selected file in S3
How to change the Cache-Control max-age via the AWS S3 Console:
Go to your bucket
Select all files you would like to change (you can select folders as well, it will include all files inside them
Click on the Actions dropdown, then click on Edit Metadata
On the page that will open, click on Add metadata
Set Type to System defined
Set Key to Cache-Control
Set value to 0 (or whatever you would like to set it to)
Click on Save Changes
Invalidate all distribution files:
aws cloudfront create-invalidation --distribution-id <dist-id> --paths "/*"
If you need to remove a file from CloudFront edge caches before it expires docs
The best practice for solving this issue is probably using the Object Version approach.
The invalidation method could solve this problem anyhow but it will bring you some side effects simultaneously. Such as cost increasing if exceeding 1000 times per month, or some object could not be deleted via this method.
Hope the official doc on "Why CloudFront is serving outdated content from Amazon" could help the poor guys.