Get User-Agent and IP Address of object anonymously uploaded to S3? - amazon-web-services

Is there a way to retrieve the User-Agent and/or IP Address of the client that uploaded an object anonymously to an S3 bucket?
I'm using an anonymous upload method similar to this: https://gist.github.com/jareware/d7a817a08e9eae51a7ea
which shows a method of testing against the client's User-Agent (via. aws:UserAgent in a StringNotEquals block in the Condition block), and there is documentation that shows that aws:SourceIp exists as well, but I see no way of grabbing either after the file has been uploaded.
Am I missing something?

This is possible if you enabled logging on the bucket...
http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html
...or captured a bucket notification event about the upload...
http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html
Otherwise, no.
Anonymous uploads are just a really, really bad idea. You can end up with objects where the only available action to you is deleting them. Bucket ownership != object ownership. Authenticating requests is simply not that difficult, so I would encourage you to back away slowly from anonymous S3 writes.

Related

Checking data integrity of downloaded AWS S3 data when using presigned URLS

Occasionally, a client requests a large chunk of data to be transferred to them.
We host our data in AWS S3, and a solution we use is to generate presign URLs for the data they need.
My question:
When should data integrity checks actually be performed on data migration or is relying on TSL good enough...
From my understanding, most uploads/downloads used via AWS CLI will automatically perform data integrity checks.
One potential solution I have is to manually generate MD5SUMS for all files transferred, and for them to perform a local comparison.
I understand that the ETAG is a checksum of sorts, but because a lot of the files are multipart uploads, the ETAG becomes a very complicated mess to use as a comparison value.
You can activate "Additional checksums" in AWS S3.
The GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.
Check out this release blog: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

AWS S3 object with data sensitive object names

We name the S3 object name with the birthday of the employees. It is stupid. We want to avoid creating object name with sensitive data. Is it safe to store the sensitive data using S3 user-defined metadata or Add an S3 bucket policy that denies the action S3:Getobject. Which will work?
As you mentioned; its not a good idea to create object name with sensitive data; but its ok... Not too bad also.. I will suggest to remove listAllObjects() permissions in the S3 policy. Policy should only allow getObject() which means anyone can get the object ONLY when they know object name; i.e. when calling api already knows DOB of the user.
With listAllObjects() permissions; caller can list all the objects in the bucket and get DOB of users.
Object keys and user metadata should not be used for sensitive data. The reasoning behind object keys is readily apparent, but metadata may be less obvious;
metadata is returned in the HTTP headers every time an object is fetched. This can't be disabled, but it can be worked around with CloudFront and Lambda#Edge response triggers, which can be used to redact the metadata when the object is downloaded through CloudFront; however,
metadata is not stored encrypted in S3, even if the object itself is encrypted.
Object tags are also not appropriate for sensitive data, because they are also not stored encrypted. Object tags are useful for flagging objects that contain sensitive data, because tags can be used in policies to control access permissions on the object, but this is only relevant when the object itself contains the sensitive data.
In the case where "sensitive" means "proprietary" rather than "personal," tags can be an acceptable place for data... this might be data that is considered sensitive from a business perspective but that does not need to be stored encrypted, such as the identification of a specific software version that created the object. (I use this strategy so that if a version of code is determined later to have a bug, I can identify which objects might have been impacted because they were generated by that version). You might want to keep this information proprietary but it would not be "sensitive" in this context.
If your s3 bucket is used to store private data and your allowing public access to the bucket this is always a bad idea - it's basically security by obscurity.
Instead of changing your existing s3 structure you could lock down the bucket to just your app then you serve the data via cloudfront signed urls?
Basically in your code where you currently inject the s3 url You can instead call the aws api to create a signed url from the s3 url and a policy and send this new url to the end user. This would mask the s3 url, and you can enforce other restrictions like how long the link is valid, enforce requiring a specific header or limit access to a specific ip etc. You also get cdn edge caching and reduced costs as side benefits.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html

AWS S3 - Privacy error when accessing file from link

I am working with a team that is using S3 to host content and they moved from a single bucket for all brands to one bucket for each brand and now we are having trouble when linking to the content from within salesforce site.com page. When I copy the link from S3 as HTTPS, I get a >"Your connection is >not private, Attackers might be trying to steal your information from >spiritxpress.s3.varsity.s3.amazonaws.com (for example, passwords, messages, or credit cards)."
I have asked them to compare the settings from the one that is working, and I don't have access to dig into it myself, and we are pretty new to this as well so thought I would see if there were any known paths to walk down. The ID and Key have not changed and I can access the content via CyberDuck, it just is not loading when reached via a link.
Let me know if additional information is needed and I will provide as quickly as I can.
[EDIT] the bucket naming convention they are using is all lowercase and meets convention guidelines as well, but it seems strange to me they way it is structured as they have named the bucket "brandname.s3.companyname" and when copying the link it comes across as "https://brandname.s3.company.s3.amazonaws.com/directory/filename" where the other bucket was being rendered as "https://s3.amazonaws.com/bucketname/......
Whoever made this change has failed to account for the way wildcard certificates work in HTTPS.
Requests to S3 using HTTPS are greeted with a certificate identifying itself as "*.s3[-region].amazonaws.com" and in order for the browser to consider this to be valid when compared to the link you're hitting, there cannot be any dots in the part of the hostname that matches the * offered by the cert. Bucket names with dots are valid, but they cannot be used on the left side of "s3[-region].amazonaws.com" in the hostname unless you are willing and able to accept a certificate that is deemed invalid... they can only be used as the first element of the path.
The only way to make dotted bucket names and S3 native wildcard SSL to work together is the other format: https://s3[-region].amazonaws.com/example.dotted.bucket.name/....
If your bucket isn't in us-standard, you likely need to use the region in the hostname, so that the request goes to the correct endpoint, e.g. https://s3-us-west-2.amazonaws.com/example.dotted.bucket.name/path... for a bucket in us-west-2 (Oregon). Otherwise S3 may return an error telling you that you need to use a different endpoint (and the endpoint they provide in the error message will be valid, but probably not the one you're wanting for SSL).
This is a limitation on how SSL certificates work, not a limitation in S3.
Okay, it appears it did boil down to some permissions that were missed and we were able to get the file to display as expected. Other issues are present, but the present one is resolved so marking as answered.