Is there a way to securely delete files on aws? - amazon-web-services

We store some sensitive stuff on S3 and on our instance filesystems on AWS. Is there a way to securely wipe it?

The short answer is no, there is no wipe utility. If you delete the file permanently, the file is gone and it cannot be recovered (unless you have snapshots or other items that might keep the file stored). However, there is not a way to wipe the disk. Don't forget, though, that AWS uses server virtualization so that your disk storage does not necessarily correspond to one physical disk platter. Instead, it is a virtual storage system spread over many drives. When your file is deleted, it is gone from public access. Then Amazon designates that area of disk for write-only operations so that your data gets overwritten quickly.
Here is a quote from an Amazon document about their data security:
When an object is deleted from Amazon S3, removal of the mapping from the public name to the object starts immediately, and is generally processed across the distributed system within several seconds. Once the mapping is removed, there is no external access to the deleted object. That storage area is then made available only for write operations and the data is overwritten by newly stored data.
Retrieved from: http://aws.amazon.com/whitepapers/overview-of-security-processes/

Related

Block Level vs File Level Storage

I have some to those terms: Block Level Storage and File Level Storage. Can someone explain why one is better than the other?
Perhaps with examples and algorithmic thinning it would be really interesting to understand.
For example, articles in AWS say that AWS EBS can be use for databases, but why is it better than File Level?
I like to think of it like this:
Amazon Elastic Block Store (Amazon EBS) is block storage. It is just like a USB disk that you plug into your computer. Information is stored in specific blocks on the disk and it is the job of the operating system to keep track of which blocks are used by each file. That's why disk formats vary between Windows and Linux.
Amazon Elastic File System (Amazon EFS) is a filesystem that is network-attached storage. It's just like the H: drive (or whatever) that companies provide their employees to store data on a fileserver. You mount the filesystem on your computer like a drive, but your computer sends files to the fileserver rather than managing the block allocation itself.
Amazon Simple Storage Service (Amazon S3) is object storage. You give it a file and it stores it as an object. You ask for the object and it gives it back. Amazon S3 is accessed via an API. It is not mounted as a disk. (There are some utilities that can mount S3 as a disk, but they actually just send API calls to the back-end and make it behave like a disk.)
When it comes to modifying files, they behave differently:
Files on block storage (like a USB disk) can be modified by the operating system. For example, changing one byte or adding data to the end of the file.
Files on a filesystem (like the H: drive) can be modified by making a request to the fileserver, much like block storage.
Files in object storage (like S3) are immutable and cannot be modified. You can upload another file with the same name, which will replace the original file, but you cannot modify a file. (Uploaded files are called objects.)
Amazon S3 has other unique attributes, such as making object available via the Internet, offering multiple storage classes for low-cost backups and triggering events when objects are created/deleted. It's a building-block for applications as opposed to a simple disk for storing data. Plus, there is no limit to the amount of data you can store.
Databases
Databases like to store their data in their own format that makes the data fast to access. Traditional databases are built to run on normal servers and they want fast access, so they store their data on directly-attached disks, which are block storage. Amazon RDS uses Amazon EBS for block storage.
A network-attached filesystem would slow the speed of disk access for a database, thereby reducing performance. However, sometimes this trade-off is worthwhile because it is easier to manage network-attached storage (SANs) than to keep adding disks to each individual server.
Some modern 'databases' (if you can use that term) like Presto can access data directly in Amazon S3 without loading the data into the database. Thus, the database processing layer is separated from the data layer. This makes it easier to access historical archived data since it doesn't need to be imported into the database.

AWS S3 scaling connections horizontally

Suppose you have to implement video streaming platform from scratch. It doesn't matter where you gonna store metadata, your not-very-popular video files will be stored at file system, or object store in case you want to use Cloud. If you'll choose AWS, in order to boost AWS S3 read performance, you can make multiple read requests against the same video file, see Performance Guidelines for Amazon S3:
You can use concurrent connections to Amazon S3 to fetch different
byte ranges from within the same object. This helps you achieve higher
aggregate throughput versus a single whole-object request.
In the same time, as you know, disk I/O is sequential for all HDD/SDD drives, so to boost read performance (if neglect RAM necessary for uncompress/decrypt each video chunk) you have to read from multiple disks (YouTube use RAID).
Why S3 will have better performance on concurrent byte range requests agains the same file? Isn't it stored on single disk? I suppose S3 may have some replication factor and still store the file on multiple disks, does it?

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?
The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

AWS S3 deletion of files that haven't been accessed

I'm writing a service that takes screenshots of a lot of URLs and saves them in a public S3 bucket.
Due to storage costs, I'd like to periodically purge the aforementioned bucket and delete every screenshot that hasn't been accessed in the last X days.
By "accessed" I mean downloaded or acquired via a GET request.
I checked out the documentation and found a lot of ways to define an expiration policy for an S3 object, but couldn't find a way to "mark" a file as read once it's been accessed externally.
Is there a way to define the periodic purge without code (only AWS rules/services)? Does the API even allow that or do I need to start implementing external workarounds?
You can use Amazon S3 Storage Class Analysis:
By using Amazon S3 analytics storage class analysis you can analyze storage access patterns to help you decide when to transition the right data to the right storage class. This new Amazon S3 analytics feature observes data access patterns to help you determine when to transition less frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access) storage class.
After storage class analysis observes the infrequent access patterns of a filtered set of data over a period of time, you can use the analysis results to help you improve your lifecycle policies.
Even if you don't use it to change Storage Class, you can use it to discover which objects are not accessed frequently.
There is no such service provided by AWS.. You will have to write your own solution.

Google Cloud Storage Transfer

We’ve been using Google Cloud Storage Transfer service and in our data source (AWS) we had a directory accidentally deleted, so we figured it would be in the data sink however upon taking a looking it wasn’t there despite versioning being on.
This leads us to believe in Storage Transfer the option deleteObjectsUniqueInSink hard deletes objects in the sink and removes them from the archive.
We'e been unable to confirm this in the documentation.
Is GCS Transfer Service's deleteObjectsUniqueInSink parameter in the TransferSpec mutually exclusive with GCS's object versioning soft-delete?
When the deleteObjectsUniqueInSink option is enabled, Google Cloud Storage Transfer will
List only the live versions of objects in source and destination buckets.
Copy any objects unique in the source to the destination bucket.
Issue a versioned delete for any unique objects in the destination bucket.
If the unique object is still live at the time that Google Cloud Storage Transfer issues the deletion, it will be archived. If another process, such as Object Lifecycle Management, archived the object before the deletion occurs, the object could be permanently deleted at this point rather than archived.
Edit: Specifying the version in the delete results in a hard delete (Objects Delete Documentation), so transfer service is currently performing hard deletes for unique objects. We will update the service to instead perform soft deletions.
Edit: The behavior has been changed. From now on deletions in versioned buckets will be soft deletes rather than hard deletes.