Where is Databricks DBFS located? - amazon-web-services

I have read through the documentation but I don't see much technical detail on DBFS.
Is this a hosted service or is it in the client's account? I assume it's not hosted, but I can't find it in my azure account or my aws account. I'm very interested in how this is set up and the technical details I can provide to clients. The most technical detail I can find is that there is a 2 gig file limit.

DBFS is the name for implementation of abstraction around underlying cloud storage, potentially of different types. Usually, when people are referring to the DBFS, it comes to two things:
DBFS Root - the main entry point of DBFS (/, /tmp, etc.). On AWS you need to provision it yourself as S3 Bucket. On Azure it's created during workspace creation as a dedicated & isolated storage account in a separate managed resource group. You can't update settings of that storage account after it's created, or access it directly. That's why it's recommended not to store critical data in the DBFS Root.
Other storage accounts (you can also use S3 or GCS) that are mounted into a workspace. Although it's convenient to work with mounted storage, you need to understand that these mounts are available for all in workspace (except so-called passthrough mount)

Is it a hosted service. DBFS is provisioned as a part of workspace creation process.
If you prefer you can also mount storage accounts.
You can find more details about DBFS here.

Related

What is the cheapest way to allow others to download a dataset I have?

I have some datasets (can go up to 10 GBs (zipped) altogether possibly) for my Machine Learning applications
In order to expose these datasets to others, I believe I have to host a server and let others to download over the network.
what is the cheapest server I can use for this? (I checked AWS free tiers, can these be used?)
Do I need to write up a web server? is there a premade tool that I can use for my use case?
You haven't indicated how much data will be downloaded (GB/month) and that's important because you pay for data transfer out to the internet (about $0.09 per GB) beyond an initial free amount (1 GB/month, I believe, but check if free tier offers more), and that's relevant to both S3 and EC2.
That said, I'd consider a few options.
Storing the files in S3 and serving them from S3 via CloudFront may be cheaper than running a server 24x7 to host and serve the files.
A small EC2 server that fits into the free tier usage plan, running a web or FTP server, serving up your files.
Similar to #1 but you can also configure requester pays for S3 downloads. This option requires your downloaders to have AWS credentials and for you to manage their access. May not be feasible in your case.
Create an EBS volume containing your data, take a snapshot of that volume, and share the snapshot with other AWS accounts, then shut down your EC2 instance. This option requires your users to be AWS account holders and that they share their AWS account numbers with you. May not be feasible in your case.
AWS SFTP serving up data stored in S3.

Prevent anyone to access my EC2 instance content on AWS

I am using a shared AWS account (Everyone in the account has root access) to deploy servers on an EC2 instance that I created. Is there a way to prevent anyone else that have access to the same AWS account to access the content that I put on the EC2 instance?
As far as I know, creating a separate key pair won't work because someone else can snapshot the instance and launch it with another key pair that they own.
I am using a shared AWS account (Everyone in the account has root
access) to deploy servers on an EC2 instance that I created. Is there
a way to prevent anyone else that have access to the same AWS account
to access the content that I put on the EC2 instance?
No, you cannot achieve your objective.
When it comes to access security, you either have 100% security or none. If other users have root access, you cannot prevent them from doing what they want to your instance. They can delete, modify, clone, and access. There are methods to make this more difficult, but anyone with solid experience will bypass these methods.
My recommendation is to create a separate account. This is not always possible, as in your case, but is a standard best practice (separation of access/responsibility). This would isolate your instance from others.
There are third-party tools that support the encryption of data. You will not be able to store the keys/passphrase on the instance. You will need to enter the keys/passphrase each time you encrypt/decrypt your data.
As far as I know, creating a separate key pair won't work because
someone else can snapshot the instance and launch it with another key
pair that they own.
With root access, there are many ways to access the data stored on your instance's disk. Clone the disk and just mount it on another instance is one example.
By default, IAM Users do not have access to any AWS services. They can't launch any Amazon EC2 instances, access Amazon S3 data or snapshot an instance.
However, for them to do their work, it is necessary to assign permissions to IAM Users. It is generally recommended not to grant Admin access to everyone. Rather, people should be assigned sufficient permissions to do their assigned job.
Some companies separate Dev/Test/Prod resources, giving lots of people permission in Dev environments, but locking-down access to Production. This is done to ensure continuity, recoverability and privacy.
Your requirement is to prevent people from accessing information on a specific Amazon EC2 instance. This can be done by using a keypair that only you know. Thus, nobody can login to the instance.
However, as you point out, there can be ways around this such as copying the disk (EBS Snapshot) and mounting it on another computer, thereby gaining access to the data. This is analogous to security in a traditional data center — if somebody has physical access to a computer, they can extract the disk, attach it to another computer and access the data. This is why traditional data centers have significant physical security to prevent unauthorized access. The AWS equivalent to this physical security are IAM permissions that grant specific users permission to perform certain actions (such as creating a disk snapshot).
If there are people who have Admin/root access on the AWS account, then they can do whatever they wish. This is by design. If you do not wish people to have such access, then do not assign them these permissions.
Sometimes companies face a trade-off: They want Admins to be able to do anything necessary to provide IT services, but they also want to protect sensitive data. An example of this is an HR system that contains sensitive information that they don't want accessible to general staff. A typical way this is handled is to put the HR system in a separate AWS Account that does not provide general access to IT staff, and probably has additional safeguards such as MFA and additional audit logging.
Bottom line: If people have physical access or Admin-like permissions, they can do whatever they like. You should either restrict the granting of permissions, or use a separate AWS Account.

Is Redshift/S3 Data co-mingled?

I am working on moving our business needs into the cloud, namely using AWS Simple Storage Service and Amazon Redshift. Because we are working with sensitive data we have some client concerns to consider. One question we will have to answer is whether or not the data in S3/Redshift is co-mingled with other data and show evidence that it is isolated.
While researching I found information about EC2 instances being shared on the same server unless the instance is specified as a dedicated instance. However, I been totally unable to find anything similar regarding other AWS services. Is the data in S3 and Redshift co-mingled as well?
Everything on cloud is co-mingled but with security boundaries unless you pay more to get dedicated service (like dedicated EC2 hosts) in which case you should stick with on-prem.
Your concerns of co-mingling falls under Shared Responsiblity Model where AWS is responsible to make sure your data is not accessible by other services running on their hosts unless you open up the access.
Read this article on how Shared Responsibility Model works
https://aws.amazon.com/compliance/shared-responsibility-model/
Or this whitepaper
https://d0.awsstatic.com/whitepapers/Security/AWS_Security_Best_Practices.pdf

Comparative Application ebs vs s3

I am new to cloud environment. I do understand the definition and storage types EBS and S3. I wanted to understand the application of EBS as compared to S3.
I do understand EBS looks like a device for heavy though put operations. I cannot find any application where this can be used in comparison to S3. I could think of putting server logs on EBS on magnetic storage, as one EBS can be attached to one instance.
S3 you can use the scaling property to add some heavy data and expand in real time. We can deploy our slef managed dbs on this service.
Please correct me if I am wrong. Please help me understand what is best suited for what and application of them in comparison with one another.
As you stated, they are primarily different types of storage:
Amazon Elastic Block Store (EBS) is a persistent disk-storage service, which provides storage volumes to a virtual machine (similar to VMDK files in VMWare)
Amazon Simple Storage Service (S3) is an object store system that stores files as objects and optionally makes them available across the Internet.
So, how do people choose which to use? It's quite simple... If they need a volume mounted on an Amazon EC2 instance, they need to use Amazon EBS. It gives them a C:, D: drive, etc in Windows and a mountable volume in Linux. Computers traditionally expect to have locally-attached disk storage. Put simply: If the operating system or an application running on an Amazon EC2 instance wants to store data locally, it will use EBS.
EBS Volumes are actually stored on two physical devices in case of failure, but an EBS volume appears as a single volume. The volume size must be selected when the volume is created. The volume exists in a single Availability Zone and can only be attached to EC2 instances in the same Availability Zone. EBS Volumes persist even when the attached EC2 instance is Stopped; when the instance is Started again, the disk remains attached and all data has been presrved.
Amazon S3, however, is something quite different. It is a storage service that allows files to be uploaded/downloaded (PutObject, GetObject) and files are replicated across at least three data centers. Files can optionally be accessed via the Internet via HTTP/HTTPS without requiring a web server. There are no limits on the amount of data that can be stored. Access can be granted per-object, per-bucket via a Bucket Policy, or via IAM Users and Groups.
Amazon S3 is a good option when data needs to be shared (both locally and across the Internet), retained for long periods, backed-up (or even for storing backups) and made accessible to other systems. However, applications need to specifically coded to use Amazon S3 and many traditional application expect to store data on a local drive rather than on a separate storage service.
While Amazon S3 has many benefits, there are still situations where Amazon EBS is a better storage choice:
When using applications that expect to store data locally
For storing temporary files
When applications want to partially update files, because the smallest storage unit in S3 is a file and updating a part of a file requires re-uploading the whole file
For very High-IO situations, such as databases (EBS Provisioned IOPS can provide volumes up to 20,000 IOPS)
For creating volume snapshots as backups
For creating Amazon Machine Images (AMIs) that can be used to boot EC2 instances
Bottom line: They are primarily different types of storage and each have their own usage sweet-spot, just like a Database is a good form of storage depending upon the situation.

AWS backup account

We all know about what happened to Cold Spaces getting hacked and their AWS account essentially erased. I'm trying to put together recommendation on set of tools, best practices on archiving my entire production AWS account into a backup only where only I would have access to. The backup account will be purely for DR purposes storing EBS snapshots, AMI's, RDS etc.
Thoughts?
Separating the production account from the backup account for DR purposes is an excellent idea.
Setting up a "cross-account" backup solution can be based on the EBS snapshot sharing feature that is currently not available for RDS.
If you want to implement such a solution, please consider the following:
Will the snapshots be stored in both the source and DR accounts? If they are, it will cost you twice.
How do you protect the credentials of the DR account? You should make sure the credentials used to copy snapshots across accounts are not permitted to delete snapshots.
Consider the way older snapshots get deleted at some point. You may want to deal with snapshot deletion separately using different credentials.
Make sure your snapshots can be easily recovered back from the DR account to the original account
Think of ways to automate this cross-account process and make it simple and error free
The company I work for recently released a product called “Cloud Protection Manager (CPM) v1.8.0” in the AWS Marketplace which supports cross-account backup and recovery in AWS and a process where a special account is used for DR only.
I think you would be able to setup A VPC and then use VPC peering to see the other account and access S3 in that account.
To prevent something like coldspaces, make sure you use MFA authentication (no excuse for not using it, the google authentication app for your phone is free and safer than just having a single password as protection.
Also dont use the account owner but setup a separate IAM role with just the permissions you need (and enable MFA on this account as well).
Only issues is that VPC peering doesnt work across regions which would be nicer than having the DR in a different AZ in the same region.