Data Lakes - S3 and Databricks - amazon-web-services

I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well.
S3 performance Best Pratices

There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
By default you may have only 100 buckets in an account - it could be increased although
You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.

Related

AWS SFTP from AWS Transfer Family

I am planning to spin-up AWS Managed SFTP Server. AWS Documentation say, I can create upto 20 users. Can I configure for 20 users 20 different buckets and assign seperate previleges ? Is this a possible configuration ?
All I am looking for exposing same endpoint with different vendors having access to different AWS S3 buckets to upload their files to designated AWS S3 buckets.
Appreciate all your thoughts and response at the earliest.
Thanks
Setting up separate buckets and AWS Transfer instances for each vendor is a best practice for workload separation. I would recommend setting up a custom URL in Route53 for each of your vendors and not attempt to consolidate on a single URL (it isn't natively supported).
https://docs.aws.amazon.com/transfer/latest/userguide/requirements-dns.html
While setting up separate AWS Transfer Family instances will work, it comes at a higher cost (remember you are charged even if you stop until the time you delete, you are billed $0.30 per hour which is ~ $216 per month).
The other way is to create different users (one per vendor) and use different home directories (one per vendor) and lock down permissions through IAM role for that user (also there is provision to use a scope-down policy along with AD). If using service managed users see this link https://docs.aws.amazon.com/transfer/latest/userguide/service-managed-users.html.

Is it better to have multiple s3 buckets or one bucket with sub folders?

Is it better to have multiple s3 buckets per category of uploads or one bucket with sub folders OR a linked s3 bucket? I know for sure there will be more user-images than there will be profille-pics and that there is a 5TB limit per bucket and 100 buckets per account. I'm doing this using aws boto library and https://github.com/amol-/depot
Which is the structure my folders in which of the following manner?
/app_bucket
/profile-pic-folder
/user-images-folder
OR
profile-pic-bucket
user-images-bucket
OR
/app_bucket_1
/app_bucket_2
The last one implies that its really a 10TB bucket where a new bucket is created when the files within bucket_1 exceeds 5TB. But all uploads will be read as if in one bucket. Or is there a better way of doing what I'm trying to do? Many thanks!
I'm not sure if this is correct... 100 buckets per account?
https://www.reddit.com/r/aws/comments/28vbjs/requesting_increase_in_number_of_s3_buckets/
Yes, there is actually a 100 bucket limit per account. I asked the reason for that to an architect in an AWS event. He said this is to avoid people hosting unlimited static websites on S3 as they think this may be abused. But you can apply for an increase.
By default, you can create up to 100 buckets in each of your AWS
accounts. If you need additional buckets, you can increase your bucket
limit by submitting a service limit increase.
Source: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
Also, please note that there are actually no folders in S3, just a flat file structure:
Amazon S3 has a flat structure with no hierarchy like you would see in
a typical file system. However, for the sake of organizational
simplicity, the Amazon S3 console supports the folder concept as a
means of grouping objects. Amazon S3 does this by using key name
prefixes for objects.
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Finally, the 5TB limit only applies to a single object. There is no limit on the number of objects or total size of the bucket.
Q: How much data can I store?
The total volume of data and number of objects you can store are
unlimited.
Source: https://aws.amazon.com/s3/faqs/
Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.
Hope this helps.
Simpler Permission with Multiple Buckets
If the images are used in different use cases, using multiple buckets will simplify the permissions model, since you can give clients/users bucket level permissions instead of directory level permissions.
2-way doors and migrations
On a similar note, using 2 buckets is more flexible down the road.
1 to 2:
If you switch from 1 bucket to 2, you now have to move all clients to the new set-up. You will need to update permissions for all clients, which can require IAM policy changes for both you and the client. Then you can move your clients over by releasing a new client library during the transition period.
2 to 1:
If you switch from 2 buckets to 1 bucket, your clients will already have access to the 1 bucket. All you need to do is update the client library and move your clients onto it during the transition period.
*If you don't have a client library than code changes are required in both cases for the clients.

Copy all files from S3 to ESB drive

Is there a way how to copy all files from S3 to an EBS drive belonging to a EC2 instance (which may belong to a a different AWS account than the S3)?
We are performing a migration of the whole account and upgrading the instances from t1 to t2 type and would like to backup the data from S3 somewhere outside S3 (and Glacier since Glacier is closely linked to S3) in case that something goes wrong and we lose the data.
I found only articles and docs talking about EBS snapshots but I am not sure if the S3 data can be actually copied to EBS (in some other way than manually).
According to this docs, I can ssh to my instance and copy the data from S3 buckets to my local EBS drive, but I have to specify the name of the bucket. Is there a way how to copy all the buckets there?
aws s3 sync s3://mybucket
I would like to achieve this:
Pseudocode:
for each bucket
do
aws s3 sync s3://bucketName bucketName
endfor
Is there a way how to do this using the AWS CLI?
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year and achieves this by automatically replicating the data you put into a bucket across 3 separate facilities (think datacenters), across Availability Zones, within a Region. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.
If you are still concerned about losing your data, you may consider copying the contents of the buckets into new buckets set up in another region. That means that you have your data in 1 system that offers 11x9's with a copy in another system that offers 11x9's. Say your original buckets reside in the Dublin region, create corresponding 'backup' buckets in the Frankfurt region and use the sync command.
eg.
aws s3 sync s3://originalbucket s3://backupbucket
That way you will have six copies of your data in six different facilities spread across Europe (naturally this is just as relevant if you use multiple regions in the US or ASIA). This would be a much more redundant configuration than pumping it into EBS volumes that have a meagre (when compared to S3) 99.999% availability. And better economics with S3 rates lower than EBS (1TB in S3 = US$30 vs 1TB in EBS(Magnetic) = US$50) and you only pay for the capacity you consume whereas EBS is based on what you provision.
Happy days...
References
http://aws.amazon.com/s3/faqs/
http://aws.amazon.com/ebs/faqs/
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
I would agree with rdp-cloud's answer, but if you insist on creating EBS backups, to answer your question - there is no single aws cli command that would sync all available buckets in one go. You can use a bash script to get the list of all available buckets and then sync looping through them:
#!/bin/bash
BUCKETS=($(aws s3 ls | awk '{print $3}'))
for (( i=0; i<${#BUCKETS[#]}; i++ ))
do
aws s3 sync s3://$BUCKETS[$i] <destination>
done
Make sure to test that aws s3 ls | awk '{print $3}' gives you the exact list you intend to sync before running the above.

What are possible ways to access Amazaon S3 data if S3 outage happens?

Can some one help me in understanding the S3 outage usecase here.
The probability of S3 outage is very less, but in case if this happens, what are the ways we can access data that sits in S3.
I know that there is one possibility, that is cross region replication, that works for new files, that I am going to put in my s3 bucket, if I enable it now. What happen to old files, I know if I go and upload all those historical files also to the other region, then it works.
Then again the same question, if both the regions went down, then what?
I am sure others would have thought of this. Any inputs on this.
From Protecting Data in Amazon S3:
Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.
...
Backed with the Amazon S3 Service Level Agreement
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year
Designed to sustain the concurrent loss of data in two facilities
So, if you're still not happy with all those statements, how can you access your data in an outage?
If your data is in only one region, and the region is not accessible, then your data is not accessible. Note, however, that an external network connectivity problem could prevent access to Amazon S3, yet Amazon S3 might still be accessible from Amazon EC2 instances in the same region.
Cross-region replication will copy your data to another Amazon S3 region. It requires versioning to be activated. To copy any files that exist prior to activating cross-region replication, use the sync command in the AWS Command-Line Utility (CLI), eg:
aws s3 sync s3://bucket1/folder s3://bucket2/folder
Each AWS region operates independently, so the possibility of multiple regions suffering outages would presumably be even less likely.
If you are feeling particularly paranoid, you could copy your data to another cloud provider (Azure, Google, Rackspace, etc). There are tools that can assist:
CloudBerry Cloud Migrator
AzureCopy
...and no doubt many more!

Does AWS S3 cross-region replication use same URL for multiple regions?

Using S3 cross-region replication, if a user downloads http://mybucket.s3.amazonaws.com/myobject , will it automatically download from the closest region like cloudfront? So no need to specify the region in the URL like http://mybucket.s3-[region].amazonaws.com/myobject ?
http://aws.amazon.com/about-aws/whats-new/2015/03/amazon-s3-introduces-cross-region-replication/
Bucket names are global, and cross-region replication involves copying it to a different bucket.
In other words, example.us-west-1 and example.us-east-1 is not valid, as there can only be one bucket named 'example'.
That's implied in the announcement post- Mr. Barr is using buckets named jbarr and jbarr-replication.
Using S3 cross-Region replication will put your object into two (or more) buckets in two different Regions.
If you want a single access point that will choose the closest available bucket then you want to use Multi-Region Access Points (MRAP)
MRAP makes use of Global Accelerator and puts bucket requests onto the AWS backbone at the closest edge location, which provides faster, more reliable connection to the actual bucket. Global Accelerator also chooses the closest available bucket. If a bucket is not available, it will serve the request from the other bucket providing automatic failover
You can also configure it in an active/passive configuration, always serving from one bucket until you initiate a failover
From the MRAP page on AWS console it even shows you a graphical representation of your replication rules
s3 is global service, no need specify the region. The S3 name has to be unique globally.
when you create s3, you need specify the region, however it doesn't mean you need put the region name when you access it. To speed up the access speed from other region, there are several options like
-- Amazon S3 Transfer Acceleration with same bucket name.
-- Or use set up another bucket with different name in different region and enable cross region replication. Create an origin group with two origins for cloudfront.