AWS SFTP from AWS Transfer Family - amazon-web-services

I am planning to spin-up AWS Managed SFTP Server. AWS Documentation say, I can create upto 20 users. Can I configure for 20 users 20 different buckets and assign seperate previleges ? Is this a possible configuration ?
All I am looking for exposing same endpoint with different vendors having access to different AWS S3 buckets to upload their files to designated AWS S3 buckets.
Appreciate all your thoughts and response at the earliest.
Thanks

Setting up separate buckets and AWS Transfer instances for each vendor is a best practice for workload separation. I would recommend setting up a custom URL in Route53 for each of your vendors and not attempt to consolidate on a single URL (it isn't natively supported).
https://docs.aws.amazon.com/transfer/latest/userguide/requirements-dns.html

While setting up separate AWS Transfer Family instances will work, it comes at a higher cost (remember you are charged even if you stop until the time you delete, you are billed $0.30 per hour which is ~ $216 per month).
The other way is to create different users (one per vendor) and use different home directories (one per vendor) and lock down permissions through IAM role for that user (also there is provision to use a scope-down policy along with AD). If using service managed users see this link https://docs.aws.amazon.com/transfer/latest/userguide/service-managed-users.html.

Related

Data Lakes - S3 and Databricks

I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well.
S3 performance Best Pratices
There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
By default you may have only 100 buckets in an account - it could be increased although
You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.

File transfer from basespace to aws S3

I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin

AWS Codepipeline limitations on a single account

We are planning to leverage AWS codepipeline by hosting it on a single AWS account, moving forward pipeline count will get around ~500, Is there any limitation by AWS that only certain number of pipelines needs to be hosted on a single account.
Do we need to have a separate account for hosting all these pipelines or just host these on the AWS account in which the application is running? what are the best practices?
You can see the limits pertaining to CodePipeline at https://docs.aws.amazon.com/codepipeline/latest/userguide/limits.html.
It looks like as of now there is a soft limit is 300 pipelines per region per account. If you hit that number, you should be able to request an increase by following the link in that document.
As mentioned in another answer, the default limit for pipelines per account per region is 300. This limit can be raised on request.
While you can run more than 300 pipelines per account, you may also start running into related limits like IAM roles per account, CloudWatch Event rules per account, etc. You can get these limits raised too, but the complexity of dealing with all this can start to add up.
My personal recommendation would be to split things across multiple accounts so that there are about 300 pipelines per account at most. If you have multiple teams or multiple departments, splitting accounts by team/department can be a good idea anyway.

Should I use nginx reverse proxy for cloud object storage?

I am currently implementing image storing architecture for my service.
As I read in one article it is a good idea to move whole
image upload and download traffic to the external cloud object storage.
https://medium.com/#jgefroh/software-architecture-image-uploading-67997101a034
As I noticed there are many cloud object storage providers:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Blob Storage
- Alibaba Object Storage
- Oracle Object Storage
- IBM Object Storage
- Backblaze B2 Object
- Exoscale Object Storage
- Aruba Object Storage
- OVH Object Storage
- DreamHost DreamObjects
- Rackspace Cloud Files
- Digital Ocean Spaces
- Wasabi Hot Object Storage
My first choice was Amazon S3 because
almost all of my system infrastructure is located on AWS.
However I see a lot of problems with this object storage.
(Please correct me if I am wrong in any point below)
1) Expensive log delivery
AWS is charging for all operational requests. If I have to pay for all requests I would like to see all request logs. and I would like to get these logs as fast as possible. AWS S3 provide log delivery, but with a big delay and each log is provided as a separate file in other S3 bucket, so each log is a separate S3 write request. Write requests are more expensive, they cost approximately 5$ per 1M requests. There is another option to trigger AWS Lambda whenever request is made, however it is also additional cost 0,2 $ per 1M lambda invocations. In summary - in my opinion log delivery of S3 requests is way to expensive.
2) Cannot configure maximum object content-length globally for a whole bucket.
I have not found the possibility to configure maximum object size (content-length) restriction for a whole bucket. In short - I want to have a possibility to block uploading files larger than specified limit for a chosen bucket. I know that it is possible to specify content-length of uploaded file in a presigned PUT urls, however I think this should be available to configure globally for a whole bucket.
3) Cannot configure request rate limit per IP numer per minute directly on a bucket.
Because all S3 requests are chargable I would like to have a possibility
to restrict a limit of requests that will be made on my bucket from one IP number.
I want to prevent massive uploads and downloads from one IP number
and I want it to be configurable for a whole bucket.
I know that this functionality can be privided by AWS WAF attached to Cloudfront
however such WAF inspected requests are way to expensive!
You have to pay 0,60$ per each 1M inspected requests.
Direct Amazon S3 requests costs 0,4$ per 1M requests,
so there is completely no point and it is completely not profitable
to use AWS WAF as a rate limit option for S3 requests as a "wallet protection" for DOS attacks.
4) Cannot create "one time - upload" presigned URL.
Generated presigned URLs can be used multiple times as long as the didnt expired.
It means that you can upload one file many times using same presigned URL.
It would be great if AWS S3 API would provide a possibility to create "one time upload" presigned urls. I know that I can implement such "one time - upload" functionality by myself.
For example see this link https://serverless.com/blog/s3-one-time-signed-url/
However in my opinion such functionality should be provided directly via S3 API
5) Every request to S3 is chargable!
Let's say you created a private bucket.
No one can access data in it however....
Anybody from the internet can run bulk requests on your bucket...
and Amazon will charge you for all that forbidden 403 requests!!!
It is not very comfortable that someone can "drain my wallet"
anytime by knowing only the name of my bucket!
It is far from being secure!, especially if you give someone
direct S3 presigned URL with bucket address.
Everyone who knows the name of a bucket can run bulk 403 requests and drain my wallet!!!
Someone already asked that question here and I guess it is still a problem
https://forums.aws.amazon.com/message.jspa?messageID=58518
In my opinion forbidden 403 requests should not be chargable at all!
6) Cannot block network traffic to S3 via NaCL rules
Because every request to S3 is chargable.
I would like to have a possibility to completely block
network traffic to my S3 bucket in a lower network layer.
Because S3 buckets cannot be placed in a private VPC
I cannot block traffic from a particular IP number via NaCl rules.
In my opinion AWS should provide such NaCl rules for S3 buckets
(and I mean NaCLs rules not ACLs rules that block only application layer)
Because of all these problems I am considering using nginx
as a proxy for all requests made to my private S3 buckets
Advantages of this solution:
I can rate limit requests to S3 for free however I want
I can cache images on my nginx for free - less requests to S3
I can add extra layer of security with Lua Resty WAF (https://github.com/p0pr0ck5/lua-resty-waf)
I can quickly cut off requests with request body greater than specified
I can provide additional request authentication with the use of openresty
(custom lua code can be executed on each request)
I can easily and quickly obtain all access logs from my EC2 nginx machine and forward them to cloud watch using cloud-watch-agent.
Disadvantages of this solution:
I have to transfer all the traffic to S3 through my EC2 machines and scale my EC2 nginx machines with the use of autoscaling group.
Direct traffic to S3 bucket is still possible from the internet for everyone who knows my bucket name!
(No possibility to hide S3 bucket in private network)
MY QUESTIONS
Do you think that such approach with reverse proxy nginx server in front of object storage is good?
Or maybe a better way is to just find alternative cloud object storage provider and not proxy object storage requests at all?
I woud be very thankful for the recommendations of alternative storage providers.
Such info about given recommendation would be preferred.
Object storage provider name
A. What is the price for INGRESS traffic?
B. What is the price for EGRESS traffic?
C. What is the price for REQUESTS?
D. What payment options are available?
E. Are there any long term agreement?
F. Where data centers are located?
G. Does it provide S3 compatible API?
H. Does it provide full access for all request logs?
I. Does it provide configurable rate limit per IP number per min for a bucket?
J. Does it allow to hide object storage in private network or allow network traffic only from particular IP number?
In my opinion a PERFECT cloud object storage provider should:
1) Provide access logs of all requests made on bucket (IP number, response code, content-length, etc.)
2) Provide possibility to rate limit buckets requests per IP number per min
3) Provide possibility to cut off traffic from malicious IP numbers in network layer
4) Provide possibility to hide object storage buckets in private network or give access only for specified IP numbers
5) Do not charge for forbidden 403 requests
I would be very thankful for allt the answers, comments and recommendations
Best regards
Using nginx as a reverse proxy for cloud object storage is a good idea for many use-cases and you can find some guides online on how to do so (at least with s3).
I am not familiar with all features available by all cloud storage providers, but I doubt that any of them will give you all the features and flexibility you have with nginx.
Regarding your disadvantages:
Scaling is always an issue, but you can see with benchmark tests
that nginx can handle a lot of throughput even in small machines
There are solution for that in AWS. First make your S3 bucket private, and then you can:
Allow access to your bucket only from the EC2 instance/s running your nginx servers
generate pre-signed URLs to your S3 bucket and serve them to your clients using nginx.
Note that both solutions for your second problem require some development
If you have an AWS Infrastructure and want to implement a on-prem S3 compatible API, you can look into MinIO.
It is a performant object storage which protects data protection through Erasure Coding

Is it better to have multiple s3 buckets or one bucket with sub folders?

Is it better to have multiple s3 buckets per category of uploads or one bucket with sub folders OR a linked s3 bucket? I know for sure there will be more user-images than there will be profille-pics and that there is a 5TB limit per bucket and 100 buckets per account. I'm doing this using aws boto library and https://github.com/amol-/depot
Which is the structure my folders in which of the following manner?
/app_bucket
/profile-pic-folder
/user-images-folder
OR
profile-pic-bucket
user-images-bucket
OR
/app_bucket_1
/app_bucket_2
The last one implies that its really a 10TB bucket where a new bucket is created when the files within bucket_1 exceeds 5TB. But all uploads will be read as if in one bucket. Or is there a better way of doing what I'm trying to do? Many thanks!
I'm not sure if this is correct... 100 buckets per account?
https://www.reddit.com/r/aws/comments/28vbjs/requesting_increase_in_number_of_s3_buckets/
Yes, there is actually a 100 bucket limit per account. I asked the reason for that to an architect in an AWS event. He said this is to avoid people hosting unlimited static websites on S3 as they think this may be abused. But you can apply for an increase.
By default, you can create up to 100 buckets in each of your AWS
accounts. If you need additional buckets, you can increase your bucket
limit by submitting a service limit increase.
Source: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
Also, please note that there are actually no folders in S3, just a flat file structure:
Amazon S3 has a flat structure with no hierarchy like you would see in
a typical file system. However, for the sake of organizational
simplicity, the Amazon S3 console supports the folder concept as a
means of grouping objects. Amazon S3 does this by using key name
prefixes for objects.
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Finally, the 5TB limit only applies to a single object. There is no limit on the number of objects or total size of the bucket.
Q: How much data can I store?
The total volume of data and number of objects you can store are
unlimited.
Source: https://aws.amazon.com/s3/faqs/
Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.
Hope this helps.
Simpler Permission with Multiple Buckets
If the images are used in different use cases, using multiple buckets will simplify the permissions model, since you can give clients/users bucket level permissions instead of directory level permissions.
2-way doors and migrations
On a similar note, using 2 buckets is more flexible down the road.
1 to 2:
If you switch from 1 bucket to 2, you now have to move all clients to the new set-up. You will need to update permissions for all clients, which can require IAM policy changes for both you and the client. Then you can move your clients over by releasing a new client library during the transition period.
2 to 1:
If you switch from 2 buckets to 1 bucket, your clients will already have access to the 1 bucket. All you need to do is update the client library and move your clients onto it during the transition period.
*If you don't have a client library than code changes are required in both cases for the clients.