Faster Upload Speeds with AWS EC2 Instance - amazon-web-services

I've got a t2.medium instance with an EBS volume and EFS in the U.S. West (Oregon) availability region.
Users (often out of California) can upload image files using a javascript file uploader, but no matter how fast the user's connection is, they can't seem to upload any faster than ~500kb/s.
For example, if a user speed-tests their upload rate at 5mb/s, and then uploads a 5MB image file, it will still take nearly 11 seconds to complete.
I get similar results when using FTP to upload files.
My initial thought was that I should change my instance to something with better Network Performance — but since I'm uploading directly to the EFS and not an amazon bucket or something else, I wasn't sure networking was my problem.
How can I achieve faster upload rates? Is this a limitation of my instance?

I would definitely experiment with different instance types as the instance family and size is directly correlated with the network performance. The t2 family of instances has one of the lowest network throughputs.
Here are two resources to help you figure out what to expect for network throughput for the various instance types:
Cloudonaut EC2 Network Performance Cheat Sheet
Amazon EC2 Instance Type documentation
The t3 family is the latest gen of low cost and burstable t instances which include enhanced networking with a much improved burstable network rate of up to 5 Gbps. This may work for you if your uploads are infrequent. At a minimum, you could switch to the t3 family to improve your network performance without changing your cost much at all.
Side note: If you are using an older AMI, you may not be able to directly use your AMI from your t2 instance as you will need a modern version of an OS that supports the enhanced networking.

Related

Accessing instance storage in AWS SageMaker notebooks

I'm trying to train a model using AWS SageMaker notebooks and am disappointed with how slowly the model is training. I think my bottleneck lies with the IOPS speed to the persistent storage (EFS and EBS) my SageMaker notebooks are accessing for the dataset.
First, I tried training on a SageMaker Studio ml.g4dn.xlarge instance, then moved everything over to a SageMaker notebook ml.g4dn.xlarge instance through Jupyter. Even though g4dn.xlarge instances come with a physically wired 125GB SSD, I'm unable to access it because SageMaker Studio automatically creates an EFS store, and SageMaker notebook instances automatically create an EBS store. How could I store my dataset on the 125GB SSD instead of EFS or EBS to speed up the IOPS?
It is clear that there are instances with memory optimised for large amounts of data. In your case, if the dataset is given as input to the model with exactly that size (so there is no upstream preprocessing to lighten this amount of data), you must know that the g4dn is EBS optimised.
The most obvious answer i can think of is to use an S3 bucket
From "Maximum transfer speed between Amazon EC2 and Amazon S3":
Traffic between Amazon EC2 and Amazon S3 can leverage up to 100 Gbps
of bandwidth to VPC endpoints and public IPs in the same region.
Besides being very fast and performant, it is also the best solution in terms of design for all components of your project on AWS. Clearly, it entails different costs and a different architecture, but you will enjoy the maximum speed that the set of AWS services can offer you (and possibly require special configurations for even better performance).
My advice is to follow the AWS guidelines for developing a complex project from scratch: Build, training and deployment of machine learning models.

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

EC2 Instance Types with Fastest Download Speed

I'm looking for the most appropriate EC2 Instance Type to download large files at a fast rate. There are several options of Network performances, and I'm leaning towards "Up to 10 Gigabit" or "10 Gigabit". Is there a recommended Model with this networking performance options that best fit the requirement? Would it be possible to download 4~6GB files in under an hour?
Network bandwidth available to an Amazon EC2 instance is based upon the Instance Type. Basically, larger instances have more bandwidth.
Instances that show 10+ Gigabit networking only provide this bandwidth within the same Placement Group, which is within one Availability Zone. It does not apply to Internet bandwidth.
You should create a test that you can run on various instance types to determine the throughput. Preferably multi-thread such tests so that you are fully-utilizing available bandwidth.
You should also experiment with running multiple, smaller instances because they might have more aggregate bandwidth than fewer, larger instances.
There are a number of factors outside of AWS control which could potentially mean that you don't get the files in the amount of time you need it in. Some of these include:
Server on the other side has poor upload speed
Bad routing
Internet backbone latency issues (can happen)
Attempting to download from geographically far distances
Existing network traffic to the instance
The instance availability zone is down
Amount of security group and NACL rules (increases processing time of individual packets)
Assuming none of these are issues you won't have trouble getting large files downloaded. For getting data to AWS at a decent speed from an on site location you can also look into DirectConnect which helps on the routing front. For when you get into the petabyte+ level of data transfer there's also Snowball and Snowmobile which is physical shipping of the data to AWS for loading into servers.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Internet speed shared by multiple EC2 instances

I am new to AWS. My task is to download large files from web and save in S3. I am using m4.xlarge to download and save with the downloading speed of ~11MB/s.
But when I launch multiple instances (m4.xlarge) and try to download files in parallel, downloading speed gets shared among the instances. For e.g., I am getting ~5.5MB/s each for 2 instances.
I thought, instances are independent of each other. Is there any configuration which I need to change, to get ~11MB/s in all the instances in parallel? Is there anything I am missing?
The network bandwidth allocated to Amazon EC2 instances depend upon their instance type. Larger instances have higher bandwidth than smaller instances.
However, the network performance of one Amazon EC2 instance will never impact the performance of another instance. This is intentional so that there will not be a noisy neighbour problem between instances.
However, if different instances are downloading content from the same website, performance may be impacted due to limited bandwidth to/from the remote site. For example, the remote server might only serve 3 concurrent sessions. This might be what you are experiencing.
To take full advantage of bandwidth available on EC2 instances, upload/download files in parallel so that the network bandwidth is fully utilised.