Can somebody explain what "Bandwidth" means in AWS Data Transfer (Billing)? - amazon-web-services

I have been trying to see how data transfer costs have scaled for the last year. I am confused as to what these line items shown below mean - Which one is transfer out to internet? What is Download Bandwidth consumed? What is the difference between them? Bandwidth in Bill
I have tried matching Download bandwidth consumed to Region-Out to Internet from CE to no avail.

There are mainly two type of transfer techniques here.
Data transfer
Bandwidth
Under data transfer, all the transfers you do within aws comes here,
And, all the transfers you do from aws to public (outside aws) is considered in Bandwidth.
Both above transfers are measured in GB per month.

Related

Fastest way to Import Data from AWS Redshift to BI Tool

I have a table in AWS redshift running ra3.xlplus with 2 nodes which has 15 million rows. I am retrieving data on-premise at the office. I am trying to load that data into Memory in a BI tool. It takes a lot of time (12 minutes) to import that data over using a JDBC connection. Also tried on ODBC connection got same result. I tried to spin up a EC2 with a 25 gigabit connection on AWS, but got the same results.
For comparison loading that data in CSV format takes about 90 seconds.
Are there any solutions as to speed up data transfer.
There are ways to improve this but the true limiter needs to be identified. The likely the bottleneck is the network bandwidth between AWS and your on-prem system. As you are bringing a large amount of data down from the cloud you will want an efficient process for this transport.
JDBC and ODBC are not network efficient as you are seeing. The first thing that will help in moving the data is compression. The second is parallel transfer since there is a fair amount of handshaking in TCP protocol and there is more usable bandwidth than one connection can consume. So how I have done this in the past is to UNLOAD the data compressed to S3, then parallel copy the files from S3 to the local machine piping the files through decompress and saving them. Lastly these files are loaded into your BI tool.
Clearly setting this up takes some time so you want to be sure that the process will be used enough to justify this effort. Another way to go is to bring your BI tool closer to Redshift by locating it in an ec2 instance. The shorter network distance and higher bandwidth should bring down the transfer time significantly. A downside of locating your database in the cloud is that it is in the cloud and not on-prem.

Is there a way to expedite the uploading of files on AWS S3?

I am trying to use AWS S3 and read the data to Jupyter Notebook. The total file size is 42 MB, however the time it is taking to get into S3 is very high. After 5 hours, only 22% has been completed and the estimated time to complete is 12 hours. Are there any other ways to either effectively upload to S3, or use other platform that possibly provides higher speed?
The upload bandwidth is determined by many factors:
Your local internet connection
Any VPN the traffic goes through
The public internet
S3-Bandwidth
Typically the last two aren't your problem (especially for 42MB), but the first two may be.
If you upload data to an S3-Bucket in a region that's far away from you, you can take a look at S3 Transfer-Acceleration, which allows you to send the data to the nearest CloudFront edge location, from where it traverses the global AWS backbone to the destination region, but I doubt this is going to help you much with the problem given the size of data, as it's most likely one of the first two.

S3 docs: "one concurrent request per 85–90 MB/s of desired network throughput" -- Why?

On the page linked below, I found the following statement:
Make one concurrent request for each 85–90 MB/s of desired network throughput. To saturate a 10 Gb/s network interface card (NIC), you might use about 15 concurrent requests over separate connections. You can scale up the concurrent requests over more connections to saturate faster NICs, such as 25 Gb/s or 100 Gb/s NICs.
Performance Design Patterns for Amazon S3 - Horizontal Scaling and Request Parallelization for High Throughput
What is the origin of these numbers? I can't find any other documentation that justifies this. My guess is that this limitation is speaking more to the limitations of NIC on the EC2 instance rather than S3. Still, is there any other source that explains where these numbers came from?
To be clear, this is not a question about how to optimize S3 throughput -- I'm aware of the alternatives. This is a question about the AWS S3 documentation itself.
The only people who could answer this definitively are those who are working on S3 internals. And they're almost certainly covered by NDA. So what I'm about to write is complete speculation.
We know that S3 is distributed and redundant: each object is stored on multiple physical drives, across multiple availability zones.
We can infer, from the fact that S3 is available as a networked service, that there is some form of network interface between the S3 volume and the outside world. Obvious, yes, but if that network interface is limited to 1Gbit/sec, it would be able to achieve approximately 85-90 Mbyte/sec sustained throughput.
It's also important to remember that AWS uses a software-defined network: so while the S3 service may in fact have a network interface that supports 10 Gbit/sec, AWS may restrict the bandwidth that is available to any given connection.
Far more interesting to me is this quote, from the same link:
we suggest making concurrent requests for byte ranges of an object at the granularity of 8–16 MB
This implies that redundancy is managed at a sub-object level, so that a large object is split into multiple pieces of maybe 64 MB, and those pieces are individually distributed. Which is how HDFS manages large files, so not a giant leap.
As for your supposition that it's a limit of EC2 rather than S3, I think that the suggestion to use multiple connections rules that out. Although it's possible that a single connection is limited to 1Gbit/sec by EC2, I would expect the S3 designers to be more concerned about load on their system. You can always test that out by opening a single connection between two EC2 instances with high-bandwidth networking, and see if it's throttled.

Amazon equivalent of Google Storage Transfer Service

I have a bucket in GCP that has millions of 3kb files, and I want to copy them over to an S3 bucket. I know google has a super fast transfer service, however I am not able to use that solution to push data back to S3 with it.
Due to the amount of objects, running a simple gsutil -m rsync gs://mybucket s3://mybucket might not do the job because it will take at least a week to transfer everything.
Is there a faster solution than this?
On the AWS side, you may want to see if S3 Transfer Acceleration would help. There are specific requirements for enabling it and naming it. You would want to make sure the bucket was in a location close to where the data is currently stored, but that might help speed things up a bit.
We got the same problem of pushing small files to S3. Compressing and storing it back does the same thing. It is the limits set to your account.
As mentioned in the documentation you need to open support ticket to increase your limits before you send burst of requests.
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
It is NOT size of the file or size of all objects matters here. It is the number of files you have is the problem.
Hope it helps.
Personally I think the main issue that you're going to have is not so much the ingress rate to Amazon's S3 service but more so the network egress rate from Google's Network. Even if you enabled the S3 Transfer Acceleration service, you'll still be restricted by the egress speed of Google's Network.
There are other services that you can set up which might assist in speeding up the process. Perhaps look into one of the Interconnect solutions which allow you to set up fast links between networks. The easiest solution to set up is the Cloud VPN solution which could allow you to set up a fast uplink between an AWS and Google Network (1.5-3 Gbps for each tunnel).
Otherwise from your data requirements the transfer of 3,000 GB isn't a terrible amount of data and setting up a Cloud server to transfer data over the space of a week isn't too bad. You might find that by the time you set up another solution it may have been easier in the first place to just spin up a machine and let it run for a week.

How to determine time needed to upload data to Amazon EC2?

We need to populate database which sits on Amazon WS { EC2 (Compute Cluster Eight extra large) + EBS 1TB }. Given that we have close to 700GB of data on local, how can I find out the time (theoretical) it would take to upload the entire data? I could not find any information on data upload/download speeds for EC2?
Since this will depend strongly on the networking betweeen your site and amazon's data centre...
Test it with a few GB and extrapolate.
Be aware of AWS Import/Export and consider the option of simply couriering Amazon a portable hard drive. (Old saying: "Never underestimate the bandwidth of a stationwagon full of tape"). In fact I note the page includes a section "When to use..." which gives some indication of transfer times vs. connection bandwidth.