Suppose you have to implement video streaming platform from scratch. It doesn't matter where you gonna store metadata, your not-very-popular video files will be stored at file system, or object store in case you want to use Cloud. If you'll choose AWS, in order to boost AWS S3 read performance, you can make multiple read requests against the same video file, see Performance Guidelines for Amazon S3:
You can use concurrent connections to Amazon S3 to fetch different
byte ranges from within the same object. This helps you achieve higher
aggregate throughput versus a single whole-object request.
In the same time, as you know, disk I/O is sequential for all HDD/SDD drives, so to boost read performance (if neglect RAM necessary for uncompress/decrypt each video chunk) you have to read from multiple disks (YouTube use RAID).
Why S3 will have better performance on concurrent byte range requests agains the same file? Isn't it stored on single disk? I suppose S3 may have some replication factor and still store the file on multiple disks, does it?
Related
I have some to those terms: Block Level Storage and File Level Storage. Can someone explain why one is better than the other?
Perhaps with examples and algorithmic thinning it would be really interesting to understand.
For example, articles in AWS say that AWS EBS can be use for databases, but why is it better than File Level?
I like to think of it like this:
Amazon Elastic Block Store (Amazon EBS) is block storage. It is just like a USB disk that you plug into your computer. Information is stored in specific blocks on the disk and it is the job of the operating system to keep track of which blocks are used by each file. That's why disk formats vary between Windows and Linux.
Amazon Elastic File System (Amazon EFS) is a filesystem that is network-attached storage. It's just like the H: drive (or whatever) that companies provide their employees to store data on a fileserver. You mount the filesystem on your computer like a drive, but your computer sends files to the fileserver rather than managing the block allocation itself.
Amazon Simple Storage Service (Amazon S3) is object storage. You give it a file and it stores it as an object. You ask for the object and it gives it back. Amazon S3 is accessed via an API. It is not mounted as a disk. (There are some utilities that can mount S3 as a disk, but they actually just send API calls to the back-end and make it behave like a disk.)
When it comes to modifying files, they behave differently:
Files on block storage (like a USB disk) can be modified by the operating system. For example, changing one byte or adding data to the end of the file.
Files on a filesystem (like the H: drive) can be modified by making a request to the fileserver, much like block storage.
Files in object storage (like S3) are immutable and cannot be modified. You can upload another file with the same name, which will replace the original file, but you cannot modify a file. (Uploaded files are called objects.)
Amazon S3 has other unique attributes, such as making object available via the Internet, offering multiple storage classes for low-cost backups and triggering events when objects are created/deleted. It's a building-block for applications as opposed to a simple disk for storing data. Plus, there is no limit to the amount of data you can store.
Databases
Databases like to store their data in their own format that makes the data fast to access. Traditional databases are built to run on normal servers and they want fast access, so they store their data on directly-attached disks, which are block storage. Amazon RDS uses Amazon EBS for block storage.
A network-attached filesystem would slow the speed of disk access for a database, thereby reducing performance. However, sometimes this trade-off is worthwhile because it is easier to manage network-attached storage (SANs) than to keep adding disks to each individual server.
Some modern 'databases' (if you can use that term) like Presto can access data directly in Amazon S3 without loading the data into the database. Thus, the database processing layer is separated from the data layer. This makes it easier to access historical archived data since it doesn't need to be imported into the database.
I need an HTTP web-service serving files (1-10GiB) being result of merging some smaller files in S3 bucket. Such a logic is pretty easy to implement, but I need a very high scalability, so would prefer to put it on cloud. What Amazon service will be most feasible for this particular case? Should I use AWS Lambda for that?
Unfortunately, you can't achieve that with lambda, since it only offer 512mb for strage, and you can't mount volumes.You will need EBS or EFS to download and process the data. Since you need scalability, I would sugest Fargate + EFS. Plain EC2 instances would do just fine, but you might lose some money because it can be tricky to provision the correct amount for your needs, and most of the time it is overprovisioned.
If you don't need to process the file in real time, you can use a single instance and use SQS to queue the jobs and save some money. In that scenario you could use lambda to trigger the jobs, and even start/kill the instance when it is not in use.
Merging files
It is possible to concatenate Amazon S3 files by using the UploadPartCopy:
Uploads a part by copying data from an existing object as data source.
However, the minimum allowable part size for a multipart upload is 5 MB.
Thus, if each of your parts is at least 5 MB, then this would be a way to concatenate files without downloading and re-uploading.
Streaming files
Alternatively, rather than creating new objects in Amazon S3, your endpoint could simply read each file in turn and stream the contents back to the requester. This could be done via API Gateway and AWS Lambda. Your AWS Lambda code would read each object from S3 and keep returning the contents until the last object has been processed.
First, let me clarify your goal: you want to have an endpoint, say https://my.example.com/retrieve that reads some set of files from S3 and combines them (say, as a ZIP)?
If yes, does whatever language/framework that you're using support chunked encoding for responses?
If yes, then it's certainly possible to do this without storing anything on disk: you read from one stream (the file coming from S3) and write to another (the response). I'm guessing you knew that already based on your comments to other answers.
However, based on your requirement of 1-10 GB of output, Lambda won't work because it has a limit of 6 MB for response payloads (and iirc that's after Base64 encoding).
So in the AWS world, that leaves you with an always-running server, either EC2 or ECS/EKS.
Unless you're doing some additional transformation along the way, this isn't going to require a lot of CPU, but if you expect high traffic it will require a lot of network bandwidth. Which to me says that you want to have a relatively large number of smallish compute units. Keep a baseline number of them always running, and scale based on network bandwidth.
Unfortunately, smallish EC2 instances in general have lower bandwidth, although the a1 family seems to be an exception to this. And Fargate doesn't publish bandwidth specs.
That said, I'd probably run on ECS with Fargate due to its simpler deployment model.
Beware: your biggest cost with this architecture will almost certainly be data transfer. And if you use a NAT, not only will you be paying for its data transfer, you'll also limit your bandwidth. I would at least consider running in a public subnet (with assigned public IPs).
On the AWS developer docs for Sagemaker, they recommend us to use PIPE mode to directly stream large datasets from S3 to the model training containers (since it's faster, uses less disk storage, reduces training time, etc.).
However, they don't include information on whether this data streaming transfer is charged for (they only include data transfer pricing for their model building & deployment stages, not training).
So, I wanted to ask if anyone knew whether this data transfer in PIPE mode is charged for, since if it is, I don't get how this would be recommended for large datasets, since streaming a few epochs for each model iteration can get prohibitively expensive for large datasets (my dataset, for example, is 6.3TB on S3).
Thank you!
You are charged for the S3 GET calls that you do similarly to what you would be charged if you used the FILE option of the training. However, these charges are usually marginal compared to the alternatives.
When you are using the FILE mode, you need to pay for the local EBS on the instances, and for the extra time that your instances are up and only copying the data from S3. If you are running multiple epochs, you will not benefit much from the PIPE mode, however, when you have so much data (6.3 TB), you don't really need to run multiple epochs.
The best usage of PIPE mode is when you can use a single pass over the data. In the era of big data, this is a better model of operation, as you can't retrain your models often. In SageMaker, you can point to your "old" model in the "model" channel, and your "new" data in the "train" channel and benefit from the PIPE mode to the maximum.
I just realized that on S3's official pricing page, it says the following under the Data transfer section:
Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free.
And since my S3 bucket and my Sagemaker instances will be in the same AWS region, the data transfer costs should be free.
My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.
KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/
Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.
Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.
I have a number of large (100GB-400GB) files stored on various EBS volumes in AWS. I need to have local copies of these files for offline use. I am wary to attempt to scp such files down from AWS considering their size. I've considered cutting the files up into smaller pieces and reassembling them once they all successfully arrive. But I wonder if there is a better way. Any thoughts?
There are multiple ways, here are some:
Copy your files to S3 and download them from there. S3 has a lot more support in the backend for downloading files (It's handled by Amazon)
Use rsync instead of scp. rsync is a bit more reliable than scp and you can resume your downloads.
rsync -azv remote-ec2-machine:/dir/iwant/to/copy /dir/where/iwant/to/put/the/files
Create a private torrent for your files. If your using Linux mktorrent is a good utility you can use: http://mktorrent.sourceforge.net/
Here is one more option you can consider if you are wanting to transfer large amounts of data:
AWS Import/Export is a service that accelerates transferring data into and out of AWS using physical storage appliances, bypassing the Internet. AWS Import/Export Disk was originally the only service offered by AWS for data transfer by mail. Disk supports transfers data directly onto and off of storage devices you own using the Amazon high-speed internal network.
Basically from what I understand, you send amazon your HDD and they will copy the data onto it for you and send it back.
As far as I know this is only available in USA but it might have been expanded to other regions.