Using AWS Kinesis for large file uploads - amazon-web-services

My client has a service which stores a lot of files, like video or sound files. The service works well, however looks like the long-time file storing is quite a challenge, and we would like to use AWS for storing these files.
The problem is the following, the client wants to use AWS kinesis for transferring every file from our servers to AWS. Is this possible? Can we transfer files using that service? There's a lot of video files, and we got more and more every day. And every files is relatively big.
We would also like to save some detail of the files, possibly into dynamoDB, we could use Lambda functions for that.
The most important thing, that we need a reliable data transfer option.

KInesis would not be the right tool to upload files, unless they were all very small - and most videos would almost certainly be over the 1MB record size limit:
The maximum size of a data blob (the data payload before
Base64-encoding) within one record is 1 megabyte (MB).
https://aws.amazon.com/kinesis/streams/faqs/

Use S3 with multi-part upload using one of the SDK's. Objects you won't be accessing for 90+ days can be moved to Glacier.
Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 4302-4306). Amazon Web Services, Inc.. Kindle Edition.
To further optimize file upload speed, use transfer acceleration:
Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
Amazon Web Services. Amazon Simple Storage Service (S3) Developer Guide (Kindle Locations 2060-2062). Amazon Web Services, Inc.. Kindle Edition.

Kinesis launched a new service "Kinesis Video Streams" - https://aws.amazon.com/kinesis/video-streams/ which may be helpful to move large amount of data.

Related

Is it possible to upload more than one file to Amazon S3 in single request?

We are fetching binary blobs (PDF, JPG) from sql server and adding the object to Amazon S3 using AWSSDK.S3 (.net) v3.7.2.2.
Currently the process is adding the binary objects to Amazon S3 sequentially (one by one).
Is there any way/api to add more than one objects to Amazon S3 in a single request as this can improve the performance.
While adding the binary objects we have to pass metadata (Binary object properties like width, height, extension etc..) as well.
It is not possible to upload/download multiple objects in one request.
However, Amazon S3 is highly scalable, so you can send multiple requests in parallel. This will also take more advantage of your available bandwidth due to the overhead of file transfer protocols.

Designing Cloud Functions to stream downloads to Google Storage

I have a workflow where I want to download many large (500 MB - 1.5 GB) files each day from an external server and upload the file to GCS. Each file has its own endpoint. Due to restrictions on the external server that I download each file from, each download can take a few minutes. I was thinking of using a Cloud Function where I send the URL of each file, and the function downloads the file locally, then uploads it to GCS. However, to do so, I would need one of the large Cloud Function instances (2 GB). These can get quite expensive -- is there a way to use a lower memory instance and stream the result to GCS directly?
If you are transferring data from other Cloud Service Providers you can use Storage Transfer Service which will only cost you the Network charges as mentioned in this document
If you are transferring the data from your on-premises server then you can use Transfer service for on-premises data and schedule the transfer service as mentioned in this document where you can set up the interval to be in Weeks, Days or Hours . This will cost you $0.0125 per GB transferred to the destination successfully in addition to the networking cost. As you are more concerned about time being needed to download if you use Cloud Function which in turn increases the cost you may consider this as it will not cost you per time rather per GB of data.
You can use Streaming Transfers as mentioned in this document which allow you to transfer data to Cloud Storage without saving data to a file. Streaming uploads to Cloud Storage are helpful when we usually don’t know the final size of the data that is being uploaded at the starting of upload.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

AWS S3 - How to serve (and receive) files to/from user based on their geographical location?

Here's my situation and my goal(s):
I have a SaaS where users (globally) can upload audio files. These audio files are then later streamed (via HTML5 <audio>) to potentially anyone in the world. Currently, the only bucket hosting files is in us-west-2, which is obviously problematic when EU customers upload files, and EU users stream audio.
How can I have AWS:
Serve up audio files to a user, using the appropriate region based on their geographical location
Receive uploads using the S3 bucket (region) closest to the user uploading files
I thought maybe CloudFront would do the trick, but AFAIK, CloudFront requires a file to be downloaded once before it actually caches it, and that won't work for my SaaS. A common use case is that someone in the US might upload an important audio file for someone in Germany to listen to. I would need that person in Germany to experience as fast a streaming experience as possible, and currently I'm getting complaints of slow load times and choppy audio.
S3 cross-region replication might make sense (replicating to eu-central-1 as a good starting point, to cover customers in Scandinavia, other European countries, and the UK), but I'm not sure how to make a single S3 URL pull the file from a specific bucket based on the user's geographical location.
What's the best solution here, and how do I execute it?
To improve file upload performance, you can use Amazon S3 Transfer Acceleration which enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
To improve file download performance, you need to use AWS Cloudfront caching. Since Cloudfront caches content from first request onwards, if you need improve even the first request performance per region, you can automatically populate the cache by requesting the URL timely from different regions.

Difference between s3 bucket vs host files for Amazon Cloud Front

Background
We have developed an e-commerce application where I want to use CDN to improve the speed of the app and also to reduce the load on the host.
The application is hosted on an EC2 server and now we are going to use Cloud Front.
Questions
After reading a lot of articles and documents, I have created a distribution for my sample site. After doing all the experience I have come to know the following things. I want to be sure if am right about these points or not.
When we create a distribution it takes all the accessible data from the given origin path. We don't need to copy/ sync our files to cloud front.
We just have to change the path of our application according to this distribution CNAME (if cname is given).
There is no difference between placing the images/js/CSS files on S3 or on our own host. Cloud Front will just take them by itself.
The application will have thousands of pictures of the products, should we place them on S3 or its ok if they are on the host itself? Please share any good article to understand the difference of both the techniques.
Because if S3 is significantly better then I'll have to make a program to sync all such data on S3.
Thanks for the help.
Some reasons to store the images on Amazon S3 rather than your own host (and then serve them via Amazon CloudFront):
Less load on your servers
Even though content is cached in Amazon CloudFront, your servers will still be hit with requests for the first access of each object from every edge location (each edge location maintains its own cache), repeated every time that the object expires. (Refreshes will generate a HEAD request, and will only re-download content that has changed or been flushed from the cache.)
More durable storage
Amazon S3 keeps copies of your data across multiple Availability Zones within the same Region. You could also replicate data between your servers to improve durability but then you would need to manage the replication and pay for storage on every server.
Lower storage cost
Storing data on Amazon S3 is lower cost than storing it on Amazon EBS volumes. If you are planning on keeping your data in both locations, then obviously using S3 is more expensive but you should also consider storing it only on S3, which makes it lower cost, more durable and less for you to backup on your server.
Reasons to NOT use S3:
More moving parts -- maintaining code to move files to S3
Not as convenient as using a local file system
Having to merge log files from S3 and your own servers to gather usage information