Transferring lots of small files between EC2 and Amazon S3 - amazon-web-services

I'm building a browser game and I have a lot of small files that need to be transfered between my EC2 instancce and S3 when players perform some key actions.
Although transferring a single big file is fairly fast, transferring multiple small files is extremely slow. I'm using Amazon's PHP SDK.
Is there a way to overcome this weakness in S3? Thanks.

It looks like combining the two solutions below is the way to go.
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
http://gearman.org/

If this transfer has to be made from EC2 instance to S3 then may be you can try using s3fuse , which will basically mount your s3 drive to storage volume of EC2 instance.

The performance of S3 is not constant and can be quite slow sometimes. If you need real-time performance for a shared object I would take a look at the AWS memcached service although I have not used it.

How exactly are you uploading files? is there a multithreaded method in the SDK? I'm asking because I've had to implement my own method for downloading stuff faster than the SDK.
Do you need to read those files right away? how many events do you have per second, do you need them ordered?
My first thought would be to make a local buffer that uploads batches every once in a while.
Then, if that's too slow, I'd store them in a fast buffer first, instead of S3, and flush it every once in a while. My choices would be simple stuff like SQS or Redis. SQS has theoretically unlimited throughput for random queues and 300 batches per second (1 batch = 1..10 messages = 0..256kb) for FIFO queues - which you can further increase.
Then you have streams, Lambda and whatever.

Related

Is there a concurrent read/write limit on a s3 file

Is there a limit on the number of simultaneous/concurrent read/write operations on a single file stored in AWS S3? I'm thinking of designing a solution which requires parallel processing of some large amount of data stored in a single file which means there will be multiple read/write operations at any given point on time. I want to understand if there is a limit for this. Consistency of data is not required for me.
S3 doesn't sounds like the ideal service for your requirement. S3 is object storage. This is an oversimplification, but it basically means that you're dealing with the entire file. Clients don't go into S3 and read/write directly into it.
When a client "reads" a file on S3, it essentially has to retrieve a copy of that file into the memory of the client device and read it from there. This way, it can handle thousands of parallel reads (up to 5,500 requests according to this announcement).
Similarly, writing to S3 means creating a new copy/version of a file. There is no way to write to a file in-place. So while S3 can support a large number of concurrent writes in general, multiple clients can't write to the same copy of a file.
Maybe EFS might fit your requirement, but I think a database designed for this sort of performance would be a better option.
Reads: Multiple concurrent reads OK. The request limit is 5,500 GET/HEAD requests per second per partitioned prefix.
Writes: For object PUT and DELETE (3,500 RPS), strong read-after-write consistency. Last writer wins, no locking:
If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins.
The docs have several concurrent application examples. See also Best practices design patterns: optimizing Amazon S3 performance.
EFS is a file-based storage option for concurrent access uses cases. See the Comparing Amazon Cloud Storage table from the docs.

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

Sync data between EC2 instances

While I'm looking to move our servers to AWS, I'm trying to figure out how to sync data between our web nodes.
I would like to mount a disk on every web node and have a local cache of the entire share.
Are there any preferred ways to do this?
Sounds like you should consider storing your files on s3 originally and if performance is key, have a sync job that pulls copies of the files locally to your ec2 instance. S3 is fast, durable and cheap - maybe even fast enough without keeping a local cache - but if you do indeed need a local copy, there are tools such as the aws cli and other 3rd party tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Depending on what you are trying to sync - take a look at
http://aws.amazon.com/elasticache/
It is an extremely fast and efficient method for sharing data.
One absolute easy solution is to install Dropbox sync client on both the machines and keep your files in Dropbox. This is by far the easiest !
In this approach, you can "load" data to the machines, using externally adding files to your dropbox account (not even go to AWS service to load) - From another machine or even from browser interface of Dropbox.