Is there a concurrent read/write limit on a s3 file - amazon-web-services

Is there a limit on the number of simultaneous/concurrent read/write operations on a single file stored in AWS S3? I'm thinking of designing a solution which requires parallel processing of some large amount of data stored in a single file which means there will be multiple read/write operations at any given point on time. I want to understand if there is a limit for this. Consistency of data is not required for me.

S3 doesn't sounds like the ideal service for your requirement. S3 is object storage. This is an oversimplification, but it basically means that you're dealing with the entire file. Clients don't go into S3 and read/write directly into it.
When a client "reads" a file on S3, it essentially has to retrieve a copy of that file into the memory of the client device and read it from there. This way, it can handle thousands of parallel reads (up to 5,500 requests according to this announcement).
Similarly, writing to S3 means creating a new copy/version of a file. There is no way to write to a file in-place. So while S3 can support a large number of concurrent writes in general, multiple clients can't write to the same copy of a file.
Maybe EFS might fit your requirement, but I think a database designed for this sort of performance would be a better option.

Reads: Multiple concurrent reads OK. The request limit is 5,500 GET/HEAD requests per second per partitioned prefix.
Writes: For object PUT and DELETE (3,500 RPS), strong read-after-write consistency. Last writer wins, no locking:
If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins.
The docs have several concurrent application examples. See also Best practices design patterns: optimizing Amazon S3 performance.
EFS is a file-based storage option for concurrent access uses cases. See the Comparing Amazon Cloud Storage table from the docs.

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?
Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

AWS services appropriate for concurrent access to a resource

I'm designing a system where a cluster of EC2 instances do some computing and then update a large file continually. What would be ideal is if I could have the file in S3, and have all the instances take turns writing to it one at a time, performing calculations while they wait.
As it stands if 2 instances PUT to S3 at the same time, 1 will simply override the other.
How can I solve this concurrency issue?
AWS has a preview service called EFS (http://aws.amazon.com/documentation/efs/) that is an NFS4 that can be shared among EC2 instances. But such service alone does not solve your problem as you may still have concurrency issues. Consider having something more sophisticated such as exploiting "embarrassingly parallel processing" such as having N processes creating N file chunks and finally having a single file joining all pieces together when everything is done.
As it is Amazon states that if you receive a success code then your S3 object is committed. Amazon also adds that there wouldn't be any dirty writes or overlapping inconsistency - you would read either of a fully committed write.
If you need more control you might be able to do it application like implementing a critical section.
It certainly makes sense to enable versioning the bucket so that you get to maintain all the writes and later you can specify which version as the latest.
You can also leverage the life cycle rules delete ( keep deleting ) the last n version to save cost.

Transferring lots of small files between EC2 and Amazon S3

I'm building a browser game and I have a lot of small files that need to be transfered between my EC2 instancce and S3 when players perform some key actions.
Although transferring a single big file is fairly fast, transferring multiple small files is extremely slow. I'm using Amazon's PHP SDK.
Is there a way to overcome this weakness in S3? Thanks.
It looks like combining the two solutions below is the way to go.
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
http://gearman.org/
If this transfer has to be made from EC2 instance to S3 then may be you can try using s3fuse , which will basically mount your s3 drive to storage volume of EC2 instance.
The performance of S3 is not constant and can be quite slow sometimes. If you need real-time performance for a shared object I would take a look at the AWS memcached service although I have not used it.
How exactly are you uploading files? is there a multithreaded method in the SDK? I'm asking because I've had to implement my own method for downloading stuff faster than the SDK.
Do you need to read those files right away? how many events do you have per second, do you need them ordered?
My first thought would be to make a local buffer that uploads batches every once in a while.
Then, if that's too slow, I'd store them in a fast buffer first, instead of S3, and flush it every once in a while. My choices would be simple stuff like SQS or Redis. SQS has theoretically unlimited throughput for random queues and 300 batches per second (1 batch = 1..10 messages = 0..256kb) for FIFO queues - which you can further increase.
Then you have streams, Lambda and whatever.