I have a spring boot application which downloads around 300 MB of data at start up and saves it to a path /app/local/mydata. Currently, I have just one dev environment with a single node and it is not a problem. However, once I create a prod instance with (say) 10 nodes, it would be a waste of data bandwidth for each node to individually download the same 300 MB data. It will put a lot of stress on the service it is downloading the data from. And there is cost associated with data flowing in/out of EC2.
I can build a logic using a touchfile to make sure that only one box downloads the data and others just wait until the download is complete. However, I don't know where to download these data such that the other nodes can read it too.
Any suggestions?
Download it to S3 if you want to keep it in a file, but it sounds like you might need to put the data in a database (RDS) or maybe cache it in Redis (ElastiCache).
I'm not sure what a "touchfile" is but I assume you mean some sort of file lock mechanism. I don't see that as the best option for coordinating this across multiple servers. I would probably use a DynamoDB table with consistent reads and conditional writes as a distributed locking mechanism.
How often does the data you are downloading change? Perhaps you could just schedule a Lambda function to refresh the data periodically and update a database or something?
In general, you need to stop thinking about using the web server's local file system for this sort of thing.
Related
I am currently using redis cluster with 2 node groups and a replica per node.
I chose to use redis because of the high performance. I have a new requirement to have persistent storage of the data in redis. I want to keep the good latency redis gives me and still build some procedure to save the data in the background. Backup built in snapshots is not good enough anymore since there is a maximum of 20 backups per 24 hours. I need data to be synced aprox. every minute
The data needs to be stored in a way that restart of the system will not make the data to be lost and that it can be restored back at all times.
So if I summarize the requirements:
Keep working with redis elasticache
Keep highest performance and latency
Be able to have the data persistent (including when the system is down or restarted)
The data sync to happen in intervals of a minute.
Be able to restore data back to redis when it lost the data.
I was looking when googling at manually running BGSAVE from a side docker in EC2 or to have a slave running in another EC2 machine. And then a lambda may take the rdb dile/data and save in s3.
Will this fit my needs?
What do the experts suggest? What are your ideas?
You can get close to your requirements by enabling AOF persistence.
This is done in the cluster's parameter group:
appendonly yes
appendfsync always|everysec
You will have to restart as well.
As you can see, redis only has two options for file system sync-for every value and every second.
Every value will be quite slow, so go with everysec if you want to keep good performance.
In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.
Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing
Basically we save cached data on Redis and we want to dump it into MongoDB every X seconds.
We have a sorted set stored on Redis, saving each user's last activity as score, and we want to periodically dump user's final state after being inactive for a certain period of time, and we wish to make sure that:
We don't strain our API servers (That's why it has to run on a worker instance.
The data dump operation is very critical - We require these worker instances to be scalable and highly resilient to failure (and should handle failure gracefully).
We must ensure that if we have X machines, that the data would be spread across instances, and that every item we pull from Redis will be handled exactly once.
I was wondering what would be the best architectural approach to deploy EC2 Windows instances that periodically handle data.
I was thinking of using Elastic Beanstalk as it's easy to deploy, scale & monitor, but I was wondering if there was a better approach to this.
Thanks in advance!
Other than the application, for which Amazon Elastic Beanstalk is a fine choice, i'd recommend you take a look at Amazon Kinesis: http://aws.amazon.com/kinesis/
That is because you mentioned "scalable", "resilient", "handle failure gracefully" and "exactly once". Those attributes are quite hard to get right in a distributed system, and Kinesis Streams and Client Library can greatly help with that.
While I'm looking to move our servers to AWS, I'm trying to figure out how to sync data between our web nodes.
I would like to mount a disk on every web node and have a local cache of the entire share.
Are there any preferred ways to do this?
Sounds like you should consider storing your files on s3 originally and if performance is key, have a sync job that pulls copies of the files locally to your ec2 instance. S3 is fast, durable and cheap - maybe even fast enough without keeping a local cache - but if you do indeed need a local copy, there are tools such as the aws cli and other 3rd party tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Depending on what you are trying to sync - take a look at
http://aws.amazon.com/elasticache/
It is an extremely fast and efficient method for sharing data.
One absolute easy solution is to install Dropbox sync client on both the machines and keep your files in Dropbox. This is by far the easiest !
In this approach, you can "load" data to the machines, using externally adding files to your dropbox account (not even go to AWS service to load) - From another machine or even from browser interface of Dropbox.