I'm a beginner user of AWS and I'm using an EC2 instance for MCMC sampling which requires some hours of time. Unfortunately I had a network problem in the middle of the sampling and got the message:
Network error: Software caused connection abort
So that I had to reboot the instance losing all of my work (but not my data).
Is there a way to set up the instance to avoid this issue?
Thank you in advance
I'm unsure what MCMC sampling mean but will try to guess.
The only way not to lost information in such cases is to store it at reliable solution, e.g. S3.
If you meant long calculations then you need to parallel them or at least subdivide to smaller chunks then store the queue, its status and the intermediate results at the reliable storage. Merhaps the code have to be modified. If your calculations can be parallelized then you may want to check SQS and spot instances, sometimes you can save a lot of money.
If my guess is incorrect then pls clarify.
instead of restarting, rebooting the instance will fix this issue most of the time. Instance reboot persist any data on its instance store volumes.
Related
I keep getting an error message that says there are not enough resources in the zone to create a VM (Us-Central F). This has been going on for a couple of days. Is there a way to fix this or report this? Any advice and answers would be appreciated!
You can reserve resources you need or wait and try your luck with creating desired VM. Changing the machine type, amount of ram etc - lowering VM specs will also increase your chances.
Otherwise you have to use other zone or even region - there's no way around it since even GCP has limited resources and due to high demand some of them may not be available. The only difference will be higher latency.
My google cloud VM hard disk got full. So I tried to increase its size. I have done this before. This time things went differently. I increased the size. But the VM was not picking up the new size. So I stopped VM. Next thing I know, my VM got deleted and recreated, my hard disk returned to previous size with all data lost. It had my database with over 2 months of changes.
I admit I was careless not to backup. But currently my concern is, is there a way to retrieve the data. On Google Cloud, it shows $400 for Gold Plan which includes Tech Support. If I can be certain that they will be able to recover the data, I will am willing to pay. Does anyone know if I pay $400, the google support team will be able to recover the data?
If there are other ways to recover data, kindly let me know.
UPDATE:
Few people have shown interest in investigating this.
This most likely happened because by default "Auto-delete boot disk" option is selected which I was not aware of. But even then, I would expect auto-delete to happen when I delete the VM, not when I simply stopped it.
I am attaching screenshot of all activities that happened after I resized the boot partition.
As you can see, I resized the disk at 2:00AM.
After receiving resize successful message, I stopped the VM.
Suddenly at 2:01, VM got deleted.
At this point I had not checked notifications, I simply thought, it stopped. Then I started VM hoping to see new resized disk.
Instead of starting my VM, new VM was created with new disk and all previous data was lost.
I tried stopping and starting VM again. But the result was still the same.
UPDATE:
Adding activities before the incident.
It is not possible to recover deleted PDs.
You have no snapshots either?
The disk may have been marked for auto-delete.
However, this disk shouldn't have been deleted when the instance was stopped even if it was marked for auto-delete.
You can also only recover a persistent disk from a snapshot.
In a managed instance group, when you stop an instance, health check fails and the MIG deletes and recreates an instance if autoscaler is on. The process is discussed here. I hope that sheds some light if that is your use case.
I've got an application that is built in node.js, and is primarily used to post photos to (up to 25mb). The app resizes to thumbnail size, and moves both the thumbnail and full size image to S3. When the uploads begin happening, they usually come in bursts of 10-15 pictures, rinse, wash, repeat in 5 minute durations. I'm seeing a lot of scaling, and the trigger is the default 6MB NetworkOut trigger. My question is, is the moving the photos to S3 considered NetworkOut? Or should I consider a different scaling trigger, so far the app hasn't stuttered so I'm hesitant to not fix what ain't broken, but I am seeing quite a big of scaling so I thought I would investigate. Thanks for any help!
The short answer - scale when ever a resource is constrained. eg, If your instances can keep up with network IO or cpu is above 80% then scale. Yes, sending any data from your ec2 instance is network out traffic. You got to get that data from point A to B somehow :)
As you go up in size on ec2 instances you get more memory and cpu along with more network IO. If you don't see issue with transfers you may want to switch the auto scale over to watch cpu or memory. In an app I'm working on users can start jobs which require a bit of cpu. So I have my auto-scale to scale if my cpu is over 80%. But you might have a process that consumes a lot of memory and not much cpu...
On a side note - you may want to think about having your uploads go directly to your s3 bucket and use a lambda to trigger the resize routine. This has several advantages over your current design. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I suggest getting familiar with the instance metrics. You can then recognize your app-specific bottlenecks on the current instance type and count.
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/health-enhanced-metrics.html
I m using a micro ES instance from Amazon with 2 nodes.
However, while I m re-indexing my data (around 300.000 docs, 300MB), the instance becomes unresponsive several times. It usually hangs when trying to read from the instance at the same time.
I m using this instance for the production of my website, and this issue causes me big headaches.
Anyone experiencing same issues? Would it help If I move to:
1) larger instance?
2) upgrade to 2.X version?
Thank you
The only time I've had issues with ES queries being unresponsive during re-indexing is when the resources are exhausted, so I would advocate a larger instance.
You should use CloudWatch metrics to determine the resource usage on your current instance during a period where it's running well and also during your re-index. Use this information to decide the best instance type accordingly, the following table will give you an idea of what resource you get https://aws.amazon.com/elasticsearch-service/pricing/
This question has a conceptual and practical parts.
Conceptually I'd like to know if using the autoscaling functionality is equivalent to simply increasing the compute power by a factor of the number of added instances?
Practically ... how does this work? I have one running instance, its database sitting on an LVM composed of multiple EBS volumes, similarly with all website data. Judging from the load on the instance I either need to upgrade to a more powerful instance or introduce this autoscaling. Is it a copy of the running server? If so, how is the database (etc) kept consistent?
I've read through the AWS documentation, and still haven't got the picture yet - I could set one autoscaling group up which would probably clear my doubts, but I am very leery to do this with a production server.
Any nudges in the right direction would be welcome.
Normally if you have a solution that also uses a database, and several machines in the solution, the database is typically not on any of the machines but is instead hosted seperately with each worker machine pointing to the same database - if you are on AWS platform already, then DynamoDB or RDS are both good solutions for this.
In theory, for some applications, upgrading the size of the single machine will give you the same power as adding several smaller machines, but increasing the size of the single machine, while usually these easiest thing to do at first, should not be considered autoscaling and has its own drawbacks. Here are some things to consider:
Using multiple machines instead of one big one gives you some fault tolerance. One or more machines can go down and if your solution is properly designed new machines will spin up to replace them.
Increasing the size of a single machine solution means you are probably paying too much. If you size that single machine big enough to handle peak workloads, that means at other times (maybe most of the time), you are paying for a bigger machine than you need. If you setup your autoscaling solution properly more machines come on line in response to increasing demand, and then they terminate when that demand decreases - you only pay for the power you need when you need it.
When your solution is designed in this manner, you need to think of all of the worker machines as ephermal - likely to disappear at any time, so you need to build your solution differently. Besides using a hosted database (like on DynamoDB or AWS RDS), you also should not store any data on the machines in your auto-scaling group that doesn't also live somewhere else. For example, if part of your app allows users to upload images, you don't store them on the instances, you store them in S3. Same would apply to any other new data that comes in.
You need to be able to figuratively 'pull the plug' at any instant on any of the machines in your ASG without losing data.
Ultimately a properly setup auto-scaling solution will likely serve you better, but without doubt it is simpler to just 'buy a bigger machine' and the extra money you spend on running that bigger machine may be more than offset by the time and effort you don't have to spend re-architecting your solution to properly run in an autoscaling environment. The unique requirements of your solution will ultimately decide which approach is better.