Using micro instance for Elastic MapReduce(EMR) on AWS - amazon-web-services

As amazon charges for an hour even when I use it for minutes. It is getting little expensive to do my school projects or play around with EMR. As there are micro instances free I want to make use of these to run my mapreduce jobs, there seem to be no option doing so any help in this regard would be great.
Also if that is totally not posibble I wanna know how do I pick any running instance (probably small instance which EMR gives an option to select via console) for mapreduce job? I am basically planning to run few small instances and get all my small mapreduce jobs use these instances this way I can make most of the money I pay.
Thanks in advance :)

I myself fired-up some EC2 instances and tried to run a map reduce job using Elastic MapReduce console and I was not given any options to use the instances already up. Amazon charging on per hour basis is turning out to be bad thing to me at-least.
Please add more info if I am missing or wrong in any way.
PS: I chose to answer my own question as I did not see any help coming and thought this might someday be helpful to someone experimenting with AWS.

Related

How efficient is to use EMR spot instances to run spark jobs?

I want to use EMR spot instances to cut down my Redshift and aws glue costs, but after reading about them I want to know if I am running a 30 mins jobs how likely is it to get interrupted , How often these spot instances are taken away while running a Job and if they are taken away how can I manage my job to re-run again.
Mostly my focus is on spark job.
Opinion-based, but here goes.
Excellent read: https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/
Basically AWS allow you to use spot instances and recover gracefully due to integration with YARN’s decommissioning mechanism. You need code nothing in your Spark App.
That said, if you are wanting to run using Spot Instances, you can wait for the output, but it may take a while.
AWS Glue is serverless and hence has nothing to do with EMR. Redshift is also costed differently.

AWS Container (ECS) vs AMI & Spot instances

The core of my question is whether or not there are downsides to using an Amazon Machine Image + Micro Spot instances to run a task, vs using the Elastic Container Service (ECS).
Here's my situation: I have the need to run a task on demand that is triggered by a remote web hook.
There is the possibility this task can get triggered 10 times in a row, or go weeks w/o ever executing, so I definitely want a service that only runs (and bills) on demand.
My plan is to point the webhook to a Lambda function, but then the question is what to have the Lambda function do.
Tho it doesn't take very long, this task requires several different runtimes (Powershell Core, Python, PHP, Git) to get its job done, so Lambda isn't really a possibility as I'd hit the deployment package size limit. But I can use Lambda to kick off the job.
What I started doing was creating an AMI that has all the necessary runtimes and code, then using a Spot request to launch an instance, have it execute the operation via a startup script passed in via userdata, then shut itself down when it's done. I'd have to put in some rate control logic to prevent two from running at once, but that's a solvable problem.
I hesitated half way through developing this solution when I realized I could probably do this with a docker container on ECS using Fargate.
I just don't know if there is any benefit of putting in the additional development time of switching to a docker container, when I am not a docker pro and already have the AMI configured. Plus ECS/Fargate is actually more expensive than just running a micro instance.
Are these any concerns about spinning up short-lived (<5min) spot requests (t3a-micro) where there could be a dozen fired off in a single day? Are there rate limits about this? Will I get an angry email from AWS telling me to knock it off? Are there other reasons ECS is the only right answer? Something else entirely?
Your solution using spot instance and AMI is a valid one, though I've experienced slow times to get a spot instance in the past. You also incur the AMI startup time.
As mentioned in the comments, you will incur a minimum of 1 hour charge for the instance, so you should leave your instance up for the hour before terminating, in case more requests can come in the same hour.
IMHO you should build it all with lambda. By splitting the workload for each runtime into its own lambda you can make it work.
AWS supports python, powershell runtimes, and you can create a custom PHP one. Chain them together with your glue of choice, SNS, SQS, direct invocation, or Step Functions, and you have the most cost effective solution. You also get the benefits of better and independent maintenance for each function/runtime.
Put the initial lambda behind API gateway and you will get rate limiting capabiltiy too.

Cloud computing pricing?

I’m looking to use a cloud computing instance which will give me best value for my particular use case.
What I need to do is fire off a script periodically which performs some actions via selenium.
My questions are:
Will the packages installed be “remembered” if I reboot the instance?
Does the instance even need to be rebooted sometimes?
Do I get persistent storage or something else?
Am I charged when my instance is running but idle (between cron jobs)?
Any recommendations on which type of service would provide best value for my use case.
EDIT: I made some edits because I offended people by mentioning specific vendors and asking about pricing. The question probably could have been worded better initially but what I really wanted to know was which cloud computing solutions would be best value for a particular use case which seems to be a bit niche.
I'm with GCP Support team, so I will provide information on Google Cloud’s part.
Google Cloud Compute Engine offers a free usage for an amount up to a specific limit.
Will the packages installed be “remembered” if I reboot the instance? Do I get persistent storage or something else?
When you create a new VM instance you get assigned at least 10 Gb of persistent disk. Everything that is stored on that disk will stay there even in case of rebooting the instance.
Does the instance even need to be rebooted sometimes?
Instances doesn’t have to be rebooted unless you want so. However, occasionally they get rebooted automatically by the Google Compute Engine service.
Am I charged when my instance is running but idle (between cron jobs)?
You are charged for the instance when it is up and running. You can find more information about Google Cloud Compute pricing here.
To have an approximation of what you will be paying, based on what you will be using, you can use GCP Pricing Calculator.
After some investigation I found a couple of suitable options for running a selenium script a few times a day.
1) GCP, which as mentioned in another answer has a free tier.
2) AWS provides lambdas for running code without needing to provision servers. They also provide a free tier and I found some pre-compiled AWS lambda packages using Python + Selenium and some pretty clear instructions on how to get them running in AWS on a schedule: https://github.com/ryfeus/lambda-packs.

finding best deployment locations in aws regions

Given we are on aws platform we need to subscribe to different sources of data, which are located around the world. How can we efficiently determine what is the region with lowest latency to some target IP (not our browser)?
There is a service called cloudping which pings from your current browser to aws regions, but this cannot be useful for obvious reasons.
Is there any tool similar to cloudping that such that we could specify what ip we want to ping to?
And a secondary question. I suppose it is possible to spawn instances using aws console api, does amazon have some significant fees if i have a script that spawns a compute instance does some short work and terminates it and does this for every single region?
Worst case we could spawn instances on all regions for short amount of time and ping to all destinations we are interested but that would be a lot of work for something rather simple... My assumption is that even within one region you might end up with some instances having significantly better latency than others, a script could spawn instances until the best one is found and terminates others...
UPDATE
It seems it rather easy to spawn instances and execute commands in them, shouldnt be hard to terminate them as well. Here is a good tool for this, now the question is will aws punish me with bills and isn't there already solution for this?
You can certainly launch and terminate Amazon EC2 instances all any region you wish. Amazon will not "punish" you -- the system will simply charge the normal cost for resources you use.
If you launch an Amazon EC2 instance with the Amazon Linux AMI, then the instance will be charged per-second, so the cost will be very low. For example, you could use a t2.micro instance for a few cents per hour (charged per second).
You could then run your own timing test from each region. However, you could probably predict the best performance simply based upon the location of the region (US East, US West, Frankfurt, Sydney, etc).
Also, please note that Ping is not a reliable measure for how your actual application would perform. To obtain the best measure, you should run an application in each region that connects to the 'source of data' you are trying to use. Measure performance as it would be used by your actual application. You might find that the remote service has higher latency than the network, meaning that location would only have a minor impact on performance.
If you use somebody else's timing or somebody else's tool, it will not be as accurate as measuring your actual application doing "real" work.

Set up basic environment of AWS without freetier

i am newbie to use AWS system.
well, my freetier is just expired recently.
so far, i run an instance all day and use ssh to make code there.
but now, i realized it could charge me more than i expected. so i decided to terminate the instance.
then, how can i connect to instance easily? do i need to turn it on everytime i want to code?
And which is better? whether using DynamicDB(in AWS) or make a separated instance and install linux and mongodb(or something else).
Thanks =)
Best practices to save on cost is to stop (not terminate) your instance when you are not using it.
You will not pay for your instance in STOPped mode. You will just pay for the storage of your EBS volumes (boot drive).
When you Terminate your instance, you cannot restart it. If you Terminate you instance, be sure your data are saved on a secondary EBS volume or a snapshot or stored on S3.
Regarding your second question, it really depends on your application needs : at what scale do you expect to run ? How large will be your data store ? What type of query are you going to perform ?
For most cases, DynamoDB will be more cost effective than running a couple of EC2 instances with a Mongo DB cluster. And you will not need to maintain and to operate the infrastructure, AWS will do it for you.
You might have other point of view from this question : DynamoDB vs MongoDB NoSQL
Might be easier to just use MongoHQ: https://bridge.mongohq.com/signup
If you're interested in learning how to set up the servers, digitalocean is a good bet as they don't charge you for the IOPS and give you SSDs on the instance:
http://www.digitalocean.com
Enjoy!