I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?
You cannot do this from the AWS console. To quote the developer guide
The Amazon Elastic MapReduce tab in the AWS Management Console does not support adding steps to a job flow.
You can only do this via the CLI and API, by creating a job flow, then adding steps to it.
$ ./elastic-mapreduce --create --active --stream
You can't do this with the web console - but through the API and programming tools, you will be able to add multiple steps to a long-running job, which is what I do. That way you can fire off jobs one after the other on the same long-running cluster, without having to re-create a new one each time.
If you are familiar with Python, I highly recommend the Boto library. The other AWS API tools let you do this as well.
If you follow the Boto EMR tutorial, you'll find some examples:
Just to give you an idea, this is what I do (with streaming jobs):
# Connect to EMR
conn = boto.connect_emr()
# Start long-running job, don't forget keep_alive setting
jobid = conn.run_jobflow(name='My jobflow',
log_uri='s3://<my log uri>/jobflow_logs',
keep_alive=True)
# Create your streaming job
step = StreamingStep(...)
# Add the step to the job
conn.add_jobflow_steps(jobid, [step])
# Wait till its complete
while True:
state = conn.describe_jobflow(jobid).steps[-1].state
if (state == "COMPLETED"):
break
if (state == "FAILED") or (state == "TERMINATED") or (state == "CANCELLED"):
print >> sys.stderr, ("EMR job failed! Message = %s!") % (state)
sys.exit(1)
time.sleep (60)
# Create your next job here and add it to the EMR cluster
step = StreamingStep(...)
conn.add_jobflow_steps(jobid, [step])
# Repeat :)
to keep the machine alive start an interactive pig session. Then the machine won't shut down. You can then execute your map/reduce logic from the command line using:
cat infile.txt | yourMapper | sort | yourReducer > outfile.txt
Related
I have been looking for a script to automatically close Sagemaker Notebook Instances that have been forgotten to be closed or that are idle. A few scripts I found don't work very well (eg: link , it is only checking if ipynb file is live, Im not using .ipynb, or taking the last updated info which never changes until you shut down or open the instance)
Is there a resource or script you can recommend?
You can use the following script to find idle instances. You can modify the script to stop the instance if idle for more than 5 minutes or have a cron job to stop the instance.
import boto3
last_modified_threshold = 5 * 60
sm_client = boto3.client('sagemaker')
response = sm_client.list_notebook_instances()
for item in response['NotebookInstances']:
last_modified_seconds = item['LastModifiedTime'].timestamp()
last_modified_minutes = last_modified_seconds/60
print(last_modified_minutes)
if last_modified_minutes > last_modified_threshold:
print('Notebook {0} has been idle for more than {1} minutes'.format(item['NotebookInstanceName'], last_modified_threshold/60))
Some jobs are remaining with pending pending state and I can't cancel them.
How do I cancel the job.
Web console shows like this.
"The graph is still being analyzed."
All logs are "No entries found matching current filter."
Job status: "Starting..."
There isn't appered a cancel button yet.
There are no instances in the Compute Engline tab.
What I did is below.
I created a streaming job. it was simple template job, Pubsub subscription to BigQuery. I set machineType as e2-micro because it was just a testing.
I also tried to drain and cancel by gcloud but it doesn't work.
$ gcloud dataflow jobs drain --region asia-northeast1 JOBID
Failed to drain job [...]: (...): Workflow modification failed. Causes: (...):
Operation drain not allowed for JOBID.
Job is not yet ready for draining. Please retry in a few minutes.
Please ensure you have permission to access the job and the `--region` flag, asia-northeast1, matches the job's
region.
This is jobs list
$ gcloud dataflow jobs list --region asia-northeast1
JOB_ID NAME TYPE CREATION_TIME STATE REGION
JOBID1 pubsub-to-bigquery-udf4 Streaming 2021-02-09 04:24:23 Pending asia-northeast1
JOBID2 pubsub-to-bigquery-udf2 Streaming 2021-02-09 03:20:35 Pending asia-northeast1
...other jobs...
Please let me know how to stop/cancel/delete these streaming jobs.
Job IDs:
2021-02-08_20_24_22-11667100055733179687
2021-02-08_20_24_22-11667100055733179687
WebUI:
https://i.stack.imgur.com/B75OX.png
https://i.stack.imgur.com/LzUGQ.png
As per personal experience some time few instance get stuck either they keep on running, or they cannot be canceled or you can not see thr graphical data flow pipelines. Best way to handle this kind of issue is to leave them in thr status, unless it is not impacting your solution by exceeding maximum concurrent runs at a moment. It will be canceled automatically or by Google team, since Dataflow is a google managed.
In GCP console Dataflow UI, if you have running Dataflow jobs, you will see the "STOP" button just like the below image.
Press the STOP button.
When you successfully stop your job, you will see the status like below. (I was too slow to stop the job with the first try, so I had to test it again. :) )
I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?
I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)
Regarding the error message you are experiencing, indeed you are hitting a quota exceeded for Cloud Logging, what you can do is to request a quota increase.
On the other hand, about an smarter way to check the status of a job without streaming logs, what you can do is to check the status once in a while by running gcloud ai-platform jobs describe <job_name> or create a Python script to check the status, this is explained in the following documentation.
I have couple of Python scripts which I would like to schedule to run once a month on Google cloud. The scripts basically trigger DLP jobs, extract data catalog information to a file in GCS. These batch workloads would hardly run for 30 mins. And so I don't need to use services like GKE, composer etc which are very resource intensive.
For these batch workloads I would like to know the best options available in GCP. Looking at some of the blog posts I found below article to use Cloud Scheduler-> Pub/Sub-> Cloud Functions -> Create VM (using a startup script).
https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25
I have below questions with above design..
1) How long does the Cloud Function run as it starts the VM? I know cloud function has a timeout of 9mins ..what happens if the VM takes longer than 9mins to process the startup script?
Any other design ideas are much appreciated.
Thanks
I'm the author of that medium post.
1) How long does the Cloud Function run as it starts the VM?
You can change the Cloud Function code to not wait for the response, It's using NodeJS so you just don't have to wait for the Promise.
Also in that solution the Cloud Function job is only to trigger the VM creation.
.createVM(vmName, vmConfig)
.then(data => {
// Operation pending.
const vm = data[0];
const operation = data[1];
console.log(`VM being created: ${vm.id}`);
console.log(`Operation info: ${operation.id}`);
return operation.promise();
// This will return right away with the VM pending state, you can finish
// your logic here, and not wait for VM creation to finish.
// You can even ignore this step if you don't need the VM ID logged for
// debugging purposes
})
.then(() => {
const message = 'VM created with success, Cloud Function finished execution.';
console.log(message);
}
Using that same code, in the worst case (if it takes more than 9 minutes), the Cloud Function will timeout but the VM creation will continue.
The desing that I suggest is using: Cloud Scheduler + Pub/Sub + Compute Engine
This design in few words:
- you compute engine will have a utility that listens to a Cloud Pub/Sub topic
- this utility will execute upon receiving a new event from the Topic and run a cron job on the instance
- Cloud scheduler is used here to push messages to the Pub/Sub Topic in a time that you can specify in your job.
By using Pub/Sub to decouple the task-scheduling logic from the logic
running the commands on Compute Engine, you can update your cron
scripts as needed, without updating the Cloud Scheduler configuration.
You can also change your task schedule without updating the utility
service on your Compute Engine instances
you can find full explanation of this design and a sample code by following this and this.
let me know if there is anything not obvious.
We're using S3, SimpleDB and SQS on quite a complicated project.
I'd like to be able to automatically track their usage, to be sure we don't suddenly spend large amounts of money when we didn't intend to (perhaps because of a bug).
Is there a way of reading the usage figures of all Amazon Web Services and/or the current real time dollar cost of an account from a script?
Or any service or script which provides alerts based on that?
Amazon just announced that you can now "set alarms for any metric that Amazon CloudWatch monitors" (CPU utilization, disk reads and writes, and network traffic, etc). Also, all instances now come with basic monitoring for free.
We just released a Lab Management service that adds policies to AWS usage: time limits, max number of instance, max machine sizes etc. You may want to try that and see if it helps: http://LabSlice.com. As this is a startup, we would really value feedback about how to resolve issues such as yours (ie. email me if you think the app could be better modified to meet your requirements).
I don't believe there is any direct way to control AWS costs to the dollar. I doubt that Amazon provides an API to get in-depth metrics on usage, as it's obviously not going to be in their interest to help you reduce costs. I actually ran into two instances where surprise costs arose in a company (bank) due to mis-configured scripts, so I know that it can be a problem.
I ran into the same issue with EC2 instances, but addressed it in a different way -- instead of monitoring the instances, I had them automatically kill themselves after a set amount of time. From your description, it sounds like this may not be practical in your environment, but I thought I would share just in case it helps. My AMI was Fedora-based, so I created the following bash script, registered it as a service, and had it run at startup:
#!/bin/bash
# chkconfig: 2345 68 20
# description: 50 Minute Kill
# Source Functions
. /etc/rc.d/init.d/functions
start()
{
# Shut down 50 minutes after starting up
at now + 50 minutes < /root/atshutdown
}
stop()
{
# Remove all jobs from the at queue because I'm not using at for anything else
for job in $(atq | awk '{print $1}')
do
atrm $job
done
}
case "$1" in
start)
start && success || failure
echo
;;
stop)
stop && success || failure
echo
;;
restart)
stop && start && success || failure
echo
;;
status)
echo $"`atq`"
;;
*)
echo $"Usage: $0 {start | stop | restart}"
RETVAL=1
esac
exit $RETVAL
You might consider doing something similar to suit your needs. If you do this, be especially careful that you stop the service before modifying your image so that the instance does not shutdown before you get a chance to re-bundle.
If you wanted, you could have the instances shutdown at a fixed time (after everyone leaves work?), or you could pass in a keep-alive length/shutdown time via the -d or -f parameters to ec2-run-instances and parse it out into the script.