I have a problem
Recently I used sagemaker to process my jobs, with that, there were some processing jobs in the history with "Complete" status, how do I stop these jobs?
I used the command aws sagemaker stop-processing-job --processing-job-name
and it didn't work the way I expected, the following message appeared when I ran it
The request was rejected because the processing job is in status Completed.
I need to stop job processing because it is probably costing me
You can't stop an (inactive) processing job because its already Completed.
To free up resources for avoiding any associated cost, try deleting the processing job with:
aws sagemaker delete-processing-job --processing-job-name my-processing-job
Note: It will delete any resources created by this job like S3 bucket/object(s).
Related
Summary
I am using AWS Batch in order to run Monte Carlo simulations. Occasionally I realise that a group of jobs that I have submitted to my queue are incorrect in some way and I wish to clean up the queue before more jobs start running.
When I try the to cancel the job through the AWS Console I get a the notification "Job cancellation request completed successfully". However, the job remains in the queue, even after waiting for multiple hours. I don't know how to cancel these jobs.
What I've tried
Cancelling jobs in the RUNNABLE through the AWS Console manually. I get a "Job cancellation request completed successfully", but no change.
Terminating jobs in the RUNNABLE through the AWS Console manually, instead of cancelling. No change either.
Cancelling jobs through the AWS CLI with aws batch cancel-job command as described in https://docs.aws.amazon.com/cli/latest/reference/batch/cancel-job.html
Terminating jobs through the AWS CLI with aws batch terminate-job command as described in https://docs.aws.amazon.com/cli/latest/reference/batch/terminate-job.html
For all of the previous cases, the job remained in the queue afterwards, with the same status (RUNNABLE).
Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
I have an AWS glue job with Spark UI enabled by following this instruction: Enabling the Spark UI for Jobs
The glue job has s3:* access to arn:aws:s3:::my-spark-event-bucket/* resource. But for some reason, when I run the glue job (and it successfully finished within 40-50 seconds and successfully generated the output parquet files), it doesn't generate any spark event logs to the destination s3 path. I wonder what could have gone wrong and if there is any systematic way for me to pinpoint the root cause.
How long is your Glue job running for?
I found that jobs with short execution times, less then or around 1 min do not reliably produce Spark UI logs in S3.
The AWS documentation states "Every 30 seconds, AWS Glue flushes the Spark event logs to the Amazon S3 path that you specify." the reason short jobs do not produce Spark UI logs probably has something to do with this.
If you have a job with a short execution time try adding additional steps to the job or even a pause/wait to length the execution time. This should help ensure that the Spark UI logs are sent to S3.
In my architecture when I receive a new file on S3 bucket, a lambda function triggers an ECS task.
The problem occurs when I receive multiple files at the same time: the lambda will trigger multiple instance of the same ECS task that acts on the same shared resources.
I want to ensure only 1 instance is running for specific ECS Task, how can I do?
Is there a specific setting that can ensure it?
I tried to query ECS Cluster before run a new instance of the ECS task, but (using AWS Python SDK) I didn't receive any information when the task is in PROVISIONING status, the sdk only return data when the task is in PENDING or RUNNING.
Thank you
I don't think you can control that because your S3 event will trigger new tasks. It will be more difficult to check if the task is already running and you might miss execution if you receive a lot of files.
You should think different to achieve what you want. If you want only one task processing that forget about triggering the ECS task from the S3 event. It might work better if you implement queues. Your S3 event should add the information (via Lambda, maybe?) to an SQS queue.
From there you can have an ECS service doing a SQS long polling and processing one message at a time.
I have a Data flow pipeline running on GCP which reads the messages from pub/sub and writes to GCS bucket.My dataflow pipeline status is cancelled by some user and I want to know who that user is ?
You can view all Step Logs for a pipeline step in Stackdriver Logging by clicking the Stackdriver link on the right side of the logs pane.
Here is a summary of the different log types available for viewing from the Monitoring→Logs page:
job-message logs contain job-level messages that various components of Cloud Dataflow generate. Examples include the
autoscaling configuration, when workers start up or shut down,
progress on the job step, and job errors. Worker-level errors that
originate from crashing user code and that are present in worker logs
also propagate up to the job-message logs.
worker logs are produced by Cloud Dataflow workers. Workers do most of the pipeline work (for example, applying your ParDos to
data). Worker logs contain messages logged by your code and Cloud
Dataflow.
worker-startup logs are present on most Cloud Dataflow jobs and can capture messages related to the startup process. The startup
process includes downloading a job's jars from Cloud Storage, then
starting the workers. If there is a problem starting workers, these
logs are a good place to look.
shuffler logs contain messages from workers that consolidate the results of parallel pipeline operations.
docker and kubelet logs contain messages related to these public technologies, which are used on Cloud Dataflow workers.
As mentioned in previous comment you should filter by Pipeline ID, the actor of the task will be in the AuthenticationEmail entry.