Is it possible to modify or add layer to SLURM scheduling - scheduling

I am a non-paying user on a computing cluster that uses SLURM.
Occasionally, I've had long-running and multiple jobs that clogged up the squeue for paying users. Due to this I've had jobs cancelled by admin. Currently I've had a cap on the number of nodes that are available to me. While I dont argue with the equity of this arrangement , this is a problem for me in terms of getting work done, especially because I see free nodes that are not running any jobs, while I just sit waiting for jobs to pass through the node cap....
With that as background info, here are my two questions:
Isnt it possible for admin to suspend, and then resume jobs - either a job, or all jobs of a user, or a set of jobs? Is this suspend / resume onerous from the admin's perspective?
I suppose it should be possible to create a list of paying Vs non-paying users. And when paying username submits with sbatch to automatically instruct SLURM to suspend non-paying username's job or jobs, and resume when paid user's jobs have completed. Is this even possible? IF yes, is it outside the skill scope of regular SLURM / Farm admins?
Could someone please suggest any other solutions (if what I have asked above are unreasonable or absurd)?
Thank you!

The admin can run scontrol suspend jobid and then scontrol resume jobid
The keywords here are 'QOS' and 'preemption'. Typically a QOS is created for the paying users, that has preemptive rights over the normal QOS. Jobs of the non-paying users can be cancelled, checkpointed, requeued, or suspended.

Related

Solution for two-node nestjs cron service

I have a nestjs cron service that is deployed to two nodes in uat/production.
If i have record that is scheduled to be created at 3 pm, both nodes will fire off at the same time, resulting in a duplicate record in the db.
I am unable to use redis as it would cost money to provision.
I have considered locking the table as well.
Anyone have ideas on how i can solve this issue?
Thank you!
Ideally the way would be to create a lock in shared memory and run only when lock is acquired.
You can replicate the behaviour using the DB. Create a table that stored a lock and allow the cron job to fire only if they are able to acquire the lock else skip. This method would be alright if you're okay with any of the two to execute the job.
If you want to implement a round robin like execution, you can use the lock along with an ID of the previous service which executed, and grant the lock to another ID. But you'll have to ensure there is more than one ID.
If you are fixed on two nodes, a very simple approach would be to displace the first run of the execution by 3 hours, and let each instance run the cron job once in 6 hours.
You may look into the different ways concurrency is handled at an OS level and implement it.

Cron Jobs vs Task Scheduler table for scheduled emails

Preamble: I have a web app, the backend is based on the serverless architecture. It's basically an amplify app hosted on AWS with a dynamoDB database. I've learnt is possible to create a task scheduling system of sorts more here. A quick summary of the article is "Its possible to create a task scheduling table taking advantage of TTL and dynamoDB streams to execute lambda function at specific times. The TTL specifies a set time for an record to be deleted, we can capture this delete event in a dynamoDB stream and run some tasks based on information from the stream"
Problem:
The goal is to send a series of emails to users who sign up for our service. Each user that signs up gets a series of "Getting Started" emails. The first of the emails is sent 24 hours after a user signs up, the second 3 days later and the third exactly 7 days after sign up.
I see how a cron job would be suitable here, but it just seems a bit inefficient to me. I would basically have to search the users table for users whose sign up time falls between a specific 24 hour period and send the email to the users whereas with a Task scheduler table I could add a task to the table ( something like send first email to user300 with a TTL of when I want it to be sent ) and listen for delete events to run the task. No need to run a cron job daily, just a function that handles each task as it comes.
I think this is more like a performance vs storage problem. Having a task scheduler table would take up space, if we add all the emails to be sent to a user as tasks on the table (each email to be sent to a specific user is it's own task) each time a user signs up then I see the task scheduler table growing 3n records for every n user signed up. But this may not really be a problem as tasks are deleted after they are run. I do not know the performance cost of using a cron job for this particular task hence I'm here. I also may be wrong and the cost of running and updating this task scheduler table may be more than that of the cron job.
I initially thought of setting up a dummy user table and running both the cron and the task scheduler and documenting cost of running both, but you can imagine how much time and effort that would take.
So I guess my question is which is a more efficient solution in terms of performance and cost?
There is no perfect solution here. Keep in mind that Dynamodb TTL takes up to 48h to invoke, so it's probably unacceptable. CRON Jobs with Lambda are cheap, and it's easy to set. You coul also use SQS and populate it with daily CRON. Yan Cui wrote great article about this problem https://theburningmonk.com/2019/03/dynamodb-ttl-as-an-ad-hoc-scheduling-mechanism/
This may not exactly be an answer. Based on the medium article you linked the guy had a plausible reason why the TTL and dynamoDB streams would be better than a cron job which you reiterated. Setting up a cron job is easier and cheaper (free) and I doubt the performance will be that much worse unless the database is huge. I don't have any experience doing something like this so I wouldn't know how large the database would have to be for it to make sense to switch over. Alternatively, you can have as many cron jobs as you want so I don't see how you couldn't just set up a user specific cron job whenever someone signs up.
You can setup a CloudWatch Event to fire a Lambda function on a regular schedule. The Lambda function can search a database for an applicable result set and perform other actions - send an email, a text message, etc.
Here is an AWS tutorial that covers a very similar use case with step by step instructions. This tutorial is implemented by using the AWS Java API (but you can implement it using other supported programming languages).
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/usecases/creating_scheduled_events
From a Cost perspective - Lambda allows 1M free requests per month. Details are here - https://aws.amazon.com/lambda/pricing/

What is the best strategy for implementing scheduled payment in Django

I wanted to implement scheduled jobs/actions in my DJANGO backend.
Actions are basically deducting monthly recurring payment from the customer. Sending payment link say before 10 days etc. etc. The dates will be based on when the user buys the subscription.
I have never implemented scheduled jobs before. I know there are some ways like cron tabs and celery.
I wanted to know what will be the best strategy/tool for scheduled payments.
So basically what I think i will do is that i will run the scheduled job every day at a particular time and will check the available candidates and will run the payment module.
Is this strategy correct to run jobs everyday. Are there any better methods available. Is there a way that jobs run automatically when say the customers new billing cycle arrives.
Yes, the strategy you are following is correct. You can use celery, redis and crontab to execute the payment system.
So, firstly you can specify the schedules using crontab. Also, .delay() function will help you in triggering the jobs whenever customer`s new billing cycle arrives.
So, the flow will be, tasks get triggered when a new billing cycle arrives using .delay()
Then the celery worker will register the task and the schedules. Then you may use celery beat to run the tasks periodically. Also you may use redis as a message queue.
Read about .delay() here
Read about celery configuration setting here
Read about setting a task scheduler using celery here

Scheduling strategy behind AWS Batch

I am wondering what the scheduling strategy behind AWS Batch looks like. The official documentation on this topic doesn't provide much details:
The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.
(https://docs.aws.amazon.com/batch/latest/userguide/job_scheduling.html)
"Approximately" fifo is quite vaque. Especially as the execution order I observed when testing AWS Batch did't look like fifo.
Did I miss something? Is there a possibility to change the scheduling strategy, or configure Batch to execute the jobs in the exact order in which they were submitted?
I've been using Batch for a while now, and it has always seemed to behave in roughly a FIFO manner. Jobs that are submitted first will generally be started first, but because of limitations with distributed systems, this general rule won't work out perfectly. Jobs with dependencies are kept in the PENDING state until their dependencies have completed, and then they go into the RUNNABLE state. In my experience, whenever Batch is ready to run more jobs from the RUNNABLE state, it picks the job with the earliest time submitted.
However, there are some caveats. First, if Job A was submitted first but requires 8 cores while Job B was submitted later but only requires 4 cores, Job B might be selected first if Batch has only 4 cores available. Second, after a job leaves the RUNNABLE state, it goes into STARTING while Batch downloads the Docker image and gets the container ready to run. Depending on a number of factors, jobs that were submitted at the same time may take longer or shorter in the STARTING state. Finally, if a job fails and is retried, it goes back into the PENDING state with its original time submitted. When Batch decides to select more jobs to run, it will generally select the job with the earliest submit date, which will be the job that failed. If other jobs have started before the first job failed, the first job will start its second run after the other jobs.
There's no way to configure Batch to be perfectly FIFO because it's a distributed system, but generally if you submit jobs with the same compute requirements spaced a few seconds apart, they'll execute in the same order you submitted them.

Using any of the Amazon Web Services, how could I schedule something to happen 1 year from now?

I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.