I am in the middle of developing a web-based application that heavily depends on job scheduling. The jobs will be extremely short, such as a single HTTP request. However, there will be lots of them. More than several thousand jobs may be scheduled every single day, but not all at the same time. My first inclination was use to crontab to schedule these jobs, but I am not sure if this is the best solution.
I see crontab mainly being used to schedule work intensive administrative tasks, but not for very short jobs. Is crontab even suitable for that? Can it handle such a large number of jobs? Should I implement a custom solution? Are there any services out there that may provide a better solution & performance?
Thank you very much!
That's what I use for my personal website. Sure, cron has the ability to run long-running tasks, but it should by no means be limited to that.
Cron has a resolution of one minute; it only wakes up once per minute to see if anything should be run. If you need something with tighter resolution, you'll need a custom solution.
Also, if you are doing this on OS X you will be using launchd rather than cron. (cron is still supported, though).
My company makes CloudQuartz (www.thecloudblocks.com) which allows you to schedule the jobs through an API and get callbacks when they are due to run.
We made it so we can schedule jobs on a cluster of servers that was not possible using CRON or Windows scheduler.
Related
I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,
Running the job on local machine
Running the job remotely on GCP
Running the job via Dataflow template
But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.
I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.
Thanks in advance!
From what I know, there are no ways to reduce startup or teardown time. You shouldn't consider that to be a bottleneck, as each run of a job is independent of the last one, so you can run them in parallel, etc. You could also consider converting this to a streaming pipeline if that's an option to eliminate those times entirely.
I have a task that gathers some information from several web-sites and saves it to disk. I want this task to run on daily basis and automatically.
I took a little tour into google cloud platform, but couldn't understand how to fit this service to my needs.
I would really like it if someone could suggest some key-points/main guidelines on how it should be done.
Thanks!
The easiest way to run any jobs which are time or scheduled based are done via CronJob of Linux. (https://help.ubuntu.com/community/CronHowto)
You can set up your scripts to be run at a specific time or interval and it should work. A checklist for you:
Bash scripts of tasks you want to perform
CronJobs that are schedules to run these scripts at specified time intervals
That should do it.
I have these tables
exam(id, start_date, deadline, duration)
exam_answer(id, exam_id, answer, time_started, status)
exam_answer.status possible values are 0-not yet started 1-started 2-submitted
Is there a way to update exam_answer.status once now - exam_answer.time_started is greater than exam.duration? Or if it is already past deadline?
I'll also mentioning this if it might help me better, I'm building this for a django project.
Django applications, like any other WSGI/web application, are only meant to handle request-response flows. If there aren't any requests, there is no activity and such changes will not happen.
You can either write a custom management command that's executed periodically by a cron job, but you run into the risk of possibly displaying incorrect data. You have elegant means at your disposal to compute the statuses before any related views start their processing, but this might be potentially a wasteful use of resources.
Your best bet might be to integrate a task scheduler with your application, such as Celery. Do not be discouraged because Celery seemingly runs in a concurrent multiprocess environment across several machines--the service can be configured to run in a single-thread and it provides a clean interface for scheduling such tasks that have to run at some exact point in the future.
I'd like to be able to create a "job" that will execute in an arbitrary time from now... Let's say 1 year from now. I'm trying to come up with a stable, distributed system that doesn't rely on me maintaining a server and scheduling code. (Obviously, I'll have to maintain the servers to execute the job).
I realize I can poll simpleDB every few seconds and check to see if there's anything that needs to be executed, but this seems very inefficient. Ideally I could create an Amazon SNS topic that would fire off at the appropriate time, but I don't think it's possible.
Alternatively, I could create a message in the Amazon SQS that would not be visible for 1 year. After 1 year, it becomes visible and my polling code picks up on it and executes it.
It would seem this is a topic like Singletons or Inversion Control that Phd's have discussed and come up with best practices for. I can't find the articles if there any.
Any ideas?
Cheers!
The easiest way for most people to do this would be to run at least an EC2 server with a cron job on the EC2 server to trigger an action. However, the cost of running an EC2 server 24 hours a day for a year just to trigger an action would be around $170 at the cheapest (8G t1.micro with Heavy Utilization Reserved Instance). Plus, you have to monitor that server and recover from failures.
I have sketched out a different approach to running jobs on a schedule that uses AWS resources completely. It's a bit more work, but does not have the expense or maintenance issues with running an EC2 instance.
You can set up an Auto Scaling schedule (cron format) to start an instance at some point in the future, or on a recurring schedule (e.g., nightly). When you set this up, you specify the job to be run in a user-data script for the launch configuration.
I've written out sample commands in the following article, along with special settings you need to take care of for this to work with Auto Scaling:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
With this approach, you only pay for the EC2 instance hours when the job is actually running and the server can shut itself down afterwards.
This wouldn't be a reasonable way to schedule tens of thousands of emails with an individual timer for each, but it can make a lot of sense for large, infrequent jobs (a few times a day to once per year).
I think it really depends on what kind of job you want to execute in 1 year and if that value (1 year) is actually hypothetical. There are many ways to schedule a task, windows and linux both offer a service to schedule tasks. Windows being Task Scheduler, linux being crontab. In addition to those operating system specific solutions you can use Maintenance tasks on MSSQL server and I'm sure many of the larger db's have similar features.
Without knowing more about what you plan on doing its kind of hard to suggest any more alternatives since I think many of the other solutions would be specific to the technologies and platforms you plan on using. If you want to provide some more insight on what you're going to be doing with these tasks then I'd be more than happy to expand my answer to be more helpful.
One of my view functions is a very long processing job and clearly needs to be handled differently.
Instead of making the user wait for long time, it would be best if I were able to lunch the processing job which would email the results, and without waiting for completion notify the user that their request is being processed and let them browse on.
I know I can use os.fork, but I was wondering if there is a 'right way' in terms of Django. Perhaps I can return the HTTP response, and than go on with this job somehow?
There are a couple of solutions to this problem, and the best one depends a bit on how heavy your workload will be.
If you have a light workload you can use the approach used by django-mailer which is to define a "jobs" model, save new jobs into the database, then have cron run a stand-alone script every so often to process the jobs stored in the database (deleting them when done). You can use something like django-chronograph to manage the job scheduling easier
If you need help understanding how to write a script to process the job see James Bennett's article Standalone Django Scripts for help.
If you have a very high workload, meaning you'll need more than a single server to process the jobs, then you want to use a real distribute task queue. There is a lot of competition here so I can't really detail all the options, but a good one to use with for Django apps is celery.
Why not simply start a thread to do the processing and then go on to send the response?
Before you select a solution, you need to determine how the process will be run. I.e is it the same process for every single user, the data is the same and can be scheduled regularly? or does each user request something and the results are slightly different ?
As an example, if the data will be the same for every single user and can be run on a schedule you could use cron.
See: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
or
http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
However if the requests will be adhoc and you need something scalable that can handle high load and is asynchronous: what you are actually looking for is a message queing system. Your view will add a request to the queue which will then get acted upon.
There are a few options to implement this in Django:
Django Queue service is purely django & python and simple, though the last commit was in April and it seems the project has been abandoned.
http://code.google.com/p/django-queue-service/
The second option if you need something that scales, is distributed and makes use of open source message queing servers: celery is what you need
http://ask.github.com/celery/introduction.html
http://github.com/ask/celery/tree