How to set up website periodic tasks? - web-services

I'm not sure if the topic is appropiate to the question... Anyway, suppose I've a website, done with PHP or JSP and a cheap hosting with limited functionalities. More precisely I don't own the server and I can't run daemons or services at my will. Now I want to run periodic (say every minute or two) tasks to do certain statistics. They are likely to be time consuming so I just can't repeat the calculation for every user accessing a page. I could do the calculation once when a user loads a page and I calculate that enough time has passed, but in this situation if the calculation gets very long the response time may be excessive and timeout (it's not likely I don't mean to run such long task, but I'm considering the worst case).
So given these costraints, what solutions would you suggest?

Each cheap hosting will have support of crontabs. Check out the hosting packages first.
If not, load in the page, and launch the task by AJAX. This way your response time doesn't suffer and you do in a different thread the work.

If you choose to use crontab, you're going to have to know a bit more to execute your PHP scripts from them. Depending on if your PHP executes as CGI or an apache module has an effect. There's a good article on how to do this at:
http://www.htmlcenter.com/blog/running-php-scripts-with-cron/
If you don't have access to crontab on your hosting provider (find a new one) there are other options. For example:
http://www.setcronjob.com/
Will call a script on your site, remotely every X period .. you have to renew it once a month I think. If you take their paid service ($5/year according to the front page) I think the jobs you set up last until you cancel them or your paid term runs out.

In Java, you could just create a Timer. It can create a background thread that will perform a given function every so often.
My preference would be to start the Timer object in a ContextListener.contextInitialized() method, but you may have a more appropriate place.

Related

Is there a way to compute this amount of data and still serve a responsive website?

Currently I am developing a django + react website, that will (I hope) serve a decent number of users. The project demo is mostly complete, and I am starting to think about the scale required to put this thing into production
The website essentially does three things:
Grab data from external APIs (i.e. Twitter) for 50,000 unique keywords (the keywords dont change). This process happens every 30 minutes
Run computation on all of the data, and save that computation to the database. Assume that the algorithm is as optimized as possible
When a user visits the website it should serve a pretty graph/chart of all of the computational data per keyword
The issue being, this is far too intense a task to be done by the same application that serves the website, users would be waiting decades to see their data. My current plan is to have a separate API made that services the website with the data, that the website can then store in it's database. This separate API would process the data without fear of affecting users, and it should be able to finish its current computation in under 30 minutes, in time for the next round of data.
Can anyone help me understand how I can better equip my project to handle the scale? I'd love some ideas.
As a 4th year CS Student I figured it's time to put a real project out into the world and I am very excited about it and the progress I've made so far. My main worry is that the end users will be negatively effected, if I don't figure out some kind of pipeline to make this process happen.
To re-iterate my idea:
Django + React - This is the forward facing website
External API - Grabs the data off the internet and processes it, and waits for a GET request from the website
Is there a better way to do this? Or on the other hand am I severely overestimating how computationally heavy this is.
Edit: Including current research
Handling computationally intensive tasks in a Django webapp
Separation of business logic and data access in django
What you want is to have the computation task to be executed by a different process in the "background".
The most straight-forward and popular solution is to use Celery, see here.
The Celery worker(s) - which performs the background task - can either run on the same machine as the web-application or (when scale becomes an issue), you can change the configuration so that it will run on an entirely different machine.

Do schedulers slow down your web application?

I'm building a jewellery e-commerce store, and a feature I'm building is incorporating current market gold prices into the final product price. I intend on updating the gold price every 3 days by making calls to an api using a scheduler, so the full process is automated and requires minimal interaction with the system.
My concern is this: will a scheduler, with one task executed every 72 hrs, slow down my server (using client-server model) and affect its performance?
The server side application is built using Django, Django REST framework, and PostgresSQL.
The scheduler I have in mind is Advanced Python Scheduler.
As far as I can see from the docs of "Advanced Python Scheduler", they do not provide a different process to run the scheduled tasks. That is left up to you to figure out.
From their docs, they are recommending a "BackgroundScheduler" which runs in a separate thread.
Now there are multiple issues which could arise:
If you're running multiple Django instances (using gunicorn or uwsgi), APS scheduler will run in each of those processes. This is a non-trivial problem to solve unless APS has considered this (you will have to check the docs).
BackgroundScheduler will run in a thread, but python is limited by the GIL. So if your background tasks are CPU intensive, your Django process will get slower at processing incoming requests.
Regardless of thread or not, if your background job is CPU intensive + lasts a long time, it can affect your server performance.
APS seems like a much lower-level library, and my recommendation would be to use something simpler:
Simply using system cronjobs to run every 3 days. Create a django management command, and use the cron to execute that.
Use django supported libraries like celery, rq/rq-scheduler/django-rq, or django-background-tasks.
I think it would be wise to take a look at https://github.com/arteria/django-background-tasks as it is the simplest of all with the least amount of setup required. Once you get a bit familiar with this you can weigh the pros & cons on what is appropriate for your use case.
Once again, your server performance depends on what your background task is doing and how long does it lasts.

How may seconds may an API call take until I need to worry?

I have to ask a more or less non-typical SO question and hope you don't mind. Right now I am developing my very first web application. I did set up an AJAX function that requests some data from a third party API and populates my html containers with the returned data.
Right now I query one single object and populate 3 html containers with around 15 lines of Javascript code. When i activate the process/function by clicking a button on my frontend, it needs around 6-7 seconds until the html content is updated.
Is this a reasonable time? The user experience honestly will be more than bad considering that I will have to query and manipulate far more data (I build a one-site dashboard related to soccer data).
There might be very controversal answers to that question, but what would be a fair enough time for the process to run using standard infrastructure? 1-2 seconds? (I will deploy the app on heroku or digitalocean and will implement a proper caching environment to handle "regular visitors").
Right now
I use a virtualenv and django server for the development
and a demo server from the third party API which might be slowed down for whatever reason (how to check this?)
which might effect the current time needed (there will be many more variables obv.).
Looking forward to your answers.
I personally think (probably a lot people might too) 6-7 secs is a significant delay for rendering a small page. The cause of this issue might not came from django directly. Check for the following:
I use a virtualenv and django server for the development
you may be running django devserver, production server might make things bit faster (use django-debug-toolbar to find what causing the delay)
Do db index in your model.
a demo server from the third party API which might be slowed down for whatever reason
use chrome developer tools 'network' tab to watch how long that third party call takes. it might not visible there if you call api in your view.py. in that case, add some timing code there to calculate how long it takes to return.

Scheduler, is it too heavy?

I have a Scheduled Tasks that is too heavy, so my question is, a too heavy Scheduled Tasks could drop down the coldfusion server? i mean sometimes my Scheduled Tasks exceeded the loop limit time, anyway I am looking for other way to make the same thing but no too heavy.
Well, a scheduled task is really just an automatic call for a normal CF page request. This means, if you manually bring up the scheduled task URL in a browser window, does it time out there as well?
Remember that a scheduled task is being called and run by the server which will mean you can have different session, CGI, request and form scope values as opposed to an actual user. However, you can use the requestTimeout attribute of the CFSETTING tag to extend how long the page will have to complete the tasks before it times out. The requestTimeout attribute takes a value which represents number of seconds after which, if no request back from the server, CF considers the page to be unresponsive.
However, it would depend greatly upon what your scheduled task is actually doing. There are all kinds of ways you could take code to break it into constituent parts for quicker processing. Maybe figuring out what the loop is doing (and does it really need to do all of everything it's doing) is a good place to start.
There's a few things to consider.
Firstly, your task takes a while to run, and currently times out. There's some things to investigate here:
well not an investigation, but you can set the requesttimeout for that request via <cfsetting> as per #coldfusiondevshop's suggestion.
You could audit your code to see if it can be coded differently to not take so long to run. Due to the nature of work that's generally done as a task, this might not be possible.
Depending on your ColdFusion version (pls always tag your questions with the CF version, as well as just "ColdFusion") you could leverage the enhanced scheduling CF10 has to split the tasks into smaller chunks and chain them together, thus making the parts of the sum less likely to time out. This too is not always possible, but is a consideration.
The other thing to think about is whether it might be worth having a separate CF instance running for your tasks. As well as your task timing out, it could also be slowing down processing for everyone else too, whilst that's running. Have you checked that? For a single task this is probably overkill, but it's good to bear in mind that tasks don't need to be run by the same CF instance(s) as the rest of the site, and indeed there's a compelling case not to do so if there are a lot of tasks that will strain a CF instance's resources.
In summary: increase your timeout via <cfsetting>, then audit your code to see if you can improve its performance. The other things I say are just "bonus point" suggestions.

How to best launch an asynchronous job request in Django view?

One of my view functions is a very long processing job and clearly needs to be handled differently.
Instead of making the user wait for long time, it would be best if I were able to lunch the processing job which would email the results, and without waiting for completion notify the user that their request is being processed and let them browse on.
I know I can use os.fork, but I was wondering if there is a 'right way' in terms of Django. Perhaps I can return the HTTP response, and than go on with this job somehow?
There are a couple of solutions to this problem, and the best one depends a bit on how heavy your workload will be.
If you have a light workload you can use the approach used by django-mailer which is to define a "jobs" model, save new jobs into the database, then have cron run a stand-alone script every so often to process the jobs stored in the database (deleting them when done). You can use something like django-chronograph to manage the job scheduling easier
If you need help understanding how to write a script to process the job see James Bennett's article Standalone Django Scripts for help.
If you have a very high workload, meaning you'll need more than a single server to process the jobs, then you want to use a real distribute task queue. There is a lot of competition here so I can't really detail all the options, but a good one to use with for Django apps is celery.
Why not simply start a thread to do the processing and then go on to send the response?
Before you select a solution, you need to determine how the process will be run. I.e is it the same process for every single user, the data is the same and can be scheduled regularly? or does each user request something and the results are slightly different ?
As an example, if the data will be the same for every single user and can be run on a schedule you could use cron.
See: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
or
http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
However if the requests will be adhoc and you need something scalable that can handle high load and is asynchronous: what you are actually looking for is a message queing system. Your view will add a request to the queue which will then get acted upon.
There are a few options to implement this in Django:
Django Queue service is purely django & python and simple, though the last commit was in April and it seems the project has been abandoned.
http://code.google.com/p/django-queue-service/
The second option if you need something that scales, is distributed and makes use of open source message queing servers: celery is what you need
http://ask.github.com/celery/introduction.html
http://github.com/ask/celery/tree