Scheduler, is it too heavy?

Scheduler, is it too heavy? - coldfusion

I have a Scheduled Tasks that is too heavy, so my question is, a too heavy Scheduled Tasks could drop down the coldfusion server? i mean sometimes my Scheduled Tasks exceeded the loop limit time, anyway I am looking for other way to make the same thing but no too heavy.

Well, a scheduled task is really just an automatic call for a normal CF page request. This means, if you manually bring up the scheduled task URL in a browser window, does it time out there as well?
Remember that a scheduled task is being called and run by the server which will mean you can have different session, CGI, request and form scope values as opposed to an actual user. However, you can use the requestTimeout attribute of the CFSETTING tag to extend how long the page will have to complete the tasks before it times out. The requestTimeout attribute takes a value which represents number of seconds after which, if no request back from the server, CF considers the page to be unresponsive.
However, it would depend greatly upon what your scheduled task is actually doing. There are all kinds of ways you could take code to break it into constituent parts for quicker processing. Maybe figuring out what the loop is doing (and does it really need to do all of everything it's doing) is a good place to start.

There's a few things to consider.
Firstly, your task takes a while to run, and currently times out. There's some things to investigate here:
well not an investigation, but you can set the requesttimeout for that request via <cfsetting> as per #coldfusiondevshop's suggestion.
You could audit your code to see if it can be coded differently to not take so long to run. Due to the nature of work that's generally done as a task, this might not be possible.
Depending on your ColdFusion version (pls always tag your questions with the CF version, as well as just "ColdFusion") you could leverage the enhanced scheduling CF10 has to split the tasks into smaller chunks and chain them together, thus making the parts of the sum less likely to time out. This too is not always possible, but is a consideration.
The other thing to think about is whether it might be worth having a separate CF instance running for your tasks. As well as your task timing out, it could also be slowing down processing for everyone else too, whilst that's running. Have you checked that? For a single task this is probably overkill, but it's good to bear in mind that tasks don't need to be run by the same CF instance(s) as the rest of the site, and indeed there's a compelling case not to do so if there are a lot of tasks that will strain a CF instance's resources.
In summary: increase your timeout via <cfsetting>, then audit your code to see if you can improve its performance. The other things I say are just "bonus point" suggestions.

Related

How to partition AWS lambda invocations to independent processing tasks

I am looking for some best practice advice on AWS, and hoping this question won't immediately be closed as too open to opinion.
I am working on a conversion of a windows server application to AWS lambda.
The server runs every 5 minutes and grabs all the files that have been uploaded to various FTP locations.
These files must be processed in a specific order, which might not be the order they arrive in, so it then sorts them and processes accordingly.
It interacts with a database to validate the files against information from previous files.
It then sends the relevant information on, and records new information in the database.
Errors are flagged, and logged in the database, to be dealt with manually.
Note that currently there is no parallel processing going on. This would be difficult because of the need to sort the files and process them in the correct order.
I have therefore been assuming the lambda will have to run as a single invocation on a schedule.
However, I have realised that the files can be partitioned according to where they come from, and those locations can be processed independantly.
So I could have a certain amount of parallelism.
My question is what is the correct way to manage that limited parallelism in AWS?
A clunky way of doing it would be through the database, something like this:
A lambda spins up and reads a particular table in the database
This table has a list of independant processing areas, and the columns: "Status", "StartTime".
The lambda finds the oldest one not currently
being processed, registers it as "processing" and updates the
"StartTime".
After processing the status is set to "done" or some such.
I think this would work, but it doesn't feel quite right to be managing such things through the database.
Can someone suggest a pattern that my problem fits into, and the correct AWS way of doing this?

if you really want to do this with parallel lambda invocations, then yes, you should absolutely use a database to coordinate their work.
The protocol you're thinking about seems reasonable. You need to use the transactional capabilities of the database to ensure that the parallel invocations don't interfere with each other, and you need to make sure that the system is resilient to lambda invocations that don't happen.
When your lambda is invoked to handle the event, it should decide how many additional parallel invocations are required, and then make asynchronous lambda calls to run those additional instances. Those instances should recognize that they were invoked directly and skip that part.
After that, all of the parallel lambda invocations should do exactly the same thing. Make sure that none of them are special in any way, so you don't need to rely on any particular one completing without error. They should each pull work from a work queue in the DB until all the work is done.
BUT NOTE: Usually the kind of tasks you're talking about are not CPU-bound. If that is the case then running multiple parallel tasks inside the same lambda invocation will make better use of your resources. You can do both, of course.

Is concurrency on Serverless (like Google Cloud Run) pointless?

As far as I can tell by default, on Google Cloud and presumably elsewhere, each vCPU = 1 hyperthread. (3rd paragraph in the intro) Which, from my perspective, would suggest that unless one changes this setting to 2 or 4 vCPUs, concurrency in the code running on the docker image achieves nothing. Is there some multi-threaded knowledge im missing that means that concurrency on a single hyperthread accomplishes something? scaling up the vCPU number isnt very attractive as the minimum memory setting is already forced to 2GB for 4 vCPUs
This question is framed based on the Google Cloud tech stack, but is meant to umbrella all providers.
Do Serverless solutions ever really benefit from concurrency?
EDIT:
The accepted answer is a great first look, but I realized my above assumptions ignored context switching idle time. For example:
If we wish to write a backend which talks to a database, a lot of our compute time might be spent idling for the database request results. context switching to the next request in this case would allow us to fill CPU load more efficiently.
Therefore, depending on the use case, even on a single threaded vCPU our Serverless app can benefit from concurrency

I wrote this. From my experience, YES, you can handle several thread in parallel and your performance increase with the number of CPU. however, you need to have a process that support multithread.
In case of Cloud Run, each request can be processed in a thread, parallelization is easy.

What is the best way to monitor and show the results of Async jobs (like EMR & AWS glue) which take 20-30 minutes to execute

I have a job for my program which takes a long time to execute. Now I want to show the status of this job to my UI once it is completed. I have found two solutions to this problem:
Have an api call execute at the end of the 30 minute job to update the status that the job is complete. This is good because it can give additional information as to what happened in the job, but has it's drawback in that if something goes completely wrong, theres a chance that the code which calls the api will never happen and hence the status will never update.
Have continuous monitoring on this task once it has started. Have a while loop and keep checking if the task is done. This is a good approach in that we can almost always get the correct status of the task, but often we can only see the high level yes/no here instead of being able to see the fine grained execution details which might be made available.
One thing I haven't implemented though which I think my be a good solution is having both of these solutions in tandem does both so if there is a success case, I get the details of the execution. In case of total failure, I get that output as well from the other monitoring tool. What are the general principles followed when building such monitoring support for jobs which take longer times to process?

Use AWS Step Functions as a serverless state machine. It has support for interacting directly with a bunch of services https://docs.aws.amazon.com/en_us/step-functions/latest/dg/connect-supported-services.html

Pitfalls with local in memory cache invalidated using RabbitMQ

I have a java web server and am currently using the Guava library to handle my in-memory caching, which I use heavily. I now need to expand to multiple servers (2+) for failover and load balancing. In the process, I switched from a in-process cache to Memcache (external service) instead. However, I'm not terribly impressed with the results, as now for nearly every call, I have to make an external call to another server, which is significantly slower than the in-memory cache.
I'm thinking instead of getting the data from Memcache, I could keep using a local cache on each server, and use RabbitMQ to notify the other servers when their caches need to be updated. So if one server makes a change to the underlying data, it would also broadcast a message to all other servers telling them their cache is now invalid. Every server is both broadcasting and listening for cache invalidation messages.
Does anyone know any potential pitfalls of this approach? I'm a little nervous because I can't find anyone else that is doing this in production. The only problems I see would be that each server needs more memory (in-memory cache), and it might take a little longer for any given server to get the updated data. Anything else?

I am a little bit confused about your problem here, so I am going to restate in a way that makes sense to me, then answer my version of your question. Please feel free to comment if I am not in line with what you are thinking.
You have a web application that uses a process-local memory cache for data. You want to expand to multiple nodes and keep this same structure for your program, rather than rely upon a 3rd party tool (memcached, Couchbase, Redis) with built-in cache replication. So, you are thinking about rolling your own using RabbitMQ to publish the changes out to the various nodes so they can update the local cache accordingly.
My initial reaction is that what you want to do is best done by rolling over to one of the above-mentioned tools. In addition to the obvious development and rigorous testing involved, Couchbase, Memcached, and Redis were all designed to solve the problem that you have.
Also, in theory you would run out of available memory in your application nodes as you scale horizontally, and then you will really have a mess. Once you get to the point when this limitation makes your app infeasible, you will end up using one of the tools anyway at which point all your hard work to design a custom solution will be for naught.
The only exceptions to this I can think of are if your app is heavily compute-intensive and does not use much memory. In this case, I think a RabbitMQ-based solution is easy, but you would need to have some sort of procedure in place to synchronize the cache between the servers on occasion, should messages be missed in RMQ. You would also need a way to handle node startup and shutdown.
Edit
In consideration of your statement in the comments that you are seeing access times in the hundreds of milliseconds, I'm going to advise that you first examine your setup. Typical read times for a single item in the cache from a Memcached (or Couchbase, or Redis, etc.) instance are sub-millisecond (somewhere around .1 milliseconds if I remember correctly), so your "problem child" of a cache server is several orders of magnitude from where it should be in terms of performance. Start there, then see if you still have the same problem.

We're using something similar for data which is read-only and doesn't require updated every time. I'm in doubt, that this is good plan for you. Just imagine you should have one more additional service on each instance, which will monitor queue, and process change to in-memory storage. This is very hard to test.
Are you sure that most of the time is spent on communication between your servers? Maybe you run multiple calls?

How to set up website periodic tasks?

I'm not sure if the topic is appropiate to the question... Anyway, suppose I've a website, done with PHP or JSP and a cheap hosting with limited functionalities. More precisely I don't own the server and I can't run daemons or services at my will. Now I want to run periodic (say every minute or two) tasks to do certain statistics. They are likely to be time consuming so I just can't repeat the calculation for every user accessing a page. I could do the calculation once when a user loads a page and I calculate that enough time has passed, but in this situation if the calculation gets very long the response time may be excessive and timeout (it's not likely I don't mean to run such long task, but I'm considering the worst case).
So given these costraints, what solutions would you suggest?

Each cheap hosting will have support of crontabs. Check out the hosting packages first.
If not, load in the page, and launch the task by AJAX. This way your response time doesn't suffer and you do in a different thread the work.

If you choose to use crontab, you're going to have to know a bit more to execute your PHP scripts from them. Depending on if your PHP executes as CGI or an apache module has an effect. There's a good article on how to do this at:
http://www.htmlcenter.com/blog/running-php-scripts-with-cron/
If you don't have access to crontab on your hosting provider (find a new one) there are other options. For example:
http://www.setcronjob.com/
Will call a script on your site, remotely every X period .. you have to renew it once a month I think. If you take their paid service ($5/year according to the front page) I think the jobs you set up last until you cancel them or your paid term runs out.

In Java, you could just create a Timer. It can create a background thread that will perform a given function every so often.
My preference would be to start the Timer object in a ContextListener.contextInitialized() method, but you may have a more appropriate place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js