I have a loop the runs over a couple thousand records and for each record it hits is does some image resizing and manipulation on the server. The process runs well in testing over a small record set but when it moves to the live server I would like to suspend and resume the process after 50 records so the server is not taxed to the point of slow performance or quits altogether.
The code looks like this:
<cfloop query="imageRecords">
<!--create and save images to server - sometimes 3 - 7 images for each record -->
</cfloop>
Like I said, I would like to pause after 50 records, then resume where it left off. I looked at cfschedule but was unsure of how to work that into this.
I also looked at the sleep() function but the documentation talks about using this within cfthread tags. Others have posted about using it to simulate long processes.
So, I'm not sure sleep() can be safely used in the fashion I need it to.
Server is CF9 and db is MySQL.
I would create a column called worked in the database that is defaulted to 0 and once the image has been updated set the flag to 1. Then your query can be something like
SELECT TOP 50 imagename
FROM images
WHERE worked = 0
Then set up a CF scheduled task to run every x minutes
Here's a different approach which you could combine with probably any of the other approaches:
<cfscript>
th=CreateObject("java","java.lang.Thread");
th.setPriority(th.MIN_PRIORITY);
// Your work here
th.setPriority(th.NORM_PRIORITY);
</cfscript>
That ought to set the thread to have lower priority than the other threads which are serving your other requests. In theory, you'll get your work done in the shortest time, but with less affect on your other users. I've not had opportunity to test this yet, so your mileage may vary.
Related
I have a for loop in django. It will loop through a list and get the corresponding data from database and then do some calculation based on the database value and then append it another list
def getArrayList(request):
list_loop = [...set of values to loop through]
store_array = [...store values here from for loop]
for a in list_loop:
val_db = SomeModel.objects.filter(somefield=a).first()
result = perform calculation on val_db
store_array.append(result)
The list if 10,000 entries. If the user want this request he is ready to wait and will be informed that it will take time
I have tried joblib with backed=threading its not saving much time than normal loop
But when i try with backend=multiprocessing. it says "Apps aren't loaded yet"
I read multiprocessing is not possible in module based files.
So i am looking at celery now. I am not sure how can this be done in celery.
Can any one guide how can we faster the for loop calculation using mutliprocessing techniques available.
You're very likely looking for the wrong solution. But then again - this is pseudo code so we can't be sure.
In either case, your pseudo code is a self-fulfilling prophecy, since you run queries in a for loop. That means network latency, result set fetching, tying up database resources etc etc. This is never a good pattern, at best it's a last resort.
The simple solution is to get all values in one query:
list_values = [ ... ]
results = []
db_values = SomeModel.objects.filter(field__in=list_values)
for value in db_values:
results.append(calc(value))
If for some reason you need to loop, then to do this in celery, you would mark the function as a task (plenty of examples to find). It won't speed up anything. But you won't speed up anything - it will we be run in the background and so you render a "please wait" message and somehow you need to notify the user again that the job is done.
I'm saying somehow, because there isn't a really good integration package that I'm aware of that ties in all the components. There's django-notifications-hq, but if this is your only background task, it's a lot of extra baggage just for that - so you may want to change the notification part to "we will send you an email when the job is done", cause that's easy to achieve inside your function.
And thirdly, if this is simply creating a report, that doesn't need things like automatic retries on failure, then you can simply opt to use Django Channels and a browser-native websocket to start and report on the job (which also allows you to send email).
You could try concurrent.futures.ProcessPoolExecutor, which is a high level api for processing cpu bound tasks
def perform_calculation(item):
pass
# specify number of workers(default: number of processors on your machine)
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
res = executor.map(perform_calculation, tasks)
EDIT
In case of IO bound operation, you could make use of ThreadPoolExecutor to open a few connections in parallel, you can wrap the pool in a contextmanager which handles the cleanup work for you(close idle connections). Here is one example but handles the connection closing manually.
I have created a framework in which I have used Set Browser Implicit Wait 30
I have 50 suite that contains total of 700 test cases. A few of the test cases (200 TC's) has steps to find if Element present and element not present. My Objective is that I do not want to wait until 30 seconds to check if Element Present or Element not Present. I tried using Wait Until Element Is Visible ${locator} timeout=10, expecting to wait only 10 seconds for the Element , but it wait for 30 seconds.
Question : Can somebody help with the right approach to deal with such scenarios in my framework? If I agree to wait until 30 seconds, the time taken to complete such test case will be more. I am trying to save 20*200 secs currently Please advise
The simplest solution is to change the implicit wait right before checking that an element does not exist, and then changing it back afterwards. You can do this with the keyword set selenium implicit wait.
For example, your keyword might look something like this:
*** Keywords ***
verify element is not on page
[Arguments] ${locator}
${old_wait}= Set selenium implicit wait 10
run keyword and continue on failure
... page should not contain element ${locator}
set selenium implicit wait ${old_wait}
You can simply add timeout="${Time}" next to the keyword you want to execute (Exp., Wait Until Page Contains Element ${locator} timeout=50)
The problem you're running into deals with issue of "Implicit wait vs Explicit Wait". Searching the internet will provide you with a lot of good explanations on why mixing is not recommended, but I think Jim Evans (Creator of IE Webdriver) explained it nicely in this stackoverflow answer.
Improving the performance of your test run is typically done by utilizing one or both of these:
Shorten the duration of each individual test
Run test in parallel.
Shortening the duration of a test typically means being in complete control of the application under test resulting in the script knowing when the application has successfully loaded the moment it happens. This means having a a low or none Implicit wait and working exclusively with Fluent waits (waiting for a condition to occur). This will result in your tests running at the speed your application allows.
This may mean investing time understanding the application you test on a technical level. By using a custom locator you can still use all the regular SeleniumLibrary keywords and have a centralized waiting function.
Running tests in parallel starts with having tests that run standalone and have no dependencies on other tests. In Robot Framework this means having Test Suite Files that can run independently of each other. Most of us use Pabot to run our suites in parallel and merge the log file afterwards.
Running several browser application tests in parallel means running more than 1 browser at the same time. If you test in Chrome, this can be done on a single host - though it's not always recommended. When you run IE then you require multiple boxes/sessions. Then you start to require a Selenium Grid type solution to distribute the execution load across multiple machines.
So we have a very huge database which has around 300,000 urls. These urls have to be pinged and get data from.(these urls are radio stations which are playing song. The data is metadata)
Some of them are sometimes inactive and sometimes active.
On any given time, around 80,000 are active. Some respond slow, some respond quickly. I have a server and I am thinking to do this using c++
My goal is to ping and parse(or crawl) them within 1 minute and keep repeating the process because information(the song playing on them) can change over time. ranging from 2-7 minutes mostly. But I am not sure if it is possible.
What should be my approach to do it?
I have thought of creating two programs, one to test if the url is active or not and run it twice a day. And how much time it generally takes to respond. Does it usually respond slow or whether it is responding slower now.
And the other to do the actual crawling where fastest will be crawled first and some dedicated threads for urls which respond faster.
Please i would love more better ideas or better solutions for it. Can any one tell me how to do the maths to find out the number of dedicated threads i should allot to each for getting the results in least number of time
You don't need performance of your CPU (not your bottleneck at the moment), but you need to avoid network layer stall... if the request timeout is 60 seconds, and you have 16 threads, and hit 16 very slow servers (which will time-out eventually), you are generally stalled for 60 seconds and not processing anything more.
So I would start with let's say 500 threads (and like 15-30s timeout, if you know the very slow radios are capable to fit even this), and keep some statistic about their turnaround, and keep adding more working threads dynamically for every original which didn't get response within 2-3 secs. 80000/500 = 160, so each "normally quick" worker thread has then to ping around 160 urls, if each does take 2 seconds, that's still 320 = 5min! So 500 sounds like minimum.
That said, having 500+ threads will somewhat burden CPU and memory (not sure how much, with decent thread/memory model implementation 500 doesn't sounds like much for modern x86 CPU with GB of RAM, even 5000 sounds still reasonable), but I would worry lot more about the network layer and about possible firewalls around, you need server-grade like network for such amount of requests (if I would try something like that from my home, my own router would filter me out with default settings, detecting it as some kind of DoS attack).
So get some statistic how long the request on average take, then take your target time (2-7min), and divide the number of urls by those, like average ping 5s, round time 3min = 300,000/(3*60/5) = 8333.33 threads at least needed. Then you will have to profile your app to verify, that with 8000 threads it will not choke on something else, but it will really handle the task as expected.
(other option is to fire asynchronous http request from single thread, but that sort of creates its own threads for each task any way, so I would rather manage the threads myself, and use synchronous http calls)
And thinking about dynamic grow mechanics... you can keep some counters about how many new requests were added in last second, and how many finished (either responded or failed), and after few seconds of running these should start to form some kind of "throughput" statistic, then if throughput is under desired threshold, you can add more threads.
About active/inactive... keep the response time/last-seen/last-check together with url, and add some further logic to check url only when it makes sense (like not within next 60s, if it did just respond, or check inactive just after 6h from last test). You need also avoid checking the same url in two different threads at the same time, so some central manager code should feed the threads with target (maybe some FIFO thread-safe queue ... actually you can use its size to estimate how well the worker threads are processing it, so you can add more threads when you see the queue is not emptying fast enough = that avoids adding the statistic code to thread themselves).
I have a very basic app that plugs data into a stored procedure which in turn returns a recordset. I've been experiencing what I thought were 'timeouts'. However, I'm now no longer convinced that this is what is really happening. The reason why is that the DBA and I watched sql server spotlight to see when the stored procedure was finished processing. As soon as the procedure finished processing and returned a recordset, the ColdFusion page returned a 'timeout' error. I'm finding this to be consistent whenever the procedure takes longer than a minute. To prove this, I created a stored procedure with nothing more than this:
BEGIN
WAITFOR DELAY '00:00:45';
SELECT TOP 1000 *
FROM AnyTableName
END
If I run it for 59 seconds I get a result back in ColdFusion. If I change it to one minute:
WAITFOR DELAY '00:01';
I get a cfstoredproc timeout error. I've tried running this in different instances of ColdFusion on the same server, different databases/datasources. Now, what is strange, is that I have other procedures that run longer than a minute and return a result. I've even tried this locally on my desktop with ColdFusion 10 and get the same result. At this point, I'm out of places to look so I'm reaching out for other things to try. I've also increased the timeout in the datasource connections and that didn't help. I even tried ColdFusion 10 with the timeout attribute but no luck there either. What is consistent is that the timeout error is displayed when the query completes.
Also, I tried adding the WAITFOR in cfquery and the same result happened. It worked when set for 59 seconds, but timed out when changed to a minute. I can change the sql to select top 1 and there is no difference in the result.
Per the comments, it looks like your request timeout is set to sixty seconds.
Use cfsetting to extend your timeout to whatever you need.
<cfsetting requesttimeout = "{numberOfSeconds}">
The default timeout for all pages is 60s, you need to change this in the cfadmin if it is not enough, but most pages should not run this long.
Take some time to familiarise yourself with the cfadmin and all its settings to avoid such head scratching.
As stated use cfsetting tag to override for specific pages.
I have a template doing a boat load of manipulations, I expect it to take 30-45 minutes to complete it's processing... I've had SOME success in setting my application and session vars to timeout # 2 hr. and I've set my request timeout to 9999 (which should be 2.77 hrs)...
However - there seems to be a magic threshold - somewhere around the 20 min mark, my browser goes to a white screen (no output) and it appears as though the CF engine has also stopped working on my task...
can anyone suggest a reliable way to keep this process going - until it's done or my astronomical timeout occurs? in addition , is there any way to push feedback to the browser so it doesn't time out....I've tried cfflush, but that doesn't seem to do it.
You could use cfthread to run the process in a separate thread and then on the page you are accessing in the browser, you could use javascript to periodically poll the system to check on its status. For example, inside the long running process in cfthread, as you work through, you could set a application variables indicating that the process is still running and how far along it is, and retrieve and report those in the browser. When its complete, you could clear the variables, or set a complete flag, etc, and your browser report page will be able to indicate that it is complete.
I strongly suggest refactoring the code to use a simple messaging / queue system. It wouldn't take but 30 minutes to implement (or write a simple one from scratch!) and would provide a lot of benefits over and above solving this issue.
For example, its not a pass/fail for the entire operation. If you hit a snag at say the 1.5 hour mark, you won't be re-doing the entire process again, only parts which fail.
Doing it this way there is literally no limit to how much processing you can do because you'll be adding and removing from the stack as needed.
If you give a little more background, I'd be happy to help you figure out logical divisions to make it possible.
If you have a process running for that long, then you'll want to run it as a scheduled task.
I imagine your browser is the one dying.
Did you check to see if the request is still running?
<cfsetting requesttimeout= "3600" /> will set the page to last for an hour. If you run it as a scheduled task, then session timeout shouldn't affect anything.
Please do not store queries in session like that. Depending on the size of the query and the number of concurrent users in the system, you could easily run out of memory, causing some current and all subsequent requests to fail.
The database should be more than able to handle the heavy lifting. I'd hazard a guess that much of the processing you're doing in the application could be re-factored to happen directly on the database and save you a considerable amount of time.
Regardless, you should look into something like CFTHREAD as Sean mentioned, a scheduled task or a queuing system to handle a long process like this. The user most likely doesn't want to wait for the process to end before seeing the next screen. If they're told up front that the process is lengthy, they'll cope with waiting as long as they can move on to other tasks.
I had the same problem. Generally the browser will timeout after 3 minutes of nothing being sent from the server. For most of these long operations I was able to periodically output a dot to keep the browser alive but when it came to some extremely long queries importing 20M records from a server side CSV file I had to think of another way.
cUrl was the answer.
So here's what I did.
<?
function get_page($page)
{
$ch = curl_init($page);
curl_setopt($ch, CURLOPT_TIMEOUT, 0);
curl_setopt($ch, CURLOPT_NOPROGRESS,false);
curl_setopt($ch, CURLOPT_PROGRESSFUNCTION,'progress');
curl_setopt($ch, CURLOPT_BUFFERSIZE, 128);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch);
}
function progress($clientp,$dltotal,$dlnow,$ultotal,$ulnow='')
{
echo '. ';
flush();
return(0);
}
get_page('http://www.example.com/my_extremely_long_operation_script.php');
?>
Even with no output from the server, curl updates the download progress periodically.
Solved!