Django/Celery 4.3 - jobs seem to fail randomly - django

These are the tasks in tasks.py:
#shared_task
def add(x, y):
return x * y
#shared_task
def verify_external_video(video_id, media_id, video_type):
return True
I am calling verify_external_video 1000+ times from a custom Django command I run from CLI
verify_external_video.delay("1", "2", "3")
In Flower, I am then monitoring the success or failure of the jobs. A random number of jobs fail, others succeed...
Those that fail, do so because of two reasons that I just cannot understand:
NotRegistered('lstv_api_v1.tasks.verify_external_video')
if it's not registered, why are 371 succeedings?
and...
TypeError: verify_external_video() takes 1 positional argument but 3 were given
Again, a mystery, as I quit Celery and Flower, and run them AGAIN from scratch before running my CLI Django command. There is no code living anywhere where verify_external_video() takes 1 parameter. And if this is the case... why are SOME of the calls successful?
This type of failure isn't sequential. I can have 3 successful jobs, followed by one that does not succeed, followed by success again, so it's not a timing issue.
I'm at a loss here.

In Short: I had a number of rogue celery processes running around from previous "violent" CTRL-C's which prevented graceful termination of what was running.

Related

Why celery not executing parallelly in Django?

I am having a issue with the celery , I will explain with the code
def samplefunction(request):
print("This is a samplefunction")
a=5,b=6
myceleryfunction.delay(a,b)
return Response({msg:" process execution started"}
#celery_app.task(name="sample celery", base=something)
def myceleryfunction(a,b):
c = a+b
my_obj = MyModel()
my_obj.value = c
my_obj.save()
In my case one person calling the celery it will work perfectly
If many peoples passing the request it will process one by one
So imagine that my celery function "myceleryfunction" take 3 Min to complete the background task .
So if 10 request are coming at the same time, last one take 30 Min delay to complete the output
How to solve this issue or any other alternative .
Thank you
I'm assuming you are running a single worker with default settings for the worker.
This will have the worker running with worker_pool=prefork and worker_concurrency=<nr of CPUs>
If the machine it runs on only has a single CPU, you won't get any parallel running tasks.
To get parallelisation you can:
set worker_concurrency to something > 1, this will use multiple processes in the same worker.
start additional workers
use celery multi to start multiple workers
when running the worker in a docker container, add replica's of the container
See Concurrency for more info.

Pathos multiprocessing pool hangs

I'm trying to use multiprocessing inside docker container. However, I'm facing two issues.
(I'm using python 2.7)
Creating ProcessingPool()/Pool() (I tried both) takes abnormally long time to create. Maybe over a minute or two.
After it processes the function, it hangs.
I basically trying to run a very simple case inside my container. Here's what I have..
import pathos.multiprocessing import ProcessingPool
import multiprocessing
class MultiprocessClassExample():
.
.
.
def worker(self, number):
return "Printing number %s" %(number)
.
.
def generateNumber(self):
PROCESSES = multiprocessing.cpu_count() - 1
NUMBER = ['One', 'Two', 'Three', 'Four', 'Five']
result = ProcessingPool(PROCESSES).map(self.worker, NUMBER)
print("Finished processing.")
print(result)
and I call using the following code.
MultiprocessClassExample().generateNumber()
Now, this seems fairly straight forward enough. I ran this on a jupyter notebook and it ran without an issue. I also tried running python inside my docker container, and tried running the above code inside, and it went fine. So I'm assuming it has to do with the complete code that I have. Obviously I didn't write out all the code, but that's the main section of the code I'm trying to handle right now.
I would expect the above code to work as well. However, first thing I notice is that when I call ProcessingPool(), it takes a long time. I tried regular multiprocessing.Pool() before, and had the same effect. Whereas, in the notebook, it ran very quick and smoothly.
After waiting several minutes, it prints :
Printing number One
Printing number Two
Printing number Three
Printing number Four
Printing number Five
and that's it. It never prints out Finished processing. and it just hangs there.
But when the print statements appear, I notice that several debug message appear at the same time. It says
[CRITICAL] WORKER TIMEOUT
[WARNING] Worker graceful timeout
[INFO] Worker exiting
[INFO] Booting worker with pid:
Any suggestions would be greatly appreciated.

Vstest.console.exe exits with code 255 in Bamboo

We are running automated unit tests in our Bamboo build, but they are sometimes failing even though our log indicates that all tests are appropriately passing. I've done some Googling and am currently getting no where. Does anyone have a clue as to why the VSTest.Console.Exe is returning a value other than 0?
Thanks a ton!
Here are the last few lines of the log:
build 26-May-2016 14:11:25 Passed ReInitializeConnection
build 26-May-2016 14:11:25 Passed UserIdentifier_CRUD
build 26-May-2016 14:11:25 Results File: D:\build-dir\AVENTURA-T2-COREUNITTESTS\TestResults\bamboo_svc_BUILDP02 2016-05-26 14_10_58.trx
build 26-May-2016 14:11:25
build 26-May-2016 14:11:25 Total tests: 159. Passed: 159. Failed: 0. Skipped: 0.
build 26-May-2016 14:11:25 Test Run Successful.
build 26-May-2016 14:11:25 Test execution time: 27.3562 Seconds
simple 26-May-2016 14:11:32 Failing task since return code of [C:\Program Files\Bamboo\temp\AVENTURA-T2-COREUNITTESTS-345-ScriptBuildTask-2971562088758505573.bat] was 255 while expected 0
simple 26-May-2016 14:11:32 Finished task 'Run vstest.console.exe' with result: Failed
This isn't the solution I wanted but it does keep my build from failing if the return code is something other than 0 and all the tests are passing. At the end of our test command I add:
if %ERRORLEVEL% NEQ 0 (
echo Failure Reason Given is %errorlevel%
exit /b 0
)
All this does it catch the error coming out of the vstest.console.exe and throw a return code of 0 out instead of 255. If anyone ever figures this out, I would greatly appreciate knowing why the return code is something other than 0.
As indicated in a comment to the question, I've come up against the issue in the test automation for my company too.
In our case, vstest would return 1 when tests failed, but then occasionally return 255. In the case of the 255 return, the test TRX output would not be generated.
In our situation, we are running integration tests that spawn child processes. The child processes have output handlers attached that write to the test context. The test starts the process, then uses the WaitForExit(int milliseconds) method to wait for it to complete.
The output handlers on the process output are then executing in a different thread, but have a reference to the test context to write their output.
This can cause issues in two ways:
In the documentation for WaitForExit(int milliseconds) on MSDN, it states:
When standard output has been redirected to asynchronous event handlers, it is possible that output processing will not have completed when this method returns. To ensure that asynchronous event handling has been completed, call the WaitForExit() overload that takes no parameter after receiving a true from this overload.
This means that it's possible that the output handlers are writing to the context after the test is complete.
When the timeout expires, the process continues to run in the background, and therefore might also be able to write to the test context.
The fix in our case was threefold:
After the call to WaitForExit(int), either kill the process (timeout) or call WaitForExit() again (non-timeout).
Deregister the output event handlers from the process object
Dispose the Process object properly (with using).
The specifics of your case might be different to ours, but look for threaded tests where (a) the thread might execute after the test is complete and (b) writes to the test output.

Delayed Job Overwhelming DB

I have a method which updates all DNS records for an account with 1 delayed job for each record. There's a lot of workers and queues which is great for getting other jobs done quickly, but this particular job completes quickly and overwhelms the database. Because each job requires DNS to resolve, it's difficult to move this to a process which collects the information then writes once. So I'm instead looking for a way to stagger delayed jobs.
As far as I know, just using sleep(0.1) in the after method should do the trick. I wanted to see if anyone else has specifically dealt with this situation and solved it.
I've created a custom job to test out a few different ideas. Here's some example code:
def update_dns
Account.active.find_each do |account|
account.domains.where('processed IS NULL').find_each do |domain|
begin
Delayed::Job.enqueue StaggerJob.new(self.id)
rescue Exception => e
self.domain_logger.error "Unable to update DNS for #{domain.name} (id=#{domain.id})"
self.domain_logger.error e.message
self.domain_logger.error e.backtrace
end
end
end
end
When a cron job calls Domain.update_dns, the delayed job table floods with tens of thousands of jobs, and the workers start working through them. There's so many workers and queues that even setting the lowest priority overwhelms the database and other requests suffer.
Here's the StaggerJob class:
class StaggerJob < Struct.new(:domain_id)
def perform
domain.fetch_dns_job
end
def enqueue(job)
job.account_id = domain.account_id
job.owner = domain
job.priority = 10 # lowest
job.save
end
def after(job)
# Sleep to avoid overwhelming the DB
sleep(0.1)
end
private
def domain
#domain ||= Domain.find self.domain_id
end
end
This may entirely do the trick, but I wanted to verify if this technique was sensible.
It turned out the priority for these jobs were set to 0 (highest). Setting to 10 (lowest) helped. Sleeping in the job in the after method would work, but there's a better way.
Delayed::Job.enqueue StaggerJob.new(domain.id, :fetch_dns!), run_at: (Time.now + (0.2*counter).seconds) # stagger by 0.2 seconds
This ends up pausing outside the job instead of inside. Better!

Django: Gracefully restart nginx + fastcgi sites to reflect code changes?

Common situation: I have a client on my server who may update some of the code in his python project. He can ssh into his shell and pull from his repository and all is fine -- but the code is stored in memory (as far as I know) so I need to actually kill the fastcgi process and restart it to have the code change.
I know I can gracefully restart fcgi but I don't want to have to manually do this. I want my client to update the code, and within 5 minutes or whatever, to have the new code running under the fcgi process.
Thanks
First off, if uptime is important to you, I'd suggest making the client do it. It can be as simple as giving him a command called deploy-code. Using your method, if there is an error in their code, your method requires a 10 minute turnaround (read: downtime) for fixing it, assuming he gets it correct.
That said, if you actually want to do this, you should create a daemon which will look for files modified within the last 5 minutes. If it detects one, it will execute the reboot command.
Code might look something like:
import os, time
CODE_DIR = '/tmp/foo'
while True:
if restarted = True:
restarted = False
time.sleep(5*60)
for root, dirs, files in os.walk(CODE_DIR):
if restarted=True:
break
for filename in files:
if restared=True:
break
updated_on = os.path.getmtime(os.path.join(root, filename))
current_time = time.time()
if current_time - updated_on <= 6 * 60: # 6 min
# 6 min could offer false negatives, but that's better
# than false positives
restarted = True
print "We should execute the restart command here."