Delayed Job Overwhelming DB - ruby-on-rails-4

I have a method which updates all DNS records for an account with 1 delayed job for each record. There's a lot of workers and queues which is great for getting other jobs done quickly, but this particular job completes quickly and overwhelms the database. Because each job requires DNS to resolve, it's difficult to move this to a process which collects the information then writes once. So I'm instead looking for a way to stagger delayed jobs.
As far as I know, just using sleep(0.1) in the after method should do the trick. I wanted to see if anyone else has specifically dealt with this situation and solved it.
I've created a custom job to test out a few different ideas. Here's some example code:
def update_dns
Account.active.find_each do |account|
account.domains.where('processed IS NULL').find_each do |domain|
begin
Delayed::Job.enqueue StaggerJob.new(self.id)
rescue Exception => e
self.domain_logger.error "Unable to update DNS for #{domain.name} (id=#{domain.id})"
self.domain_logger.error e.message
self.domain_logger.error e.backtrace
end
end
end
end
When a cron job calls Domain.update_dns, the delayed job table floods with tens of thousands of jobs, and the workers start working through them. There's so many workers and queues that even setting the lowest priority overwhelms the database and other requests suffer.
Here's the StaggerJob class:
class StaggerJob < Struct.new(:domain_id)
def perform
domain.fetch_dns_job
end
def enqueue(job)
job.account_id = domain.account_id
job.owner = domain
job.priority = 10 # lowest
job.save
end
def after(job)
# Sleep to avoid overwhelming the DB
sleep(0.1)
end
private
def domain
#domain ||= Domain.find self.domain_id
end
end
This may entirely do the trick, but I wanted to verify if this technique was sensible.

It turned out the priority for these jobs were set to 0 (highest). Setting to 10 (lowest) helped. Sleeping in the job in the after method would work, but there's a better way.
Delayed::Job.enqueue StaggerJob.new(domain.id, :fetch_dns!), run_at: (Time.now + (0.2*counter).seconds) # stagger by 0.2 seconds
This ends up pausing outside the job instead of inside. Better!

Related

Why does .save() still take up time when using transaction.atomic()?

In Django, I read that transaction.atomic() should leave all the queries to be executed until the end of the code segment in a single transaction. However, it doesn't seem to be working as expected:
import time
from django.db import transaction
my_objs=Obj.objects.all()[:100]
count=avg1=0
with transaction.atomic():
for obj in my_objs:
start = time.time()
obj.save()
end = time.time()
avg1+= end - start
count+= 1
print("total:",avg1,"average:",avg1/count)
Why when wrapping the .save() method around a start/end time to check how long it takes, it is not instantaneous?
The result of the code above was:
total: 3.5636022090911865 average: 0.035636022090911865
When logging the SQL queries with the debugger, it also displays an UPDATE statement for each time .save() is called.
Any ideas why its not working as expected?
PS. I am using Postgres.
There is probably just a misunderstanding here about what transaction.atomic actually does. It doesn't necessarily wait to execute all the queries -- the ORM is still talking to the database as you execute your code in an atomic block. It simply waits to commit (SQL COMMIT;) changes until the [successful] end of the block. In the case there is an exception before the end of the transaction block, all the modifications in the transaction are not committed and are rolled back.

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Why celery not executing parallelly in Django?

I am having a issue with the celery , I will explain with the code
def samplefunction(request):
print("This is a samplefunction")
a=5,b=6
myceleryfunction.delay(a,b)
return Response({msg:" process execution started"}
#celery_app.task(name="sample celery", base=something)
def myceleryfunction(a,b):
c = a+b
my_obj = MyModel()
my_obj.value = c
my_obj.save()
In my case one person calling the celery it will work perfectly
If many peoples passing the request it will process one by one
So imagine that my celery function "myceleryfunction" take 3 Min to complete the background task .
So if 10 request are coming at the same time, last one take 30 Min delay to complete the output
How to solve this issue or any other alternative .
Thank you
I'm assuming you are running a single worker with default settings for the worker.
This will have the worker running with worker_pool=prefork and worker_concurrency=<nr of CPUs>
If the machine it runs on only has a single CPU, you won't get any parallel running tasks.
To get parallelisation you can:
set worker_concurrency to something > 1, this will use multiple processes in the same worker.
start additional workers
use celery multi to start multiple workers
when running the worker in a docker container, add replica's of the container
See Concurrency for more info.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

tensorflow fully connected control flow per n-epoch summary

When I don't use queues, I like to tally the loss, accuracy, ppv etc during an epoch of training and submit that tf.summary at the end of every epoch.
I'm not sure how to replicate this behavior with queues. Is there a signal I can listen to for when an epoch is complete?
(version 0.9)
A typical setup goes as follows:
queue=tf.string_input_producer(num_epochs=7)
...#build graph#...
#training
try:
while not coord.should_stop():
sess.run(train_op)
except:
#file has been read num_epoch times
#do some stuff.. maybe summaries
coord.request_stop()
finally:
coord.join(threads)
So, clearly I could just set num_epoch=1 and create summaries in the except block. This would require running my entire program once per epoch and somehow it doesn't seem the most efficient.
EDIT Changed to account for edits to the question.
An epoch is not something that is a built-in or 'known' to TensorFlow. You have to keep track of the epochs in your training loop and run the summary ops at the end of an epoch. A pseudo code like the following should work :
num_mini_batches_in_epoch = ... # something like examples_in_file / mini_batch_size
try:
while True:
for i in num_mini_batches_in_epoch:
if coord.should_stop(): raise Exception()
sess.run(train_op)
sess.run([loss_summary, accuracy_summary])
except:
#file has been read num_epoch times
#do some stuff.. maybe summaries
coord.request_stop()
finally:
coord.join(threads)