PBS Professional hook not updating Priority - python-2.7

I am trying to implement a hook to determine a job's priority upon entering the queue.
The hook is enabled, imported, and event type is "queuejob", so it is in place (like other hooks we have enabled). This hook however does not seem to alter a job's priority as I am expecting.
Here is a simplified example of how I'm trying to alter the Priority for a job:
import pbs
try:
e=pbs.event()
j=e.job
if j.server == 'myserver':
j.Priority = j.Priority + 50
e.accept()
except SystemExit:
pass
Whenever I submit a job after importing this hook, I run the 'qstat -f' on my job, the Priority is always 0, whether I set it to another value in my qsub script or leave it to the default.
Thank you.

Couple of things I discovered:
It appears that PBS does not like using j.Priority in a calculation and assignment, so I had to use another internal variable (which was fine since I had one already for something else)
i.e.:
j.Priority = High_Priority
if pbs.server() == 'myserver'
j.Priority = High_Priority + 50
Also, (as can be seen in the last example), j.server should actually be pbs.server().

Related

Set or modify an AWS Lambda environment variable with Python boto3

i want to set or modify an environment variable in my lambda script.
I need to save a value for the next call of my script.
For exemple i create an environment variable with the aws lambda console and don't set value. After that i try this :
import boto3
import os
if os.environ['ENV_VAR']:
print(os.environ['ENV_VAR'])
os.environ['ENV_VAR'] = "new value"
In this case my value will never print.
I tried with :
os.putenv()
but it's the same result.
Do you know why this environment variable is not set ?
Thank you !
Consider using the boto3 lambda command, update_function_configuration to update the environment variable.
response = client.update_function_configuration(
FunctionName='test-env-var',
Environment={
'Variables': {
'env_var': 'hello'
}
}
)
I need to save a value for the next call of my script.
That's not how environment variables work, nor is it how lambda works. Environment variables cannot be set in a child process for the parent - a process can only set environment variables in its own and child process environments.
This may be confusing to you if you set environment variables at the shell, but in that case, the shell is the long running process setting and getting your environment variables, not the programs it calls.
Consider this example:
from os import environ
print environ['A']
environ['A'] = "Set from python"
print environ['A']
This will only set env A for itself. If you run it several times, the initial value of A is always the shell's value, never the value python sets.
$ export A="set from bash"
$ python t.py
set from bash
Set from python
$ python t.py
set from bash
Set from python
Further, even if that wasn't the case, it wouldn't work reliably with aws lambda. Lambda runs your code on whatever compute resources are available at the time; it will typically cache runtimes for frequently executed functions, so in these cases data could be written to the filesystem to preserve it. But if the next invocation wasn't run in that runtime, your data would be lost.
For your needs, you want to preserve your data outside the lambda. Some obvious options are: write to s3, write to dynamo, or, write to sqs. The next invocation would read from that location, achieving the desired result.
AWS Lambda just executes the piece of code with given set of inputs. Once executed, it returns the output and that's all. If you want to preserve the output for your next call, then you probably need to store that in DB or Queue as Dan said. I personally use SQS in conjunction with SNS that sends me notifications about current state. You can even store the end result like success or failure in SQS which you can use for next trigger. Just throwing the options here, rest all depends on your requirements.

Celery: dynamic queues at object level

I'm writing a django app to make polls which uses celery to put under control the voting system. Right now, I have two queues, default and polls, the first one with concurrency set to 8 and the second one set to 1.
$ celery multi start -A myproject.celery default polls -Q:default default -Q:polls polls -c:default 8 -c:polls 1
Celery routes:
CELERY_ROUTES = {
'polls.tasks.option_add_vote': {
'queue': 'polls',
},
'polls.tasks.option_subtract_vote': {
'queue': 'polls',
}
}
Task:
#app.task
def option_add_vote(pk):
"""
Updates given option id and its poll increasing vote number by 1.
"""
option = Option.objects.get(pk=pk)
try:
with transaction.atomic():
option.vote_quantity += 1
option.save()
option.poll.total_votes += 1
option.poll.save()
except IntegrityError as exc:
raise self.retry(exc=exc)
The option_add_vote method (task) updates the poll-object vote-number value adding 1 to the previous value. So, to avoid concurrency problems, I set the poll queue concurrency to 1. This allow the system to handle thousand of vote requests to be completed successfully.
The problem will be, as I can imagine, a bottle-neck when the system grows up.
So, I was thinking about some kind of dynamic queues where all vote requests to any options of a certain poll where routered to a custom queue. I think this will make the system more reliable and fast.
What do you think? How can I make it?
EDIT1:
I got a new idea thanks to Paul and Plahcinski. I'm storing the votes as objects in their own model (a user-options relationship). When someone votes an option it creates an object from this model, allowing me to count how many votes an option has. This free the system from the voting-concurrency problem, so it could be executed in parallel.
I'm thinking about using CELERYBEAT_SCHEDULE to cron a task that updates poll options based on the result of Vote.objects.get(pk=pk).count(). Maybe I could execute it every hour or do partial updates for those options that are getting new votes...
But, how do I give to the clients updated options in real time?
As Plahcinski says, I can have a cached value for my options in Redis (or any other mem-cached system?) and use it to temporally store this values, giving to any new request the cached value.
How can I mix this with my standar values in django models? Anyone could give me some code references or hints?
Am I in the good way or did I make mistakes?
What I would do is remove your incrementation for the database and move to redis and use the database model as your cached value. Have a celery beat that updates recently incremented redis keys to your database
http://redis.io/commands/INCR
What about just having a simple model that stores vote -1/+1 integers then a celery task that reconciles those with the FK object for atomic transactions and updates?

copying rather than modifying a job (APScheduler)

I'm writing a database-driven application with APScheduler (v3.0.0). Especially during development, I find myself frequently wanting to command a scheduled job to start running now without affecting its subsequent schedule.
It's possible to do this at job creation time, of course:
def dummy_job(arg):
pass
sched.add_job(dummy_job, trigger='interval', hours=3, args=(None,))
sched.add_job(dummy_job, trigger=None, args=(None,))
However, if I already have a job scheduled with an interval or date trigger...
>>> sched.print_jobs()
Jobstore default:
job1 (trigger: interval[3:00:00], next run at: 2014-08-19 18:56:48 PDT)
... there doesn't seem to be a good way to tell the scheduler "make a copy of this job which will start right now." I've tried sched.reschedule_job(trigger=None), which schedules the job to start right now, but removes its existing trigger.
There's also no obvious, simple way to duplicate a job object while preserving its args and any other stateful properties. The interface I'm imagining is something like this:
sched.dup_job(id='job1', new_id='job2')
sched.reschedule_job('job2', trigger=None)
Clearly, APScheduler already contains an internal mechanism to copy job objects since repeated calls to get_job don't return the same object (that is, (sched.get_job(id) is sched.get_job(id))==False).
Has anyone else come up with a solution here? I'm thinking of posting a suggestion on the developers' site if not.
As you've probably figured out by now, that phenomenon is caused by the job stores instantiating jobs on the fly based on data loaded from the back end. To run a copy of a job immediately, this should do the trick:
job = sched.get_job(id)
sched.add_job(job.func, args=job.args, kwargs=job.kwargs)

get number of pending tasks for a specific user

In one of my applications i want to limit users to make a only a specific number of document conversion each calendar month and want to notify them of the conversions they've made and number of conversions they can still make in that calendar month.
So I do something like the following.
class CustomUser(models.Model):
# user fields here
def get_converted_docs(self):
return self.document_set.filter(date__range=[start, end]).count()
def remaining_docs(self):
converted = self.get_converted_docs()
return LIMIT - converted
Now, document conversion is done in the background using celery. So there may be a situation when a conversion task is pending, so in that case the above methods would let a user make an extra conversion, because the pending task is not being included in the count.
How can i get the number of tasks pending for a specific CustomUser object here ??
update
ok so i tried the following:
from celery.task.control import inspect
def get_scheduled_tasks():
tasks = []
scheduled = inspect().scheduled()
for task in scheduled.values()
tasks.extend(task)
return tasks
This gives me a list of scheduled tasks but now all the values are unicode for the above mentioned task args look like this:
u'args': u'(<Document: test_document.doc>, <CustomUser: Test User>)'
is there a way these can be decoded back to original django objects so that i can filter them ?
Store the state of your documents somewhere else, don't inspect your queue.
Either create a seperate model for that, or eg. have a state on your document model, at least independently from your queue. This should have several advantages:
Inspecting the queue might be expensive - also depending on the backend for that. And as you see it can also turn out to be difficult.
Your queue might not be persistent, if eg. your server crashes and use something like Redis you would loose this information, so it's a good thing to have a log somewhere else to be able to reconstruct the queue)

How should I implement callback for taskset in celery

Question
I use celery to launch task sets that look like this:
I perform a batch of tasks that can be run in parallel, number of tasks in this batch varies from tens to couple thousands.
I aggregate results of these tasks into single answer, then do something with this answer --- like store to the database, save to special result file and so on. Basically after tasks done executing I have to call function that has following signature:
def callback(result_file_name, task_result_list):
#store in file
def callback(entity_key, task_result_list):
#store in db
For now step 1. is done in Celery queue and step 2 is done outside celery:
tasks = []
# add taksks to tasks list
task_group = group()
task_group.tasks = tasks
result = task_group.apply_async()
res = result.join()
# Aggregate results
# Save results to file, database whatever
This approach is cumbersome since I have to stop a single thread until all tasks are performed (which can take couple of hours).
I would like to somehow move step 2 to celery also --- esentially I would need to add a callback to entire taskset (as far as I know it is unsupported in Celery) or submit a task that is executed after all these subtasks.
Does anyone have idea how to do it? I use it in the django enviorment so I can store some state in the database.
To sum up my recent findings
Chords won't do
I'cant use chords straight forwardly because chords enable me to create callbacks that look this way:
def callback(task_result_list):
#store in file
there is no obvious way to pass additional parameters to callback (especially because these callbacks can't be local functions).
Using the database either
I can store results using TaskSetMeta but this entity has no status field --- so even if I would add a signal to TaskSetMeta i'd have to pool task results which could have siginificant overhead.
Well answer was really straightforward, and I can indeed use chords --- and additional parameters (like report file name and so on) must be passed as kwargs.
Here is chord task:
#task
def print_and_sum(to_sum, file_name):
print file_name
print sum(to_sum)
return file_name, sum(to_sum)
Here is how to instantiate it:
subtasks = [...]
result = chord(subtasks)(print_and_sum.subtask(kwargs={'file_name' : 'report_file.csv'}))