Represent forall operation in Ortools - scheduling

I am trying to solve a production scheduling problem in Ortools. The problem contains parallel machines. I have created a variable called all_task just like in standard example at https://developers.google.com/optimization/scheduling/job_shop but instead of task id I indexed the variable at machine id
all_tasks[job_id, machine_id] = task_type(start=start_var, end=end_var, interval=interval_var)
Now while creating constraint, I want the sum of interval(production duration) in all machines for a single job should be equal to the total interval required for the job.
How do I do it in ortools? In Pulp package i can create a for loop of jobs and then again create for iterator inside the lpsum function inside the first for loop.

Related

It's possible to perform db operations asynchronously in django?

I'm writing a command to randomly create 5M orders in a database.
def constrained_sum_sample(
number_of_integers: int, total: Optional[int] = 5000000
) -> int:
"""Return a randomly chosen list of n positive integers summing to total.
Args:
number_of_integers (int): The number of integers;
total (Optional[int]): The total sum. Defaults to 5000000.
Yields:
(int): The integers whose the sum is equals to total.
"""
dividers = sorted(sample(range(1, total), number_of_integers - 1))
for i, j in zip(dividers + [total], [0] + dividers):
yield i - j
def create_orders():
customers = Customer.objects.all()
number_of_customers = Customer.objects.count()
for customer, number_of_orders in zip(
customers,
constrained_sum_sample(number_of_integers=number_of_customers),
):
for _ in range(number_of_orders):
create_order(customer=customer)
number_of_customers will be at least greater than 1k and the create_order function does at least 5 db operations (one to create the order, one to randomly get the order's store, one to create the order item (and this can go up to 30, also randomly), one to get the item's product (or higher but equals to the item) and one to create the sales note.
As you may suspect this take a LONG time to complete. I've tried, unsuccessfully, to perform these operations asynchronously. All of my attempts (dozen at least; most of them using sync_to_async) have raised the following error:
SynchronousOnlyOperation you cannot call this from an async context - use a thread or sync_to_async
Before I continue to break my head, I ask: is it possible to achieve what I desire? If so, how should I proceed?
Thank you very much!
Not yet supported but in development.
Django 3.1 has officially asynchronous support for views and middleware however if you try to call ORM within async function you will get SynchronousOnlyOperation.
if you need to call DB from async function they have provided helpers utils like:
async_to_sync and sync_to_async to change between threaded or coroutine mode as follows:
from asgiref.sync import sync_to_async
results = await sync_to_async(Blog.objects.get, thread_sensitive=True)(pk=123)
If you need to queue call to DB, we used to use tasks queues like celery or rabbitMQ.
By the way if you really know what you are doing you can call it but on your responsibility
just turn off the Async safety but watch out for data lost and integrity errors
#settings.py
DJANGO_ALLOW_ASYNC_UNSAFE=True
The reason this is needed in Django is that many libraries, specifically database adapters, require that they are accessed in the same thread that they were created in. Also a lot of existing Django code assumes it all runs in the same thread, e.g. middleware adding things to a request for later use in views.
More fun news in the release notes:
https://docs.djangoproject.com/en/3.1/topics/async/
It's possible to achieve what you desire, however you need a different perspective to solve this problem.
Try using asynchronous workers, and a simple one would be rq workers or celery.
Use one of these libraries to process async long-running tasks defined in django in different threads or processes.
you can use bulk_create() to create large number of objects , this will speed up the process , additionally put the bulk_create() under a separate thread.

Running multiple functions

I think I have confused myself on how I should approach this.
I have a number of functions that I use to interact with an api, for example get product ID, update product detail, update inventory. These calls need to be done one after another, and are all wrapped up in one function api.push().
Let's say I need to run api.push() 100 times, 100 product IDs
What I want to do is run many api.push at the same time, so that I can speed up the processing of my. For example, lets say I want to run 5 at a time.
I am confused to whether this is multiprocessing or threading, or neither. I tried both but they didn't seem to work, for example I have this
jobs = []
for n in range(0, 4):
print "adding a job %s" % n
p = multiprocessing.Process(target=api.push())
jobs.append(p)
# Starts threads
for job in jobs:
job.start()
for job in jobs:
job.join()
Any guidance would be appreciated
Thanks
Please read the python doc and do some research on the global interpreter lock to see whether you should use threading or multiprocessing in your situation.
I do not know the inner workings of api.push, but please note that you should pass a function reference to multiprocessing.Process.
Using p = multiprocessing.Process(target=api.push()) will pass whatever api.push() returns as the function to be called in the subprocesses.
if api.push is the function to be called in the subprocess, you should use p = multiprocessing.Process(target=api.push) instead, as it passes a reference to the function rather than a reference to the result of the function.

Amazon DynamoDB Atomic Writes

I have a list of Lambda worker functions (say 1000), each running simultaneously and doing its job. To be able to figure out the end result of all workers I have come up with this idea.
Before starting the job and spawning the Lambda worker functions, I save a record in DynamoDB, for example two attributes:
total_number_of_jobs
jobs_completed (set initially to 0)
On finish of each Lambda worker function it will go and increment the attribute jobs_completed by one. Then read the record and check if total_number_of_jobs equals to jobs_completed and if it is, put a record in SQS.
My questions are:
Is this a good idea?
Would the updates be consistent and atomic? Could there be any race conditions?
Any better solution than this?
I would update the counter, jobs_completed, in an UpdateItem API call like this:
SET jobs_completed = jobs_completed + :incr_by where incr_by would be equal to 1.
As long as you use DynamoDB atomic counters, like your example shows, and you check the return value of the UpdateItem call instead of running a separate query, then your proposed solution should work fine.

task queue in Appengine (using NDB) stopping another function from updating data

cred_query = credits_tbl.query(ancestor=user_key).fetch(1)
for q in cred_query:
q.total_credits = q.total_credits + credits_bought
q.put()
I have a task running which is constantly updating a users total_credits in the credits table.
While that task runs the user can also buy additional credits at any point (as shown in the code above) to add to the total. However, when they try to do so, it does not update the total_credits in the credits table.
I guess I don't understand the 'strongly consistent' modelling of appengine (using ndb) as well as I thought.
Do you know why this happens?

Querying a growing data-set

We have a data set that grows while the application is processing the data set. After a long discussion we have come to the decision that we do not want blocking or asynchronous APIs at this time, and we will periodically query our data store.
We thought of two options to design an API for querying our storage:
A query method returns a snapshot of the data and a flag indicating weather we might have more data. When we finish iterating over the last returned snapshot, we query again to get another snapshot for the rest of the data.
A query method returns a "live" iterator over the data, and when this iterator advances it returns one of the following options: Data is available, No more data, Might have more data.
We are using C++ and we borrowed the .NET style enumerator API for reasons which are out of scope for this question. Here is some code to demonstrate the two options. Which option would you prefer?
/* ======== FIRST OPTION ============== */
// similar to the familier .NET enumerator.
class IFooEnumerator
{
// true --> A data element may be accessed using the Current() method
// false --> End of sequence. Calling Current() is an invalid operation.
virtual bool MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
enum class Availability
{
EndOfData,
MightHaveMoreData,
};
class IDataProvider
{
// Query params allow specifying the ID of the starting element. Here is the intended usage pattern:
// 1. Call GetFoo() without specifying a starting point.
// 2. Process all elements returned by IFooEnumerator until it ends.
// 3. Check the availability.
// 3.1 MightHaveMoreDataLater --> Invoke GetFoo() again after some time by specifying the last processed element as the starting point
// and repeat steps (2) and (3)
// 3.2 EndOfData --> The data set will not grow any more and we know that we have finished processing.
virtual std::tuple<std::unique_ptr<IFooEnumerator>, Availability> GetFoo(query-params) = 0;
};
/* ====== SECOND OPTION ====== */
enum class Availability
{
HasData,
MightHaveMoreData,
EndOfData,
};
class IGrowingFooEnumerator
{
// HasData:
// We might access the current data element by invoking Current()
// EndOfData:
// The data set has finished growing and no more data elements will arrive later
// MightHaveMoreData:
// The data set will grow and we need to continue calling MoveNext() periodically (preferably after a short delay)
// until we get a "HasData" or "EndOfData" result.
virtual Availability MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
class IDataProvider
{
std::unique_ptr<IGrowingFooEnumerator> GetFoo(query-params) = 0;
};
Update
Given the current answers, I have some clarification. The debate is mainly over the interface - its expressiveness and intuitiveness in representing queries for a growing data-set that at some point in time will stop growing. The implementation of both interfaces is possible without race conditions (at-least we believe so) because of the following properties:
The 1st option can be implemented correctly if the pair of the iterator + the flag represent a snapshot of the system at the time of querying. Getting snapshot semantics is a non-issue, as we use database transactions.
The 2nd option can be implemented given a correct implementation of the 1st option. The "MoveNext()" of the 2nd option will, internally, use something like the 1st option and re-issue the query if needed.
The data-set can change from "Might have more data" to "End of data", but not vice versa. So if we, wrongly, return "Might have more data" because of a race condition, we just get a small performance overhead because we need to query again, and the next time we will receive "End of data".
"Invoke GetFoo() again after some time by specifying the last processed element as the starting point"
How are you planning to do that? If it's using the earlier-returned IFooEnumerator, then functionally the two options are equivalent. Otherwise, letting the caller destroy the "enumerator" then however-long afterwards call GetFoo() to continue iteration means you're losing your ability to monitor the client's ongoing interest in the query results. It might be that right now you have no need for that, but I think it's poor design to exclude the ability to track state throughout the overall result processing.
It really depends on many things whether the overall system will at all work (not going into details about your actual implementation):
No matter how you twist it, there will be a race condition between checking for "Is there more data" and more data being added to the system. Which means that it's possibly pointless to try to capture the last few data items?
You probably need to limit the number of repeated runs for "is there more data", or you could end up in an endless loop of "new data came in while processing the last lot".
How easy it is to know if data has been updated - if all the updates are "new items" with new ID's that are sequentially higher, you can simply query "Is there data above X", where X is your last ID. But if you are, for example, counting how many items in the data has property Y set to value A, and data may be updated anywhere in the database at the time (e.g. a database of where taxis are at present, that gets updated via GPS every few seconds and has thousands of cars, it may be hard to determine which cars have had updates since last time you read the database).
As to your implementation, in option 2, I'm not sure what you mean by the MightHaveMoreData state - either it has, or it hasn't, right? Repeated polling for more data is a bad design in this case - given that you will never be able to say 100% certain that there hasn't been "new data" provided in the time it took from fetching the last data until it was processed and acted on (displayed, used to buy shares on the stock market, stopped the train or whatever it is that you want to do once you have processed your new data).
Read-write lock could help. Many readers have simultaneous access to data set, and only one writer.
The idea is simple:
-when you need read-only access, reader uses "read-block", which could be shared with other reads and exclusive with writers;
-when you need write access, writer uses write-lock which is exclusive for both readers and writers;