So we have a timeline of T days in which some tasks have to be performed.
Every task has a penalty score. If the task is not performed in the given timeline , it's score adds up in the final penalty score. Every task can be performed only after it's given starting time.
The input will be given in the format:
T
Score Quantity_of_task Starting_time
For eg :
T = 10
140 5 4
This means that 5 tasks with penalty score 140 have to be performed from 4th day onwards.
You can perform at most 1 task on a particular day.
The goal is to minimize the final penalty score.
What I tried to do:
Example -
T = 10
Input size = 5
150 4 1
120 4 3
200 2 7
100 10 5
50 5 1
I sorted the list according to the penalty score , and greedily assigned the tasks with high penalty score to their corresponding days,i.e
2 tasks with highest score 200 are assigned to days 7 and 8
4 tasks with next highest score 150 are assigned to 1,2,3,4 days
4 tasks with next highest score 120 are assigned to 5,6,9,10 days
which gives the schedule as
150 150 150 150 120 120 200 200 120 120
Left out tasks:
10 tasks with 100 score = 1000 penalty
5 tasks with 50 score = 250 penalty
Final penalty = 1250.
This requires O(T * input_size). Is there a more elegant and optimized way of doing it?
Both input size and T have a constraint of 10^5.
Thanks.
If you store the available days in an ordered set, then you can perform your algorithm much faster.
For example, C++ provides an ordered set with a lower_bound method that will find in O(logn) time the first available day after the starting time.
Overall this should give an O(nlogn) algorithm where n = T+input_size.
For example, I suspect that when you have your 4 tasks of penalty 120 to assign from day 3 onwards, your current code will loop over days 3,4,5,etc. until you find a day that has not been assigned. You can now replace this O(n) loop with a single O(logn) call to lower_bound to find the first unassigned day. When you greedily assign the days, you should also remove them from the set so they won't be assigned twice.
Note that there are only T days so there will be at most T day assignments. For example, suppose all tasks have starting time 1, and quantity T. Then the first task will take O(Tlogn) time to assign, but all subsequent tasks will only need a single call to lower_bound (because there are no days left to assign), so will take O(logn) each.
Related
Say we have a table with average item size of 1 KB. We perform a query which reads 3 such items. Now according to what I have read, the number of RCUs should be (strongly consistent reads) :
(Number of items read) * ceil(item_size/4) = 3 * ceil(1/4) = 3*1 = 3.
So wanted to confirm : is this correct? Or do we use a single RCU as total size of messages read is 3, which is less than 4.
An RCU is good for 1 strongly consistent read of up to 4KB.
Thus you can query() four 1KB items for 1 RCU.
Since you have only 3 to read, 1 RCU will be consumed.
Using GetItem() to get those same 3 records would cost 3 RCU.
Let say you had 100 items that matched (HK+SK) the query, but you're also using filter to further select records to be returned; so you're only getting 4 records back. That query would take 25 RCU, as the records still have to be read even if not returned.
Reference can be found here :
Query—Reads multiple items that have the same partition key value. All items returned are treated as a single read operation, where DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.
Could you, please, help me with the data structure that allows O(logN) (or at least O(sqrtN)) operations for the following:
Insert an item having ID (int64_t) and health (double)
Remove an item by ID
Find an item that is weighted random by health
The preferred language is C++ or C. By the weighted random I mean the following:
Consider totalHealth=Sum(health[0], health[1], ..., health[N-1]). I need a fast (as described above) operation that is equivalent to:
Compute const double atHealth = rand_uint64_t()*totalHealth/numeric_limits<uint64_t>::max();
Iterate over i=0 to N-1 to find the first i such that Sum(health[0], health[1], ..., health[i]) >= atHealth
Constraints: health[i] > 0, rand_uint64_t() returns a uniformly distributed integer value between 0 and numeric_limits<uint64_t>::max().
What I have tried so far is a C++ unordered_map that allows quick (Θ(1)) insertion by ID and removal by ID, but the operation #3 is still linear in N as described in my pseudo-code above.
You help is very appreciated!
I can't think of a way to do it with the existing STL containers but I can think of a way to do it if you're willing to code up your own binary tree. The trick is that each node maintains the total health of all the nodes to its left (it doesn't need to worry about nodes to its right as you'll see below). Then, if you walk the tree in ID order you can also compute the "cumulative health", also in ID order, in log(n) time. So the tree is sorted by both ID and cumulative health and you can do lookups in log(n) time either by ID or by "cumulative health". For example, consider a very simple tree like the following:
ID: 8
h: 10
chl: 15
+-------|--------+
| |
ID: 4 ID: 10
h: 15 h: 7
chl: 0 chl: 0
in the above h is the health of the node and chl is the cumulative health of all nodes to it's left. So the total health of all nodes in the above is 15 + 10 + 7 = 32 (and I assume you maintain that count separately though you could also track cumulative health of nodes the right and you wouldn't need to). Let's look at 3 cases:
You compute an atHealth < 15. Then at the first node you can see that your value is less than the chl so you know you need to go left and you end up at the correct leaf.
You compute an atHealth >= 15 < 25 so you know it's > 15 so you don't go left at the root, the node you're at has health 10 and 10 + 15 means the cumulative health at that node is between 15 and 25 so you're good.
You compute an atHealth >= 25. Every time you visit a node and go right you must add the chl and h of the node you were at to keep computing cumulative health as you walk the tree so you know you're starting at 10 + 25 = 25 when you go right and you'll add that to the h or chl of any node you encounter after that. Thus you can quickly find that the node to the right is the correct one.
When you insert a new node you increment the total health of each parent node as you walk the tree and when you remove a node you walk back up the tree subtracting from the total health. Inserts and deletions are thus still O(log(n)) and lookups by ID are also log(n) either by ID or by atHealth.
Things obviously get more complicated if you want to maintain a balanced tree but it's still do-able.
This question already has answers here:
Given a set of intervals, how to find the maximum number of intersections among them,
(3 answers)
Closed 3 years ago.
On the input, we are given a number N stating how many presentations are to be given, which is followed by N rows of start and end times given in HHMM format
Example:
3
0800 0900
0830 1000
0900 1030
The code must calculate the maximum nuber of occupied rooms (one room can host only one presentation at a time, therefore expected output to the provided example is 2.
My first idea was to create a table of bools 1440 x N (number of minutes in a day by number of presentations) and fill each minute when a presentation is being held, and later go column by column and find the maximum value of presentations at a time. It works, but I'm sure it can be done faster and better. Can someone suggest how to do it in a better manner?
Pretty simple actually: we just simulate the process. First of all, it doesn't matter which presentations are going on right now. All we care about is the number of presentations happening. So we'll just have a counter which we update when a presentation starts or ends.
We could iterate over every minute for the simulation, but our counter only changes when a presentation starts or ends, so we can just make a big list of all the start and end events, sort the list by time, and iterate through the list adjusting our counter appropriately.
The way you propose needs 1440 x 3 = 4320 values. As you can tell, that's very inefficient. A better way is to store just the minutes that are actually required. To save even less values, divide the time into slots of 30 minutes.
Now use a dictionary (std::map) to keep a count of how many presentations are during the same time slot. For your example this gives:
std::map<std::string, int> slots =
{
{"0800", 1},
{"0830", 2},
{"0900", 2},
{"0930", 2},
{"1000", 1},
{"1030", 1},
}
I'll let you figure out how to implement this.
Currently I am dealing with a massive amount of Data in the original form of a list through combination. I am running conditions on each set of list through a for loop. Problem is this small for loop is taking hours with the data. I'm looking to optimize the speed by changing some functions or vectorizing it.
I know one of the biggest NO NOs is don't do Pandas or Dataframe operations in for loops but I need to sum up the columns and organize it a little to get what I want. It seems unavoidable.
So you have a better understanding, each list looks something like this when its thrown into a dataframe:
0
Name Role Cost Value
0 Johnny Tsunami Driver 1000 39
1 Michael B. Jackson Pistol 2500 46
2 Bobby Zuko Pistol 3000 50
3 Greg Ritcher Lookout 200 25
Name Role Cost Value
4 Johnny Tsunami Driver 1000 39
5 Michael B. Jackson Pistol 2500 46
6 Bobby Zuko Pistol 3000 50
7 Appa Derren Lookout 250 30
This is the current loop, any ideas?
for element in itertools.product(*combine_list):
combo = list(element)
df = pd.DataFrame(np.array(combo).reshape(-1,11))
df[[2,3]] = df[[2,3]].apply(pd.to_numeric)
if (df[2].sum()) <= 5000 and (df[3].sum()) > 190:
df2 = pd.concat([df2, df], ignore_index=True)
Couple things I've done that have sliced off some time but not enough.
*df[2].sum() to df[2].values.sum----its faster
*where the concat is in the if statement I've tried using append and also adding the dataframe together as a list...concat is actually 2 secs faster normally or it will end up being about the same speed.
*by the .apply(pd.to_numeric) changed it to .astype(np.int64) its faster as well.
I'm currently looking at PYPY and Cython as well but I want to start here first before I go through the headache.
My DataFrame 3 fields are account ,month and salary.
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and calculating sum of salary for group. Then removing duplicates.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(sum)
MyDataFrame = MyDataFrame.drop_duplicates()
Expecting output like below:
account month Salary
1 201501 10000
2 201506 40000
3 201508 60000
3 201506 40000
It works well for few records. I tried same for 600 Million records and it is in progress since 4-5 hours. Initially when I loaded data using pd.read_csv() data acquired 60 GB RAM, till 1-2 hour RAM usages was in between 90 to 120 GB. After 3 hours process is taking 236 GB RAM and it is still running.
Please suggest if any other alternative faster way is available for this.
EDIT:
Now 15 Minutes in df.groupby(['account', 'month'], sort=False)['Salary'].sum()
Just to follow up on chrisb's answer and Alexander's comment, you indeed will get more performance out of the .sum() and .agg('sum') methods. Here's a Jupyter %%timeit output for the three:
So, the answers that chrisb and Alexander mention are about twice as fast on your very small example dataset.
Also, according to the Pandas API documentation, adding the kwarg sort=False will also help performance. So, your groupby should look something like df.groupby(['account', 'month'], sort=False)['Salary'].sum(). Indeed, when I ran it, it was about 10% faster than the runs shown in the above image.
Unless I'm misunderstanding something, you're really doing an aggregation - transform is for when you need the data in the shape as the original frame. This should be somewhat faster and does it all in one step.
df.groupby(['account', 'month'])['Salary'].agg('sum')
Might be worth downloading the development version of Pandas 0.17.0. They are unlocking the GIL, which controls multi threading. It's going to be natively implemented in the groupby and this blog post suggested speed increases of 3x on a group-mean example.
http://continuum.io/blog/pandas-releasing-the-gil
http://pandas.pydata.org/